EPFL & ETHZ
09.09.18 - 13.09.18, Magliaso, Ticino, Switzerland
How to make your code readable, maintainable, and scalable?
How to share code and data in a sustainable way?
How to create software that will be useful to others?
|16:30 - 17:30||Welcome and keynote: The motivation behind reproducible science. Why does it matter? — Luc Henry (EPFL) (Video, Slides)|
|09:00 - 10:00||A software development workflow for academic research — Wenzel Jakob (EPFL) (Slides)|
|10:00 - 11:00||The philosophy of reproducible quantitative methods — Christie Bahlai (Kent State University) (Video, Slides)|
|11:00 - 11:30||Break|
|11:30 - 12:30||Research from an engineering viewpoint: Software development process and best practices — Filip Pavetić (Google)|
|12:30 - 13:30||Lunch|
|13:30 - 14:30||Advanced git workshop — Sourabh Lal (EPFL, GitHub Campus Expert) (Video, Slides)|
|14:30 - 18:00||Tools for reproducible research (Workshop) — Tim Head (Wild Tree Tech) (Video, Slides)|
|09:00 - 10:00||Introduction to the Renku platform — Eric Bouillet (SDSC) (Slides)|
|10:00 - 10:45||The journal as a medium for publishing software — Kate Keahey (University of Chicago)|
|10:45 - 11:00||Break|
|11:00 - 11:30||Automatic tools for reproducible research — Kate Keahey (University of Chicago)|
|11:30 - 12:30||Techniques and guidelines for reporting reproducible and statistically sound results — Torsten Hoefler (ETH Zürich) (Video, Slides)|
|12:30 - 13:30||Lunch|
|13:30 - 15:00||Discussion Panel|
|15:00 - 15:30||Break|
|15:30 - 17:30||Data management in research — Ana Sesartic Petrus & Malin Michelle Ziehmer (Digital Curation Office, ETH Zürich) (Video, Slides, Workshop notes 1, 2)|
|17:30 - 18:00||Practical data management — Anna Krystalli (University of Sheffield) (dataspice, tutorial)|
|09:00 - 10:00||Publishing and maintaining open data — Bastian Greshake Tzovaras (Video, Slides)|
|10:00 - 11:00||Tools for reproducibility in Statistics and Machine Learning — Heidi Seibold (LMU Munich) (Video, Slides)|
|11:00 - 11:30||Break|
|11:30 - 12:30||Project work|
|12:30 - 13:30||Lunch|
|13:30 - 18:00||How to write the perfect reproducible paper (Workshop)
— Christie Bahlai (Kent State University) & Anna Krystalli (University of Sheffield)
|09:00 - 11:00||Presentations|
|11:00 - 13:00||Good-bye apero and lunch|
Christie is an Assistant Professor at Kent State University and head of the Bahlai Lab. She is an applied quantitative ecologist and population ecologist who uses approaches from data science to help solve problems in conservation, sustainability, and ecosystem management. She combines a background in physics and organismal ecology with influences from the tech sector and conservation NGOs to ask questions and build tools addressing problems in population ecology.
Christie is a strong advocate of open science and has taught workshops and classes on open data management and data analysis, such as this amazing workshop, for pushing higher reproducibility standards in science. During this summer school, she will lead a workshop exploring the reproducibility of papers prepared using open science methods and talk about some of the philosophies and motivations for these approaches.
Eric Bouillet received his PhD degree in Electrical Engineering from Columbia University, New York, NY in June 1999. Eric Bouillet has been working at IBM T.J. Watson Research Center, Hawthorne, NY since June 2004, and at the IBM Smarter City Technical Centre, Dublin from October 2010 to August 2016. While at IBM he has been working on scalable data stream analytics applied to a number of fields, including finances, law-enforcement, telecommunications, environmental monitoring, intelligent transport systems, and aircraft reliability control systems. He is currently Head of Engineering at the Swiss Data Science Center (SDSC).
Eric will share with us his experiences on building large-scale analytics systems and the challenges which arise from an engineering point of view. He will also introduce the Renku platform, a SDSC-based platform which facilitates collaboration and reproducibility for data scientists.
Bastian is a biologist-turned-bioinformatician. In 2011 he co-founded openSNP, an open data repository that allows individuals to donate their personal genomes into the public domain. Since then over 4,100 datasets have been donated through openSNP, making it the worlds largest database of its kind. After he finished his PhD in Bioinformatics he joined the Open Humans Foundation. As its Director of Research, Bastian facilitates participatory research projects and empowers individuals to take control over sharing their personal data. In addition to this Bastian also mentors the next generation of Open Leaders with Mozilla and serves on the Board of the Open Bioinformatics Foundation.
While our algorithms become more and more powerful they also become hungrier and hungrier - for data. The large-scale success of machine learning these days is thus not only a function of better algorithms and open source software packages, but also of ever-growing training data sets. It is for this reason that reproducibility not only depends on good software engineering practices, but also on open data. Bastian will share his experiences in making data open and maintaining it.
Tim is an independent software developer and teacher. He works primarily on data science tools (such as binder) and teaches machine-learning courses for programmers, scientists and engineers. He has worked with International Organisations in Geneva, small startups in Zurich and academics around the world. Tim trained as an experimental physicist and worked at CERN and EPFL for several years.
Tim will introduce the Binder project to us. Binder drastically lowers the bar to sharing and re-using software. As a user wanting to try out someone else’s work requires only clicking a single link. For the author to prepare a binder-ready project is much easier than having to support many different platforms and for many projects involves little additional work.
Luc Henry spent around ten years exploring science in various ways and places in Europe. In 2015, he was the Managing Editor of European science magazine Technologist. He then worked at the Swiss National Science Foundation before taking his actual position as an advisor to the President of EPFL. Luc holds a PhD in chemical biology from the University of Oxford.
Luc is interested in all aspects of open science since his early days as a researcher. In his presentation, he will give an overview of how science is changing, and why the transparency and reproducibility of research results are embedded in a broader movement to make knowledge more accessible and useful.
Torsten is an Associate Professor of Computer Science at ETH Zürich,
Switzerland. Before joining ETH, he led the performance modeling and
simulation efforts of parallel petascale applications for the NSF-funded
Blue Waters project at NCSA/UIUC. He is also a key member of the
Message Passing Interface (MPI) Forum where he chairs the "Collective
Operations and Topologies" working group. His
research interests revolve around the central topic of
"Performance-centric System Design" and include scalable networks,
parallel programming techniques, and performance modeling.
Torsten has been pushing reproducibility in the field of high performance computing for years. During the summer school he will teach techniques and guidelines for reporting reproducible and statistically sound results.
Wenzel is an assistant professor leading the Realistic Graphics Lab at EPFL's School of Computer and Communication Sciences. His research mostly revolves around light transport simulations and material appearance models. The overarching goal of this work is to improve the accuracy and fidelity of light transport simulations in computer graphics, and to make the underlying models accurate and fast enough for predictive design and manufacturing applications.
Wenzel will introduce us to the software development workflow used by his group, which enforces maintainable and reproducible development practices without losing the agility that is needed to make progress on a challenging research problem. He will step through all parts of the stack including Jupyter notebooks, C++14, pybind11, continuous integration and Git submodules in a hands-on manner. He will also share his experiences in creating projects that ultimately became relatively widely used and some of the positive and unexpected fallout that this has created.
Kate is a Scientist at Argonne National Laboratory and a Senior Fellow at the Computation Institute at the University of Chicago. She is one of the pioneers of infrastructure cloud computing. She created the Nimbus project, recognized as the first open source Infrastructure-as-a-Service implementation, and continues to work on research aligning cloud computing concepts with the needs of scientific datacenters and applications.
During the summer school, Kate will share her view on challenges arising for researchers practicing data-intensive science, as well as her experience and motivation in creating and maintaining large-scale open-source projects supporting such endeavors.
Anna is a Research Software Engineer at the University of Sheffield, helping researchers do more with their code and data. With a background in computational ecology, Anna is interested in open source methodological innovation in science, and community and capacity around modern open science tools and practices. She is a member of vibrant open science communities such as the Mozilla Open Leaders network, as a member of the inaugural cohort and mentor on subsequent training rounds, rOpenSci and a co-organizer of the Sheffield R users group.
During the summer school, Anna will teach principles and best practice for conducting reproducible science, focusing on data and workflow management.
Sourabh is a masters student at EPFL studying Computer Science and specializing in Information Systems. He is also the founder of LauzHack - EPFL's student hackathon. Prior to coming to Switzerland he studied at Jacobs University in Germany and Carnegie Mellon University in the USA. Interspersing his studies Sourabh has gained experience working at several companies including Logitech, CERN and Fraunhofer. Sourabh is currently a GitHub Campus Expert - a role that enables him to help empower student developers at EPFL.
During his workshop, Sourabh will introduce some of Git/GitHub's slightly more advanced features that make it easier to collaborate and contribute towards open source development.
Filip is a software engineer at Google Zürich working on video analysis at YouTube. His most recent work is oriented around developing machine learning models and integrating them into complex systems. Before that he did two internships at Facebook, where he worked on spam detection and infrastructure, and a research internship at EPFL.
Filip will describe the experience and difficulties of applying research from an engineering viewpoint. Additionally, he will talk about the day-to-day software development process, covering practices in coding style, code review and testing.
Heidi is a statistician at the LMU Munich, where she works on statistical methods for personalized medicine. She is a core member of OpenML, assistant editor at the Journal of Statistical Software (responsible for reproducibility checks) and a member of the LMU Open Science Center. Heidi is passionate about open science, open source and reproducible research.
Heidi's talk will be about reproducibility in Statistics and Machine Learning. She will cover tools and workflows to tackle the challenges in this field based on examples of her own research.
Ana and Malin are responsible for data management planning and teaching at the Digital Curation Office, ETH Library, ETH Zurich. Before joining the library, they both worked in research and received their PhD in Climate Sciences - Ana from the ETH Zurich and Malin from the University of Bern. During their research career, they were confronted with the daily challenges of data management in modelling (Ana) and experimental (Malin) research. This experience highly motivated them to engage in training researchers at any career stage on data management and data management planning.
Ana and Malin will interactively introduce what data management essentially is and why it concerns all of us. As the research data lifecycle is probably going to outlive a researcher's project at a particular institution, a data management plan is a stringent necessity which describes measures for active research data management, data sharing as well as the long-term preservation of research data for a potential re-use of data in the future.
Computational Science & Engineering Lab
Audiovisual Communications Lab
Computational Science & Engineering Lab
Digital Epidemiology Lab