Managing Open and Reproducible Computational Projects

The Managing Open and Reproducible Computational Projects training material covers best practices for managing and supervising computational projects in biology and related fields through data science methods, analysis, interpretation, and reporting processes. Through lessons learned in this training, researchers will enhance their understanding and guide the integration of rigorous and reproducible scientific methods for designing reproducible, transparent and collaborative computational projects. Furthermore, the guidance provided for managing and supervising early career researchers in conducting computational (data-driven/informed) research will help ensure transparency and research integrity throughout the project design, methodology, analysis, interpretation and reporting process.

This training material is developed under the Data Science for Biomedical Scientists project. It massively reuses The Turing Way chapters and builds on The Carpentries and Open Life Science practices. Hosted by the Tools, practices and systems (TPS) research team, all materials are shared under CC-BY 4.0 License. Although the training course is tailored to the biomedical sciences community, materials will be generally transferable and directly relevant for data science projects across different domains. Anyone interested in collaboration and improvements of this material is welcome to connect with the development team on GitHub (see the repository).

Funding Acknowledgement: The first iteration of this project was funded by The Alan Turing Institute - AI for Science and Government (ASG) Research Programme from October 2021 to March 2022.

Prerequisites

This resource is designed for experimental biologists, biomedical researchers and adjacent communities, with a focus on two key professional/career groups:

Group leaders or lab managers without prior experience with Data Science or management of computational projects

Postdoc and lab scientists (next-generation senior leaders) interested in enabling the integration of computational science into biosciences

In defining the scope of this project for our target audience, we make some assumptions about the learner groups:

Our learners have a good understanding of designing or contributing to a scientific project throughout its lifecycle.

They have a computational project in mind for which funding and research ethics approval have been received.

We also assume that the research team of any size is (either partially or fully) established.

This lesson is developed alongside the Introduction to Data Science and AI for senior researchers lesson. Our learners are encouraged to go through Introduction to Data Science and AI for senior researchers lesson to learn about data science and AI/ML practices that could be relevant to life science domains, where the best practices for Managing Open and Reproducible Computational Projects can be practically applied.

Schedule

	Setup	Download files required for the lesson
00:00	1. Introduction to this course	What is the purpose of this training? Who are the target audience? What will they learn at the end of this training?
00:10	2. Better and faster research !	How does this training relate to your work? What are the benefits of using data science skills? What can go wrong with working on data/code? What are the challenges for teams and management? Are there procedures and protocols that can help?
00:45	3. Setting up a computational project	How to set up a computational project? What main concerns and challenges exist and how to address them? How to create a project repository for sharing, collaboration and an intention to release?
01:05	4. Research Data Management	What is considered research data? How to start building a research data management plan? What is FAIR principles for data management? Why care about documentation and metadata standards?
01:25	5. Version Control and Open Science Practices	How is version control system relevant for biomedical research? How to maintain history of contributions and contributors? How to apply open science practices to work transparently and collaborate openly?
01:35	6. Method selection	How can I build on existing work? What resources/infrastructure should I consider? Is machine learning suitable for my project?
01:35	7. Data analysis and results	What is the role of data wrangling? What are good practices for keeping code in check? How to use data visualisation for insight and communication?
01:55	8. Implementing tools and methods	How to manage and oversee tasks and track progress of your projects on GitHub? How collaborative practices help ensure code quality, testing and reuse? What is literate programming and how does it help with early communication, testing and collaboration?
02:15	9. Code Review	What are the main objectives and best practices for reviewing code? What are the difference between synchronous and asynchronous code reviews? How can group leaders facilitate a collaborative environment for code review?
02:25	10. Publication and release	Why should I make my research objects available? What open source tools to use for applying data science practices in bioscience? How to get your research work cited and invite more contributions to your project?
02:35	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.