The ability of others to understand and repeat a study (i.e. replicable) is one of the core pillars of the scientific method. Historically the ability of others to understand what was done (and therefore to do the same research again) was typically provided by the Methods section in a publication. With this description in hand, it was thought, others would be able to spin up their own study and see how closely their results matched those reported in a given study.
Although in practice this doesn't always happen, it is exactly how the most important scientific research gets used. Discovered in 2012, the gene-editing technology CRISPR-Cas9 is already one of the most important discoveries in the history of Biology (lead to the 2020 Nobel Prize to the authors, Jennifer Doudna and Emmanuelle Charpentier), making it possible to edit specific genetic code to add or remove information. The rapid adoption and use of CRISPR-Cas9 to edit genes is entirely due to its being presented in a reproducible way, with a detailed supplemental section in the main paper that provided all the information needed to try it at home.
In contemporary biology, reproducible research has come to mean that all necessary information to repeat a study is made available, including a description of the methods used, the data collected, and the computer code used for analysis and production of published graphics.
In this Lab we'll discuss contemporary ways to store and manage code and data, so that others (including your future self) can figure out what was done.
A project is a collection of files (i.e. code, data, notes, figures, etc.) required to accomplish a goal, which could be a course's report, a poster, a scientific manuscript, etc.
The scientific process is naturally incremental, and many projects start as random notes, some code, and some data. As you incrementally clean, explore and refine your data and code, you keep saving your files with slightly different names to keep track of your progress, and eventually all these files have a tendency to end up all mixed together in a "semi-chaotic" state, like the example below:
This "semi-chaotic" method of project organization is terrible for "reproducible research"! It is really hard to figure out which version of your data, was processed by which version of your code, to produce which version of your figures, tables and results. It is hard to work with the contents of a "semi-chaotic" project and, as it grows, it becomes harder and harder to remember which version are the "good ones", causing progressively more and more headaches. You should AVOID doing this!
A good project layout should:
Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:
Treat data as read only: This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel), where they can be modified, means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.
Data Cleaning: In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. Storing these scripts in a separate folder, and creating a second “read-only” data folder to hold the “cleaned” data sets, can prevent confusion between the two sets.
Treat generated output as disposable: Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.
There are lots of different ways to manage this output. Having an output folder with different sub-directories for each separate analysis makes it easier later. Since many analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.
Separate function definition and application: One of the more effective ways to work with R is to start by writing the code you want to run directly in a .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the “Run” button) in the interactive R console.
When your project is in its early stages, the initial .R script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. It’s a good idea to separate these functions into two separate folders; one to store useful functions that you’ll reuse across analyses and projects, and one to store the analysis scripts.
Version Control your code: Version control is a program that "tracks changes" to a set of files over time, so that you can go back in time to any previous "version" of your work. We will learn the details of version control later in this lab (see section below).
Below is an example of a better filing scheme. Note that in this particular case...
the data is common to many projects, thus it is kept outside of the projects
directory.
the user made several .R files containing functions that are used across many projects, thus they are also kept outside of the projects
directory (i.e. inside the my_functions
directory).
Each individual project and the my_functions
directory are "version controlled" repositories and therefore, each has its own README.md and .gitignore files (more on this below).
my_functions/
│ │ README.md
| | .gitignore
| | my_stats.R
| | my_goespatial.R
| | my_fitting.R
|
data/
| └───raw_data/
│ | datafile1.csv
│ | datafile2.csv
| | ...
| └───clean_data/
│ | datafile1.csv
│ | datafile2.csv
| | ...
|
projects/
|
└───project1
│ │ README.md
| | .gitignore
| └───docs
| | notebook.md
| └───results
| | summarized_results.csv
| | plot1.png
| | plot2.png
| └───analyses
| sightings_analysis.R
| plots.R
│
└───project2
│ │ README.md
| | .gitignore
| ...
Other resources about Scientific Project Management:
It may seem like a trivial thing, but how you name your files, and the directories you put them in, is a big deal. Programming languages need to be able to read file names easily. More importantly, YOU need to be able to read file names easily. If you do not name your files in a descriptive manner, a few months later you may forget the name of one file that you urgently need (e.g. the one that makes the graph for your final assignment)... thus wasting many hours opening every file until you find it. Effective naming is important!
Bad names
A few things to avoid:
Good names
A few things to include:
Effective naming is important for the readability of file names and directory names (as explained above), also for column headings, variable names, and really anything "object" that you or your computer may need to read.
Effective naming should follow the three principles outlined by Jenny Bryan's.
Names should be:
Computers use regular expressions, which means that they use standardized syntax to process and search information. Without regular expressions, any kind of search on your computer or on the internet would fail miserably. In the context of naming, regular expressions avoid spaces, punctuation, accents, and case sensitivity.
Great file names look like they're oversharing:
Awful file names are coy:
From the first list, we know exactly how and when the data were collected, and we can use built-in R functions to do some computing for us. The second is close to useless.
By naming things well, your ability to find a specific file later goes way, way up.
For example, if we want search through a load of files from years ago, there is a big difference in encountering:
versus
Naming things well means it is easy to figure out what something is based on its name.
Humans separate words with spaces. However, machines do not like that. The naming conventions below (from Bååth) can help you name items (i.e. files, directories, variables, etc.) is a way that avoids using spaces:
underscore_separated - All letters are lower case and multiple words are separated by an underscore as in seq_along or package_version.
lowerCamelCase - Single word names consist of lower case letters and in names consisting of more than one word all, except the first word, are capitalized as in colMeans or suppressPackageStartupMessage.
UpperCamelCase - All words are capitalized both when the name consists of a single word, as in Vectorize, or multiple words, as in NextMethod.
Take your pick, but keep it informative, and be consistent.
Default ordering is the way in which files listed in a directory will look when you look at them. This order uses underscores first, then by the first number, then alphabetically. So, if you want a specific file to always be at the top of your directory, you can use an underscore.
Bryan suggests a few key points:
Points 1 & 2 help keep things in logical order, either by date or the order you want them in. Point 3 just means that numbers between 1 and 10 will not fall in the correct order unless they have a leading zero. For example:
Isn't the behaviour we expected. Far more logical is
Version control systems are programs that records changes to a set of files over time so that you can recall specific versions later. Version control is like an unlimited ‘undo’, if you screw things up or lose files, you can easily recover them.
Version control systems start with a base version of the document and then record changes you make each step of the way. You can think of it as a recording of your progress: you can rewind to start at the base document and play back each change you made, eventually arriving at your more recent version.
Once you think of changes as separate from the document itself, you can then think about “playing back” different sets of changes on the base document, ultimately resulting in different versions of that document.
Version control systems are particularly useful during collaborations. For example, two users can make independent sets of changes on the same document.
Then, the changes can be incorporated into the same base document.
There are many version control software. You can see a full list here. However, BY FAR, the most popular software is Git, which will be the focus of the next section.
Git is a free and open source distributed version control system. It was designed by Linus Torvalds, the inventor of Linux. Git is currently the most popular version control system. | GitHub is a company that provides FREE hosting services for public Git repositories, thus drastically boosting collaboration and development of open-source projects. Under the hood, GitHub servers run Git. |
You need BOTH, Git and GitHub. You need Git installed in your computer to create local repositories to keep your code version controlled, and GitHub is the website where you upload your repositories so that they are remotely backed up, and accessible for collaboration with others. You need both, Git + GitHub, to:
Enforce reproducible science: the easy shareability of code on GitHub means that our scientific analyses can be readily reproduced by others, from data manipulation all the way to final figures, including your future self. This is a BIG DEAL and a crucial part of contemporary science.
Promote collaborative science: Nowadays, GitHub is the main way to do scientific collaborations.
Hunt for jobs: A GitHub account it is increasingly important when applying for jobs (along with a CV). Employers are now wanting see your public repositories in GitHub as a way to evaluate your coding work.
You will need a GitHub account for this lab. If you don't have one already, sign up for a free GitHub account here. While creating your account, if you like, you can add a personal touch by choosing a profile picture, and add other information, like bio, location, website, etc.
You may also want to sign up for the free student pack, which gives you access to extra stuff like secret repos.
From the GitHub website you can manage your repositories and track progress on projects. You can also follow other coders (as on twitter) and see what they're working on - many people keep their work in public repositories.
When you are done, your GitHub account should look like the one below...
If you don't have one already, sign up for a free GitHub account.
In your local computer (e.g. your laptop at home or your desktop at school or work), you can create and manage git
repositories using:
Commands on a terminal (also known as command line or shell). This method is essential when working on remote clusters, supercomputers, Amazon-cloud, servers, or any other remote computer that you connect, from your personal computer, via the internet. Working on remote computers is very common when working with "big data" (e.g. genome sequencing, numerical modelling, etc.).
A git GUI Client, which are user-friendly software to visualize and interact with repositories. These are great in your personal computer; however, they are not easy (or impossible) to install on "remote" computers. There are many git GUI Clients (see list of other recommended clients). In this Lab we will use GitKraken. Feel free to try others GUI Clients.
In this lab we will simultaneously learn how to do git
in the terminal, as well as with GitKraken.
GitKraken should already be installed in the Lab computers. However, if you want to follow along in your personal laptop, download GitKraken here. Open the set up wizard, which should eventually take you to a sign in page. The easiest thing to do is sign in with your GitHub account so that they connect automatically.
During installation, you will need to configure GitKranken by providing your Name, email (choose the same email as your GitHub account), and even an Avatar.
GitKraken comes with git included in the installation package. Therefore, you do not need to install git separately if you are only going to use GitKraken. In this lab, we ask you to have both installed because we want you to learn both ways (i.e. GUI and command line) to interact with repositories.
Git should already be installed in the Lab computers. However, if you want to follow along in your personal laptop, download Git from here: https://git-scm.com/download/win and follow the default instructions in the wizard for set up. Atlassian and happygitwithr are also great resources for download and set up instructions in either Mac or Windows.
To work with base Git, you need to type commands in a terminal (i.e. the "black screen" window). See below on how to find the terminal in your computer.
Where is my Terminal??!!
In Mac and Linux, you can simply interact with Git using your computer's terminal. To open your terminal simply search for "Terminal" in your Mac's search bar (i.e. Spotlight).
Windows does not have an actual terminal, however, Git comes with a dedicated terminal called Git Bash which should now be installed on your machine. You should be able to choose Git Bash from the list of programmes in your start menu (or search for "Git Bash" in the task bar)
To start off, you need to configure git by telling it what GitHub account you want to use, and what is your username, by using the config command:
You can check which account Git is connected to any time using
Now that you have your GitHub account and GitKraken/Git installed in your local computer, it is time to get to work. Let's do our first repository!
NOTE: Repo = repository
It is way easier to create your first repo in GitHub and then clone to you local computer.
In your browser, go to GitHub, go into the repositories area of your GitHub account, and press
mynewrepo
as the name for your first repoIf you are using sensitive data or for some reason prefer others not to see your code, you can choose to make your repo private. For now, keep the repository public so that we can see it.
Adding a README file is useful because you can describe the contents of the repo, aims of your project, and anything else you may want to remember later. The README file will always be visible on your repo's GitHub page.
Every repository has a dedicated URL which allows you to access its contents remotely. To see your new repo's URL, go to the repository's page on GitHub and press
Now that you have your new remote repository in GitHub (i.e mynewrepo
), you can "make a copy" in you local computer. The term in Git to do that is "clone". Note that you only need to clone a repository once. Afterwards you can keep your local and remote repositories synchronized using the ""pull" and "push" commands, as explained in sections below.
Follow either of the instructions below (i.e. GitKraken or Terminal), to clone your remote mynewrepo
repository into your local computer.
File
> Clone Repo
. A "Repository Management" window will appear.GitHub.com
Browse
button and select the location where you want to create your repository, in this case, lets choose: Desktop
Repository to clone
pull down menu and select the mynewrepo
that you created in the step aboveClone the repo!
Desktop
code
button, and copy-paste the HTTPS URL of your code: git clone
command and the URL of your code, to clone your mynewrepo
repository. Make sure you use double quotation marks " "
- git gets angry if you use single quotes ' '
. The executed command should look like below but with YOUR username:Regardless of whether you used GitKraken or the terminal, you should now have a directory on the desktop called mynewrepo
. Using you file explorer, open it. In it you, should see something like below:
Desktop/
│ └───mynewrepo/
│ │ └───.git
│ │ README.md
Congratulations! Now you have your first repo mynewrepo
synced in your local computer and in GitHub.
Note that inside you mynewrepo
directory there is a hidden directory called .git
(you may need to activate "See hidden directories" to be able to see it). In this .git
hidden directory, Git saves all the changes and instruction to recreate any version of you work.
Also, if you made a README.md
file during the creation of your repo in GitHub, you should see that README.md
file in your local mynewrepo
repo.
Now that you have set up your first repository in both, your local machine and GitHub, it is time to start moving information between the two (local and remote) repositories. However, first must talk a bit about the anatomy and typical workflow in Git repositories. There are several "repository parts" and there are very specific actions or "commands" to move information from one part to another. These "repository parts" and "commands" are the same whether you are working with GitKraken or with plain Git, and are shown in the diagram below:
Inside your mynewrepo
directory, you should see all the files in your "Working Directory". For now, you should only have README.md. The contents of "Staging Area" and "Local Repository" are not obvious; you can only see their contents using a GUI (e.g. GitKraken) or by using git commands on the terminal to query their contents. This is because the instructions on how to recreate the contents of the "Staging Area" and "Local Repository" are hidden inside the .git
hidden directory.
Note that most people would refer to "version controlled directories" as if they were the same as "Local Repositories" (e.g. mynewrepo
directory = mynewrepo
repository). Technically this is not correct, because many times a few files from the "Working Directory" are on purpose not tracked and thus excluded from the "Local Repository" area (see example below in the ".gitignore file" section). However, it is so convenient to call "version controlled directories" simply as repos, that people just do it with the understanding that is not perfectly accurate.
The "commands" or actions required to move information across the different repo parts, are:
In reality, you will use some of these commands many times during the day, and some other commands once or twice per day. In a typical day, you may...
Below is a the diagram representation of the typical workday explained above.
Once your repository exists on your computer, you can add new files or modify existing ones as you would normally. Once you are finished with a given task, remember to save your changes back to GitHub by using the git commands add, commit and push. However, as we mentioned above, the contents "Staging Area" and "Local Repository" are not obvious; you can only see their contents using a GUI (e.g. GitKraken) or by using git command status
on the terminal to query their contents. Below we explain how to check status with both methods (i.e. GitKraken and Terminal). Take a look first, then we'll practice with your mynewrepo
repository.
Simply take a look at the "Unstaged Files" panel in GitKraken
Type:
To see how this works, first we need do create a new file in our mynewrepo
repository...
Now, lets check "status" using both methods:
mynewrepo
repoository using GitKraken (see instructions above)mynewrepo
repoository using the Terminal (see instructions above)The next step is to add files to the staging area, which is a list of tracked files that tells git which files versions you intend to commit to your local repository.
Lets take a look how to do this, then we'll practice...
Note that staged files now show in the "Staged Files" panel
You can stage each file individually...
...or you can stage all the unstaged files all at once.
Files which have been added are now staged and will appear in green if you run a status check.
If for some reason you decide you don't want to commit the thing you have just staged, you can unstage files using the reset
command
mynewrepo
repository using one of the two methods above (i.e. GitKraken or Terminal)The .gitignore
file is a text file that tells Git which files or folders to ignore in a project.
For example, consider having code that produces graphs. You want to version control your code, but you do not want to version control your graphs, since they can be easily made again from the code. In this case you can configure your code so that your output graphs are saved in a new directory called /graphs
, then you can write /graphs
in the .gitignore
file, so that Git automatically ignores those graphs, keeping them out of your version controlled repository.
Below is an sample .gitignore
file with a few items typically ignored items in projects written in R. Note that /
is used to ignore pathnames relative to the .gitignore
file. Also, *
is wild card to exclude all files of a given extension.
# History files
.Rhistory
.Rapp.history
# Output files from R CMD build
/*.tar.gz
# Temporary files created by R markdown
*.utf8.md
*.knit.md
When you create a new repository in GitHub or GitKraken, there is an option to include an automatically generated .gitignore
file, where you select a programming language (e.g. R) and GitHub or GitKraken will return a .gitignore
file pre-populated with the most common extensions and paths to ignore for that specific language.
The commit step is really important for version control because (1) this is the step where you actually "save" or "freeze" the latest version that were working on, and (2) because this is where you write notes to yourself or your collaborators about latest changes that you were working on. You MUST include a message in order to make a successful commit. Again, remember to use double quotation marks.
First take a look at the instruction on how to commit and the Terminal, then we'll practice.
Commit a single file:
... or Commit all staged files at once:
mynewrepo
repository using one of the two methods above (i.e. GitKraken or Terminal)You must "PUSH" to send the latest changes to form your local repository to your remote repository (i.e. GitHub). Pushing has the potential to overwrite changes, caution should be taken when pushing.
First take a look at the instructions below, then we'll practice...
You may be asked for your GitHub password.
Type:
You may be asked for your GitHub password.
mynewrepo
repository, to your remote repo in GitHub, using one of the two methods above (i.e. GitKraken or Terminal)Pull is the opposite of push in git. You use pull to download the current version of a repo from GitHub onto your local machine (you don't need to use clone again after the first time you copy your repo onto your computer). It's good practice to run pull at the beginning of every work session if you are collaborating with others on a piece of code, as they may have changed something since the last time you viewed the file and you want to make sure that you are working on the most up to date version.
You may be asked for your GitHub password.
Type:
You may be asked for your GitHub password.
mynewrepo
repository, using one of the two methods above (i.e. GitKraken or Terminal).Note that you probably will get a message saying that there is nothing to "pull", since both repositories (i.e. local and remote) were synchronized in the previous step.
One of the most important principles in GitHub is the notion of branching, which allows developers (including you) to work on different aspects of a coding project without impacting the other people. A branch is a copy of the files in a repository that can be edited and tested independently from the main body of a project, and then later merged back into the master version. Branches are essential for collaborative work!
Branches can also be very useful even if you are the only one working on your repository (i.e. no collaboration). Sometimes you may want to work on several features or sub-sections of your project, where each feature can have its own branch, thus ensuring that no sub-project impact the work on the others.
Branching can be initiated either from GitHub's website, or from your computer (GitKraken or Git Bash).
Here are the instruction how to start a new branch from GitHub...
mynewrepo
webpage in GitHubmybrach
as the name for your new branchViola! You just made your first branch!
Note that you "main" button now says "mybranch", because you are now inside your new branch.
You can view and edit branches from your local machine using the GitKraken or the branch and checkout commands on Git Bash.
mybranch
branchYou are done! Note that the center panel says mybranck
rather than main
To list existing branch names, type:
To switch branches, type:
Note If you did not pull down the latest version of your repo from GitHub after making the new branch, it will not appear in the list of branch names. Run git pull
to sync up your local repo with the GitHub version.
mybranch
branch. Once you switch, any changes you commit & push to your local repo and your GitHub repo will be saved under the mybranch
branch, not the main
branch.
You can also make branches directly in your local computer and then push them out to your remote repository.
You can also use checkout to create new branches directly from the command line. In this case however, Git doesn't automatically know how your new branch connects to the other branches in the repository. You have to explicitly connect the new branch with an 'upstream' branch, usually the main branch.
mynewrepo
Repository. Call the new branch my_second_branch
.my_second_branch
) is checked out... it should, since you just made it. Then, make a new text file in your working directory. Name the new text file test_file.txt and add some text inside the file. You can use RStudio to make the text file (Go to File > New File > Text File)Once you decide you are happy with the changes you've made in your new branch, whether it's adding a new script or testing out a new functionality, you can merge your branch back into the main branch using merge.
mybranch
branchTo merge current branch with upstream branch:, type:
Sometimes you may want to close side branches after merging with the main branch, much like closing out an issue in project management software.
To close branch:, type:
If this explanation isn't quite enough, you can also try GitHub's amazing interactive tutorial
my_second_branch
to the main
branch of your mynewrepo
repository.So far, we have only described (in some depth) how to keep track of changes and versions of your own files, which is a very useful skill. However, the main strength of Git/GitHub is the capability of doing collaborations. You can have multiple people, in different continents, working on the same code. This can be complicated and is beyond the scope of this lab. However, here we will give you a small introduction to collaborations in Git/GitHub and we'll point you to a 5 min YouTube video that shows you how this in done in practice.
The core of collaborations lay in something called "Pull Requests"!!
If you own a repository and want to upload some changes, you would simply do a push. However, if you do not own a repository, you cannot push to it (thank goodness! Can you imagine if anybody could just push (i.e. upload) stuff to your repositories?). Luckily, there is another alternative, a command called Pull Request, where you can politely ask the owner of a repository to take a look at your suggested changes and, if all looks good, then the owner can "pull your requested" changes and merge them with his/her main repository.
In many cases, between the Pull Request and the merge there are a lot of discussion done in GitHub's built-in commenting tool. There whole process looks a bit like the diagram below:
Now, to be able to open Pull Requests, you need to either:
A Fork is similar to a Branch. However, Forks are independent copies of the original repository. If the original repository is deleted, all its Branches would be deleted too, but any Fork would remain in existence.
Take a look at this 5 minute YouTube video from Jake Vanderplas, where he demonstrates a Fork and Pull request in a simple real-life collaboration: https://www.youtube.com/watch?v=rgbCcBNZcdQ
Now it is time to do an exercise requiring "Forking" a public repository and creating a "Pull Request". The objective of the exercise is to make a "phytoplankton plot", and to contribute your plot to a collaborator's repository via a "Pull Request":
If every goes well, at the end, there will be one plot for every student in the class, inside the repo's submitted_plots directory.
Some parts of this lab where borrowed from:
Code below is for formatting of this lab. Do not alter!
cssFile <- '../css/custom.css'
IRdisplay::display_html(readChar(cssFile, file.info(cssFile)$size))
IRdisplay::display_html("<style>.Q::before {counter-increment: question_num;
content: 'QUESTION ' counter(question_num) ': '; white-space: pre; }.T::before {counter-increment: task_num;
content: 'Task ' counter(task_num) ': ';</style>")