The ability of others to understand and repeat a study (i.e. replicable) is one of the core pillars of the scientific method. Historically the ability of others to understand what was done (and therefore to do the same research again) was typically provided by the Methods section in a publication. With this description in hand, it was thought, others would be able to spin up their own study and see how closely their results matched those reported in a given study.

Although in practice this doesn't always happen, it is exactly how the most important scientific research gets used. Discovered in 2012, the gene-editing technology CRISPR-Cas9 is already one of the most important discoveries in the history of Biology (lead to the 2020 Nobel Prize to the authors, Jennifer Doudna and Emmanuelle Charpentier), making it possible to edit specific genetic code to add or remove information. The rapid adoption and use of CRISPR-Cas9 to edit genes is entirely due to its being presented in a reproducible way, with a detailed supplemental section in the main paper that provided all the information needed to try it at home.

In contemporary biology, reproducible research has come to mean that all necessary information to repeat a study is made available, including a description of the methods used, the data collected, and the computer code used for analysis and production of published graphics.

In this Lab we'll discuss contemporary ways to store and manage code and data, so that others (including your future self) can figure out what was done.

Project Management¶

A project is a collection of files (i.e. code, data, notes, figures, etc.) required to accomplish a goal, which could be a course's report, a poster, a scientific manuscript, etc.

The scientific process is naturally incremental, and many projects start as random notes, some code, and some data. As you incrementally clean, explore and refine your data and code, you keep saving your files with slightly different names to keep track of your progress, and eventually all these files have a tendency to end up all mixed together in a "semi-chaotic" state, like the example below:

This "semi-chaotic" method of project organization is terrible for "reproducible research"! It is really hard to figure out which version of your data, was processed by which version of your code, to produce which version of your figures, tables and results. It is hard to work with the contents of a "semi-chaotic" project and, as it grows, it becomes harder and harder to remember which version are the "good ones", causing progressively more and more headaches. You should AVOID doing this!

A good project layout should:

ensure the integrity of your data;
make it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
allow you to easily upload your code with your manuscript submission;
make it easier to pick the project back up after a break
ultimately, make your life easier!

Best practices for project organization¶

Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:

Treat data as read only: This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel), where they can be modified, means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.

Data Cleaning: In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. Storing these scripts in a separate folder, and creating a second “read-only” data folder to hold the “cleaned” data sets, can prevent confusion between the two sets.

Treat generated output as disposable: Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.

There are lots of different ways to manage this output. Having an output folder with different sub-directories for each separate analysis makes it easier later. Since many analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.

Separate function definition and application: One of the more effective ways to work with R is to start by writing the code you want to run directly in a .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the “Run” button) in the interactive R console.

When your project is in its early stages, the initial .R script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. It’s a good idea to separate these functions into two separate folders; one to store useful functions that you’ll reuse across analyses and projects, and one to store the analysis scripts.

Version Control your code: Version control is a program that "tracks changes" to a set of files over time, so that you can go back in time to any previous "version" of your work. We will learn the details of version control later in this lab (see section below).

Below is an example of a better filing scheme. Note that in this particular case...

the data is common to many projects, thus it is kept outside of the projects directory.
the user made several .R files containing functions that are used across many projects, thus they are also kept outside of the projects directory (i.e. inside the my_functions directory).
Each individual project and the my_functions directory are "version controlled" repositories and therefore, each has its own README.md and .gitignore files (more on this below).

my_functions/
│   │   README.md
|   |   .gitignore
|   |   my_stats.R
|   |   my_goespatial.R
|   |   my_fitting.R
|    
data/
|   └───raw_data/
│   |          datafile1.csv
│   |          datafile2.csv
|   |          ...
|   └───clean_data/
│   |          datafile1.csv
│   |          datafile2.csv
|   |          ...
|
projects/
    |
    └───project1
    │   │   README.md
    |   |   .gitignore
    |   └───docs
    |   |      notebook.md
    |   └───results
    |   |      summarized_results.csv
    |   |      plot1.png
    |   |      plot2.png
    |   └───analyses
    |          sightings_analysis.R
    |          plots.R
    │   
    └───project2
    │   │   README.md
    |   |   .gitignore
    |       ...

Other resources about Scientific Project Management:

You just got a "raw data file", you open it to take a first look and discover that the is are typos in the first 3 rows. What do you do?

Which of the two project below is better? ...see choices in Brightspace

Naming conventions¶

It may seem like a trivial thing, but how you name your files, and the directories you put them in, is a big deal. Programming languages need to be able to read file names easily. More importantly, YOU need to be able to read file names easily. If you do not name your files in a descriptive manner, a few months later you may forget the name of one file that you urgently need (e.g. the one that makes the graph for your final assignment)... thus wasting many hours opening every file until you find it. Effective naming is important!

Bad names

A few things to avoid:

non specific names (unless in a clearly defined directory): e.g. abstract.docx, figure1.jpg, etc.
S P A C E S: e.g. Figure 1.jpg, Data for BIO 1000.csv
commas: test_data,trial1.csv
punctuation of any kind: big_result!.csv, awesomess:).jpg

Good names

A few things to include:

dates: e.g. trout_draft_05_May_17.doc, unicorn_data_05_05_16.csv
detailed descriptors: e.g. figure_1_trout_draft2.png, unicorn_meristics.csv

Effective naming is important for the readability of file names and directory names (as explained above), also for column headings, variable names, and really anything "object" that you or your computer may need to read.

Effective naming should follow the three principles outlined by Jenny Bryan's.

Names should be:

Machine readable
Human readable
Plays well with default ordering

Machine readability¶

Computers use regular expressions, which means that they use standardized syntax to process and search information. Without regular expressions, any kind of search on your computer or on the internet would fail miserably. In the context of naming, regular expressions avoid spaces, punctuation, accents, and case sensitivity.

Great file names look like they're oversharing:

2012-07-07_FINPRINT_Aruba-LionCay-T1.csv
2012-07-07_FINPRINT_Aruba-LionCay-T2.csv
2012-07-07_FINPRINT_Aruba-LionCay-T3.csv
2012-07-07_FINPRINT_Aruba-TigerCay-T1.csv
2012-07-07_FINPRINT_Aruba-TigerCay-T2.csv
2012-07-07_FINPRINT_Aruba-TigerCay-T3.csv
2012-07-07_FINPRINT_Aruba-OcelotCay-T1.csv
2012-07-07_FINPRINT_Aruba-OcelotCay-T2.csv
2012-07-07_FINPRINT_Aruba-OcelotCay-T3.csv

Awful file names are coy:

transect1.csv
transect2.csv
Transect2.csv
partialreef.csv
quickone.csv

From the first list, we know exactly how and when the data were collected, and we can use built-in R functions to do some computing for us. The second is close to useless.

Human readability¶

By naming things well, your ability to find a specific file later goes way, way up.

For example, if we want search through a load of files from years ago, there is a big difference in encountering:

first_attempt.r
goodone.r
meh.r
goodone41.r
goodone42.r
goodone43.r
goodone424.r

versus

B1_basic_analysis.r
B2_added_hierarchy.r
B3_added_hierarchy_and_covariates.r
B4_full_Bayesian.r
B5_full_Frequentist.r

Naming things well means it is easy to figure out what something is based on its name.

Humans separate words with spaces. However, machines do not like that. The naming conventions below (from Bååth) can help you name items (i.e. files, directories, variables, etc.) is a way that avoids using spaces:

underscore_separated - All letters are lower case and multiple words are separated by an underscore as in seq_along or package_version.
lowerCamelCase - Single word names consist of lower case letters and in names consisting of more than one word all, except the first word, are capitalized as in colMeans or suppressPackageStartupMessage.
UpperCamelCase - All words are capitalized both when the name consists of a single word, as in Vectorize, or multiple words, as in NextMethod.

Take your pick, but keep it informative, and be consistent.

Plays well with default ordering¶

Default ordering is the way in which files listed in a directory will look when you look at them. This order uses underscores first, then by the first number, then alphabetically. So, if you want a specific file to always be at the top of your directory, you can use an underscore.

Bryan suggests a few key points:

Put something numeric first
Use YYYY-MM-DD for dates
Use leading zeros

Points 1 & 2 help keep things in logical order, either by date or the order you want them in. Point 3 just means that numbers between 1 and 10 will not fall in the correct order unless they have a leading zero. For example:

10_final_figures.R
1_initial_data_wrangling.R
2_model_fitting.R
...

Isn't the behaviour we expected. Far more logical is

01_initial_data_wrangling.R
02_model_fitting.R
...
10_final_figures.R

Select all the files from the list below have BAD file names?

Version Control¶

Version control systems are programs that records changes to a set of files over time so that you can recall specific versions later. Version control is like an unlimited ‘undo’, if you screw things up or lose files, you can easily recover them.

Version control systems start with a base version of the document and then record changes you make each step of the way. You can think of it as a recording of your progress: you can rewind to start at the base document and play back each change you made, eventually arriving at your more recent version.

Once you think of changes as separate from the document itself, you can then think about “playing back” different sets of changes on the base document, ultimately resulting in different versions of that document.

Version control systems are particularly useful during collaborations. For example, two users can make independent sets of changes on the same document.

Then, the changes can be incorporated into the same base document.

There are many version control software. You can see a full list here. However, BY FAR, the most popular software is Git, which will be the focus of the next section.

Git and GitHub¶

Git is a free and open source distributed version control system. It was designed by Linus Torvalds, the inventor of Linux. Git is currently the most popular version control system.

GitHub is a company that provides FREE hosting services for public Git repositories, thus drastically boosting collaboration and development of open-source projects. Under the hood, GitHub servers run Git.

You need BOTH, Git and GitHub. You need Git installed in your computer to create local repositories to keep your code version controlled, and GitHub is the website where you upload your repositories so that they are remotely backed up, and accessible for collaboration with others. You need both, Git + GitHub, to:

Enforce reproducible science: the easy shareability of code on GitHub means that our scientific analyses can be readily reproduced by others, from data manipulation all the way to final figures, including your future self. This is a BIG DEAL and a crucial part of contemporary science.
Promote collaborative science: Nowadays, GitHub is the main way to do scientific collaborations.
Hunt for jobs: A GitHub account it is increasingly important when applying for jobs (along with a CV). Employers are now wanting see your public repositories in GitHub as a way to evaluate your coding work.

Making a GitHub account¶

You will need a GitHub account for this lab. If you don't have one already, sign up for a free GitHub account here. While creating your account, if you like, you can add a personal touch by choosing a profile picture, and add other information, like bio, location, website, etc.

You may also want to sign up for the free student pack, which gives you access to extra stuff like secret repos.

From the GitHub website you can manage your repositories and track progress on projects. You can also follow other coders (as on twitter) and see what they're working on - many people keep their work in public repositories.

When you are done, your GitHub account should look like the one below...

account_page

If you don't have one already, sign up for a free GitHub account.

Git and GitKraken in your local computer¶

In your local computer (e.g. your laptop at home or your desktop at school or work), you can create and manage git repositories using:

Commands on a terminal (also known as command line or shell). This method is essential when working on remote clusters, supercomputers, Amazon-cloud, servers, or any other remote computer that you connect, from your personal computer, via the internet. Working on remote computers is very common when working with "big data" (e.g. genome sequencing, numerical modelling, etc.).
A git GUI Client, which are user-friendly software to visualize and interact with repositories. These are great in your personal computer; however, they are not easy (or impossible) to install on "remote" computers. There are many git GUI Clients (see list of other recommended clients). In this Lab we will use GitKraken. Feel free to try others GUI Clients.

In this lab we will simultaneously learn how to do git in the terminal, as well as with GitKraken.

GitKraken - Installation and setup¶

GitKraken should already be installed in the Lab computers. However, if you want to follow along in your personal laptop, download GitKraken here. Open the set up wizard, which should eventually take you to a sign in page. The easiest thing to do is sign in with your GitHub account so that they connect automatically.

During installation, you will need to configure GitKranken by providing your Name, email (choose the same email as your GitHub account), and even an Avatar.

GitKraken comes with git included in the installation package. Therefore, you do not need to install git separately if you are only going to use GitKraken. In this lab, we ask you to have both installed because we want you to learn both ways (i.e. GUI and command line) to interact with repositories.

Git - Installation and setup¶

Git should already be installed in the Lab computers. However, if you want to follow along in your personal laptop, download Git from here: https://git-scm.com/download/win and follow the default instructions in the wizard for set up. Atlassian and happygitwithr are also great resources for download and set up instructions in either Mac or Windows.

To work with base Git, you need to type commands in a terminal (i.e. the "black screen" window). See below on how to find the terminal in your computer.

Where is my Terminal??!!

In Mac and Linux, you can simply interact with Git using your computer's terminal. To open your terminal simply search for "Terminal" in your Mac's search bar (i.e. Spotlight).

Windows does not have an actual terminal, however, Git comes with a dedicated terminal called Git Bash which should now be installed on your machine. You should be able to choose Git Bash from the list of programmes in your start menu (or search for "Git Bash" in the task bar)

To start off, you need to configure git by telling it what GitHub account you want to use, and what is your username, by using the config command:

git config --global user.name 'Your Username'

git config --global user.email 'Your GitHub email address'

You can check which account Git is connected to any time using

git config --global --list

gitmcd

My first remote repo¶

Now that you have your GitHub account and GitKraken/Git installed in your local computer, it is time to get to work. Let's do our first repository!

NOTE: Repo = repository

It is way easier to create your first repo in GitHub and then clone to you local computer.

In your browser, go to GitHub, go into the repositories area of your GitHub account, and press

On the create a new repository page, write mynewrepo as the name for your first repo
Make sure the "Public" tab is selected
Select the tab "Add a README file"
Click [Create repository].

create_repo

If you are using sensitive data or for some reason prefer others not to see your code, you can choose to make your repo private. For now, keep the repository public so that we can see it.

Adding a README file is useful because you can describe the contents of the repo, aims of your project, and anything else you may want to remember later. The README file will always be visible on your repo's GitHub page.

Every repository has a dedicated URL which allows you to access its contents remotely. To see your new repo's URL, go to the repository's page on GitHub and press

Cloning locally your remote repo¶

Now that you have your new remote repository in GitHub (i.e mynewrepo), you can "make a copy" in you local computer. The term in Git to do that is "clone". Note that you only need to clone a repository once. Afterwards you can keep your local and remote repositories synchronized using the ""pull" and "push" commands, as explained in sections below.

Follow either of the instructions below (i.e. GitKraken or Terminal), to clone your remote mynewrepo repository into your local computer.

Using GitKraken

Click on File > Clone Repo. A "Repository Management" window will appear.

In the middle column of the "Repository Management" window, select GitHub.com
In the "Clone a Repo" section of the "Repository Management" window, click on the Browse button and select the location where you want to create your repository, in this case, lets choose: Desktop
Then, click on Repository to clone pull down menu and select the mynewrepo that you created in the step above
Click Clone the repo!

Using the Terminal (i.e. plain Git)

Open "Git Bash" or your Mac's terminal
Navigate to the place where you want to create your clone the repository, in this case, lets choose: Desktop
cd Desktop
In GitHub, go to your repository's page and press the code button, and copy-paste the HTTPS URL of your code:
In Git Bash, use the git clone command and the URL of your code, to clone your mynewrepo repository. Make sure you use double quotation marks " " - git gets angry if you use single quotes ' '. The executed command should look like below but with YOUR username:
git clone "https://github.com/username/mynewrepo.git"

Regardless of whether you used GitKraken or the terminal, you should now have a directory on the desktop called mynewrepo. Using you file explorer, open it. In it you, should see something like below:

Desktop/
│   └───mynewrepo/
│   │           └───.git
│   │           README.md

Congratulations! Now you have your first repo mynewrepo synced in your local computer and in GitHub.

Note that inside you mynewrepo directory there is a hidden directory called .git (you may need to activate "See hidden directories" to be able to see it). In this .git hidden directory, Git saves all the changes and instruction to recreate any version of you work.

Also, if you made a README.md file during the creation of your repo in GitHub, you should see that README.md file in your local mynewrepo repo.

Git Workflow¶

Now that you have set up your first repository in both, your local machine and GitHub, it is time to start moving information between the two (local and remote) repositories. However, first must talk a bit about the anatomy and typical workflow in Git repositories. There are several "repository parts" and there are very specific actions or "commands" to move information from one part to another. These "repository parts" and "commands" are the same whether you are working with GitKraken or with plain Git, and are shown in the diagram below:

Inside your mynewrepo directory, you should see all the files in your "Working Directory". For now, you should only have README.md. The contents of "Staging Area" and "Local Repository" are not obvious; you can only see their contents using a GUI (e.g. GitKraken) or by using git commands on the terminal to query their contents. This is because the instructions on how to recreate the contents of the "Staging Area" and "Local Repository" are hidden inside the .git hidden directory.

Note that most people would refer to "version controlled directories" as if they were the same as "Local Repositories" (e.g. mynewrepo directory = mynewrepo repository). Technically this is not correct, because many times a few files from the "Working Directory" are on purpose not tracked and thus excluded from the "Local Repository" area (see example below in the ".gitignore file" section). However, it is so convenient to call "version controlled directories" simply as repos, that people just do it with the understanding that is not perfectly accurate.

The "commands" or actions required to move information across the different repo parts, are:

Add: Earmarks files from the Working Directory, which are then considered to be "in the Staging Area"
Commit: "Saves" changes in all the "earmarked" files in the Staging Area, to the Local Repository (HEAD) (or whatever branch or version is currently checked out)
Push: Moves all committed changes from the Local Repository" to the Remote Repository. After a "push", both repositories are "synchronized".
Pull: Moves all committed changes from the Remote Repository to the Local Repository. After a "pull, both repositories are "synchronized".
Checkout: Uses the instructions in the Local Repository to recreate the latest version of the Working Directory. Note that you can checkout previous versions or other branches, to recreate the Working Directory of whatever version or branch you decided to checkout.

In reality, you will use some of these commands many times during the day, and some other commands once or twice per day. In a typical day, you may...

start your day by "pulling" your remote repo (to make sure you are synched with GitHub)
then work, work, work
then "save" your work (i.e. add/commit) to your local repository
then work, work, work
then "save" again your work (i.e. add/commit) to your local repository
then work, work, work
then "save" again your work (i.e. add/commit) to your local repository
then end your day by "pushing" your local repo to GitHub (to make sure your computer and GitHub are synched again)

Below is a the diagram representation of the typical workday explained above.

Which of the following Git day-work sequence is correct?

Which of the following is NOT a reason why to use Git and GitHub?

If you want to "earmark" a file from the "Working Directory" so that it is included in the "Staging Area", you need to use:

If you want to send the latest changes to form your "local repository" to your "remote repository" (i.e. GitHub), you need to use:

If you want to "save" the changes in all the files in the "Staging Area", to the "Local Repository", you need to use:

If you want to use the instructions in the "Local Repository" to recreate a particular version of the "Working Directory", you need to use:

If you want to move all committed changes from the "Remote Repository" to the "Local Repository" (to synchronize both repos), you need to use:

If you want to copy your "Remote Repository" to your local machine (i.e. to make a "Local" copy of the repository), you need to use:

Check Status of repo¶

Once your repository exists on your computer, you can add new files or modify existing ones as you would normally. Once you are finished with a given task, remember to save your changes back to GitHub by using the git commands add, commit and push. However, as we mentioned above, the contents "Staging Area" and "Local Repository" are not obvious; you can only see their contents using a GUI (e.g. GitKraken) or by using git command status on the terminal to query their contents. Below we explain how to check status with both methods (i.e. GitKraken and Terminal). Take a look first, then we'll practice with your mynewrepo repository.

In GitKraken

Simply take a look at the "Unstaged Files" panel in GitKraken

In the Terminal (i.e. plain Git)

Type:

git status

git-status

To see how this works, first we need do create a new file in our mynewrepo repository...

Create a new text file called Hello_World.txt and save it into your git repo.

Now, lets check "status" using both methods:

Check the status of your mynewrepo repoository using GitKraken (see instructions above)
Check the status of your mynewrepo repoository using the Terminal (see instructions above)

Stage Files¶

The next step is to add files to the staging area, which is a list of tracked files that tells git which files versions you intend to commit to your local repository.

Lets take a look how to do this, then we'll practice...

In GitKraken

Click on the "plus sign" beside the files you want to stage, or
Simply click on "Stage all changes" to stage all the unstaged files

Note that staged files now show in the "Staged Files" panel

In the Terminal (i.e. plain Git)

You can stage each file individually...

git add filename

...or you can stage all the unstaged files all at once.

git add --all

Files which have been added are now staged and will appear in green if you run a status check.

git-add

If for some reason you decide you don't want to commit the thing you have just staged, you can unstage files using the reset command

git reset filename

add any new file(s) to the staging area of your mynewrepo repository using one of the two methods above (i.e. GitKraken or Terminal)

.gitignore file¶

The .gitignore file is a text file that tells Git which files or folders to ignore in a project.

For example, consider having code that produces graphs. You want to version control your code, but you do not want to version control your graphs, since they can be easily made again from the code. In this case you can configure your code so that your output graphs are saved in a new directory called /graphs, then you can write /graphs in the .gitignore file, so that Git automatically ignores those graphs, keeping them out of your version controlled repository.

Below is an sample .gitignore file with a few items typically ignored items in projects written in R. Note that / is used to ignore pathnames relative to the .gitignore file. Also, * is wild card to exclude all files of a given extension.

# History files
.Rhistory
.Rapp.history

# Output files from R CMD build
/*.tar.gz

# Temporary files created by R markdown
*.utf8.md
*.knit.md

When you create a new repository in GitHub or GitKraken, there is an option to include an automatically generated .gitignore file, where you select a programming language (e.g. R) and GitHub or GitKraken will return a .gitignore file pre-populated with the most common extensions and paths to ignore for that specific language.

Commit changes¶

The commit step is really important for version control because (1) this is the step where you actually "save" or "freeze" the latest version that were working on, and (2) because this is where you write notes to yourself or your collaborators about latest changes that you were working on. You MUST include a message in order to make a successful commit. Again, remember to use double quotation marks.

First take a look at the instruction on how to commit and the Terminal, then we'll practice.

In GitKraken

Write a message in the "Commit Message" panel. It is better if your message describes the new changes.
Click on the "Commit changes to # files" button

In the Terminal (i.e. plain Git)

Commit a single file:

git commit filename -m "message here"

... or Commit all staged files at once:

git commit -a -m "message here"

Commit changes to your local mynewrepo repository using one of the two methods above (i.e. GitKraken or Terminal)

Push to remote repo¶

You must "PUSH" to send the latest changes to form your local repository to your remote repository (i.e. GitHub). Pushing has the potential to overwrite changes, caution should be taken when pushing.

First take a look at the instructions below, then we'll practice...

In GitKraken

Simply click on the "Push" button:

You may be asked for your GitHub password.

In the Terminal (i.e. plain Git)

Type:

git push

You may be asked for your GitHub password.

git-push

Push changes from your local mynewrepo repository, to your remote repo in GitHub, using one of the two methods above (i.e. GitKraken or Terminal)

Pull from remote repo¶

Pull is the opposite of push in git. You use pull to download the current version of a repo from GitHub onto your local machine (you don't need to use clone again after the first time you copy your repo onto your computer). It's good practice to run pull at the beginning of every work session if you are collaborating with others on a piece of code, as they may have changed something since the last time you viewed the file and you want to make sure that you are working on the most up to date version.

In GitKraken

Simply click on the "Pull" button:

You may be asked for your GitHub password.

In the Terminal (i.e. plain Git)

Type:

git pull

You may be asked for your GitHub password.

If have not done so yet...

Pull changes from your remote repo in GitHub, to your local mynewrepo repository, using one of the two methods above (i.e. GitKraken or Terminal).

Note that you probably will get a message saying that there is nothing to "pull", since both repositories (i.e. local and remote) were synchronized in the previous step.

If you want to see the contents of the "Staging Area" and the "Local Repository", you need to:

What command do you need to "stage" a file?

What is a ".gitignore" file used for?

What do you ALWAYS have to provide when committing changes to a repository?

The "Push" command is used when...

The "Pull" command is used when...

True or False: You need to "clone" a remote repository every time you want to synchronize it with a local repository

Branching¶

One of the most important principles in GitHub is the notion of branching, which allows developers (including you) to work on different aspects of a coding project without impacting the other people. A branch is a copy of the files in a repository that can be edited and tested independently from the main body of a project, and then later merged back into the master version. Branches are essential for collaborative work!

Branches can also be very useful even if you are the only one working on your repository (i.e. no collaboration). Sometimes you may want to work on several features or sub-sections of your project, where each feature can have its own branch, thus ensuring that no sub-project impact the work on the others.

Make a branch on GitHub¶

Branching can be initiated either from GitHub's website, or from your computer (GitKraken or Git Bash).

Here are the instruction how to start a new branch from GitHub...

Go into your mynewrepo webpage in GitHub
Press the "main" button,
In the "Find or create branch..." field, write mybrach as the name for your new branch
Click [Enter]

Create a branch on GitHub using the instructions above.

Viola! You just made your first branch!

Note that you "main" button now says "mybranch", because you are now inside your new branch.

Checking out a Branch¶

You can view and edit branches from your local machine using the GitKraken or the branch and checkout commands on Git Bash.

In GitKraken

In GitKraken, click on the "Pull" button to sync with the remote repo
Then, find in the left panel your mybranch branch
Right-mouse click... or click on the three horizontal dots
Click on "Checkout origin/mybranch"

You are done! Note that the center panel says mybranck rather than main

In the Terminal (i.e. plain Git)

To list existing branch names, type:

git show-branch

To switch branches, type:

git checkout branchname

git-branch

Note If you did not pull down the latest version of your repo from GitHub after making the new branch, it will not appear in the list of branch names. Run git pull to sync up your local repo with the GitHub version.

Check out your new mybranch branch.

Once you switch, any changes you commit & push to your local repo and your GitHub repo will be saved under the mybranch branch, not the main branch.

Make a new local branch¶

You can also make branches directly in your local computer and then push them out to your remote repository.

In GitKraken

Click on the "Branch" button
A writable field will appear... write the branch name you want
Click [Enter]
If you want to upload the new branch to your remote repo, click "Push", then "Submit"

In the Terminal (i.e. plain Git)

You can also use checkout to create new branches directly from the command line. In this case however, Git doesn't automatically know how your new branch connects to the other branches in the repository. You have to explicitly connect the new branch with an 'upstream' branch, usually the main branch.

To create a new branch:

git checkout -b branchname

To push and set upstream branch:

git push --set-upstream origin branchname

Using either method (i.e. GitKraken or Terminal), make a new branch in your mynewrepo Repository. Call the new branch my_second_branch.
Make sure you new branch (i.e. my_second_branch) is checked out... it should, since you just made it. Then, make a new text file in your working directory. Name the new text file test_file.txt and add some text inside the file. You can use RStudio to make the text file (Go to File > New File > Text File)
add, commit and push changes

Merge Branches¶

Once you decide you are happy with the changes you've made in your new branch, whether it's adding a new script or testing out a new functionality, you can merge your branch back into the main branch using merge.

In GitKraken

Then, find in the left panel your mybranch branch
Right-mouse click... or click on the three horizontal dots
Click on "Merge mybranch into main". Note that the option of merging is only available is there are differences between the branch to be merges and main

In the Terminal (i.e. plain Git)

To merge current branch with upstream branch:, type:

git merge branchname

Sometimes you may want to close side branches after merging with the main branch, much like closing out an issue in project management software.

To close branch:, type:

git branch -d branchname

If this explanation isn't quite enough, you can also try GitHub's amazing interactive tutorial

Using either method (i.e. GitKraken or Terminal), merge your new branch my_second_branch to the main branch of your mynewrepo repository.
push changes

In Git, what is a "Branch"?

There are many applications where "branches" are used. Which of the following in NOT an application where you would use "branches"?

If you want to "stage" a file from the "Working Directory" so that it is included in the "Staging Area", you need to use:

If you want sync your local and remote repositories, where you send the latest changes form your "local repository" to your "remote repository", you need to use:

If you want to "save" changes from the "Staging Area" to the "Local Repository", you need to use:

If you want to use the instructions in the "Local Repository" to recreate a particular version of the "Working Directory", you need to use:

If you want to use the instructions in the "Local Repository" to recreate the contents of a particular "Branch" in the "Working Directory", you need to use:

If you want to move the latest changes from the "Remote Repository" to the "Local Repository" (to synchronize both repos), you need to use:

If you want to copy of your "Remote Repository" (i.e. to make a "Local" copy of the repository), you need to use:

When you want to re-integrate the contents of a "branch" with the contents of the main trunk, you need to use:

Collaborations (Forks and Pull Requests)¶

So far, we have only described (in some depth) how to keep track of changes and versions of your own files, which is a very useful skill. However, the main strength of Git/GitHub is the capability of doing collaborations. You can have multiple people, in different continents, working on the same code. This can be complicated and is beyond the scope of this lab. However, here we will give you a small introduction to collaborations in Git/GitHub and we'll point you to a 5 min YouTube video that shows you how this in done in practice.

The core of collaborations lay in something called "Pull Requests"!!

If you own a repository and want to upload some changes, you would simply do a push. However, if you do not own a repository, you cannot push to it (thank goodness! Can you imagine if anybody could just push (i.e. upload) stuff to your repositories?). Luckily, there is another alternative, a command called Pull Request, where you can politely ask the owner of a repository to take a look at your suggested changes and, if all looks good, then the owner can "pull your requested" changes and merge them with his/her main repository.

In many cases, between the Pull Request and the merge there are a lot of discussion done in GitHub's built-in commenting tool. There whole process looks a bit like the diagram below:

Now, to be able to open Pull Requests, you need to either:

Have access (granted by the owner) to create branches in the repository you want to suggest changes to
or, you can Fork the repository you want to suggest changes to (which can be any public repository in GitHub)

A Fork is similar to a Branch. However, Forks are independent copies of the original repository. If the original repository is deleted, all its Branches would be deleted too, but any Fork would remain in existence.

Take a look at this 5 minute YouTube video from Jake Vanderplas, where he demonstrates a Fork and Pull request in a simple real-life collaboration: https://www.youtube.com/watch?v=rgbCcBNZcdQ

True or False: If you find a public repository in GitHub (from a stranger), and you see that there are some really useful R functions in this public repository, you can just "clone" the repo to your local machine and start using those functions.

True or False: If you see an error in a public repository in GitHub (from a stranger), you can just "clone" the repo to your local machine, fix the error and the "push" the changes to correct the error in the public repository in GitHub.

If you see an error in a public repository in GitHub (from a stranger), you can send corrections to the owner of the public repository in GitHub, using:

What is the difference between a "branch" and a "fork"?

When you want to re-integrate the contents of a "branch" with the contents of the main trunk, you need to use:

If you are the owner of a repo, and a collaborator just submitted a "pull request" that you want to include in your repo, you need to use:

Now it is time to do an exercise requiring "Forking" a public repository and creating a "Pull Request". The objective of the exercise is to make a "phytoplankton plot", and to contribute your plot to a collaborator's repository via a "Pull Request":

Go the following repository: https://github.com/Diego-Ibarra/phytoplankton_plot
"Fork" the repository. This will make a copy of the repository in your own GitHub account.
Clone your forked repository in your local computer.
Take a look at inside your cloned repository. You'll see there are:
- an R file: make_phytoplankton_plot.R
- a directory named data, that contains a data file called phytoplankton_data.csv
- a directory named submitted_plots. Note that, depending when you forked the repo, there may be some plots already in this directory.
- and a README.md and LICENSE files
Use RStudio to open the file in your cloned reository called make_phytoplankton_plot.R
Run the whole make_phytoplankton_plot.R file using the button... a plot should appear in the "Plot panel"
In the "Plot panel", click on the "Export" button and save your plot as FirstName_LastName.png (e.g. John_Smith.png inside the submitted_plots directory within your cloned reository.
Add/commit/push your cloned repository
In GitHub, open a "Pull Request" asking the original owner of the "phytoplankton_plot" repo to accept your changes (i.e. the new plot that you made).

If every goes well, at the end, there will be one plot for every student in the class, inside the repo's submitted_plots directory.

LAB 2: Project management and version control
¶

Reproducible research (an introduction)¶

Project Management¶

Best practices for project organization¶

Naming conventions¶

Machine readability¶

Human readability¶

Plays well with default ordering¶

Version Control¶

Git and GitHub¶

Making a GitHub account¶

Git and GitKraken in your local computer¶

GitKraken - Installation and setup¶

Git - Installation and setup¶

My first remote repo¶

Cloning locally your remote repo¶

Git Workflow¶

Check Status of repo¶

Stage Files¶

.gitignore file¶

Commit changes¶

Push to remote repo¶

Pull from remote repo¶

Branching¶

Make a branch on GitHub¶

Checking out a Branch¶

Make a new local branch¶

Merge Branches¶

Collaborations (Forks and Pull Requests)¶

LAB 2: Project management and version control¶

Reproducible research (an introduction)¶

Project Management¶

Best practices for project organization¶

Naming conventions¶

Machine readability¶

Human readability¶

Plays well with default ordering¶

Version Control¶

Git and GitHub¶

Making a GitHub account¶

Git and GitKraken in your local computer¶

GitKraken - Installation and setup¶

Git - Installation and setup¶

My first remote repo¶

Cloning locally your remote repo¶

Git Workflow¶

Check Status of repo¶

Stage Files¶

.gitignore file¶

Commit changes¶

Push to remote repo¶

Pull from remote repo¶

Branching¶

Make a branch on GitHub¶

Checking out a Branch¶

Make a new local branch¶

Merge Branches¶

Collaborations (Forks and Pull Requests)¶

LAB 2: Project management and version control
¶