LAB 5: Advanced data manipulation

BIO3782: Biologist's Toolkit (Dalhousie University)


Setup of workspace

Make sure all required files are in the working directory:

As in previous labs, we'll try simulate "real-life" coding, where:

  1. Sometimes you want to type of copy-paste directly in a .r file, using RSTudio's , to build a script file that can be run as a whole to accomplish a series of tasks. This file will allow you to save your work so that you can review it (or expand it) at a later date. In the you can execute sections as you type them in, by highlighting and clicking . We'll guide you when to use the by displaying the following before a "code" cell:



  1. Some other times you want to quickly do "one-off" queries or inspections that you do not want to be part of your script file. In this case, type or copy-paste code into RStudio's and click [enter]. We'll guide you when to use the by displaying the following before a "code" cell:




Data manipulation

The path between data collection and the final plot or analysis is a publication is, to say the least, complicated and convoluted. Raw data files contain data organized in a way that preserves all the information that was available during collection, often in a format that facilitates data entry. For each different research objective — whether it is to test a hypothesis, do a plot or run an analysis — the raw data often needs to be filtered, splitted, joined or merged with other data, reshaped, mutated, summarized, and/or many other processes and trasformations. All of these intermediate steps, which are necessary to get the raw data into a shape or format suitable for the final analysis, are collective know as Data manipulation.

The tidyverse

The tidyverse package is an "umbrella-package" (or "bundle-package") that installs tidyr, dplyr, and several other packages useful for data analysis, such as ggplot2, tibble, etc.

The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the built-in functions that come with R:

  1. The results from a base R function sometimes depend on the type of data.

  2. Using R expressions in a non standard way, which can be confusing for new learners.

  3. Hidden arguments, having default operations that new learners are not aware of.

Here we will only install and load the tidyverse package. In the following sections we will lear how to use many functions of several packages included in the tidyverse

You probably already have installed the tidyverse package, if so skip this step.

To install the tidyverse package...


To load the tidyverse package...


You should also download the dplyr cheat sheet for reference as well.

What are dplyr and tidyr?

Both, dplyr and tidyr, are packages included in the tidyverse package. They should be both loaded (when you did library(tidyverse)) and ready to go...

dplyr is a package for making tabular data manipulation easier. It pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.

The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized to run very fast (they were written in a compiled language, like C++).

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups (e.g., a time period, an experimental unit like a plot or a batch number). Moving back and forth between these formats is non-trivial, and tidyr gives you tools for this and more sophisticated data manipulation.

The data

Here we will work with data collected to evaluate the cumulative effects of drought and disease on the silver tree (Leucadendron Argenteum). The data includes measured physiological responses of the plant such as photosynthetic oxygen evolution and electrical conductivity.

First, let's download the data.

You should already have the data files Silver Tree Study photosynthetis.csv and Silver tree study conductance.csv inside Desktop/Lab5, and set up that folder as your working directory. If you haven't take a look at the beginning of this lab.

First, we'll read in our data using the read_csv() function from the tidyverse package readr.


Let's take a look at the photosynthesis datasets using the head(), tail() and str().




How many rows did the function head() display?



What date was the first row collected on?



What is the function to display the last few rows of a dataframe (opposite of head())?

Now lets take a look at the structure of data frame.




How many columns/variables are there in photosynthesis?



What class of variable is Species?



The function str() shows the structure of the dataframe?

Let's make a plot to quickly asseess Photosynthesis as a function of pathogen species.


As you can see, the first bar (i.e Both Pathogens) shows smaller Photosysthesis that the Control and trees with only the exotic pathogen or the indigenous pathogen.

Now lets do the same with conductivity.




Take a look at the conductivity datasets using the head(), tail() and str(). Then answer the following questions.






How many observations are in conductivity?



How many columns/variables are there in conductivity?



What class of variable is Conductance?



What is the last value in the Conductance column?

Let's make a plot to asseess Conductance as a function of pathogen species.


When looking at conductance the pattern is even stronger than with photosynthesis. Here too, the first bar (i.e Both Pathogens) shows smaller conductance that the Control and trees with only the exotic pathogen or the indigenous pathogen.

Since the experiment was run with two different instruments, we ended up with two different datasets of paired measurements. We can combined both datasets into one more useful dataset.

Combining two or more datasets

There are several functions that allow you to combined multiple datasets. Here we will discuss several join functions and also the merge() function.

Join

Join functions allow you to connect multiple tables based on common variables (columns). There are many types of join functions included in dplyr. Here we'll review a few, but you can see the full list here.

Most join functions follow the basic syntax below:

joinFunctionName(dataFrame1, dateFrame2, by = "variable names")

However, most of the time you will use the "joined" product multiple times; therefore, you probably want to "save" the joined table in a new variable:

newJoinedDataFrame <- joinFunctionName(dataFrame1, dateFrame2, by = "variable names")

left_join

The most useful of the join functions is a left join(), which takes a table (say table 1), and then adds all matching records from table 2.The resulting table will show all records from table 1 (the left hand table), and matching records from table 2 (the right hand table).

The specific syntax of left_join() is:

dataFrame3 <- left_join(dataFrame1, dateFrame2, by = "variable names")

Graphically, it would look like this:

However, to make the graphics a bit simpler, we won't save the resulting "joined" data frame into the df3 variable. Instead, we will only show you the command ran over the resulting joined table. The new simpler graphic looks like this:

From now on, we will be using the simpler graphics in all the examples below.


About the left_join(), note that R will automatically detect which columns two tables have in common IF they are called the same thing. If not, you have to feed the matching columns names into the left_join() function.

We can create a new dataframe with information all the information from both photosynthesis and conductivity joining the two dataframes using the qualifying variables in the columns from "Sample type" to "Obs".


Let's look at the combined dataset below using the head() and str() functions:


We can see that left_join combined photosynthesis and conductivity data by all the criteria columns we specified above. We basically get photosynthesis back, but with the addition of variables Conductance and Ci, which are unique to conductivity.




How many observations(rows) does `combined_data` have?



How many columns does `combined_data` have?

inner_join

The inner_join() function returns all rows from dataframe 1 where there are matching values in dataframe 2, and all columns from dataframe 1 and dataframe 2. If there are multiple matches between them, all combination of the matches are returned. Basically, the result shows only 'matched' records - where there is a matching value in both tables. Instances where there are no matching values are thrown out.



Then,


Notice that we get repeated variables Licor.x and Licor.y. We didn't specify that Licor was a criteria for joining and since it appears in both datasets, join will create new variables for them; once from each dataset.



What variables would be duplicated if we ran the following code?

inner_join(photosynthesis, conductivity,
          by =c("Sample type",
          "Days after inoculation",
          "Date",
          "Trial",
          "Treatment",
          "Species",
          "Plant Number",
          "Isolate Number",
          "Unique Sample Number"
          ))


full_join

full_join will return all rows and all columns from both dataframe 1 and dataframe 2. full_join will return NA for instances where there are no matching values.






Take a look at combined_full using the head(), tail() and str(). Then answer the following questions.






How many observations/rows does the variable Ci have in the combined_full dataset?



Are the variables (or columns) of combined_full and combined_data the same?

Notice that this returns pretty much the same dataset as the one we created with left_join. We didn't have to specify any of the variables to join by in this case. We get all varaibles and rows from both photosynthesis and conductivity.

anti_join

anti_join removes all records from dataframe 1 where that have matching variables in dataframe 2. This basically filters your datasets. Note that anti_join() is essentially the opposite of inner_join().






Take a look at combined_anti using the head(), tail() and str(). Then answer the following questions.






How many variables does combined_anti have?



How many rows does combined_anti have?

Notice that we effectively returned all variables of photosynthesis without any data! This happened because conductivity had the all same variables and these measurements were paired with those of photosynthesis (taken from the same sample at the same time). There were no instances where the variables didn't match that the anti_join function could return and observation.

Merge

Another way to combine datasets is to use the function merge(). By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by the by, by.x and by.y arguments. The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each. This basically merges two datasets horizontally.




Use the function's help file to look at the syntax for merge().




Let's recreate the combined_data dataset using merge instead.


Now, lets take a look at combined_data





What does the arguement "by" in the merge function do?



How many variables does combined_data have?



What extra variable does combined_data have that conductivity does not?



Are the variable classes of combined_data created using left_join and merge the same?




Let us download the task datasets task_data_fox_size.csv and task_data_fox_weight.csv

Using Joins or merge, create a new dataset called fox_data combining data from both task_data_fox_size.csv and task_data_fox_weight.csv






How many rows does fox_data have?



How many columns does fox_data have?



Using the summary() function, what is the median value of avfood?



What class of variable is ID?

Next, we're going to learn some of the most common dplyr functions:


Selecting columns

If we only want to keep certain columns of data, we can use select(). The first argument of this function is the dataset, and the subsequent arguments are the columns to keep. You can read all the details on how to use select() in its help file. However, the basic syntax is:

newSelectedData <- select(data, column names)

Graphically, select() looks like this:


In the following code, we will select the columns Sample type, Photosynthesis, and Conductance.


Let's take look ar sml_data:


Then,




How many variables (or columns) are in sml_data?

Selecting multiple columns with :

We can also use : to select multiple columns from left to right.


Let's take a look at sml_data_select:




How many variables (or columns) are in sml_data_select?

Omiting columns during select

To select all columns except certain ones, put a "-" in front of the variable to exclude it. In the following code, we want to exclude the columns "Licor" and "Obs".


Let's take a look at sml_data_omit:




How many variables (or columns) are in sml_data_omit?



Using the "combined_data" dataset and the "-", what would the code look like to select all columns except Species?

Filtering rows

What if we only want to select the data from one Species? You can also choose rows based on a specific criterion using the filter() function. You can read all the details on how to use filter() in its help file. However, the basic syntax is:

filteredData <- filter(dataset, selection criteria)

Graphically, filter() looks like this:

In the following code, we want to select all data from Indigenous Pathogen Species only.

You could do this following the same approach as above...

Step 1: create a new filtered_variable:


Step 2: Use the new filtered_data variable:



...or we can skip the intermediate variable and "nest" one function within another (see example below). This is preferable if you are only going to use the "filtered" data once. If you are going to use the filtered data multiple times, it is better to create a sepate variable, like filtered_data. We'll talk more about nesting functions is a section below.

For now, let's filter combined_data and take a look, using nested functions:




How many rows are in the original combined_data dataset?



How many rows are in filter(combined_data, Species == "Indigenous Pathogen")?

We can also filter by multiple criteria. Say we want Obs 1 data from Indigenous Pathogen Species only. We can use the filter() function and expand out selection criteria.




How many rows are in filter(combined_data, Species == "Indigenous Pathogen", Obs ==1)?


Let's take a look at our Species data to see if we can clean up our dataset. We note that Species is a character column, where using the levels() function would return NULL (i.e. empty, or nothing).


Using the function as.factor() we can convert Species entries into factors:


Now, we can use the levels() function to examine the factor levels:


For simplicity, let's create a new object data_filtered and select only rows where Photosyntheis < 200 and Species is not equal to "Both Pathogens". Let's take a look at the first 4 rows.


Now, lets take a look at data_filtered:


Then,


Then,


Note that "Both Pathogens" still appears as a level of Species in data_filtered but all the observations have been removed. Let's compare Photosynthesis vs Conductance between combined_data and data_filtered graphically.


Then,


In data_filtered, outlying data points (Photosynthesis>200) have been filtered out making it easier to visualize our data.



What does != in the select() function mean?



How many levels does Species have in data_filtered?



TRUE or FALSE: "Both Pathogens" in data_filtered has 0 observations because they did not take measurements?




Create a dataset called male_fox_data wherein only data from male foxes are included.






How many rows does male_fox_data have?

Sequential operations

What if you want to select and filter at the same time? In other words, what if you need to do something to your data, get the output and do something else with it, and again get the output and do again something else it...

There are three ways to do this:

  1. use intermediate steps,
  2. use nested functions, or
  3. use pipes

Intermediate steps

This is when you save the output of intermediate steps into a variable, then you feed that intermediate variable as the input of next subsequent step.

Example:

What if we wanted to select only Sample type, Photosynthesis and Conductance from the Control and Indigenous Pathogen Species? We could create multiple objects.

  1. First, filter for species Control and Indigenous Pathogen and save the output to an intermediate variable (i.e. species_filtered)

  2. Then apply select() to the inter intermediate variable species_filtered to create your final output: species_sml

We can use %in% with select() to specify a list of paramters/levels we want to keep.

Graphically, what we want to do would look like this:

The actual code to do it is bellow:


Now, lets take a look at species_sml:


Then,


This is readable, but is clunky and can clutter up your workspace with lots of small objects that youd have to keep track of.

Nested functions

Another method would be to use nested functions. This is when you write one function inside of another function, like this:


Let's take a look:


Then,


See, same results as above...

R evaluates the expression from the inside out (in this case, filtering, then selecting). I would avoid this method since it makes debugging your code a nightmare! It is also hard to read!

Pipes

A better method is to use pipes. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr.

If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

We will re-create the species_sml dataset using pipes. First, we want to filter data from the data_filtered dataset, keeping rows where Species is either Indigenous Pathogen or Control, then select only the Sample type, Species, Photosynthesis and Conductance columns. We will use the pipe to send the data_filtered dataset through the filter() then select() functions.


You can also use pipes to take a look at your data:


However, I find it simpler to use the "normal" approach:


See, same results as above...

Now, let's make a plot also using pipes:


Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don't need to explicitly include the data frame as an argument to the filter() and select() functions any more.




Using the fox_data dataset, create a new object fox_data_cleaned that contains only the columns ID, Sex, avfood, from female foxes only.






How many rows are there in fox_data_cleaned?



What is the median weight of female foxes in the fox_data_cleaned dataset?


Reshaping dataframes

Useful links:

  1. Pivot tutorial
  2. pivot_longer
  3. pivot_wider

Sometimes you may need to alter the layout of tabluar datasets while preserving the values and relationships contained within. This can also be very useful when plotting data from very large datasets with multiple columns. You may have heard of spread() and gather(). These are functions with similar but more limited functionality and were replaced by pivot_wider() and pivot_longer().

pivot_wider

pivot_wider() moves rows into columns (long to wide format). Values in the "Parameter" column become new column names and their respective values get copied into the reshaped data. Columns with no observations will get coded with a value of NA.

Let's use pivot_wider() to turn Species factors into column names by changing the dataset from long to wide format.


Let's take a look at data_wide:



Notice that some columns have NA values. pivot_wider creates that value if it doesn't exist in the data.

Let's try to plot the data from the Control, Exotic Pathogen, and Indigenous Pathogen columns by date. In the wide format we will have to create 3 separate plots and used a plotting package like cowplot to acheive what we want. Wide format is nice for people that are not used to seeing data separated into value (i.e. mg/L O2) and classifier/parameter (i.e. ogygen) columns (like what you'll see in a little bit).

First, if you have not yet installed cowplot, you may have to:


Then, load the cowplpot package:


Then, do the plot...




How many observations does data_wide have?

pivot_longer

In this situation we are gathering the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.

pivot_longer moves columns into rows (wide to long format). pivot_longer takes several arguements but the most important ones for this exercise are names_to and values_to. names_to creates a new classifier/criteria column, taking arguements from your column names. values_to takes values from those columns and creates a new column of values classfied by the factors created by names_to.

Let's use pivot_longer to transform data_wide back into data_filtered.


Let't take alook:



We have bassically created a column for values called Photosynthesis and a classifier/parameter column called Species. Let's try and plot Photosynthesis for all three species by date again.


We got a similar graphical result with less lines of code! Long format is great when you need to plot data with multiple factor levels.



How many rows does data_long have?



TRUE or FALSE: data_wide and data_long have the same number of nobservations because pivot_wider fills missing observations with NA?


Mutate

We can use mutate() to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns.

For example, we can convert Photosynthesis oxygen measurements from mg/L to ug/L by creating a new column called Photosynthesis_ppb



Let's take a look at unit_conversion:


Then,


Let's make a plot to make sure it worked. That is, that the relationship between Photosynthesis and Photosynthesis_ppb is linear and with a slope of 1000:


Let's see some summary statistics:


Notice that summary statistics after the mutation are 1000 times the original.

We can also use mutate to change variable level names.

First, let's check original levels:


Now, let's create a new change_lables with new variable names:


Let's check levels again:


We can also use mutate() to change variable types. Here, we will convert some variables from character to factors:


We can use the function sapply() to see the column variable classes:





In fox_data, create a new column called 'Consumption' which will contain values of avfood divided by weight i.e Consumption =avfood/weight






What is the mean Consumption?


group_by() and summarize() functions

Many data analysis tasks can be approached using the split-apply-combine paradigm:

  1. split the data into groups
  2. apply some analysis to each group
  3. combine the results.

dplyr makes this very easy through the use of the group_by() function. group_by() takes an existing data frame and converts it into a "grouped" data frame where operations are performed "by group".

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics.

In the example below, we compute the mean and standard deviation of Photosynthesis measurements by Treatment.




What would the code inside the summarise function look like if we wanted to get the mean of Conductance?

Let's take a look at our data graphically. We notice that the solid horizontal lines representing our average Wet and Drought values don't line up with the boxplot median. This indicates that our data might not be normally distributed. We may want to use the median() function to find the midpoint of the data instead.


We can also group by mutliple columns.





What would the code inside the group_by function look like if we wanted to group by both Date, Treatment and Species?

We can also arrange the values in order of "median photosysthesis" using the arrange() function. To sort in descending order, we need to add the desc() function.


Here is a table of some useful functions you can use with summarize.



Using the help section, what would your arrange function code look like to arrange median in ascending order?




In fox_data, find the mean weight by sex of the fox population.






What is the mean weight of Male foxes?


Counting observations

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). Let's summarise the number of measurements made for each plant on each day.


We can also use summarise(),n() with group_by() to attain similar results.


There should be at least 10 measurements per plant per day (count >= 10). Say we wanted to see how many plants were there per Treatment and Species, we could use the function n_distinct() to count unique instances and remove duplicates.


We can proceed with arrange() to sort the table according to a number of criteria so that we have a better comparison. For instance, we might want to arrange the table above in:

  1. by Treatment and,
  2. in descending order of the count:


Let's see how may samples were taken for each species and treatment for each value of Days after inoculation.




What is the difference between the functions n() and n_distinct()?

Preliminary data examination

Now that we've got the basics down, let's play around with our data.

Mean per plant

Let's take a look at the mean photosynthesis value per plant while keeping our other variables of interest. Here we are calculating a plant average for each plant. Note that we did not use the same plants every time we took measurements (1DroughtControl1 was only measured on days 9 and 22). We could only measure <30 plants per day, so we randomly selected them from the treatments each time we went to measure.



Each plant is a sampling unit, so we want to conduct our analysis with a single value for each plant. If we include each measurement as a seperate observation, we are incorporating pseudoreplication because each measurement was not independant (10 measurements were made on a single plant–therefore, each measurement is dependant on the plant).

Mean per group

Ok now that we have the means per plant, we can look at the means per trial, treatment, etc.


If there was only one plant in a group, the standard deviation, dev could not be calculated because there is no variation Some of the data might be missing and we would need to address this in real analysis.

We can plot the mean values from each Species and each Treatment on each day they were measured.


This plot reveals some concerns: only control plants were measured on day 3 and no control plants were measured on day 5. Only one group of plants measured on day 36. Lots of errors in data, demonstrating the importance of exploring your data graphically!

Let's see if we can add estimates of error to our measurments.



Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data, in our working directory that will store this generated dataset. We don't want to write generated datasets in the same directory as our raw data. It's good practice to keep them separate. The folder for raw data should only contain the raw, unaltered data, and should be left alone to make sure we don't delete or modify it. In contrast, our script will generate the contents of the data directory, so even if the files it contains are deleted, we can always re-generate them.

Let's create a new folder/directory called data_cleaned using the dir.create function.

Now that our data set and data_cleaned folder is ready, we can save it as a CSV file in our data folder.

This is the end of lab


Code below is for formatting of this lab. Do not alter!