LAB 6: SQL databases and R

BIO3782: Biologist's Toolkit (Dalhousie University)


Setup of workspace

Make sure the required files are in the working directory:

As in previous labs, we'll try simulate "real-life" coding, by using the tags below to indicate when to use RStudio's and when to use the :







What are databases

A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS). Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to just a database.

Data within the most common types of databases in operation today is typically modeled in rows and columns in a series of tables to make processing and data querying efficient and with minimum duplication of information. Below is diagram of typical database, where each box represents a table (similar to an MS-Excel spreadsheet) and the arrows show how the different tables are link to each other:

The data within databses can then be easily accessed, managed, modified, updated, controlled, and organized; all with minimum duplication of information. Most databases (i.e. Oracle, Salesforce, MySQL, IBM DB2, etc. see full list HERE) use structured query language (SQL) for writing and querying data.

Check out this brief history of databases on Wikipedia.

What to do about very large datasets?

So far, we have dealt with small datasets that easily fit into your computer's memory. But what about datasets that are too large for your computer's memorey to handle as a whole? In this case, memory to handle as a whole? In this case, you need to store your data in a format that can be opened in small pieces, which includes databases (note that there are other options, like NetCDF files and ERDDAP servers). Connecting to the database allows you to retrieve only the chunks needed for the current analysis. Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.

R can connect to almost any existing database type. Most common database types have R packages that allow you to connect to them (e.g., RSQLite, RMySQL, etc). Furthermore, the dplyr package, in conjunction with dbplyr, supports connecting to the widely-used open source databases sqlite, mysql and postgresql, as well as Google’s bigquery, and it can also be extended to other database types.

The dplyr package now has a generalized SQL back end for talking to databases, and the new dbplyr package translates R code into database-specific variants. SQL variants are supported for the following databases: Oracle, Microsoft SQL Server, PostgreSQL, Amazon Redshift, Apache Hive, and Apache Impala. Interfacing with databases using dplyr focuses on retrieving and analyzing datasets by generating SELECT SQL statements, but it doesn't modify the database itself. dplyr does not offer functions to UPDATE or DELETE entries. If you need these functionalities, you will need to use additional R packages (e.g., RSQLite).

What is Structured Query Language (SQL)?

SQL is a programming language used by nearly all relational databases to query, manipulate, and define data, and to provide access control. SQL was first developed at IBM in the 1970s with Oracle as a major contributor, which led to implementation of the SQL ANSI standard, SQL has spurred many extensions from companies such as IBM, Oracle, and Microsoft. Although SQL is still widely used today, new programming languages are beginning to appear.

SQLite is the most widely deployed database in the world. It is included on Android, iPhone and IOS devices and in Firefox, Chrome and Safari web browsers. Apple and Microsoft include it in their OSX and Windows 10 operating systems respectively and there are many other products that include SQLite. It is extremely easy to use, and can be of great value to developers who need a database available but want to avoid the overhead often associated with installing and configuring an external database. In this demonstration, we will download and install the RSQLite package which will integrate SQLite into R running in RStudio.

Here we will demonstrate how to interact with a database using dplyr, using both the dplyr's verb syntax and the SQL syntax.

Connect the databases

We can point R to a database using the dplyr, dbplyr and RSQLite packages. We will create a new, empty SQLite database where we can store the mammals data. SQLite has a rather simple data storage mechanism, all data for a database is installed within a single file. The name of this file must be specified when the database is created, and a connection to this database is returned and used in subsequent commands to access and manipulate the data and data structures within the database.

Let's install the dbplyr and RSQLite packages:


Then, load the required packages....


Then, connect to the database.


This command uses 2 packages that helps dbplyr and dplyr talk to the SQLite database. DBI is not something that you'll use directly as a user. It allows R to send commands to databases irrespective of the database management system used. The RSQLite package allows R to interface with SQLite databases.

This command does not load the data into the R session (as the read_csv() function did). Instead, it merely instructs R to connect to the SQLite database contained in the portal_mammals.sqlite file.

Using a similar approach, you could connect to many other database management systems that are supported by R including MySQL, PostgreSQL, BigQuery, etc.

Let's take a closer look at the mammals database we just connected to.


Just like a spreadsheet with multiple worksheets, a SQLite database can contain multiple tables. In this case three of them are listed in the tbls row in the output above:

Let's take a look to see what is inside the surveys table:

As you can see the species tables is made by ?? rows by 9 columns. We do not know the number of rows (i..e ??) because the function tbl does not look at the entire dataset, it only looks at a small sample to give you an insite of what is inside. However, tbl gives you enough information to figure out that each row is probably an "obsevation" including date (i.e. year, month and day), IDs for "record", "plot" and "species", and 3 columns of data: sex, hindfoot_length and weight.

Let's take a look at the species table:


Here we have 4 columns and ?? unknown rows. However, note that the species table has species_id, which is also found in the surveys table. This is a clever way to minimaze duplication of information. In the main table, surveys, you only need to enter ONE number per observation (i.e. species_id), to be able to identify the genus, species and taxa of the observation (after linking the surveys table with the species table).




Take a look inside the plots table.







What type of file is portal_mammals?



What does the src: from the src_dbi function tell you?



How many columns does the plots table have?

Now that we know we can connect to the database, let's explore how to get the data from its tables into R.

Querying with SQL syntax

To connect to tables within a database, you can use the tbl() function from dplyr. This function can be used to send SQL queries to the database. That is, between quotes " ", you send an command in SQL programmin language.

To demonstrate this functionality, let's select the columns "year", "species_id", and "plot_id" from the surveys table.



What class of object does tbl() return?


How many columns are in the table created using: tbl(mammals, sql("SELECT year, species_id, plot_id FROM surveys"))

Querying with dplry syntax

We can do something similar using the dplyr package. First, we select the table on which to do the operations by creating the surveys object, and then we use the standard dplyr syntax as if it were a data frame.



What class of object is surveys?

Let's take a look at how SQL handles instructions from dplyr using the show_query() function.


The surveys object behaves like a data frame. Several functions that can be used with data frames can also be used on tables from a database. The output looks just like a regular data frame. However, the columns plot_type, taxa, genus, and species are missing. These are now located in the tables plots and species which we will join together in a moment.

Some dplyr functions won't work the way we expect them too. For example, let's check to see how many rows surveys has with the nrow() function.


nrow() returns NA. You may have also noticed that the surveys output included ?? indicating the number of rows was unknown. The reason for this behavior highlights a key difference between using dplyr on datasets in memory (e.g. loaded into your R session via read_csv()) and those provided by a database.

To understand it, we will take a closer look at how dplyr communicates with our SQLite database.

SQL translation

Relational databases typically use the special-purpose language, Structured Query Language (SQL), to manage and query data.

For example, the following SQL query returns the first 10 rows from the surveys table.

SELECT *
FROM `surveys`
LIMIT 10

We will use dplyr's show_query() function to show which SQL commands are actually sent to the database.


The output shows the actual SQL query sent to the database; it matches our manually constructed SELECT statement above.


What does show_query do?

Behind the scenes, dplyr:

  1. translates your R code into SQL
  2. submits it to the database
  3. translates the database's response into an R data frame

Instead of having to formulate the SQL query ourselves - and having to mentally switch back and forth between R and SQL syntax - we can delegate this translation to dplyr, which in turn doesn't do the real work of subsetting the table, either. Instead, it merely sends the query to the database, waits for its response and returns it to us.

R connects to the database and downloads a bare minimum of information on fields, data types, etc. — enough to allow manipulation of the object without physical download of the data. R never gets to see the full surveys table - and that's why it could not tell us how many rows it contains.

There are two components to dplyr’s SQL translation system:

Most filtering, mutating, or summarizing operations only perform simple mathematical operations. These operations are very similar between R and SQL, so they’re easy to translate. To see what’s happening yourself, you can use translate_sql(). The function translate_sql() is built on top of R’s parsing engine and has been carefully designed to generate correct SQL. It also protects you against SQL injection attacks by correctly escaping the strings and variable names needed by the database that you’re connecting to.

Let's take a look at some differences between R and SQL

dplyr can translate many different query types into SQL allowing us to select() specific columns, filter() rows, or join tables. You can now manipulate surveys in the same way as you would manipulate other tables in R.

To see this in action, let's compose a few queries with dplyr.


In SQL, is 3 an integer or real number?


What does translate_sql do?

Simple database queries

First, let's only request rows of the surveys table in which weight is less than 5 and keep only the species_id, sex, and weight columns.


Executing this command will return a table showing 10 rows and the requested species_id, sex and weight columns. But wait, what does the last line mean?

This indicates that R is only showing us the first 10 rows that fit our criterion. It does not call the whole dataset.

It does not return a dataframe object! The str() call returns a list of tables instead.


When working with databases, dplyr tries to be as lazy as possible.


Why does the object simple only return 10 rows and not the whole dataset?

When you construct a dplyr query, you can connect multiple verbs into a single pipeline. For example, we combined the filter() and select() verbs using the %>% pipe.


Now let's take a look at the structure of data_subset


Notice that the str() call returns a list instead of a dataframe. The call select(species_id, sex, weight) command wasn't executed by R but was sent to the database instead. R doesn't retrieve the full set of results - instead it only retrieves the first 10 results from the database by default.


Why does the object data_subset only return 10 rows?


What class of object is data_subset?

To instruct R to retrieve all of the query results from the database, we add the collect() command to our pipe. It indicates that our database query is finished, get the final results, then load them into the R session.


Let's take a look at data_subset again.


Notice the str() call now returns the structure of a dataframe. Now we have all 32208 rows that match our query in a dataframe and can continue to work with them exclusively in R, without communicating with the database.

Let's look at a histogram of species_id separated by sex.



How many observations does data_subset have?


What class of variable is sex?


TRUE or FALSE: collect() retrieves all queries from our database?

Complex database queries

dplyr enables database queries across one or multiple database tables, using the same single- and multiple-table verbs you encountered previously. This means you can use the same commands regardless of whether you interact with a remote database or local dataset.

This is a really useful feature if you work with large datasets:

Being able to use SQL queries directly can be useful if your collaborators have already put together complex queries to prepare the dataset that you need for your analysis.

To illustrate how to use dplyr with these complex queries, we are going to join the plots and surveys tables. The plots table in the database contains information about the different plots surveyed by the researchers. To access it, we point the tbl() command to it.



What class of object is plots?


TRUE or FALSE: Using this line of code:

tbl(mammals, "plots")

You will be able to see all the observations of the dataset?

The plot_id column also features in the surveys table.


Because plot_id is listed in both tables, we can use it to look up matching records, and join the two tables.

Remember from the previous lab, if we have two tables named x and y with a common column called "ID", we can join them using 'join' functions, two of which are described and illustrated below.

In both forms of join, if there are multiple matches between x and y, all combinations of the matches are returned.

We want to join the two tables plot and surveys.

To extract all surveys for the first plot, which has plot_id = 1, we can do:


If we want the full set of 1,995 observations, we can add the function collect() instead. This also converts the table into a data frame without needing to use the as.data.frame() function.


Let's examine full_plots.


Now we can treat full_plots like we would any dataframe object in R. Let's plot hindfoot_length vs weight over time by species.


Now let's plot hindfoot_length vs weight over time by year and sex for data between the years 1977 - 1980.


We can also use other dplyr functions like mutate() on database objects. Let's use the table object joined_plots and create a column of hindfoot_length/weight. First we'll remove rows where weight is NA using filter(!is.na(weight)). Next we'll add a new column called "ratio" and convert the table into a dataframe using as.data.frame(). Finally, we'll used head() to display the first 6 rows of the data frame.


All these manipulations occur without physical download of the data, by translating your code into SQL in the background. Since data download is often the most time consuming step, this allows you to think about how much work you want to get done on the server before you pull the data. When you are ready to pull the data, you just use collect(). This will send the background compiled SQL query to the database and execute it.


What does he function head() do?


inner_join returns:


TRUE or FALSE: left_join will remove all entries without observations (no data).



What would the code that returns the number of rodents observed in each plot in each year look like in dplyr syntax? Hint: Connect to the species table and write a query that joins the species and survey tables together to exclude all non-rodents. The query should return counts of rodents by year.



What would the code that returns the total number of rodents in each genus caught in the different plot types look like in dplyr syntax?

Hint: Write a query that joins the species, plot, and survey tables together. The query should return counts of genus by plot type.

Diving deeper into databases

Now that we have a basic idea of how R and SQL talk to each other, let's take a look at connecting to different databases.

Example 1: Treebase

Let's take a look at the TreeBASE database. We will use the treebase package described in Boettiger & Lang, 2012. treebase queries this API directly rather than the treebase.org website. TreeBASE is a repository of phylogenetic information, specifically user-submitted phylogenetic trees and the data used to generate them.

TreeBASE provides two APIs to query the database, one which searches by the metadata associated with different publications (called OAI-PMH), and another which queries the phylogenies directly (called Phylo-ws). They have somewhat redundant functions, but for our purposes the second one returns the actual data, while the first returns metadata.

Let's install the treebase package directly from github.

First, you may need to install the devtools library (unless you already have it installed):


Then, use devtools to install treebase package directly from github:


Then let's load the required packages into our workspace.


We start with some queries of the metadata directly without downloading any trees. Let’s search the database for data from the author Huelsenbeck.

Queries available in search_treebase(). The first argument is the keyword used in the query such as an author's name and the second argument indicates the type of query (i.e. "author").


Data discovery involves searching for existing data that meets certain desired characteristics. The Web repository uses separate interfaces (APIs) to access metadata describing the publications associated with the data entered (i.e. publisher, year of publication, etc.) and a different interface to describe the metadata associated with an individual phylogeny (i.e. the number of taxa or the kind of tree). The treebase package can query these individual sources of metadata separately.

We can also look at all the available metadata from a certain date range in Treebase using the download_metadata() function.

Let's get all submissions from 2009 - present.


Let's look at trends in the growth of the database over time.


We can also show authors with the most submissions in that date range.


What about if we look at journal submissions by volume. First let's load the publisher metadata from meta object


Then let's "unpack" the list of journals and sort them, keeping only the last 5 journals.


Finally, let's plot the object J as a barplot.


Tests across many phylogenies

A standard test of the constant rate of diversification is the gamma statistic of Pybus & Harvey (2009) which tests the null hypothesis that the rates of speciation and extinction are constant. Under the null hypothesis, the gamma statistic is normally distributed about 0; values larger than 0 indicate that internal nodes are closer to the tip than expected, while values smaller than 0 indicate nodes farther from the tip than expected.

We will collect all phylogenetic trees from TreeBASE and select those with branch length data that we can time‐calibrate using tools available in R. We can then calculate the distribution of this statistic for all available trees and compare these results with those from the analyses mentioned above.

For certain applications, we may wish to download all the available phylogenies from TreeBASE. Using the cache_treebase() function allows a user to download a local copy of all trees. Because direct database dumps are not currently available from treebase.org, this function has intentional delays to avoid overtaxing the TreeBASE servers and will take a full day to run.

treebase <− cache_treebase()

Once run, the cache is saved compactly in memory where it can be easily and quickly restored. For convenience, we will load the treebase.Rdata object which contains a copy of all the data already cached, which can be loaded into memory.


Note that the newly loaded treebase is a list (i.e. not a data frame). Therefore, common functions for data frames, like ncol(), and nrow() do not work. You need to use length() or str() instead.



How many elements/rows does treebase have?



What does treebase contain?



What is a cache of the database?



Why would you want to create a cache of the database?

We will only be able to use those phylogenies that include branch length data, which we can determine from the have_branchlength function in the treebase package. We drop those that do not from the data set.

For simplicity, we will use the first 200 entries of the database only. Let's load the cached data.




What type/class of object is branchlentghs?



How many elements/rows does branchlengths have?

This analysis will require ultrametric trees (branch lengths proportional to time, rather than to the nucleotide substitution rate). As most of these phylogenies are calibrated with branch length proportional to mutational step, we must time‐calibrate each of them first. The following function drops trees that cannot meet the assumptions of the time‐calibration function.


Let's take a look at our cleaned data tt.




What does the "11" in "List of 11" represent?



What type of variable is type?

At this point, we have 199 time‐calibrated phylogenies over which we will apply the diversification rate analysis.


Let's take a look at gammas.




Why does gammas have 199 rows?

Now let's see what the gamma distribution looks like.


The overall distribution appears slightly skewed towards positive values. This could indicate increasing rate of speciation or constant extinction rates. While differences in sampling may account for much of the spread observed, the position and identity of outlier phylogenies could suggest new hypotheses and potential directions for further exploration.

Example 2: Fishbase

Let's take a look at the FishBase database. We will use the rfishbase package described in Boettiger et al. (2012)

The rfishbase package queries this API directly rather than the FishBase.org website. This reduces load on the FishBase web servers and increases both the performance and the breadth of data available. rfishbase functions are primarily aimed at facilitating queries for specific data across a given list of many species. This is a task that is common to much scientific research and tedious to perform on very large datasets.

Let's install the rfishbase package directly from github.


Then let's use the install_github() function to download and install the `rfishbase' package.


.....then load the package.


Accesing specific versions of the database

rfishbase relies on periodic cache releases but we can specify which version of the database to access. We will set the version of the database to "19.04" by setting the environmental variable.


Buidling a species list

Let's assemble a good list of species we are interested in. Almost all functions in rfishbase take a list (character vector) of species scientific names. You can also read in a list of names from any existing data you are working with. When providing your own species list, you should always begin by validating the names. Taxonomy is a moving target, and this well help align the scientific names you are using with the names used by FishBase, and alert you to any potential issues.




What does validate_names return when you misspell a name? (i.e. validate_names(c("Salbo Frutta")

We can also collect information about all species in a particular taxonomic group, such as a Genus, Family or Order using the species_list() function.




How many entries/values does the previous command return?

rfishbase also recognizes common names. When a common name refers to multiple species, all matching species are returned. We can use the function common_to_sci to see all records matching the common name.




How many observations does fish have?



How many distinct Species does fish have?
HINT: use the function n_distinct()



How many distinct common names does fish have?
HINT: use the function n_distinct()

Getting data

With a species list in place, we are ready to query fishbase for data. Note that if you have a very long list of species, it is always a good idea to try out your intended functions with a subset of that list first to make sure everything is working.

The species() function returns a table containing some of the information found on the summary or homepage for a species on fishbase.org. rfishbase functions always return tidy data tables and displays tables: rows are observations (e.g. a species, individual samples from a species) and columns are variables (fields).


Discovering data

Because rfishbase accesses the back end database, it does not always line up with the web display. Frequently rfishbase functions will return more information than is available on the web versions of the these tables. Some information found on the summary homepage for a species is not available from the species summary function, but must be extracted from a different table. For instance, the species Resilience information is not one of the fields in the species summary table, despite appearing on the species homepage of fishbase.org. To discover which table this information is in, we can use the rfishbase function list_fields, which will list all tables with a field matching the query string:


This shows us that this information appears on the stocks table. Let's create an object resil with Resilience information.




What is the value of the 10th row in the resil table?



How many distinct types/levels of Resilience are there in resil?
HINT: Use the unique() function



What are the distinct types/levels of Resilience are there in resil?
HINT: Use the unique() function

We may only be interested in the PriceCateg (Price category) and the Vulnerability of the species. Let's creaate the object dat by querying for our full species list, asking for only these fields to be returned:




What is the price category (PriceCateg) of the last observation in dat?



How many observations does dat have?

Working in R, it is easy to query additional tables and combine the results with the data we have collected so far.


Let's take a look at combined_data



What type of object is combined_data have?



What type of variable is PriceCateg?



How many distinct entries are there in the SpecCode variable of combined_data?



What is the SpecCode of the 8th entry in combined_data?



How many observations in combined_data have an unknown PriceCateg? HINT: You'll need to use the functions group_by and summarize



How many observations would you have in combined_data if you only want species with Medium resilience? HINT: Use nrow() to count the number of observations.



How many distinct SpecCode entries would you have in combined_data if you only want species with Medium resilience?

Using databases to improve performance and decision-making

With massive data collection from the Internet of Things transforming life and industry across the globe, we have access to more data than ever before. Forward-thinking organizations can now use databases to go beyond basic data storage and transactions to analyze vast quantities of data from multiple systems. Using database and other computing and business intelligence tools, organizations can now leverage the data they collect to run more efficiently, enable better decision-making, and become more agile and scalable. By having direct control over the ability to create and use databases, users gain control and autonomy while still maintaining important security standards.

This is the end of lab


Code below is for formatting of this lab. Do not alter!