# Lab settings - please ingnore
options(repr.plot.width=7, repr.plot.height=4, repr.plot.res=250 ) # Make plots a resonable size
BIO3782: Biologist's Toolkit (Dalhousie University)
Make sure the required files are in the working directory:
As in previous labs, we'll try simulate "real-life" coding, by using the tags below to indicate when to use RStudio's and when to use the :
A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS). Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to just a database.
Data within the most common types of databases in operation today is typically modeled in rows and columns in a series of tables to make processing and data querying efficient and with minimum duplication of information. Below is diagram of typical database, where each box represents a table (similar to an MS-Excel spreadsheet) and the arrows show how the different tables are link to each other:
The data within databses can then be easily accessed, managed, modified, updated, controlled, and organized; all with minimum duplication of information. Most databases (i.e. Oracle, Salesforce, MySQL, IBM DB2, etc. see full list HERE) use structured query language (SQL) for writing and querying data.
Check out this brief history of databases on Wikipedia.
So far, we have dealt with small datasets that easily fit into your computer's memory. But what about datasets that are too large for your computer's memorey to handle as a whole? In this case, memory to handle as a whole? In this case, you need to store your data in a format that can be opened in small pieces, which includes databases (note that there are other options, like NetCDF files and ERDDAP servers). Connecting to the database allows you to retrieve only the chunks needed for the current analysis. Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.
R can connect to almost any existing database type. Most common database types have R packages that allow you to connect to them (e.g., RSQLite
, RMySQL
, etc). Furthermore, the dplyr
package, in conjunction with dbplyr
, supports connecting to the widely-used open source databases sqlite, mysql and postgresql, as well as Google’s bigquery, and it can also be extended to other database types.
The dplyr
package now has a generalized SQL back end for talking to databases, and the new dbplyr
package translates R code into database-specific variants. SQL variants are supported for the following databases: Oracle, Microsoft SQL Server, PostgreSQL, Amazon Redshift, Apache Hive, and Apache Impala. Interfacing with databases using dplyr
focuses on retrieving and analyzing datasets by generating SELECT SQL statements, but it doesn't modify the database itself. dplyr
does not offer functions to UPDATE or DELETE entries. If you need these functionalities, you will need to use additional R packages (e.g., RSQLite
).
SQL is a programming language used by nearly all relational databases to query, manipulate, and define data, and to provide access control. SQL was first developed at IBM in the 1970s with Oracle as a major contributor, which led to implementation of the SQL ANSI standard, SQL has spurred many extensions from companies such as IBM, Oracle, and Microsoft. Although SQL is still widely used today, new programming languages are beginning to appear.
SQLite is the most widely deployed database in the world. It is included on Android, iPhone and IOS devices and in Firefox, Chrome and Safari web browsers. Apple and Microsoft include it in their OSX and Windows 10 operating systems respectively and there are many other products that include SQLite. It is extremely easy to use, and can be of great value to developers who need a database available but want to avoid the overhead often associated with installing and configuring an external database. In this demonstration, we will download and install the RSQLite
package which will integrate SQLite into R running in RStudio.
Here we will demonstrate how to interact with a database using dplyr
, using both the dplyr
's verb syntax and the SQL syntax.
We can point R to a database using the dplyr
, dbplyr
and RSQLite
packages. We will create a new, empty SQLite database where we can store the mammals
data. SQLite has a rather simple data storage mechanism, all data for a database is installed within a single file. The name of this file must be specified when the database is created, and a connection to this database is returned and used in subsequent commands to access and manipulate the data and data structures within the database.
Let's install the dbplyr
and RSQLite
packages:
install.packages(c("dbplyr", "RSQLite"))
Then, load the required packages....
library(dplyr)
library(dbplyr)
library(ggplot2)
Then, connect to the database.
mammals <- DBI::dbConnect(RSQLite::SQLite(), "data_raw/portal_mammals.sqlite")
This command uses 2 packages that helps dbplyr
and dplyr
talk to the SQLite database. DBI
is not something that you'll use directly as a user. It allows R to send commands to databases irrespective of the database management system used. The RSQLite
package allows R to interface with SQLite databases.
This command does not load the data into the R session (as the read_csv()
function did). Instead, it merely instructs R to connect to the SQLite database contained in the portal_mammals.sqlite file.
Using a similar approach, you could connect to many other database management systems that are supported by R including MySQL, PostgreSQL, BigQuery, etc.
Let's take a closer look at the mammals database we just connected to.
src_dbi(mammals)
src: sqlite 3.34.1 [C:\Users\Diego\Documents\6.Biologist's Toolkit\25Jan\biol3782\week6\data_raw\portal_mammals.sqlite] tbls: plots, species, surveys
Just like a spreadsheet with multiple worksheets, a SQLite database can contain multiple tables. In this case three of them are listed in the tbls row in the output above:
Let's take a look to see what is inside the surveys
table:
tbl(mammals,"surveys")
# Source: table<surveys> [?? x 9] # Database: sqlite 3.34.1 [C:\Users\Diego\Documents\6.Biologist's # Toolkit\25Jan\biol3782\week6\data_raw\portal_mammals.sqlite] record_id month day year plot_id species_id sex hindfoot_length weight <int> <int> <int> <int> <int> <chr> <chr> <int> <int> 1 1 7 16 1977 2 NL M 32 NA 2 2 7 16 1977 3 NL M 33 NA 3 3 7 16 1977 2 DM F 37 NA 4 4 7 16 1977 7 DM M 36 NA 5 5 7 16 1977 3 DM M 35 NA 6 6 7 16 1977 1 PF M 14 NA 7 7 7 16 1977 2 PE F NA NA 8 8 7 16 1977 1 DM M 37 NA 9 9 7 16 1977 1 DM F 34 NA 10 10 7 16 1977 6 PF F 20 NA # ... with more rows
As you can see the species
tables is made by ??
rows by 9
columns. We do not know the number of rows (i..e ??
) because the function tbl
does not look at the entire dataset, it only looks at a small sample to give you an insite of what is inside. However, tbl
gives you enough information to figure out that each row is probably an "obsevation" including date (i.e. year, month and day), IDs for "record", "plot" and "species", and 3 columns of data: sex, hindfoot_length and weight.
Let's take a look at the species
table:
tbl(mammals,"species")
# Source: table<species> [?? x 4] # Database: sqlite 3.34.1 [C:\Users\Diego\Documents\6.Biologist's # Toolkit\25Jan\biol3782\week6\data_raw\portal_mammals.sqlite] species_id genus species taxa <chr> <chr> <chr> <chr> 1 AB Amphispiza bilineata Bird 2 AH Ammospermophilus harrisi Rodent 3 AS Ammodramus savannarum Bird 4 BA Baiomys taylori Rodent 5 CB Campylorhynchus brunneicapillus Bird 6 CM Calamospiza melanocorys Bird 7 CQ Callipepla squamata Bird 8 CS Crotalus scutalatus Reptile 9 CT Cnemidophorus tigris Reptile 10 CU Cnemidophorus uniparens Reptile # ... with more rows
Here we have 4 columns and ??
unknown rows. However, note that the species
table has species_id
, which is also found in the surveys
table. This is a clever way to minimaze duplication of information. In the main table, surveys
, you only need to enter ONE number per observation (i.e. species_id
), to be able to identify the genus
, species
and taxa
of the observation (after linking the surveys
table with the species
table).
Take a look inside the plots
table.
Now that we know we can connect to the database, let's explore how to get the data from its tables into R.
To connect to tables within a database, you can use the tbl()
function from dplyr
. This function can be used to send SQL queries to the database. That is, between quotes " "
, you send an command in SQL programmin language.
To demonstrate this functionality, let's select the columns "year", "species_id", and "plot_id" from the surveys table.
tbl(mammals, sql("SELECT year, species_id, plot_id FROM surveys"))
We can do something similar using the dplyr
package. First, we select the table on which to do the operations by creating the surveys object, and then we use the standard dplyr syntax as if it were a data frame.
surveys <- tbl(mammals, "surveys")
surveys %>%
select(year, species_id, plot_id)
Let's take a look at how SQL handles instructions from dplyr
using the show_query()
function.
surveys %>%
select(year, species_id, plot_id) %>%
show_query()
<SQL> SELECT `year`, `species_id`, `plot_id` FROM `surveys`
The surveys
object behaves like a data frame. Several functions that can be used with data frames can also be used on tables from a database. The output looks just like a regular data frame. However, the columns plot_type
, taxa
, genus
, and species
are missing. These are now located in the tables plots
and species
which we will join together in a moment.
Some dplyr
functions won't work the way we expect them too. For example, let's check to see how many rows surveys
has with the nrow()
function.
nrow(surveys)
nrow()
returns NA. You may have also noticed that the surveys
output included ??
indicating the number of rows was unknown. The reason for this behavior highlights a key difference between using dplyr
on datasets in memory (e.g. loaded into your R session via read_csv()
) and those provided by a database.
To understand it, we will take a closer look at how dplyr
communicates with our SQLite database.
Relational databases typically use the special-purpose language, Structured Query Language (SQL), to manage and query data.
For example, the following SQL query returns the first 10 rows from the surveys table.
We will use dplyr
's show_query()
function to show which SQL commands are actually sent to the database.
show_query(head(surveys, n = 10))
<SQL> SELECT * FROM `surveys` LIMIT 10
The output shows the actual SQL query sent to the database; it matches our manually constructed SELECT statement above.
Behind the scenes, dplyr
:
Instead of having to formulate the SQL query ourselves - and having to mentally switch back and forth between R and SQL syntax - we can delegate this translation to dplyr
, which in turn doesn't do the real work of subsetting the table, either. Instead, it merely sends the query to the database, waits for its response and returns it to us.
R connects to the database and downloads a bare minimum of information on fields, data types, etc. — enough to allow manipulation of the object without physical download of the data. R never gets to see the full surveys
table - and that's why it could not tell us how many rows it contains.
There are two components to dplyr’s SQL translation system:
translation of vector expressions like: x * y + 10
translation of whole verbs like mutate()
or summarize()
Most filtering, mutating, or summarizing operations only perform simple mathematical operations. These operations are very similar between R and SQL, so they’re easy to translate. To see what’s happening yourself, you can use translate_sql()
. The function translate_sql()
is built on top of R’s parsing engine and has been carefully designed to generate correct SQL. It also protects you against SQL injection attacks by correctly escaping the strings and variable names needed by the database that you’re connecting to.
Let's take a look at some differences between R and SQL
"
and '
aren't the same#In SQLite, variable names are escaped by double quotes
translate_sql(x)
#> <SQL> `x`
# And strings are escaped by single quotes
translate_sql("x")
#> <SQL> 'x'
translate_sql(x == 1 && (y < 2 || z > 3))
#> <SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)
translate_sql(x ^ 2 < 10)
#> <SQL> POWER(`x`, 2.0) < 10.0
translate_sql(x %% 2 == 10)
#> <SQL> `x` % 2.0 = 10.0
translate_sql(substr(x, 5, 10))
#> <SQL> SUBSTR(`x`, 5, 6)
translate_sql(log(x, 10))
#> <SQL> LOG(10.0, `x`)
translate_sql(1)
#> <SQL> 1.0
translate_sql(1L)
#> <SQL> 1
translate_sql(if (x > 5) "big" else "small")
#> <SQL> CASE WHEN (`x` > 5.0) THEN ('big') WHEN NOT(`x` > 5.0) THEN ('small') END
dplyr
can translate many different query types into SQL allowing us to select()
specific columns, filter()
rows, or join tables. You can now manipulate surveys
in the same way as you would manipulate other tables in R.
To see this in action, let's compose a few queries with dplyr
.
First, let's only request rows of the surveys
table in which weight is less than 5 and keep only the species_id, sex, and weight columns.
simple <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
simple
# Source: lazy query [?? x 3] # Database: sqlite 3.34.1 [C:\Users\Diego\Documents\6.Biologist's # Toolkit\25Jan\biol3782\week6\data_raw\portal_mammals.sqlite] species_id sex weight <chr> <chr> <int> 1 PF M 4 2 PF F 4 3 PF NA 4 4 PF F 4 5 PF F 4 6 RM M 4 7 RM F 4 8 RM M 4 9 RM M 4 10 RM M 4 # ... with more rows
Executing this command will return a table showing 10 rows and the requested species_id, sex and weight columns. But wait, what does the last line mean?
# ... with more rows
This indicates that R is only showing us the first 10 rows that fit our criterion. It does not call the whole dataset.
It does not return a dataframe object! The str()
call returns a list of tables instead.
str(simple)
List of 2 $ src:List of 2 ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots .. .. ..@ ptr :<externalptr> .. .. ..@ dbname : chr "C:\\Users\\Diego\\Documents\\6.Biologist's Toolkit\\25Jan\\biol3782\\week6\\data_raw\\portal_mammals.sqlite" .. .. ..@ loadable.extensions: logi TRUE .. .. ..@ flags : int 70 .. .. ..@ vfs : chr "" .. .. ..@ ref :<environment: 0x000000003582f388> .. .. ..@ bigint : chr "integer64" .. .. ..@ extended_types : logi FALSE ..$ disco: NULL ..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src" $ ops:List of 4 ..$ name: chr "select" ..$ x :List of 4 .. ..$ name: chr "filter" .. ..$ x :List of 2 .. .. ..$ x : 'ident' chr "surveys" .. .. ..$ vars: chr [1:9] "record_id" "month" "day" "year" ... .. .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op" .. ..$ dots:List of 1 .. .. ..$ : language ~weight < 5 .. .. .. ..- attr(*, ".Environment")=<environment: 0x000000003a47f470> .. ..$ args: list() .. ..- attr(*, "class")= chr [1:3] "op_filter" "op_single" "op" ..$ dots: list() ..$ args:List of 1 .. ..$ vars:List of 3 .. .. ..$ species_id: symbol species_id .. .. ..$ sex : symbol sex .. .. ..$ weight : symbol weight ..- attr(*, "class")= chr [1:3] "op_select" "op_single" "op" - attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
When working with databases, dplyr
tries to be as lazy as possible.
When you construct a dplyr
query, you can connect multiple verbs into a single pipeline. For example, we combined the filter()
and select()
verbs using the %>%
pipe.
data_subset <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
data_subset
# Source: lazy query [?? x 3] # Database: sqlite 3.34.1 [C:\Users\Diego\Documents\6.Biologist's # Toolkit\25Jan\biol3782\week6\data_raw\portal_mammals.sqlite] species_id sex weight <chr> <chr> <int> 1 PF M 4 2 PF F 4 3 PF NA 4 4 PF F 4 5 PF F 4 6 RM M 4 7 RM F 4 8 RM M 4 9 RM M 4 10 RM M 4 # ... with more rows
Now let's take a look at the structure of data_subset
str(data_subset)
Notice that the str()
call returns a list instead of a dataframe. The call select(species_id, sex, weight)
command wasn't executed by R but was sent to the database instead. R doesn't retrieve the full set of results - instead it only retrieves the first 10 results from the database by default.
To instruct R to retrieve all of the query results from the database, we add the collect()
command to our pipe. It indicates that our database query is finished, get the final results, then load them into the R session.
data_subset <- surveys %>%
filter(weight > 5) %>%
select(species_id, sex, weight) %>%
collect()
data_subset %>%
head()
species_id | sex | weight |
---|---|---|
<chr> | <chr> | <int> |
DM | M | 40 |
DM | M | 48 |
DM | F | 29 |
DM | F | 46 |
DM | M | 36 |
DO | F | 52 |
Let's take a look at data_subset
again.
str(data_subset)
tibble [32,208 x 3] (S3: tbl_df/tbl/data.frame) $ species_id: chr [1:32208] "DM" "DM" "DM" "DM" ... $ sex : chr [1:32208] "M" "M" "F" "F" ... $ weight : int [1:32208] 40 48 29 46 36 52 8 22 35 7 ...
Notice the str()
call now returns the structure of a dataframe. Now we have all 32208 rows that match our query in a dataframe and can continue to work with them exclusively in R, without communicating with the database.
Let's look at a histogram of species_id
separated by sex
.
data_subset %>%
na.omit() %>%
ggplot(aes(species_id)) +
geom_histogram(aes(fill = sex), position = "dodge", stat = "count", color = "black") +
theme_classic()
Warning message: "Ignoring unknown parameters: binwidth, bins, pad"
dplyr
enables database queries across one or multiple database tables, using the same single- and multiple-table verbs you encountered previously. This means you can use the same commands regardless of whether you interact with a remote database or local dataset.
This is a really useful feature if you work with large datasets:
Being able to use SQL queries directly can be useful if your collaborators have already put together complex queries to prepare the dataset that you need for your analysis.
To illustrate how to use dplyr
with these complex queries, we are going to join the plots and surveys tables. The plots table in the database contains information about the different plots surveyed by the researchers. To access it, we point the tbl()
command to it.
plots <- tbl(mammals, "plots")
plots
The plot_id
column also features in the surveys
table.
surveys
# Source: table<surveys> [?? x 9] # Database: sqlite 3.34.1 [C:\Users\Diego\Documents\6.Biologist's # Toolkit\25Jan\biol3782\week6\data_raw\portal_mammals.sqlite] record_id month day year plot_id species_id sex hindfoot_length weight <int> <int> <int> <int> <int> <chr> <chr> <int> <int> 1 1 7 16 1977 2 NL M 32 NA 2 2 7 16 1977 3 NL M 33 NA 3 3 7 16 1977 2 DM F 37 NA 4 4 7 16 1977 7 DM M 36 NA 5 5 7 16 1977 3 DM M 35 NA 6 6 7 16 1977 1 PF M 14 NA 7 7 7 16 1977 2 PE F NA NA 8 8 7 16 1977 1 DM M 37 NA 9 9 7 16 1977 1 DM F 34 NA 10 10 7 16 1977 6 PF F 20 NA # ... with more rows
Because plot_id
is listed in both tables, we can use it to look up matching records, and join the two tables.
Remember from the previous lab, if we have two tables named x and y with a common column called "ID", we can join them using 'join' functions, two of which are described and illustrated below.
inner_join()
: This returns all rows from x where there are matching values in y, and all columns from x and y.
left_join()
: This return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.
In both forms of join, if there are multiple matches between x and y, all combinations of the matches are returned.
We want to join the two tables plot
and surveys.
To extract all surveys for the first plot, which has plot_id = 1, we can do:
joined_plots <- plots %>%
filter(plot_id == 1) %>%
inner_join(surveys)
joined_plots %>%
as.data.frame() %>%
head()
Joining, by = "plot_id"
plot_id | plot_type | record_id | month | day | year | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <int> | <int> | <int> | <int> | <chr> | <chr> | <int> | <int> | |
1 | 1 | Spectab exclosure | 6 | 7 | 16 | 1977 | PF | M | 14 | NA |
2 | 1 | Spectab exclosure | 8 | 7 | 16 | 1977 | DM | M | 37 | NA |
3 | 1 | Spectab exclosure | 9 | 7 | 16 | 1977 | DM | F | 34 | NA |
4 | 1 | Spectab exclosure | 78 | 8 | 19 | 1977 | PF | M | 16 | 9 |
5 | 1 | Spectab exclosure | 80 | 8 | 19 | 1977 | DS | M | 48 | NA |
6 | 1 | Spectab exclosure | 218 | 9 | 13 | 1977 | PF | M | 13 | 4 |
If we want the full set of 1,995 observations, we can add the function collect()
instead. This also converts the table into a data frame without needing to use the as.data.frame()
function.
full_plots <- plots %>%
filter(plot_id == 1) %>%
inner_join(surveys) %>%
collect()
Joining, by = "plot_id"
Let's examine full_plots
.
str(full_plots)
tibble [1,995 x 10] (S3: tbl_df/tbl/data.frame) $ plot_id : int [1:1995] 1 1 1 1 1 1 1 1 1 1 ... $ plot_type : chr [1:1995] "Spectab exclosure" "Spectab exclosure" "Spectab exclosure" "Spectab exclosure" ... $ record_id : int [1:1995] 6 8 9 78 80 218 222 239 263 270 ... $ month : int [1:1995] 7 7 7 8 8 9 9 9 10 10 ... $ day : int [1:1995] 16 16 16 19 19 13 13 13 16 16 ... $ year : int [1:1995] 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ... $ species_id : chr [1:1995] "PF" "DM" "DM" "PF" ... $ sex : chr [1:1995] "M" "M" "F" "M" ... $ hindfoot_length: int [1:1995] 14 37 34 16 48 13 52 48 37 36 ... $ weight : int [1:1995] NA NA NA 9 NA 4 NA NA 40 38 ...
Now we can treat full_plots
like we would any dataframe object in R. Let's plot hindfoot_length vs weight over time by species.
full_plots %>%
na.omit() %>%
ggplot() +
geom_point(aes(x = weight, y = hindfoot_length, fill = species_id), shape = 21) +
geom_abline(aes(intercept = 0, slope = 1)) +
facet_wrap( ~year) +
theme_classic() +
scale_x_continuous(breaks=c(0, 100, 200))
Now let's plot hindfoot_length vs weight over time by year and sex for data between the years 1977 - 1980.
full_plots %>%
filter(year >= 1977 & year <= 1980) %>%
na.omit() %>%
ggplot() +
geom_point(aes(x = weight, y = hindfoot_length, fill = species_id), shape = 21) +
geom_abline(aes(intercept = 0, slope = 1)) +
facet_grid(year ~ sex) +
theme_classic() +
scale_x_continuous(breaks=c(0, 100, 200)) +
scale_y_continuous(breaks=c(0, 30, 60))
We can also use other dplyr
functions like mutate()
on database objects. Let's use the table object joined_plots
and create a column of hindfoot_length/weight. First we'll remove rows where weight is NA using filter(!is.na(weight))
. Next we'll add a new column called "ratio" and convert the table into a dataframe using as.data.frame()
. Finally, we'll used head()
to display the first 6 rows of the data frame.
joined_plots %>%
filter(!is.na(weight)) %>%
mutate(ratio = hindfoot_length/weight) %>%
as.data.frame() %>%
head()
plot_id | plot_type | record_id | month | day | year | species_id | sex | hindfoot_length | weight | ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <int> | <int> | <int> | <int> | <chr> | <chr> | <int> | <int> | <int> | |
1 | 1 | Spectab exclosure | 78 | 8 | 19 | 1977 | PF | M | 16 | 9 | 1 |
2 | 1 | Spectab exclosure | 218 | 9 | 13 | 1977 | PF | M | 13 | 4 | 3 |
3 | 1 | Spectab exclosure | 263 | 10 | 16 | 1977 | DM | M | 37 | 40 | 0 |
4 | 1 | Spectab exclosure | 270 | 10 | 16 | 1977 | DM | F | 36 | 38 | 0 |
5 | 1 | Spectab exclosure | 285 | 10 | 16 | 1977 | DM | M | 36 | 53 | 0 |
6 | 1 | Spectab exclosure | 352 | 11 | 12 | 1977 | DM | M | 36 | 40 | 0 |
All these manipulations occur without physical download of the data, by translating your code into SQL in the background. Since data download is often the most time consuming step, this allows you to think about how much work you want to get done on the server before you pull the data. When you are ready to pull the data, you just use collect()
. This will send the background compiled SQL query to the database and execute it.
Now that we have a basic idea of how R and SQL talk to each other, let's take a look at connecting to different databases.
Let's take a look at the TreeBASE database. We will use the treebase
package described in Boettiger & Lang, 2012. treebase
queries this API directly rather than the treebase.org website. TreeBASE is a repository of phylogenetic information, specifically user-submitted phylogenetic trees and the data used to generate them.
TreeBASE provides two APIs to query the database, one which searches by the metadata associated with different publications (called OAI-PMH), and another which queries the phylogenies directly (called Phylo-ws). They have somewhat redundant functions, but for our purposes the second one returns the actual data, while the first returns metadata.
Let's install the treebase
package directly from github.
First, you may need to install the devtools
library (unless you already have it installed):
install.packages("devtools")
Then, use devtools
to install treebase
package directly from github:
library(devtools)
install_github("ropensci/treebase")
Then let's load the required packages into our workspace.
library(treebase)
library(ggplot2)
library(tidyverse)
We start with some queries of the metadata directly without downloading any trees. Let’s search the database for data from the author Huelsenbeck.
Queries available in search_treebase()
. The first argument is the keyword used in the query such as an author's name and the second argument indicates the type of query (i.e. "author").
Phylogenies_from_Huelsenbeck <- search_treebase("Huelsenbeck", "author")
str(Phylogenies_from_Huelsenbeck)
List of 2 $ :List of 9 ..$ edge : int [1:61, 1:2] 33 33 34 35 36 37 37 36 38 38 ... ..$ Nnode : int 30 ..$ tip.label: chr [1:32] "Ibalia" "Isocolus" "Aulacidea" "Panteliella" ... ..$ S.id : chr "1070" ..$ Tr.id : chr "5846" ..$ type : chr "Single" ..$ kind : chr "Species Tree" ..$ quality : chr "Unrated" ..$ ntax : chr "32" ..- attr(*, "class")= chr "phylo" ..- attr(*, "order")= chr "cladewise" $ :List of 9 ..$ edge : int [1:93, 1:2] 51 51 52 53 53 52 54 55 56 57 ... ..$ Nnode : int 44 ..$ tip.label: chr [1:50] "Acraea_andromacha" "Actinote_genitrix" "Actinote_stratonice" "Anthocharis_midea" ... ..$ S.id : chr "1915" ..$ Tr.id : chr "4975" ..$ type : chr "Consensus" ..$ kind : chr "Species Tree" ..$ quality : chr "Unrated" ..$ ntax : chr "50" ..- attr(*, "class")= chr "phylo" ..- attr(*, "order")= chr "cladewise"
Data discovery involves searching for existing data that meets certain desired characteristics. The Web repository uses separate interfaces (APIs) to access metadata describing the publications associated with the data entered (i.e. publisher, year of publication, etc.) and a different interface to describe the metadata associated with an individual phylogeny (i.e. the number of taxa or the kind of tree). The treebase
package can query these individual sources of metadata separately.
We can also look at all the available metadata from a certain date range in Treebase using the download_metadata()
function.
Let's get all submissions from 2009 - present.
# get all trees from a certain depostition date forwards
meta <- download_metadata("2009-01-01", by="until")
# extract any metadata (i.e. publication date)
dates <- sapply(meta, function(x) as.numeric(x$date))
Let's look at trends in the growth of the database over time.
hist(dates, main="TreeBase growth", xlab="Year")
We can also show authors with the most submissions in that date range.
authors <- sapply(meta, function(x){
index <- grep( "creator", names(x))
x[index]
})
a <- as.factor(unlist(authors))
head(summary(a))
What about if we look at journal submissions by volume. First let's load the publisher metadata from meta object
journals <- sapply(meta, function(x) x$publisher)
Then let's "unpack" the list of journals and sort them, keeping only the last 5 journals.
J <- tail(sort(table(as.factor(unlist(journals)))),5)
Finally, let's plot the object J
as a barplot.
b <- barplot(as.numeric(J))
text(b, names(J), srt=70, pos=4, xpd=T)
A standard test of the constant rate of diversification is the gamma statistic of Pybus & Harvey (2009) which tests the null hypothesis that the rates of speciation and extinction are constant. Under the null hypothesis, the gamma statistic is normally distributed about 0; values larger than 0 indicate that internal nodes are closer to the tip than expected, while values smaller than 0 indicate nodes farther from the tip than expected.
We will collect all phylogenetic trees from TreeBASE and select those with branch length data that we can time‐calibrate using tools available in R. We can then calculate the distribution of this statistic for all available trees and compare these results with those from the analyses mentioned above.
For certain applications, we may wish to download all the available phylogenies from TreeBASE
. Using the cache_treebase()
function allows a user to download a local copy of all trees. Because direct database dumps are not currently available from treebase.org, this function has intentional delays to avoid overtaxing the TreeBASE servers and will take a full day to run.
treebase <− cache_treebase()
Once run, the cache is saved compactly in memory where it can be easily and quickly restored. For convenience, we will load the treebase.Rdata object which contains a copy of all the data already cached, which can be loaded into memory.
load("data_raw/treebase.RData")
Note that the newly loaded treebase
is a list (i.e. not a data frame). Therefore, common functions for data frames, like ncol(),
and nrow()
do not work. You need to use length()
or str()
instead.
We will only be able to use those phylogenies that include branch length data, which we can determine from the have_branchlength
function in the treebase package. We drop those that do not from the data set.
For simplicity, we will use the first 200 entries of the database only. Let's load the cached data.
have <- have_branchlength(treebase)
branchlengths <- treebase[have][c(1:200)]
This analysis will require ultrametric trees (branch lengths proportional to time, rather than to the nucleotide substitution rate). As most of these phylogenies are calibrated with branch length proportional to mutational step, we must time‐calibrate each of them first. The following function drops trees that cannot meet the assumptions of the time‐calibration function.
#time calibration function
timetree <- function(tree){
try(chronoMPL(multi2di(tree)),
silent=TRUE)}
#cleaned data
tt <- drop_nontrees(sapply(branchlengths,timetree))
Let's take a look at our cleaned data tt
.
str(tt)
At this point, we have 199 time‐calibrated phylogenies over which we will apply the diversification rate analysis.
gammas <- sapply(tt, gammaStat)
Let's take a look at gammas
.
str(gammas)
num [1:199] 1.24 5.05 3.58 13.38 2.42 ...
Now let's see what the gamma distribution looks like.
qplot(gammas, binwidth = 0.5)+
xlab('gamma statistic')+
theme_classic()
Warning message: "Removed 7 rows containing non-finite values (stat_bin)."
The overall distribution appears slightly skewed towards positive values. This could indicate increasing rate of speciation or constant extinction rates. While differences in sampling may account for much of the spread observed, the position and identity of outlier phylogenies could suggest new hypotheses and potential directions for further exploration.
Let's take a look at the FishBase
database. We will use the rfishbase
package described in Boettiger et al. (2012)
The rfishbase
package queries this API directly rather than the FishBase.org website. This reduces load on the FishBase web servers and increases both the performance and the breadth of data available. rfishbase
functions are primarily aimed at facilitating queries for specific data across a given list of many species. This is a task that is common to much scientific research and tedious to perform on very large datasets.
Let's install the rfishbase
package directly from github.
Then let's use the install_github()
function to download and install the `rfishbase' package.
install.packages("rfishbase",
repos = c("http://packages.ropensci.org", "http://cran.rstudio.com"),
type="source")
.....then load the package.
library('rfishbase')
rfishbase
relies on periodic cache releases but we can specify which version of the database to access. We will set the version of the database to "19.04" by setting the environmental variable.
options(FISHBASE_VERSION="19.04")
Let's assemble a good list of species we are interested in. Almost all functions in rfishbase
take a list (character vector) of species scientific names. You can also read in a list of names from any existing data you are working with. When providing your own species list, you should always begin by validating the names. Taxonomy is a moving target, and this well help align the scientific names you are using with the names used by FishBase, and alert you to any potential issues.
validate_names(c("Oreochromis niloticus", "Salmo trutta"))
We can also collect information about all species in a particular taxonomic group, such as a Genus, Family or Order using the species_list()
function.
species_list(Genus = "Labroides")
rfishbase
also recognizes common names. When a common name refers to multiple species, all matching species are returned. We can use the function common_to_sci
to see all records matching the common name.
fish <- common_to_sci("trout")
fish %>%
head()
Species | ComName | Language | SpecCode |
---|---|---|---|
<chr> | <chr> | <chr> | <dbl> |
Salmo obtusirostris | Adriatic trout | English | 6210 |
Schizothorax richardsonii | Alawan snowtrout | English | 8705 |
Schizopyge niger | Alghad snowtrout | English | 24454 |
Salvelinus fontinalis | American brook trout | English | 246 |
Salmo trutta | Amu-Darya trout | English | 238 |
Salmo kottelati | Antalya trout | English | 67602 |
With a species list in place, we are ready to query fishbase for data. Note that if you have a very long list of species, it is always a good idea to try out your intended functions with a subset of that list first to make sure everything is working.
The species()
function returns a table containing some of the information found on the summary or homepage for a species on fishbase.org. rfishbase
functions always return tidy data tables and displays tables: rows are observations (e.g. a species, individual samples from a species) and columns are variables (fields).
species(fish$Species[1:2])
SpecCode | Species | Genus | SpeciesRefNo | Author | FBname | PicPreferredName | PicPreferredNameM | PicPreferredNameF | PicPreferredNameJ | ... | Profile | PD50 | Emblematic | Entered | DateEntered | Modified | DateModified | Expert | DateChecked | TS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | ... | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
6210 | Salmo obtusirostris | Salmo | 59043 | (Heckel, 1851) | Adriatic trout | Saobt_u0.jpg | NA | NA | NA | ... | NA | 0.5 | 0 | 22 | 712800000 | 10 | 1278892800 | 1 | 763084800 | NA |
8705 | Schizothorax richardsonii | Schizothorax | 4832 | (Gray, 1832) | Snowtrout | Scric_u1.jpg | NA | NA | NA | ... | NA | 0.5 | 0 | 1 | 757987200 | 2 | 1336521600 | 1 | 843696000 | NA |
Because rfishbase
accesses the back end database, it does not always line up with the web display. Frequently rfishbase
functions will return more information than is available on the web versions of the these tables. Some information found on the summary homepage for a species is not available from the species
summary function, but must be extracted from a different table. For instance, the species Resilience
information is not one of the fields in the species summary table, despite appearing on the species homepage of fishbase.org. To discover which table this information is in, we can use the rfishbase
function list_fields
, which will list all tables with a field matching the query string:
list_fields("Resilience")
table |
---|
<chr> |
stocks |
This shows us that this information appears on the stocks table. Let's create an object resil
with Resilience
information.
resil <- stocks(fish$Species, fields="Resilience")
resil %>%
head()
Resilience |
---|
<chr> |
Medium |
Medium |
Medium |
Medium |
NA |
NA |
We may only be interested in the PriceCateg
(Price category) and the Vulnerability
of the species. Let's creaate the object dat
by querying for our full species list, asking for only these fields to be returned:
dat <- species(fish$Species, fields=c("SpecCode", "PriceCateg", "Vulnerability"))
dat %>%
head()
SpecCode | PriceCateg | Vulnerability |
---|---|---|
<dbl> | <chr> | <dbl> |
6210 | very high | 46.98 |
8705 | unknown | 34.78 |
24454 | unknown | 46.76 |
246 | very high | 43.37 |
238 | very high | 59.96 |
67602 | NA | 33.71 |
Working in R, it is easy to query additional tables and combine the results with the data we have collected so far.
combined_data <- merge(dat, resil)
combined_data %>%
head()
SpecCode | PriceCateg | Vulnerability | Resilience | |
---|---|---|---|---|
<dbl> | <chr> | <dbl> | <chr> | |
1 | 6210 | very high | 46.98 | Medium |
2 | 8705 | unknown | 34.78 | Medium |
3 | 24454 | unknown | 46.76 | Medium |
4 | 246 | very high | 43.37 | Medium |
5 | 238 | very high | 59.96 | Medium |
6 | 67602 | NA | 33.71 | Medium |
Let's take a look at combined_data
str(combined_data)
'data.frame': 124135 obs. of 4 variables: $ SpecCode : num 6210 8705 24454 246 238 ... $ PriceCateg : chr "very high" "unknown" "unknown" "very high" ... $ Vulnerability: num 47 34.8 46.8 43.4 60 ... $ Resilience : chr "Medium" "Medium" "Medium" "Medium" ...
combined_data %>%
ggplot(aes(x = Vulnerability)) +
geom_histogram(aes(fill = Resilience), stat = "count", position ="dodge") +
facet_wrap( ~PriceCateg) +
scale_y_log10() +
theme_classic() +
theme(legend.position = "bottom")
Warning message: "Ignoring unknown parameters: binwidth, bins, pad"
With massive data collection from the Internet of Things transforming life and industry across the globe, we have access to more data than ever before. Forward-thinking organizations can now use databases to go beyond basic data storage and transactions to analyze vast quantities of data from multiple systems. Using database and other computing and business intelligence tools, organizations can now leverage the data they collect to run more efficiently, enable better decision-making, and become more agile and scalable. By having direct control over the ability to create and use databases, users gain control and autonomy while still maintaining important security standards.
Code below is for formatting of this lab. Do not alter!
cssFile <- '../css/custom.css'
IRdisplay::display_html(readChar(cssFile, file.info(cssFile)$size))
IRdisplay::display_html("<style>body {counter-reset: question_num;}.Q::before {counter-increment: question_num;
content: 'QUESTION ' counter(question_num) '\\A'; white-space: pre; }</style>")