LAB 8: Temporal data and string manipulation plus Web scrapping!

BIO3782: Biologist's Toolkit (Dalhousie University)


Setup of workspace

Make sure the required files are in the working directory:

As in previous labs, we'll try simulate "real-life" coding, by using the tags below to indicate when to use RStudio's and when to use the :







Let's load up some packages that we'll need.


Temporal data

Time is a surprisingly difficult thing to get right sometimes. The problem is that so many people store time differently. French Canada uses 24-hr notation, while English-speaking Canada tends to use 12-hour notation. Then, in the US there is the habit of using month-day-year, rather than day-month-year. Time zones are a pain. This is just the start of where problems begin, and there is also daylight savings for part of the year in many countries. So, let's start with some notation.

Date coding notation in R

There are several functions to manipulate dates, times, or both. It is generally recommended to stick to the simplest level you need, so stick with dates if you just have dates, and times with times. The date-only function is as.Date().

Because there are so many ways to keep track of time, you often need to specify the specific date-time notation associated with your data, so that R can handle and convert dates and times into a standardized format. In as.Date(), the notation is:

Code Value
%d Day of the month (decimal number)
%m Month (decimal number)
%b Month (abbreviated)
%B Month (full name)
%y Year (2 digit)
%Y Year (4 digit)

For each piece of data one might look at, a date object can be created by passing a string into the as.Date() function and specifying what it looks like:

Note that in format= argument you have to specify the date-time notation that matches your data, including any characters used to separate the elements of your date (e.g. -, /, _, etc.). The specific notation is this case is %d for "day", then a /, then a %m for "month", then another /, and finally %Y for year. Also, in this case, the time zone tz= is AST for Atlantic Standard Time.

Date objects

Let's make the following two varibles:


Now lets print the two variables to screen by doing...


As you can see, both return something that looks like a date. Our brain quickly deduces that '15/1/2021' and 2021-01-15 are both dates, and in fact, they are the same date just with different format. However, R does not sees date1 and date2 the same way than your brain does. R sees date1 as simply string of characters, like 'hello world', while R understand that date2 is a date object that contains temporal information. This is evident if we query the class of those variables:


Dates are weird because they are a mixture of units of uneven size - 12 months, with a range of days in them that vary by year are not easy to follow. Let's say you want to know how many days there are between a series of sampling dates. Calculating this by hand would be a nightmare to have to do, unless you cleverly stored your data in a date object, which allows you to do all kinds of "temporal math". We can determine the length of time between two dates by using the diff() function.

First let's create a vector of dates called sample_dates.


Next, let's compute the difference between each date.


It's a good idea to specify the time zone where the time was collected as this will anchor time into a universal standard that can get ambiguous quickly (try sampling reefs on a trip across the Pacific and keeping track of what time zone you're in, or even what hemisphere, once you come back and look at the data). Time zones are a common pitfall, as names we use locally may not apply universally (everyone wants Eastern Standard Time it seems).

The full list of time zones is long and can also be printed in R. Below is how to get the first 6 time zones:




How many days passed between each of the assassinations of JFK (November 22, 1963), Malcolm X (February 21, 1965), and Martin Luther King Jr.(April 4, 1968) in the 1960's?

We can also create a column/vector of dates. We'll use the function seq(), which creates a sequence of things.


We have just created a vector of dates for 20 weeks starting from January 15,2021. Let's check to see if the dates are separated weekly (by 7 days). Let's check the time difference between each pair of elements in my_dates:


In most date systems, dates are really stored as integers, with some specific day in history being day zero. Excel famously uses 1900 but wrongly identifying it as a leap year so as to maintain the Microsoft obsession with backward compatibility. In R, 1 January 1970 is year zero, following the tradition of Unix.

To look at the integer day format for a datetime from the my_dates vector we created, we can use the following functions:




How many days would pass between the first and last dates if we ran this string of code seq(date2, length=20, by='day')?

In addition to turning dates into numbers, as.Date() can also turn numbers into words. A few convenience functions:




There is also the julian function which returns the number of days since time 0. 'Julian' here refers to the Julian calendar declared by Julius Caesar in 46 BC, use of which has continued from early adoption by astronomers due to the coincidence of three astronomical cycles on Monday, January 1, 4713 BC (which preceded any dates in recorded history).


It returns integers similar to the integer date format we talked about earlier.


Datetime objects

If you also have times in your data (datetime data), you can create a POSIX object. The name POSIX is an acronym for Portable Operating System Interface which is a set of standards for maintaining compatability of computer systems. POSIX notation adds additional arguments to how things are specified:

Code Meaning Code Meaning
%a Abbreviated weekday %A Full weekday
%b Abbreviated month %B Full month
%c Locale-specific date and time %d Decimal date
%H Decimal hours (24 hour) %I Decimal hours (12 hour)
%j Decimal day of the year %m Decimal month
%M Decimal minute %p Locale-specific AM/PM
%S Decimal second %U Decimal week of the year (starting on Sunday)
%w Decimal Weekday (0=Sunday) %W Decimal week of the year (starting on Monday)
%x Locale-specific Date %X Locale-specific Time
%y 2-digit year %Y 4-digit year
%z Offset from GMT %Z Time zone (character)



Reflecting the various time components and conventions that people use globally.

In R, the POSIX conversions for datetime objects is handled by two functions:

  1. as.POSIXct() creates an atomic object of the number of seconds since time zero (ct = calendar time)
  2. as.POSIXlt() creates a list of time attributes (lt = list time)

Let's take a look at the difference between the two by creating two objects using similar dates.


Here is the object we created with POSIXct


Here is the object we created with POSIXlt


Because lists have more overhead computationally, unless you need the mixed categories, the best course is to stick with using as.POSIXct, where all the conversions are handled behind the scenes. POSIXct objects work a little more intuitively than the as.Date objects.

Let's take a look at the difference between the two time objects we created (time1,time2).


The fact that these work with seconds mean you can add to them coherently, provided you convert to seconds first.




Use POSIXct to determine the number of seconds Apollo 11 took between take off (July 16, 1969, 13:32:00) and landing on the moon (July 20, 1969, 20:17:40 )

POSIXct objects will keep track of daylight savings time, which is applied "willy-nilly" among provinces, states, and countries.


strptime()

Finally there is also the strptime() function, which is an internal workhorse function to take a string and convert it into a time data type.

Let's create a dataframe called events.


Notice that the time variable is a character instead of a date (see the <chr> below the time column title?). We could use mutuate and as.Date to change it into the right format or we could use the strptime function instead.


You can see now it shows <dttm> below the time column title.

The problem with strptime is that it makes some assumptions that might mess things up for you if they go undetected. For example...


and...


... returns the results in different units (first in seconds, then in hours). This might mess you up if you're scripting to extract times like we do below.





Most often, we won't create dates and times by hand. Instead we will import them from a flat file. Here we can download daily wind and rainfall data for London.

For your convenience, we have provided the data from 2017 in the Weather_Data_2017.csv file.

Using the Date.and.Time column timestamp, calculate the average length of time (in hours) between gale force (i.e. >34 knots) maximum gust records.






How many observations from Weather_Data_2017.csv have hourly maximum gusts > 34 knots?



What is the highest recorded hourly maximum gust?



What is the length of time (in hours) between gale force (i.e. >34 kts) maximum gust records?




Using Weather_Data_2017.csv, plot hourly maximum gust through time.




Your plot should look a bit like the one below:

String Manipulation

A surprising amount of what people do with computers involves text - searching for and manipulating strings within a programming language. In biology, the area with the major lock on text manipulation is bioinformatics. As the name implies, bioinformatics deals with biological information - especially analysis of DNA, RNA and protein sequences. However, the challenges and scientific opportunities for analyzing the information are incredible. In its simplest form, we can represent DNA/RNA and protein as text -- either nucleic or amino acids. Each base or amino acid is represented as a single letter (e.g. A/C/G/T for DNA). Stored in the sequence of nucleic and amino acids are all the instructions to create life. So strings are important.

Among the simplest but most crucial attributes of strings is determining its length. We'll use the nchar() function for that.


We can also take a slice of the string. To grab a section of a string by its position of each letter, R has a substr() function.


We can also split a string using the strsplit() function. We will use the character "a" to separate the strings.


Notice here that the 'a' has disappeared. If we want to keep that 'a', we need to take a slice at the 'a' position. We can use the gregexpr() function to find the position of 'a'.


For reasons known only to the original R programmers, this returns a list object, with a number at the beginning, followed by the position we're looking for. So to use this to get that index number, we need to index the list.


Let's explore string manipulation using the mlb2017_pitching.txt dataset. First let's load the data.


Next, let's clean it up a little by extracting only the first and last names of each of the players. For that, we need the help of regular expressions.

Regular expressions

A Regular expression (or regex) is a sequence of characters that specifies a "search pattern" to be used against a dataset made of characters. Regular expressions are really useful for sifting through and subsetting large textual datasets (like genetic sequences). Using them can impart superhero-like qualities:

The power of regular expression lays on the ability to be able to include "wild cards" as part of the search pattern. The most common are shown below:

Special characters (or wild card) What does it do?
. matches any single character
* match preceding character/number 0 or more times
+ match preceding character/number at least once
\? match preceding character/number exactly once
\ suppresses special meanings. Use this if you want to search for one of the special characters in your string
^ matches the beginning of the string
\$ matches the end of a string
[] match any characters inside the square brackets
[^] match any characters except those inside the brackets
{n} match preceding character n times
{n,m} match preceding character between n and m times
\n new line
\t tab


Below are some of the things you can do with regular expressions:

Find values in strings/vectors that match your desired pattern or sequence

Let's go back to our example with baseball players. We were about to clean up the "Name" column by extracting only the first and last names of each of the players. We can use the str_extract() function for that. str_extract() extracts matching patterns from a string.


We can also find all instances of a pattern. For instance, we can find all the players with the name "Jim". We will use grep() for this. grep() returns the position of each instance of the search string, so we can also use them inside an indexing statement to find other values.

Note that we will be using the ^ character to indicate the beginning of a string (i.e. we do not want last names with the word "jim" in it, like "Jimenez").


Notice it returned the row index of the players with the name Jim. To return the actual entry, we will have to combined grep() with some indexing.


Slightly more powerfully than just the "Jims" is to figure out quantities, for example what proportion of players are in their 30's?



grep finds and matches the strings in the Age column that begin with "3".



How many distinct pitchers are there in mlb_pitching?



What proportion of players are over 40yrs old?



Who are the players over 40yrs old?

Return logical vectors that match your pattern

The grepl() function will return TRUE if conditions are satisfied, and FALSE if not. Here you have to use the standard escape character \ to stop the function of * as a special character:


With this, you can then do typical boolean indexing.

Find and replace values that match your pattern

If you or your data provider has a spelling problem, you can correct them on the fly with the gsub() find and replace function. For example, we can change the name of all the players called "Tyler" to "Superman".


Probably a more useful thing for this data is filtering out parts of the string we don't want. Let's get rid of all the text after the back-slashes.




Why are there three "Al Alburquerque" in the MLB pitching data?

Web scraping

Among the more functional and powerful things R can do is to pull down information from the web and process it for use. The library to do this is rvest created (again) by Hadley Wickham. This is a deep topic that requires some insight into html, a tag-driven programming language that powers most of the web.

What is Web scraping?

Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. Almost all the main languages provide ways for performing web scraping.



Ways to scrape data

There are several ways of scraping data from the web. Some of the popular ways are:

  1. Human Copy-Paste: This is a slow and efficient way of scraping data from the web. This involves humans themselves analyzing and copying the data to local storage.
  2. Text pattern matching: Another simple yet powerful approach to extract information from the web is by using regular expression matching facilities of programming languages (we learned regular expressions in R in a section above).
  3. API Interface: Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/ or private APIs which can be called using the standard code for retrieving the data in the prescribed format.
  4. DOM Parsing: By using web browsers, programs can retrieve the dynamic content generated by client-side scripts. It is also possible to parse web pages into a DOM tree, based on which programs can retrieve parts of these pages.

We’ll use the DOM parsing approach during the course of this tutorial and rely on the CSS selectors of the webpage for finding the relevant fields which contain the desired information. But before we begin there are a few prerequisites that one need in order to proficiently scrape data from any website.

Understanding a web page

Before we can start learning how to scrape a web page, we need to understand how a web page itself is structured.

From a user perspective, a web page has text, images and links all organized in a way that is aesthetically pleasing and easy to read. But the web page itself is written in specific coding languages that are then interpreted by our web browsers. When we're web scraping, we’ll need to deal with the actual contents of the web page itself: the code before it’s interpreted by the browser.

If you want to see "the code" of this website (i.e. Lab 8), simply click Ctr + u, or Command + u in Mac (this should work in most modern browsers).

The main languages used to build web pages are called Hypertext Markup Language (HTML), Cascading Style Sheets (CSS) and Javascript. HTML gives a web page its actual structure and content. CSS gives a web page its style and look, including details like fonts and colors. Javascript gives a webpage functionality.

In this tutorial, we’ll focus mostly on how to use R web scraping to read the HTML and CSS that make up a web page.

HTML

Unlike R, HTML is not a programming language. Instead, it’s called a markup language — it describes the content and structure of a web page. HTML is organized using tags, which are surrounded by <> symbols. Different tags perform different functions. Together, many tags will form and contain the content of a web page.

The text below is a legitimate HTML document. If we were to save that as a .html file and open it using a web browser, we would see a page saying:

Here's a paragraph of text!
Here's a second paragraph of text!

Notice that each of the tags are “paired” in a sense that each one is accompanied by another with a similar name. That is to say, the opening <html> tag is paired with another tag </html> that indicates the beginning and end of the HTML document. The same applies to <body> and <p>.

The <html> </html> tags specify the begging and end of HTML content. The <head> </head> and <body> <body> tags, add more structure to the document specifying beginning and end of headers and main body of the file, respectively. The <p>``</p> tags are what we use in HTML to designate paragraphs.

There are many, many tags in HTML, but we won’t be able to cover all of them in this tutorial. If interested, you can check out this site. The important takeaway is to know that tags have particular names (html, body, p, etc.) to make them identifiable in an HTML document.

Having opening and ending tags (e.g. <p>``</p>) is important, because it allows tags to be nested within each other. The <body> and <head> tags are nested within <html>, and <p> is nested within <body>. This nesting gives HTML a “tree-like” structure:

This tree-like structure will inform how we look for certain tags when we're using R for web scraping, so it’s important to keep it in mind. If a tag has other tags nested within it, we would refer to the containing tag as the parent and each of the tags within it as the “children”. If there is more than one child in a parent, the child tags are collectively referred to as “siblings”. These notions of parent, child and siblings give us an idea of the hierarchy of the tags.

CSS

Whereas HTML provides the content and structure of a web page, CSS provides information about how a web page should be styled. Without CSS, a web page is dreadfully plain. Here's a simple HTML document without CSS that demonstrates this.

When we say styling, we are referring to a wide, wide range of things. Styling can refer to the attributes (e.g. color, size, position, font, alignment, etc.) of particular HTML elements. Like HTML, the scope of CSS material is so large that we can’t cover every possible concept in the language. If you’re interested, you can learn more here.

Two concepts we do need to learn before we delve into the R web scraping code are classes and ids.

First, let's talk about classes. If we were making a website, there would often be times when we'd want similar elements of a website to look the same. For example, we might want a number of items in a list to all appear in the same color, red.

We could accomplish that by directly inserting some CSS that contains the color information into each line of text's HTML tag, like so:

The style text indicates that we are trying to apply CSS to the <p> tags. Inside the quotes, we see a key-value pair “color:red”. color refers to the color of the text in the <p> tags, while red describes what the color should be.

If we wanted to change the color of that text, we'd have to change each line one by one.

Instead of repeating this style text in all of these <p> tags, we can replace it with a class selector:

The class selector, we can better indicate that these <p> tags are related in some way. In a separate CSS file, we can create the red-text class and define how it looks by writing:

Combining these two elements into a single web page will produce the same effect as the first set of red <p> tags, but it allows us to make quick changes more easily.

In this tutorial, of course, we're interested in web scraping, not building a web page. But when we're web scraping, we'll often need to select a specific class of HTML tags, so we need understand the basics of how CSS classes work.

Similarly, we may often want to scrape specific data that's identified using an id. CSS ids are used to give a single element an identifiable name, much like how a class helps define a class of elements.

If an id is attached to a HTML tag, it makes it easier for us to identify this tag when we are performing our actual web scraping with R.

Don’t worry if you don’t quite understand classes and ids yet, it’ll become more clear when we start manipulating the code.

Web scraping imdb website

There are several R libraries designed to take HTML and CSS and be able to traverse them to look for particular tags. The library we’ll use is rvest.

In this tutorial, we’ll use R for scraping the data for the most popular feature films of 2019 from the IMDb website.

We’ll get a number of features for each of the 100 popular feature films released in 2019. Also, we’ll look at the most common problems that one might face while scraping data from the internet because of the lack of consistency in the website code and look at how to solve these problems.

If you do have rvest installed...


Then, load the library:


Let's specify a url for the desired website, that loads the first 100 titles of 2019.

Now, as new films are added to the imdb website, the content returned by the url query may vary over time. We purposely chose a year in the past (2019) to avoid this, but changes can still happen (actually changes just happened in the last week). Therefore, I downloaded a copy of the imdb website in my github account, we will use this copy for the web scraping exercises of this lab. However, below you can see as a comment the actual url that you could use if you want to perform the web scraping on the live imdb site.


Next, let's read the html code from the website.


Now, we’ll be scraping the following data from this website.

Here’s a screenshot that contains how all these fields are arranged.

Rank

Now, we will start by scraping the Rank field. For that, we’ll use the selector gadget to get the specific CSS selectors that encloses the rankings. You can click on the extension in your browser and select the rankings field with the cursor.

To see the html code in Google Chrome, you can go to Options -> More tools -> Developer tools or hit Ctrl + Shift + I (for Windows).

First let's select the ranking. Highlight the "1." beside "Captain Marvel" and hit Ctrl + Shift + I or right click and select "inspect". This should take you to rank css on the source page.

Once you are sure that you have made the right selections, you need to copy the corresponding CSS selector. In our case, it's text-primary.

Once you know the CSS selector that contains the rankings, you can use this simple R code to get all the rankings.

We will use the html_nodes function to extract pieces of data out of HTML documents using CSS selector. You can use the help section to take a look at the code syntax for html_node.


Next, we will use html_text to extract attributes, text and tag name from html.


Now let's see if it pulled out the rankings.


Once you have the data, make sure that it looks in the desired format. In our case, we would convert the text to numeric.


Now we can select all the titles. You can visually inspect that all the titles are selected.

Title

Let's scrape all the titles using the lister-item-header a CSS tag.


Let's have a look at the first 6 titles.


In the following code, we will do the same thing for scraping Description, Runtime, Genre, Rating, Metascore, Votes, Gross_Earning_in_Mil , Director and Actor data.

Notice the web scrapping code is relatively similar across the criteria but CSS tags are different.

Description


Let's scrape film Description data from the webpage.


Notice that each entry begins with "\n". We will need to remove that using the gsub function.

We will do similar things to the next set of data.

Runtime


Let's scrape the website data for Runtime.


To make it easier to deal with later, let's remove "min" and convert the data to numeric.


Genre


Let's scrape the website data for Genre.


Notice we have "\n" and excess spaces. let's remove that to clean up the data.


Now let's only take the first genre of each movie and convert the data from characters to factors.


Rating


Let's scrape the website data for Rating.


Notice that rating data is saved as characters (Because that's what html_text does!). Let's change it to numeric data.


Votes


Let's scrape the website data for Votes. We will apply similar processes as above. First we'll read in the data, remove extra characters then convert it to numeric.


Director


Let's scrape the website data for data on Directors. First we'll read in the data then convert it from characters to factors.


Actor


Let's scrape the website data for names of Actors. First we'll read in the data then convert it from characters to factors.


Metascore


Let's read in and examine the data.


This is great, but because metascore_data is made up of characters, we cannot do math or calculate statistics with this data. See what happens when we try to compute descriptive statistics:


As you can see, the summary() function cannot return min, max, median, mean, etc. because this metrics cannot be computed on characters (i.e. "letters").

We should convert the characters to numeric!


Let's take a look at the summary of metascore_data again...


Great! We computed descriptive statistics!


Gross


Here there is the problem that not all entries have reported grossed earning. We want those missing values to be filled with NaN. We will the html_node() function (See ?html_nodes for the difference between html_node and html_nodes) to fill with NaN. Note that instead of using webpage as input, we will use the output of html_nodes(webpage, '.lister-item-content') as input:


Note that we got 100 elements, with NaN where there are missing grossed earnings.

Let's clean up the data by eliminating $ and M, and converting to numeric values.


Now let's check the statistics summary:


Now that we have successfully scraped all the 11 features for the 100 most popular feature films released in 2019. Let’s combine them to create a data frame and inspect its structure.


Analyzing scraped data from the web

Once you have the data, you can perform several tasks like analyzing the data, drawing inferences from it, training machine learning models over this data, etc. I have gone on to create some interesting visualization out of the data we have just scraped.

Let's take a look at the distribution of movies by runtime and genre.


What about runtime vs rating?


What about runtime vs earnings?


Now you have a fair idea of the problems which you might come across when working with time, strings and web scraping and how you can make your way around them. As most of the data on the web is present in an unstructured format, web scraping is a really handy skill for any data scientist.

Now let's get you to play around with data on your own!

Your lab task: Analyzing film data from 2016!




Scrape data from IMDB on the top 100 movies released in 2016 and answer the questions that follow. Use the url posted below.

Same as in the example above, to ensure that the content does not change, the url is copy of the 2016 imdb results saved in my github:

https://raw.githubusercontent.com/Diego-Ibarra/biol3782/main/week8/imdb_100titles_2016.html



If you are curious and you want to run again your code in the live imdb site, use http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature However, you will get different results depending on the updates and changes that imdb does to its database. DO NOT use this url to compute the answers for your Brightspace quiz, you will get the wrong answers if you do!



HINTS

For this task you will need...




Remember to check your datasets! You can use str() and dim() or just look at your raw data to make sure your cleaned data is in the format you want.






What was the highest ranked film by popularity of 2016?



What is the CSS selector we use to scrape ranking data?



What is the title of the sixth highest ranked film of 2016?



What is the title of the lowest ranked film of 2016?



How many observations are there in the webscraped title data object you created?



In the 19th most popular film of 2016, who was the story about?



Where was the 17th most popular film of 2016 set?



What is the CSS selector we use to scrape runtime data?



How many films have no metascore data?



How many films are missing gross data values?



After combining all the scraped datasets, (forming movies_df in our example), how many observations are there?



After combining all the scraped datasets, (forming movies_df in our example), how many variables are there?



What is the runtime of the 56th film in the combined dataset (i.e. movies_df)?



What is the genre of the 73rd film in the combined dataset (i.e. movies_df)?



How many unique directors are there in the films list?



Which of the folks in the choices below directed more than one film?



How many text characters does the description of the 50th most popular film contain?
HINT: You can use the string functions we learned earlier



How many directors have first names beginning with the letter "J"?
HINT: You can use the string functions we learned earlier



How many directors have first names whose 2nd letter is "e"?
HINT: You can use the string functions we learned earlier



How many movies on this list did Lily James appear in?



Which of the films Lily James appeared was the highest rated?



Which movie from which Genre had the 3rd longest runtime?



In the Runtime of 130-160 mins, which genre has the highest votes?



In the Runtime of 130-160 mins, who directed the film with the highest votes?



Across all genres, which genre has the highest total gross earnings (combination of all the films) in runtime 100 to 120 mins?



In the runtime 100 to 120 mins, how much did Horror movies earn?



How many Crime films with runtimes between 100 to 120 mins earned >$100M?



What was the most profitable film genre in 2016?



Was the film with the highest number of Votes also the most profitable?



Which genre made large profits but didn't necessarily garner the most votes?



If you were to make a movie, what genre would you choose to make the most profit?