The most difficult scenario for web scraping is when data is hidden behind multiple pages that can only be accessed entering information into web forms. There are a few approaches that might work in these cases, with varying degree of difficulty and reliability, but often the best method is to use [Selenium](https://en.wikipedia.org/wiki/Selenium_(software)). Selenium automates web browsing sessions, and was originally designed for testing purposes. You can simulate clicks, enter information into web forms, add some waiting time between clicks, etc. To learn how it works, we will scrape a heavily javascripted [website of 2017 General Election results](https://www.theguardian.com/politics/ng-interactive/2017/jun/08/live-uk-election-results-in-full-2017). (You can download the information from the government websites, but well, this is an example.) ```{r} url <- 'https://www.theguardian.com/politics/ng-interactive/2017/jun/08/live-uk-election-results-in-full-2017' ``` As you can see, the information we want to scrape is dynamically displayed by putting information in the search field. By checking the website source, you can confirm that the information is not in the `html` but rendered dynamically when you select a particular url. The first step is to load the RSelenium. Then, we will start a browser running in the background. I will use Firefox, but also Chrome should work. ```{r} library(RSelenium) library(tidyverse) library(rvest) library(xml2) library(netstat) ``` We first start Selenium server in Firefox: ```{r} # Start the Selenium server: rD <- rsDriver(browser=c("firefox"), verbose = F, port = netstat::free_port(random = TRUE), chromever = NULL) driver <- rD[["client"]] # note this alternative but equivalent call for setting the driver client # Navigate to the selected URL address driver$navigate(url) ``` This should open a browser window (in Firefox) with the specified URL. Here's how we would check that it worked: ```{r} # Get source code of the page src <- driver$getPageSource() # and see its first 1000 characters substr(src, 1, 1000) ``` First things first: the following code will remove the cookie banner at the bottom. (_This is just to show you how to switch between frames — the script would run successfully even with the cookie banner_) ```{r} # The cookie pop-up window is on frame one driver$switchToFrame(1) # We need to click on "Yes, I'm happy" button: # 1. Use a command to locate the button on the page accept_button <- driver$___________(using = "xpath", value = "/html/body/div/div[2]/div[3]/div/div/button[1]") # 2. Click on the button: accept_button$_____________() # Switch back to default frame -- if this does not work try driver$switchToFrame(NA) driver$switchToFrame(NULL) ``` Let's assume we want to see the results of the constituency here. We can feed post code or constituency names, and check the results. First, let's identify the elements that we're trying to scrape. Then, send the text to the field and "enter" key inputs. ```{r} # 1. identify the node for input search_field <- driver$findElement(using = 'class name', value = '______________') # 2. send the post code search_field$sendKeysToElement(list("WC2A 2AE")) # 3. this is a tricky part, we need to wait until a suggestion shows up while(driver$findElement(using = 'class name', # get suggestions value = 'ge-lookup__suggestions')$getElementText() %>% nchar() == 0) { # count number of characters and compare to 0 Sys.sleep(1) # if number of characters == 0, there are no suggestions and we need # print("Waiting") # to wait -> use Sys.sleep(1) } # the while() loop runs until you get some suggestions! # 4. click "Enter" search_field$sendKeysToElement(list(key = "______")) ``` Now that we have the results table displayed, we will scrape the name of constituency and the table. ```{r} ## get the constituency name const_name <- driver$findElement(using = 'class name', value = '_____________')$getElementText() ## get the div with the result information res_div <- driver$findElement(using = 'class name', value = 'ge-result') ## what we can do here is identify the root node where the results are displayed ## and then you can hand the html from browser to ## rvest and use familiar html_table() function ## get the html of the table, then parse it using rvest's "html_table" results_html <- read_html(res_div$getElementAttribute('innerHTML')[[1]]) results_table <- html_table(results_html)[[1]] names(results_table)[c(1, 5)] <- c('tmp', 'tmp2') ``` The first column of the table was supposedly the party. But that information is not coming through, because it's just blank `` tags. We still can extract the information by using the class information attached. ```{r} # the code here finds a span in the first column of a table row, and extract the # value of class attributes party_class <- results_html %>% html_elements(xpath = "//tr/td[1]/span") %>% html_attr("class") print(party_class) # there are some extra texts here. Remove them using `str_replace_*` function. party <- str_replace_all(party_class, ".+-", "") print(party) ``` Now, let's create a `data.frame`. ```{r} results_table <- results_table %>% mutate(constituency = const_name[[1]], # create a new variable to get the constituency name party = party, # create a new variable for party names votes = str_replace_all(votes, "\\D+", "") %>% # "\\D+" -> get all digits as.numeric()) %>% # make it numeric class select(c(2, 3, 6, 7)) # select columns to keep print(results_table) ``` -------- We think that we have identified the necessary steps to get the data. We can now go over the list of constituency names and get __all__ candidate data. First, combine the previous steps into a function with constituency name as input that returns the corresponding table: ```{r} get_results_by_const <- function(const_name, sec = 4){ # 1. Setup ------------------------------------------------------------------ # URL url <- 'https://www.theguardian.com/politics/ng-interactive/2017/jun/08/live-uk-election-results-in-full-2017' # Scroll the page to the body webElem <- driver$findElement("css", "body") webElem$sendKeysToElement(list(key = "space")) # 2. Search ----------------------------------------------------------------- # Locate the search field and type in constituency search_field <- ____________________________________________________ search_field$sendKeysToElement(list(const_name)) # Wait for the suggestions while(driver$findElement(using = 'class name', value = 'ge-lookup__suggestions')$getElementText() %>% nchar() == 0) { ___________(1) # Hint: Sleep for 1 second } # Press ENTER to run the search _______________________________________________ # 3. Extract results -------------------------------------------------------- # Get the div with the result information (i.e. table with results) res_div <- driver$findElement(using = 'class name', value = 'ge-result') # Get the html of the table, then parse it using rvest's "html_table" results_html <- read_html(res_div$getElementAttribute('innerHTML')[[1]]) results_table <- html_table(results_html)[[1]] # Rename blank columns names(results_table)[c(1, 5)] <- c('tmp', 'tmp2') # Extract political party information party_class <- results_html %>% html_elements(xpath = "______________") %>% html_attr("____________") party <- sub(".+-", "", party_class) # 4. Collect everything in one dataframe ------------------------------------ results_table <- results_table %>% mutate(constituency = const_name) %>% mutate(party = party) %>% mutate(votes = str_replace_all(votes, "\\D", "") %>% as.numeric()) %>% select(constituency, party, candidates, votes, `%`) # Don't overload the server and wait for a few seconds (default sec = 4) Sys.sleep(sec) # Return the dataset return(results_table) } ``` Second, we need constituency names. We can get them from Wikipedia! ```{r} url_const <- "https://en.wikipedia.org/wiki/United_Kingdom_Parliament_constituencies" const <- read_html(url_const) %>% html_table(fill = TRUE) %>% .[[5]] %>% ## Note that (as of current Wikipedia article revision) this is just English constituencies -- we could also get Scottish, Welsh, and NI constituencies by finding their tables on the same page -- you can do this as a challenge if you want! select(Constituency) %>% .[[1]] # Print out first twenty const[1:20] ``` Finally, we feed the constituency names to the `get_results_by_const()` function. Sit back and relax, selenium will do all the work for you. Let's just get the first ten of them (to spare the Guardian's servers). As a challenge, you could try and call instead a random set of 10 (this is also good practice, to test that your code works across cases, rather than just for the first ten). ```{r} data_all <- lapply(const[1:10], get_results_by_const) %>% bind_rows() head(data_all) ``` Close the session: ```{r} driver$close() rD$server$stop() # close the associated Java processes if necessary: system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE) ```