The most difficult scenario for web scraping is when data is hidden behind multiple pages that can only be accessed entering information into web forms. There are a few approaches that might work in these cases, with varying degree of difficulty and reliability, but often the best method is to use [Selenium](https://en.wikipedia.org/wiki/Selenium_(software)). Selenium automates web browsing sessions, and was originally designed for testing purposes. You can simulate clicks, enter information into web forms, add some waiting time between clicks, etc. To learn how it works, we will scrape a heavily javascripted [website of 2017 General Election results](https://www.theguardian.com/politics/ng-interactive/2017/jun/08/live-uk-election-results-in-full-2017). (You can download the information from the government websites, but well, this is an example.) ```{r} url <- 'https://www.theguardian.com/politics/ng-interactive/2017/jun/08/live-uk-election-results-in-full-2017' ``` As you can see, the information we want to scrape is dynamically displayed by putting information in the search field. By checking the website source, you can confirm that the information is not in the `html` but rendered dynamically when you select a particular url. The first step is to load the RSelenium. Then, we will start a browser running in the background. I will use Firefox, but also Chrome should work. ```{r} library(RSelenium) library(tidyverse) library(rvest) library(xml2) library(netstat) ``` We first start Selenium server in Firefox: ```{r} # Start the Selenium server: rD <- rsDriver(browser=c("firefox"), verbose = F, port = netstat::free_port(random = TRUE), chromever = NULL) driver <- rD[["client"]] # note this alternative but equivalent call for setting the driver client # Navigate to the selected URL address driver$navigate(url) ``` This should open a browser window (in Firefox) with the specified URL. Here's how we would check that it worked: ```{r} # Get source code of the page src <- driver$getPageSource() # and see its first 1000 characters substr(src, 1, 1000) ``` First things first: the following code will remove the cookie banner at the bottom. (_This is just to show you how to switch between frames — the script would run successfully even with the cookie banner_) ```{r} # The cookie pop-up window is on frame one driver$switchToFrame(1) # We need to click on "Yes, I'm happy" button: # 1. Use a command to locate the button on the page accept_button <- driver$___________(using = "xpath", value = "/html/body/div/div[2]/div[3]/div/div/button[1]") # 2. Click on the button: accept_button$_____________() # Switch back to default frame -- if this does not work try driver$switchToFrame(NA) driver$switchToFrame(NULL) ``` Let's assume we want to see the results of the constituency here. We can feed post code or constituency names, and check the results. First, let's identify the elements that we're trying to scrape. Then, send the text to the field and "enter" key inputs. ```{r} # 1. identify the node for input search_field <- driver$findElement(using = 'class name', value = '______________') # 2. send the post code search_field$sendKeysToElement(list("WC2A 2AE")) # 3. this is a tricky part, we need to wait until a suggestion shows up while(driver$findElement(using = 'class name', # get suggestions value = 'ge-lookup__suggestions')$getElementText() %>% nchar() == 0) { # count number of characters and compare to 0 Sys.sleep(1) # if number of characters == 0, there are no suggestions and we need # print("Waiting") # to wait -> use Sys.sleep(1) } # the while() loop runs until you get some suggestions! # 4. click "Enter" search_field$sendKeysToElement(list(key = "______")) ``` Now that we have the results table displayed, we will scrape the name of constituency and the table. ```{r} ## get the constituency name const_name <- driver$findElement(using = 'class name', value = '_____________')$getElementText() ## get the div with the result information res_div <- driver$findElement(using = 'class name', value = 'ge-result') ## what we can do here is identify the root node where the results are displayed ## and then you can hand the html from browser to ## rvest and use familiar html_table() function ## get the html of the table, then parse it using rvest's "html_table" results_html <- read_html(res_div$getElementAttribute('innerHTML')[[1]]) results_table <- html_table(results_html)[[1]] names(results_table)[c(1, 5)] <- c('tmp', 'tmp2') ``` The first column of the table was supposedly the party. But that information is not coming through, because it's just blank `