MY472 Data for Data Scientists
Michaelmas Term 2023
Instructors
Office hour slots to be booked via LSE’s StudentHub
- Thomas Robinson, Department of Methodology.
- Dan de Kadt, Department of Methodology.
Assignments
Type | Due date | |
---|---|---|
1 | Formative problem set | 12 October 2023, 4pm |
2 | Summative problem set | 2 November 2023, 4pm |
3 | Summative problem set | 7 December 2023, 4pm |
4 | Take home assessment | 10 January 2024, 4pm |
Quick links to topics
Week | Topic | Lecturer |
---|---|---|
1 | Introduction | Thomas Robinson |
2 | Tabular data | Thomas Robinson |
3 | Data visualisation | Thomas Robinson |
4 | Textual data | Thomas Robinson |
5 | HTML, CSS, and scraping static pages | Dan de Kadt |
6 | Reading week | |
7 | XML, RSS, and scraping non-static pages | Dan de Kadt |
8 | Working with APIs | Dan de Kadt |
9 | Creating and managing databases | Dan de Kadt |
10 | Interacting with online databases | Dan de Kadt |
11 | Cloud computing | Thomas Robinson |
Detailed course schedule
Please note, links to slides and code scripts will be updated/added in advance of each week’s teaching.
1. Introduction
In the first week, we will introduce some basic concepts of how data is recorded and stored, and we will also review R fundamentals. Because the course relies fundamentally on GitHub, a collaborative code and data sharing platform, we will also discuss the use of git and GitHub.
Lecture
- Slides
- Code: A plain R script, a first R markdown example, and a recap on vectors, lists, data frames
- Seminar
Seminar
- Review of Git/GitHub basics discussed in lecture
- Branches, merges, and pull requests
Guide on GitHub, collaboration and pull requests
Readings
- Wickham, Hadley. Nd. Advanced R, 2nd ed. Ch 3, Names and values, Chapter 4, Vectors, and Chapter 5, Subsetting. (Ch. 2-3 of the print edition),
- GitHub Guides, especially: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.
- GitHub. “Markdown Syntax” (a cheatsheet).
Additional readings
- Lake, P. and Crowther, P. 2013. Concise guide to databases: A Practical Introduction. London: Springer-Verlag. Chapter 1, Data, an Organizational Asset
- Nelson, Meghan. 2015. “An Intro to Git and GitHub for Beginners (Tutorial).”
- Jim McGlone, “Creating and Hosting a Personal Site on GitHub A step-by-step beginner’s guide to creating a personal website and blog using Jekyll and hosting it for free using GitHub Pages.”.
2. Tabular data
This week discusses processing tabular data in R with functions from the tidyverse
after some further review of R fundamentals.
Lecture
- Slides
- Code: Conditionals, loops, and functions, data processing in R, industrial production dataset, and industrial production and unemployment dataset
Seminar
- Code: Dplyr exercises, solution
Reading
- Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly. Part II Wrangle, Tibbles, Data Import, Tidy Data (Ch. 7-9 of the print edition).
- The Tidyverse collection of packages for R.
Assignment 1
This is a formative assignment, and is due 12 October 2023 by 4pm. You must submit your response as a knitted .html file via the Moodle page.
3. Data visualisation
The lecture this week will offer an overview of the principles of exploratory data analysis through (good) data visualization. In the coding session and seminars, we will practice producing our own graphs using ggplot2.
Lecture
- Slides
- Lecture code: Anscombe, ggplot2 walkthrough
- Data: Congressional Facebook posts, unemployment data
- Further reference code: ggplot2 basics, ggplot2 scales, axes, and legends
Seminar
- Code: Exercises in visualistion, solution
- Graphic to replicate: Unemployment rates
Reading
- Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly. Data visualization, Graphics for communication (Ch. 1 and 22 of the print edition).
Further reading
- Hughes, A. (2015) “Visualizing inequality: How graphical emphasis shapes public opinion” Research and Politics.
- Tufte, E. (2002) “The visual display of quantitative information”.
4. Textual data
We will learn how to work with unstructured data in the form of text and discuss character encoding, search and replace with regular expressions, and elementary quantitative textual analysis.
Lecture
- Slides
- Code: Regular expressions in R, text analysis, parsing pdfs
- Data: Sample texts, Keynes’ “General Theory” cover
Seminar
- Code: Exercises in text analysis, solution
- Data: UoL institutions
Reading
- Kenneth Benoit. July 16, 2019. “Text as Data: An Overview” Forthcoming in Cuirini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage.
Further reading
- Wickham, Hadley and Garett Grolemund. 2017, Chapter 14
- Regular expressions cheat sheet
- Regular expressions in R vignette
5. HTML, CSS, and scraping static pages
This week we cover the basics of web scraping for tables and unstructured data from static pages. We will also discuss the client-server model.
Lecture
Seminar
Reading
- Lazer, David, and Jason Radford. 2017. “Data Ex Machina: Introduction to Big Data.” Annual Review of Sociology 43(1): 19–39.
- Howe, Shay. 2015. Learn to Code HTML and CSS: Develop and Style Websites. New Riders. Chs 1-8.
- Kingl, Arvid. 2018. Web Scraping in R: rvest Tutorial.
Further reading
- Munzert, Simon, Christian Rubba, Peter Meissner, and Dominic Nyhuis D. 2014. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Hoboken, NJ/Chichester, UK:Wiley & Sons. Ch. 2-4, 9.
- Severance, Charles Russell. 2015. Introduction to Networking: How the Internet Works. Charles Severance, 2015.
- Duckett, Jon. 2011. HTML and CSS: Design and Build Websites. New York: Wiley.
Assignment 2
This is a summative assignment, and is due 2 November 2023 by 4pm. You must submit your response as a knitted .html file via the Moodle page.
6. Reading week
7. XML, RSS, and scraping non-static pages
Continuing from the material covered in Week 5, we will learn the advanced topics in scraping the web. The topics include the scraping documents in XML (such as RSS), and scraping websites with non-static components with Selenium.
Lecture
Seminar
Reading
Further reading
- Mozilla Developer Web Docs. What is JavaScript.
- Web Scraping with R and PhantomJS.
- Mozilla Developer Web Docs. A First Splash into JavaScript.
8. Working with APIs
This week discusses how to work with Application Programming Interfaces (APIs) that offer developers and researchers access to data in a structured format.
Lecture
Seminar
Reading
- Barberá & Steinert-Threlkeld. 2018. “How to use social media data for political science research”. In The Sage handbook of research methods in political science and international relations, pages 404-423.
Further reading
- Ruths and Pfeffer. 2014. Social media for large studies of behavior. Science.
Assignment 3
This is a summative assignment, and is due 7 December 2023 by 4pm. You must submit your response as a knitted .html file via the Moodle page.
Assignment 3 problem set, ivyleague.csv
9. Creating and managing databases
This session will offer an introduction to relational databases: structure, logic, and main types. We will learn how to write SQL code, a language designed to query this type of databases that is currently employed by many companies; and how to use it from R using the DBI package.
Lecture
Seminar
- Code: SQL exercises, solution
Reading
- Beaulieu. 2009. Learning SQL. O’Reilly. (Chapters 1, 3, 4, 5, 8)
Further reading
- Stephens et al. 2009. Teach yourself SQL in one hour a day. Sam’s Publishing.
10. NoSQL and cloud databases
This week covers how to set up and use relational databases in the cloud and fundamentals of a document based NoSQL database.
Lecture
Seminar
- Code: Exercises BigQuery, SQL joins, SQL subqueries, solution BigQuery, solution joins, solution subqueries
Required
- Beaulieu. 2009. Learning SQL. O’Reilly. (Chapters 2)
- Hows, Membrey, and Plugge. 2014. MongoDB Basics. Apress. (Chapter 1)
- Tigani and Naidu. 2017. Google BigQuery Analytics. Weily. (Chapters 1-3)
Further reading
11. Cloud computing and containerization
This week we will focus on the setup of computation environments run outside our host system. We will introduce cloud computing and discuss why it is relevant to data scientists. We will then introduce the concept of containerization and the Docker platform. We will set up different instances in the cloud and on our own local machines, and study cloud computing through an example of Shiny dashboards.
Lecture
- Slides
- Connecting to the instance with Windows via PuTTY
- Code: Prime number finder, installing R packages on an EC2 instance, Dockerfile
- Optional code: Using storage outside the EC2 instance, Parallel computing
Seminar
- Code: Exercises in shiny
Reading
- Rajaraman, V. 2014. “Cloud Computing.” Resonance 19(3): 242–58.
- AWS: What is cloud computing.
- Azure: Developer guide.
Further reading
- Puparelia, Nayan. 2016. “Cloud Computing.” MIT Press. Ch. 1-3.
- Botta, Alessio, Walter De Donato, Valerio Persico, and Antonio Pescapé. 2016. “Integration of Cloud Computing and Internet of Things: A Survey.” Future Generation Computer Systems 56: 684–700.
Assignment 4
This is a summative assignment. There are two parts to the assignment, the first part of which is due 10 January 2024 by 4pm. Please read the instructions very carefully: