MY472 Data for Data Scientists
Michaelmas Term 2022
Prerequisites
All students are required to complete the preparatory course ‘R Advanced for Methodology’ early in Michaelmas Term, ideally in weeks 0 and 1. You will find the link to the preparatory course on the Moodle page of MY472.
Instructors
Office hour slots to be booked via LSE’s StudentHub
- Friedrich Geiecke, Department of Methodology. Office hours: Tuesdays 5-7pm (book via Student Hub)
- Yuhao Qian, Department of Economics.
Course information
- Lecture:
- Tuesdays 9-11am, MAR.2.08
- Classes:
- Thursdays 1-2pm, NAB.2.08
- Thursdays 2-3pm, CBG.2.05
- Thursdays 5-6pm, NAB.2.16
No lectures or classes will take place during (Reading) Week 6.
Quick links to topics
Week | Date | Topic |
---|---|---|
1 | 27 Sep | Introduction |
2 | 4 Oct | Tabular data |
3 | 11 Oct | Data visualisation |
4 | 18 Oct | Textual data |
5 | 25 Oct | HTML, CSS, and scraping static pages |
6 | 1 Nov | Reading week |
7 | 7 Nov | XML, RSS, and scraping non-static pages |
8 | 14 Nov | Working with APIs |
9 | 21 Nov | Creating and managing databases |
10 | 28 Nov | Interacting with online databases |
11 | 5 Dec | Cloud computing |
Course description
This course covers the principles of collecting, processing, and storing data with R. It also covers workflow management for typical data transformation and cleaning projects, frequently the starting point and most time-consuming part of any data science project. We use a project-based learning approach towards the study of computation and some group-based collaboration, essential parts of modern data science work. We also make frequent use of version control and collaboration tools such as Git and GitHub.
We will begin by discussing Git and R fundamentals, and continue with an introduction to reshaping data in R. Afterwards it follows an overview of visualisation with ggplot2. We then learn how to work with unstructured data in the form of text. We will continue with a discussion of common data types on the internet such as markup languages (e.g. HTML and XML) and study the fundamentals of acquisition of data from the internet through scraping of websites. We will then download data from web APIs. Afterwards we will discuss databases, especially relational databases. Students will be introduced to SQL through SQLite, and programming assignments in this unit of the course will be designed to ensure that students learn to create, populate and query an SQL database. We will then discuss NoSQL using MongoDB for comparison. The course will be concluded with a discussion of cloud computing. Students will first learn the basics of cloud computing that can serve various purposes such data analysis and then how to set up a cloud computing environment through Amazon Web Services, a popular cloud platform.
Assessment
Summative assignments
Four term-time assignment (50%) and one final assignment (50%).
Assessment criteria
Assignments will be marked using the following criteria:
-
70–100: Very Good to Excellent (Distinction). Perceptive, focused use of a good depth of material with a critical edge. Original ideas or structure of argument.
-
60–69: Good (Merit). Perceptive understanding of the issues plus a coherent well-read and stylish treatment though lacking originality
-
50–59: Satisfactory (Pass). A “correct” answer based largely on lecture material. Little detail or originality but presented in adequate framework. Small factual errors allowed.
-
30–49: Unsatisfactory (Fail) and 0–29: Unsatisfactory (Bad fail). Based entirely on lecture material but unstructured and with increasing error component. Concepts are disordered or flawed. Poor presentation. Errors of concept and scope or poor in knowledge, structure and expression.
Some of the assignments will involve shorter questions, to which the answers can be relatively unambiguously coded as (fully or partially) correct or incorrect. In the marking, these questions may be further broken down into smaller steps and marked step by step. The final mark is then a function of the proportion of parts of the questions which have been answered correctly. In such marking, the principle of partial credit is observed as far as feasible. This means that an answer to a part of a question will be treated as correct when it is correct conditional on answers to other parts of the question, even if those other parts have been answered incorrectly.
Detailed course schedule
Schedule
1. Introduction
In the first week, we will introduce some basic concepts of how data is recorded and stored, and we will also review R fundamentals. Because the course relies fundamentally on GitHub, a collaborative code and data sharing platform, we will also discuss the use of git and GitHub.
Lecture
- Slides
- Code: A plain R script, a first R markdown example, and a recap on vectors, lists, data frames
- Seminar
Seminar
- Review of Git/GitHub basics discussed in lecture
- Branches, merges, and pull requests
Readings
- Wickham, Hadley. Nd. Advanced R, 2nd ed. Ch 3, Names and values, Chapter 4, Vectors, and Chapter 5, Subsetting. (Ch. 2-3 of the print edition),
- GitHub Guides, especially: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.
- GitHub. “Markdown Syntax” (a cheatsheet).
Additional readings
- Lake, P. and Crowther, P. 2013. Concise guide to databases: A Practical Introduction. London: Springer-Verlag. Chapter 1, Data, an Organizational Asset
- Nelson, Meghan. 2015. “An Intro to Git and GitHub for Beginners (Tutorial).”
- Jim McGlone, “Creating and Hosting a Personal Site on GitHub A step-by-step beginner’s guide to creating a personal website and blog using Jekyll and hosting it for free using GitHub Pages.”.
2. Tabular data
This week discusses processing tabular data in R with functions from the tidyverse
after some further review of R fundamentals.
Lecture
- Slides
- Code: Conditionals, loops, and functions, data processing in R, industrial production dataset, and industrial production and unemployment dataset
Seminar
- Code: Dplyr exercises, solution
Reading
- Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly. Part II Wrangle, Tibbles, Data Import, Tidy Data (Ch. 7-9 of the print edition).
- The Tidyverse collection of packages for R.
Assignment 1: Processing data in R (practise assignment)
- GitHub Classroom link available via Moodle on Monday, 3 October
- Deadline on Friday, 14 October, 2pm
3. Data visualisation
The lecture this week will offer an overview of the principles of exploratory data analysis through (good) data visualization. In the coding session and seminars, we will practice producing our own graphs using ggplot2.
Lecture
Seminar
Reading
- Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly. Data visualization, Graphics for communication (Ch. 1 and 22 of the print edition).
Further reading
- Hughes, A. (2015) “Visualizing inequality: How graphical emphasis shapes public opinion” Research and Politics.
- Tufte, E. (2002) “The visual display of quantitative information”.
4. Textual data
We will learn how to work with unstructured data in the form of text and discuss character encoding, search and replace with regular expressions, and elementary quantitative textual analysis.
Lecture
Seminar
Reading
- Kenneth Benoit. July 16, 2019. “Text as Data: An Overview” Forthcoming in Cuirini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage.
Further reading
- Wickham, Hadley and Garett Grolemund. 2017, Chapter 14
- Regular expressions cheat sheet
- Regular expressions in R vignette
Assignment 2: Data visualisation
- GitHub Classroom link available via Moodle on Monday, 17 October
- Deadline on Friday, 28 October, 2pm
5. HTML, CSS, and scraping static pages
This week we cover the basics of web scraping for tables and unstructured data from static pages. We will also discuss the client-server model.
Lecture
Seminar
Reading
- Lazer, David, and Jason Radford. 2017. “Data Ex Machina: Introduction to Big Data.” Annual Review of Sociology 43(1): 19–39.
- Howe, Shay. 2015. Learn to Code HTML and CSS: Develop and Style Websites. New Riders. Chs 1-8.
- Kingl, Arvid. 2018. Web Scraping in R: rvest Tutorial.
Further reading
- Munzert, Simon, Christian Rubba, Peter Meissner, and Dominic Nyhuis D. 2014. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Hoboken, NJ/Chichester, UK:Wiley & Sons. Ch. 2-4, 9.
- Severance, Charles Russell. 2015. Introduction to Networking: How the Internet Works. Charles Severance, 2015.
- Duckett, Jon. 2011. HTML and CSS: Design and Build Websites. New York: Wiley.
6. Reading week
Assignment 3: Web scraping
- GitHub Classroom link available via Moodle on Monday, 31 October
- Deadline on Friday, 11 November, 2pm
7. XML, RSS, and scraping non-static pages
Continuing from the material covered in Week 5, we will learn the advanced topics in scraping the web. The topics include the scraping documents in XML (such as RSS), and scraping websites with non-static components with Selenium.
Lecture
Seminar
Reading
Further reading
- Mozilla Developer Web Docs. What is JavaScript.
- Web Scraping with R and PhantomJS.
- Mozilla Developer Web Docs. A First Splash into JavaScript.
8. Working with APIs
This week discusses how to work with Application Programming Interfaces (APIs) that offer developers and researchers access to data in a structured format. Our running examples will be the New York Times API and the Twitter API.
Lecture
- Slides
- Code: JSON in R, NYT API, Twitter REST API
Seminar
- Code: Twitter streaming API
Reading
- Steinert-Threlkeld. 2018. Twitter as Data. Cambridge University Press.
Further reading
- Ruths and Pfeffer. 2014. Social media for large studies of behavior. Science.
Assignment 4: APIs
- GitHub Classroom link available via Moodle on Tuesday, 15 November
- Deadline on Monday, 28 November, 11am
9. Creating and managing databases
This session will offer an introduction to relational databases: structure, logic, and main types. We will learn how to write SQL code, a language designed to query this type of databases that is currently employed by many companies; and how to use it from R using the DBI package.
Lecture
Seminar
- Code: SQL exercises, solution
Reading
- Beaulieu. 2009. Learning SQL. O’Reilly. (Chapters 1, 3, 4, 5, 8)
Further reading
- Stephens et al. 2009. Teach yourself SQL in one hour a day. Sam’s Publishing.
10. NoSQL and cloud databases
This week covers how to set up and use relational databases in the cloud and fundamentals of a document based NoSQL database.
Lecture
Seminar
- Code: Exercises BigQuery, SQL joins, SQL subqueries, solution BigQuery, solution joins, solution subqueries
Required
- Beaulieu. 2009. Learning SQL. O’Reilly. (Chapters 2)
- Hows, Membrey, and Plugge. 2014. MongoDB Basics. Apress. (Chapter 1)
- Tigani and Naidu. 2017. Google BigQuery Analytics. Weily. (Chapters 1-3)
Further reading
Assignment 5: Databases
- GitHub Classroom link available via Moodle on Monday, 28 November
- Deadline on Friday, 9 December, 2pm
11. Cloud computing
In this week, we focus on the setup of computation environments on the Internet. We will introduce the cloud computing concepts and learn why the big shift to the cloud computing is occurring in the industry and how it is relevant to data scientists. We will then set up different instances in the cloud and study cloud computing through an example of continuous scraping.
Lecture
- Slides
- Connecting to the instance with Windows via PuTTY
- Code: Hello world, continuous scraping within R only, installing R packages on the EC2 instance, continuous scraping via a schedule
- Optional code: Using storage outside the EC2 instance
Seminar
Reading
- Rajaraman, V. 2014. “Cloud Computing.” Resonance 19(3): 242–58.
- AWS: What is cloud computing.
- Azure: Developer guide.
Further reading
- Puparelia, Nayan. 2016. “Cloud Computing.” MIT Press. Ch. 1-3.
- Botta, Alessio, Walter De Donato, Valerio Persico, and Antonio Pescapé. 2016. “Integration of Cloud Computing and Internet of Things: A Survey.” Future Generation Computer Systems 56: 684–700.
Take-home exam
- GitHub Classroom link available via Moodle on Wednesday, 14 December
- Deadline on Monday, 16 January, 2pm