Skip to the content.

MY472 Data for Data Scientists

Michaelmas Term 2023

Main course repo

Moodle page

Instructors

Office hour slots to be booked via LSE’s StudentHub

Assignments

  Type Due date
1 Formative problem set 12 October 2023, 4pm
2 Summative problem set 2 November 2023, 4pm
3 Summative problem set 7 December 2023, 4pm
4 Take home assessment 10 January 2024, 4pm
Week Topic Lecturer
1 Introduction Thomas Robinson
2 Tabular data Thomas Robinson
3 Data visualisation Thomas Robinson
4 Textual data Thomas Robinson
5 HTML, CSS, and scraping static pages Dan de Kadt
6 Reading week  
7 XML, RSS, and scraping non-static pages Dan de Kadt
8 Working with APIs Dan de Kadt
9 Creating and managing databases Dan de Kadt
10 Interacting with online databases Dan de Kadt
11 Cloud computing Thomas Robinson

Detailed course schedule

Please note, links to slides and code scripts will be updated/added in advance of each week’s teaching.

1. Introduction

In the first week, we will introduce some basic concepts of how data is recorded and stored, and we will also review R fundamentals. Because the course relies fundamentally on GitHub, a collaborative code and data sharing platform, we will also discuss the use of git and GitHub.

Lecture
Seminar
Guide on GitHub, collaboration and pull requests

YouTube video by Tom and Dan

Readings
Additional readings

2. Tabular data

This week discusses processing tabular data in R with functions from the tidyverse after some further review of R fundamentals.

Lecture
Seminar
Reading

Assignment 1

This is a formative assignment, and is due 12 October 2023 by 4pm. You must submit your response as a knitted .html file via the Moodle page.

Assignment 1 problem set

Template RMarkdown repository

3. Data visualisation

The lecture this week will offer an overview of the principles of exploratory data analysis through (good) data visualization. In the coding session and seminars, we will practice producing our own graphs using ggplot2.

Lecture
Seminar
Reading
Further reading

4. Textual data

We will learn how to work with unstructured data in the form of text and discuss character encoding, search and replace with regular expressions, and elementary quantitative textual analysis.

Lecture
Seminar
Reading
Further reading

5. HTML, CSS, and scraping static pages

This week we cover the basics of web scraping for tables and unstructured data from static pages. We will also discuss the client-server model.

Lecture
Seminar
Reading
Further reading

Assignment 2

This is a summative assignment, and is due 2 November 2023 by 4pm. You must submit your response as a knitted .html file via the Moodle page.

Assignment 2 problem set

6. Reading week

7. XML, RSS, and scraping non-static pages

Continuing from the material covered in Week 5, we will learn the advanced topics in scraping the web. The topics include the scraping documents in XML (such as RSS), and scraping websites with non-static components with Selenium.

Lecture
Seminar
Reading
Further reading

8. Working with APIs

This week discusses how to work with Application Programming Interfaces (APIs) that offer developers and researchers access to data in a structured format.

Lecture
Seminar
Reading
Further reading

Assignment 3

This is a summative assignment, and is due 7 December 2023 by 4pm. You must submit your response as a knitted .html file via the Moodle page.

Assignment 3 problem set, ivyleague.csv

9. Creating and managing databases

This session will offer an introduction to relational databases: structure, logic, and main types. We will learn how to write SQL code, a language designed to query this type of databases that is currently employed by many companies; and how to use it from R using the DBI package.

Lecture
Seminar
Reading
Further reading

10. NoSQL and cloud databases

This week covers how to set up and use relational databases in the cloud and fundamentals of a document based NoSQL database.

Lecture
Seminar
Required
Further reading

11. Cloud computing and containerization

This week we will focus on the setup of computation environments run outside our host system. We will introduce cloud computing and discuss why it is relevant to data scientists. We will then introduce the concept of containerization and the Docker platform. We will set up different instances in the cloud and on our own local machines, and study cloud computing through an example of Shiny dashboards.

Lecture
Seminar
Reading
Further reading
Assignment 4

This is a summative assignment. There are two parts to the assignment, the first part of which is due 10 January 2024 by 4pm. Please read the instructions very carefully:

Final Assignment