Skip to the content.

MY472 Data for Data Scientists

Michaelmas Term 2022

Main course repo

Prerequisites

All students are required to complete the preparatory course ‘R Advanced for Methodology’ early in Michaelmas Term, ideally in weeks 0 and 1. You will find the link to the preparatory course on the Moodle page of MY472.

Instructors

Office hour slots to be booked via LSE’s StudentHub

Course information

No lectures or classes will take place during (Reading) Week 6.

Week Date Topic
1 27 Sep Introduction
2 4 Oct The shape of data
3 11 Oct Data visualisation
4 18 Oct Textual data
5 25 Oct HTML and CSS
6 1 Nov Reading week
7 7 Nov XML, RSS, and scraping non-static pages
8 14 Nov Working with APIs
9 21 Nov Creating and managing databases
10 28 Nov Interacting with online databases
11 5 Dec Cloud Computing

Course description

This course covers the principles of collecting, processing, and storing data with R. It also covers workflow management for typical data transformation and cleaning projects, frequently the starting point and most time-consuming part of any data science project. We use a project-based learning approach towards the study of computation and some group-based collaboration, essential parts of modern data science work. We also make frequent use of version control and collaboration tools such as Git and GitHub.

We will begin by discussing Git and R fundamentals, and continue with an introduction to reshaping data in R. Afterwards it follows an overview of visualisation with ggplot2. We then learn how to work with unstructured data in the form of text. We will continue with a discussion of common data types on the internet such as markup languages (e.g. HTML and XML) and study the fundamentals of acquisition of data from the internet through scraping of websites. We will then download data from web APIs. Afterwards we will discuss databases, especially relational databases. Students will be introduced to SQL through SQLite, and programming assignments in this unit of the course will be designed to ensure that students learn to create, populate and query an SQL database. We will then discuss NoSQL using MongoDB for comparison. The course will be concluded with a discussion of cloud computing. Students will first learn the basics of cloud computing that can serve various purposes such data analysis and then how to set up a cloud computing environment through Amazon Web Services, a popular cloud platform.

Assessment

Summative assignments

Four term-time assignment (50%) and one final assignment (50%).

Assessment criteria

Assignments will be marked using the following criteria:

Some of the assignments will involve shorter questions, to which the answers can be relatively unambiguously coded as (fully or partially) correct or incorrect. In the marking, these questions may be further broken down into smaller steps and marked step by step. The final mark is then a function of the proportion of parts of the questions which have been answered correctly. In such marking, the principle of partial credit is observed as far as feasible. This means that an answer to a part of a question will be treated as correct when it is correct conditional on answers to other parts of the question, even if those other parts have been answered incorrectly.

Detailed course schedule

Schedule

1. Introduction

In the first week, we will introduce some basic concepts of how data is recorded and stored, and we will also review R fundamentals. Because the course relies fundamentally on GitHub, a collaborative code and data sharing platform, we will also discuss the use of git and GitHub.

Lecture
Class
Readings
Additional readings

2. The shape of data

This week discusses data processing and manipulation in R using functions from the tidyverse after some further review of R fundamentals.

Lecture
Class
Reading
Mock assignment: Processing data in R

3. Data visualisation

The lecture this week will offer an overview of the principles of exploratory data analysis through (good) data visualization. In the coding session and seminars, we will practice producing our own graphs using ggplot2.

Lecture
Class
Reading
Further reading

4. Textual data

We will learn how to work with unstructured data in the form of text and discuss character encoding, search and replace with regular expressions, and elementary quantitative textual analysis.

Lecture
Class
Reading
Further reading
Assignment 1: Data visualisation

5. HTML and CSS

In this week, we will learn how to obtain the data from the internet. This week we cover basic web scraping to turn web data into text or numbers. We will also discuss the client-server model.

Lecture
Class
Reading
Further reading

6. Reading week

Assignment 2: Web scraping

7. XML, RSS, and scraping non-static pages

Continuing from the material covered in Week 5, we will learn the advanced topics in scraping the web. The topics include the scraping documents in XML (such as RSS), and scraping websites with non-static components with Selenium.

Lecture
Class
Reading
Further reading

8. Working with APIs

This week discusses how to work with Application Programming Interfaces (APIs) that offer developers and researchers access to data in a structured format. Our running examples will be the New York Times API and the Twitter API.

Lecture
Class
Reading
Further reading
Assignment 3: APIs

9. Creating and managing databases

This session will offer an introduction to relational databases: structure, logic, and main types. We will learn how to write SQL code, a language designed to query this type of databases that is currently employed by many companies; and how to use it from R using the DBI package.

Lecture
Class
Reading
Further reading

10. NoSQL and online databases

This week covers how to set up and use relational databases in the cloud and fundamentals of a document based NoSQL database.

Lecture
Class
Required reading
Assignment 4: Databases

11. Cloud computing

In this week, we focus on the setup of computation environments on the Internet. We will introduce the cloud computing concepts and learn why the big shift to the cloud computing is occurring in the industry and how it is relevant to data scientists. We will then set up different instances in the cloud and study cloud computing through an example of continuous scraping.

Lecture
Class
Reading
Further reading

Take-home exam