Skip to the content.

MY472 Data for Data Scientists

Michaelmas Term 2021

Main course repo

Prerequisites

All students are required to complete the preparatory course ‘R Advanced for Methodology’ early in Michaelmas Term, ideally in weeks 0 and 1. You will be auto-enrolled into the R course when enrolling into MY472 on Moodle.

Instructors

Office hour slots to be booked via LSE’s StudentHub

Course information

No lectures or classes will take place during (Reading) Week 6.

Week Date Topic
1 27 Sep Introduction to data
2 4 Oct The shape of data
3 11 Oct HTML and CSS
4 18 Oct XML, RSS, and scraping non-static pages
5 25 Oct Working with APIs
6 1 Nov Reading week
7 8 Nov Textual data
8 15 Nov Data visualisation
9 22 Nov Creating and managing databases
10 29 Nov Interacting with online databases
11 6 Dec Cloud Computing

Course description

This course will cover the principles of digital methods for collecting, processing, and storing data. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time-consuming part of any data science project. We use a project-based learning approach towards the study of computation and some group-based collaboration, essential ingredients of modern data science work. We will also make frequent use of version control and group collaboration tools such as git and GitHub.

We begin by discussing concepts in fundamental data types, and how data is stored and recorded electronically. We continue with an introduction of R markdown and the reshaping of data in R. It follows a discussion of various common data types on the internet such as mark­up languages (e.g. HTML and XML) and JSON. Students also study the fundamentals of acquisition and management of data from the internet through both scraping of websites and accessing APIs of online databases and social network services.

After the reading week, we will learn how to work with unstructured data in the form of text. Afterwards we continue with an overview of the principles of exploratory data analysis through data visualisation e.g. using R’s ggplot2. Next, we will cover database design, especially relational databases, using examples across a variety of fields. Students are introduced to SQL through MySQL, and programming assignments in this unit of the course will be designed to ensure that students learn to create, populate and query an SQL database. We will then introduce NoSQL using MongoDB and the JSON data format for comparison. For both types of database, students will be encouraged to work with data relevant to their own interests as they learn to create, populate and query data. The course will be concluded with a discussion of cloud computing. Students will first learn the basics of cloud computing that can serve various purposes such data analysis and then how to set up a cloud computing environment through Amazon Web Services, a popular cloud platform.

Assessment

Formative coursework

Students will be expected to produce five weekly, structured problem sets with a beginning component to be started in the staff-led lab sessions, to be completed by the student outside of class. These problem sets do not require submission and are not marked, but model solutions will be provided after class. One or more of these problem sets will be completed in collaboration with other students.

Summative assignments

Five term time assignment (50%) and one final assignment (50%).

Assessment criteria

Assignments will be marked using the following criteria:

Some of the assignemnts will involve shorter questions, to which the answers can be relatively unambiguously coded as (fully or partially) correct or incorrect. In the marking, these questions may be further broken down into smaller steps and marked step by step. The final mark is then a function of the proportion of parts of the questions which have been answered correctly. In such marking, the principle of partial credit is observed as far as feasible. This means that an answer to a part of a question will be treated as correct when it is correct conditional on answers to other parts of the question, even if those other parts have been answered incorrectly.

Detailed course schedule

Schedule

1. Introduction to data

In the first week, we will introduce some basic concepts of how data is recorded and stored, and we will also review R fundamentals. Because the course relies fundamentally on GitHub, a collaborative code and data sharing platform, we will also discuss the use of git and GitHub.

Lecture
Class
Required reading

2. The shape of data

This week discusses data processing and manipulation in R using functions from the tidyverse after some further review of R fundamentals.

Lecture
Class
Required reading
Assignment 1: Data cleaning in R

3. HTML and CSS

From week 3 to week 5, we will learn how to get the data from the internet. This week we cover basic web scraping to turn web data into text or numbers. We will also cover the client-server model.

Lecture
Class
Required reading

4. XML, RSS, and scraping non-static pages

Continuing from the material covered in Week 3, we will learn the advanced topics in scraping the web. The topics include the scraping documents in XML (such as RSS), and scraping websites with non-static components with Selenium.

Lecture
Class
Required reading
Assignment 2: Web scraping

5. Working with APIs

This week discusses how to work with Application Programming Interfaces (APIs) that offer developers and researchers access to data in a structured format. Our running examples will be the New York Times API and the Twitter API.

Lecture
Class
Required reading
Assignment 3: APIs

6. Reading week

7. Textual data

We will learn how to work with unstructured data in the form of text and discuss character encoding, search and replace with regular expressions, and elementary quantitative textual analysis.

Lecture
Class
Required reading

8. Data visualisation

The lecture this week will offer an overview of the principles of exploratory data analysis through (good) data visualization. In the coding session and seminars, we will practice producing our own graphs using ggplot2.

Lecture
Class
Required reading
Assignment 4: Data visualization

9. Creating and managing databases

This session will offer an introduction to relational databases: structure, logic, and main types. We will learn how to write SQL code, a language designed to query this type of databases that is currently employed by many companies; and how to use it from R using the DBI package.

Lecture
Class
Required reading

10. NoSQL and online databases

This week covers how to set up and use relational databases in the cloud and fundamentals of a document based NoSQL database.

Lecture
Class
Required reading
Assignment 5: Databases

11. Cloud computing

In this week, we focus on the setup of computation environments on the Internet. We will introduce the cloud computing concepts and learn why the big shift to the cloud computing is occurring in the industry and how it is relevant to data scientists. We will then set up different instances in the cloud and study cloud computing through an example of continuous scraping.

Lecture
Class
Required reading

Take-home exam