MY472 Data for Data Scientists
Michaelmas Term 2019
Office hour slots to be booked via LSE’s StudentHub.
Kenneth Benoit, Department of Methodology. Office hours: Tuesdays 15:30-17:00, Wednesdays 10:00-11:00, COL.8.11.
Milena Tsvetkova, Department of Methodology. Office hours: Fridays 10:00–12:00, COL 8.03 (only weeks 10, 11).
- Lectures on Tuesdays 09:00–11:00 in CBG.2.01
- Classes on:
- Thursdays 9:30-11:00, FAW.4.02
- Fridays 15:00-16:30, FAW.4.02
No lectures or classes will take place during (Reading) Week 6.
Quick links to topics
|1||1 Oct||KB||Introduction to data|
|2||8 Oct||KB||The Shape of Data|
|3||15 Oct||KB||Cloud Computing|
|4||22 Oct||KB||HTML and CSS|
|5||29 Oct||KB||Using data from the Internet|
|6||6 Nov||-||reading week|
|7||12 Nov||KB||Working with APIs|
|8||19 Nov||KB||Textual data|
|9||26 Nov||KB||Data visualisation|
|10||3 Dec||MT||Creating and managing databases|
|11||10 Dec||MT||Interacting with online databases|
This course will cover the principles of digital methods for storing and structuring data, including data types, relational and non-relational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time-consuming part of any data science project. This course uses a project-based learning approach towards the study of performance computation and group-based collaboration, essential ingredients of modern data science projects. The coverage of data sharing will include key skills in on-line publishing, including the elements of web design, the technical elements of web technologies and web programming, as well as the use of revision-control and group collaboration tools such as GitHub.
In this course, we introduce principles and applications of the electronic storage, structuring, manipulation, transformation, extraction, and dissemination of data. This includes data types, database design, data base implementation, and data analysis through structured queries. Through joining operations, we will also cover the challenges of data linkage and how to combine datasets from different sources. We begin by discussing concepts in fundamental data types, and how data is stored and recorded electronically.
Cloud computing and online collaboration tools forms the second part of this course, along with the tools and technologies that underlie them. Students will firstly learn the basics of cloud computing that can serve various purposes such as secure hosting of webpages and web services and on-demand computations for data analysis and then learn how to set up a cloud-computing environment through Amazon Web Services, a popular cloud platform. Collaboration and the dissemination and submission of course assignments will use GitHub, the popular code repository and version control system. The course also provides indepth look at various common data-types on the Internet such as various markup languages (e.g. HTML and XML) and JSON. Students also study the fundamentals of acquisition and management of the data from the Internet through both scraping of websites and accessing APIs of online databases and social network services.
In the third part of the course, we will learn the data management and basic methodology of data analysis. We will cover database design, especially relational databases, using substantive examples across a variety of fields. Students are introduced to SQL through MySQL, and programming assignments in this unit of the course will be designed to insure that students learn to create, populate and query an SQL database. We will introduce NoSQL using MongoDB and the JSON data format for comparison. For both types of database, students will be encouraged to work with data relevant to their own interests as they learn to create, populate and query data. In the next section of the data section of the course, we will step through a complete workflow including data cleaning and transformation, illustrating many of the practical challenges faced at the outset of any data analysis or data science project. The course will be concluded with the discussion of performance issues in computation with the particular focus on parallel computing.
This class is supported by DataCamp, the most intuitive learning platform for data science. Learn R, Python and SQL the way you learn best through a combination of short expert videos and hands-on-the-keyboard exercises. Take over 100+ courses by expert instructors on topics such as importing data, data visualization or machine learning and learn faster through immediate and personalised feedback on every exercise.
Students will be expected to produce five weekly, structured problem sets with a beginning component to be started in the staff-led lab sessions, to be completed by the student outside of class. Answers should be formatted and submitted for assessment. One or more of these problem sets will be completed in collaboration with other students.
Take home exam (50%) and in class assessment (50%).
Student problem sets will be marked and will provide 50% of the mark.
Assignments will be marked using the following criteria:
70–100: Very Good to Excellent (Distinction). Perceptive, focused use of a good depth of material with a critical edge. Original ideas or structure of argument.
60–69: Good (Merit). Perceptive understanding of the issues plus a coherent well-read and stylish treatment though lacking originality
50–59: Satisfactory (Pass). A “correct” answer based largely on lecture material. Little detail or originality but presented in adequate framework. Small factual errors allowed.
30–49: Unsatisfactory (Fail) and 0–29: Unsatisfactory (Bad fail). Based entirely on lecture material but unstructured and with increasing error component. Concepts are disordered or flawed. Poor presentation. Errors of concept and scope or poor in knowledge, structure and expression.
Some of the assignemnts will involve shorter questions, to which the answers can be relatively unambiguously coded as (fully or partially) correct or incorrect. In the marking, these questions may be further broken down into smaller steps and marked step by step. The final mark is then a function of the proportion of parts of the questions which have been answered correctly. In such marking, the principle of partial credit is observed as far as feasible. This means that an answer to a part of a question will be treated as correct when it is correct conditional on answers to other parts of the question, even if those other parts have been answered incorrectly.
Detailed Course Schedule
1. Introduction to data
In the first week, we will introduce the basic concepts of the course, including how data is recorded, stored, and shared. Because the course relies fundamentally on GitHub, a collaborative code and data sharing platform, we will introduce the use of git and GitHub, using the lab session to guide students through in setting up an account and subscribing to the course organisation and assignments.
This week will also introduce basic data types, in a language-agnostic manner, from the perspective of machine implementations through to high-level programming languages. We will then focus on how basic data types are implemented in R.
- Lecture slides
- git and GitHub notes
- R example: Introduction to RMarkdown and as rmd source
- R example: vectors, lists, data frames
- Wickham, Hadley. Nd. Advanced R, 2nd ed. Ch 3, Names and values, Chapter 4, Vectors, and Chapter 5, Subsetting. (Ch. 2-3 of the print edition),
- GitHub Guides, especially: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.
- GitHub. “Markdown Syntax” (a cheatsheet).
- Lake, P. and Crowther, P. 2013. Concise guide to databases: A Practical Introduction. London: Springer-Verlag. Chapter 1, Data, an Organizational Asset
- Nelson, Meghan. 2015. “An Intro to Git and GitHub for Beginners (Tutorial).”
- Jim McGlone, “Creating and Hosting a Personal Site on GitHub A step-by-step beginner’s guide to creating a personal website and blog using Jekyll and hosting it for free using GitHub Pages.”.
- Installing git and setting up an account on GitHub
- How to complete and submit assignments using GitHub Classroom
- Forking and correcting a broken RMarkdown file
- Cloning a website repository, modifying it, and publishing a personal webpage
2. The shape of data
This week moves beyond the rectangular format common in statistical datasets, modeled on a spreadsheet, to cover relational structures and the concept of database normalization. We will also cover ways to restructure data from “wide” to “long” format, within strictly rectangular data structures. Additional topics concerning text encoding, date formats, and sparse matrix formats are also covered.
- Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly. Part II Wrangle, Tibbles, Data Import, Tidy Data (Ch. 7-9 of the print edition).
- The reshape2 package for R.
[Lab: Reshaping data in R**]
Assignment 1: Data cleaning in R. Deadline: October 18.
3. Cloud computing
In this week, we focus on the setup of computation environments on the Internet. We will introduce the cloud computing concepts and learn why the big shift to the cloud computing is occurring in the industry and how it is relevant to us as data scientists. In the lab, we will have an introduction to the cloud environment setup using Amazon Web Services. We will sign up an account, launch a cloud computing environment, create a webpage, and set up a statistical computing environment.
- Rajaraman, V. 2014. “Cloud Computing.” Resonance 19(3): 242–58.
- AWS: What is cloud computing.
- Azure: Developer guide.
- Puparelia, Nayan. 2016. “Cloud Computing.” MIT Press. Ch. 1-3.
- Botta, Alessio, Walter De Donato, Valerio Persico, and Antonio Pescapé. 2016. “Integration of Cloud Computing and Internet of Things: A Survey.” Future Generation Computer Systems 56: 684–700.
Lab: Working with AWS
- Setup an AWS account (link from Moodle for AWS Educate free account)
- Secure the account
- Configure EC2 instance
- Work with EC2 instance
- Login EC2-Linux Console
- Set up a web server
- Install R, some packages
- Stop the instance
- Link to the github classroom
4. HTML and CSS
- Lazer, David, and Jason Radford. 2017. “Data Ex Machina: Introduction to Big Data.” Annual Review of Sociology 43(1): 19–39.
- Howe, Shay. 2015. Learn to Code HTML and CSS: Develop and Style Websites. New Riders. Chs 1-8.
- Kingl, Arvid. 2018. Web Scraping in R: rvest Tutorial.
- Munzert, Simon, Christian Rubba, Peter Meissner, and Dominic Nyhuis D. 2014. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Hoboken, NJ/Chichester, UK:Wiley & Sons. Ch. 2-4, 9.
- Severance, Charles Russell. 2015. Introduction to Networking: How the Internet Works. Charles Severance, 2015.
- Duckett, Jon. 2011. HTML and CSS: Design and Build Websites. New York: Wiley.
Lab: Web scraping 1
- Scraping tables
- Scraping unstructured data
Assignment 2: Webscraping
- Link to the GitHub classroom. Deadline: Friday, November 1.
5. Using data from the Internet
Continuing from the material covered in Week 4, we will learn the advanced topics in scraping the web. The topics include the scraping documents in XML (such as RSS), scraping websites beyond the authentication, and websites with non-static components.
- Sai Swapna Gollapudi. 2018. Learn Web Scraping and Browser Automation Using RSelenium in R.
- Wickham, Hadley. 2015. Parse and process XML (and HTML) with xml2
- Schouwenaars, Filip. 2015. Web Scraping with R and PhantomJS.
Lab: Group work on first five weeks
- Coming soon
6. Reading week
7. Working with APIs
How to work with Application Programming Interfaces (APIs), which offer developers and researchers access to data in a structured format. Our running examples will be the New York Times API and the Twitter API.
- Steinert-Threlkeld. 2018. Twitter as Data. Cambridge University Press.
- Ruths and Pfeffer. 2014. Social media for large studies of behavior. Science.
- Interacting with the New York Times API
- Interacting with Twitter’s REST and Streaming API
Assignment 3: APIs
8. Textual data
We will learn how to work with unstructured data in the form of text, and how to deal with format conversion, encoding problems, and serialization. We will also cover search and replace operations using regular expressions, as well as the most common textual data types in R and Python.
- Kenneth Benoit. July 16, 2019. “Text as Data: An Overview” Forthcoming in Cuirini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage.
- Group working with textual data.
9. Data visualisation
The lecture this week will offer an overview of the principles of exploratory data analysis through (good) data visualization. In the seminars, we will practice producing our own graphs using ggplot2.
- Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly. Data visualization, Graphics for communication (Ch. 1 and 22 of the print edition).
- Hughes, A. (2015) “Visualizing inequality: How graphical emphasis shapes public opinion” Research and Politics.
- Tufte, E. (2002) “The visual display of quantitative information”.
- Data visualization with ggplot2.
- Github Classroom: TBC
10. Creating and managing databases
This session will offer an introduction to relational databases: structure, logic, and main types. We will learn how to write SQL code, a language designed to query this type of databases that is currently employed by most tech companies; and how to use it from R using the DBI package.
- Beaulieu. 2009. Learning SQL. O’Reilly. (Chapters 1, 3, 4, 5, 8)
- Stephens et al. 2009. Teach yourself SQL in one hour a day. Sam’s Publishing.
- Analyzing public Facebook data in a SQLite database
11. Interacting with online databases
This week, we will dive deeper into the databases. In particular, this week covers following topics: How to set up and use relational databases in the cloud, how to obtain big data analytics through data warehousing services (e.g. Google BigQuery), and fundamentals of noSQL databases.
- Beaulieu. 2009. Learning SQL. O’Reilly. (Chapters 2)
- Hows, Membrey, and Plugge. 2014. MongoDB Basics. Apress. (Chapter 1)
- Tigani and Naidu. 2017. Google BigQuery Analytics. Weily. (Chapters 1-3)
- MongoDB Basics on edX
- Analyzing Big Data in less time with Google BigQuery on YouTube
- SQL JOINs, subqueries, and BigQuery
Assignment 5: Databases.
- Deadline: December 19, 15:00.
Deadline: January 17, 14:00.