General Overview

STAT 430: Topics in Applied Statistics is a topics course in the Department of Statistics at the University of Illinois.

A new course Data Science Programming Methods was first taught in the Spring 2019 term. It will be offered again in this fall 2019 term. The instructor is Dirk Eddelbuettel who also designed the course.

Brief Description

Statistics and Data Science are focused on making sense of data – and face an ever-increasing demand for their work. Yet at the same time, data sets increase in size and scope. Proper tooling is essential to meet these challenges, and as applied work in data analysis is in effect applied computational work, we will learn the computational tools and programming methods to meet these data science challenges. Proficiency at the shell, familiarity with git version control, sufficient understanding of SQL, and of course acquiring actual expertise in R programming are the goals of this course to prepare students for the coming computational challenges. We will use RStudio Cloud instances so students are not required to install and maintain all required components. Prior programming experience (in R or another language) will certainly be helpful, but is not a formal requirement for taking the course.

Objectives

A 2018 report by National Academies of Sciences, Engineering, and Medicine stated:

Data science is emerging as a field that is revolutionizing science and industries alike. Work across nearly all domains is becoming more data driven, affecting both the jobs that are available and the skills that are required. As more data and ways of analyzing them become available, more aspects of the economy, society, and daily life will become dependent on data.

This courses introduces key concepts for computational literacy in a data science context:

  • Basic shell operations: a core building block for all computing systems
  • Git for version control: for versioning source code, write ups and much more such as social computing
  • R Programming: getting familiar with the best language and environment for _programming with data+
  • Reproducible computing: Markdown is at the core of this and incredibly versatile
  • Further extensions such as Docker, C++ and a litle Rcpp.

This course is both fast paced. We cover a considerable amount of material. It may still be a little rough at the edges as it is only second iteration of this course — and delivered online.

Format

  • Lectures, generally as (on-line) slides along with short videos
  • Self-study, with plenty of reading and coding to do
  • Six homeworks administered via the PrairieLearn system
  • Six quizzes in the CBTF facility also using PrairieLearn system.
  • A self-directing group project demonstrating data science programming
  • On-line office hours as well as on-campus office hours

Note that the CBTF tests generally require an on-campus presence.