Stat 597A Introduction to Statistical Computing

John Staudenmayer (Office LGRT 1435K, Phone 545 0999)

Office hours: Wed and Fri 11AM, or by appointment.

jstauden at math.umass.edu

www.math.umass.edu/~jstauden/stat597A.html

 

Textbook

The Art of R Programming (Norman Matloff) We will use the book a lot. Additional readings will be assigned during the semester.

 

Software & Prerequisites

This class will mostly be about using R to do modern statistical analyses and data management. (You need a computer with R on it! R studio is strongly recommended.) We will learn to use R to run existing programs, and we will also learn to write our own programs and functions. Effective programming strategies and principles will be emphasized.

This class does not assume a lot of programming background. We will cover core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. If you know a lot about programming already, much of the class may be review. The class will assume that students know the basic concepts of statistical thinking (with data) and basic probability.

 

Classes & Assignments

Classes will consist of lectures and labs. A tentative schedule and topic list is below. I will try to post the lecture notes on the web after class. The labs will consist of group activities done on your computers during class time. Each lab will have an assignment to be handed in. There will also be approximately weekly problem sets.

 

All assignments must be turned in electronically, via email. All assignments will involve writing a combination of code and actual prose. You must submit your assignment in a format that allows for the combination of the two, and the automatic execution of all your code. The easiest way to do this is to use R Markdown (http://rmarkdown.rstudio.com).

 

Projects

There will be a midterm project (assigned 10/23 and due 10/30) and a final project (assigned 11/16 and due before end of final exams). The midterm will be done alone, and the final project will be done in small assigned groups. The last few classes will consist of project presentations.

 

Grading

Approximately weekly problem sets and labs: 30%

Midterm Project: 30%

Final Project: 40%

 

Some R Resources

There are many online resources for learning about it and working with it, in addition to the textbook:

¥   The official intro, "An Introduction to R", available online in HTML and PDF

¥   John Verzani, "simpleR", in PDF

¥   Google R Style Guide offers some rules for naming, spacing, etc., which are generally good ideas

¥   Quick-R. This is primarily aimed at those who already know a commercial statistics package like SAS, SPSS or Stata, but it's very clear and well-organized, and others may find it useful as well.

¥   Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."

¥   Thomas Lumley, "R Fundamentals and Programming Techniques" (large PDF)

¥   The website Software Carpentry is not specifically R related, but contains a lot of valuable advice and information on scientific programming.

 

Important Note

A lot of this class (and this information!) is based on a class at Carnegie Mellon:

Shalizi, C. R. and Thomas, A. C. (2014), "Statistical Computing 36-350: Beginning to Advanced Techniques in R", http://www.stat.cmu.edu/~cshalizi/statcomp/14

 

Tentative Schedule / Class Outline

 

Data types and data structures

Lecture 1 (Sept 9): Simple data types and structures

Lecture 1 R Markdown File

Lecture 2 (Sept 11): Bigger data structures

Lecture 2 R Markdown File

Problem set 1 (due 9/18)

Lab 1 R Markdown File (Sept 14)

 

Flow control and looping

Lecture 3 (Sept 16): Data Frames and Control

Lecture 3 R Markdown File

Lab 2 R Markdown File (Sept 18)

 

Text

Lecture 4 (Sept 21): Text basics

Lecture 4 R Markdown File

Text file for lecture code.

Problem set 2 (due 9/25)

Lecture 5 (Sept. 23): Regular expressions

Lecture 5 R Markdown File

Lab 3 R Markdown File (Sept 25)

rich.html

Problem set 3 will not occur this week.

 

Writing and calling functions

Lecture 6 (Sept 28): Writing functions

Lecture 6 R Markdown File

gmp.dat (for example)

Lecture 7 (Sept 30): Multiple functions

Lecture 7 R Markdown File

Lab 4 R Markdown File (Oct 2)

 

Data from elsewhere

Lecture 8 (Oct 7): Getting data

Lecture 8 R Markdown File

huh.txt

Lab 5 R Markdown File (Oct 9)

Lab 5 pdf (Oct 9)

wtid-report.csv

 

Fitting and using statistical models

Lecture 10 (Oct 13): Random number generation

Lecture 10 R Markdown File

Lab 6 R Markdown File (Oct 16)

Lab 6 pdf (Oct 16)

Midterm R Markdown File (due Oct 30)

Midterm (due Oct 30)

Top 250 data (use for problem 2 if you get stuck on problem 1)

 

Changing Shapes

Lecture 11 (Oct 19): Timing code and a start at apply() functions

Lecture 11 R Markdown File

Lecture 12 (Oct 21): More with the apply family

Lecture 12 R Markdown File

strikes.csv

Lab 7 R Markdown File (Oct 23)

Lab 7 pdf (Oct 23)

debt.csv

10/26 plyr: Lecture 13 R Markdown File

10/28 plyr: Lecture 14 R Markdown File

Lab 8 R Markdown File (Oct 30)\

Lab 8 pdf (Oct 30)

Mid-semester project due Oct 30

 

Functions of functions, and optimization

Lecture 15 (Nov 4): Functions as objects

11/04 functions as objects: Lecture 15 R Markdown File

Lab 9 R Markdown File (Nov 6)\

Lab 9 pdf (Nov 6)

11/09 Optimization: Lecture 16 R Markdown File

 

More optimization

Lab 10 pdf (Nov 16)

11/18 Optimization Example: Lecture 17 R Presentation File

11/25 Databases and Databases and R R Presentation File

Baseball.db