Welcome to part 2 of STA 380, a course on machine learning in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.
Instructors:
- Dr. James Scott. Office hours on M T W, 12:30 to 1:15 PM, WEL 5.228G.
- Dr. David Puelz. Office hours on M T W, 4-4:45p in CBA 6.444.
The exercises are available here. These are due Sunday, August 18th at 11:59 PM, U.S central time. Pace yourself over the next few weeks, and start early on the first couple of problems!
Slides: The data scientist's toolbox
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.
Readings:
- Introduction to RMarkdown
- RMarkdown tutorial
- Introduction to GitHub
- Getting starting with GitHub Desktop
- Jeff Leek's guide to sharing data
Your assignment after the first class day is to get yourself up and running on GitHub, if you're not already.
Slides: Some fun topics in probability
Optional reference: Chapter 1 of these course notes. There's a lot more technical stuff in here, but Chapter 1 really covers the basics of what every data scientist should know about probability.
Topics: plotting pitfalls; the grammar of graphics; data visualization with R.
Slides:
R materials:
- Lessons 4-6 of Data Science in R: A Gentle Introduction. You'll find lesson 5 a bit basic so feel free to breeze through that. The main thing you need to take away from lesson 5 is the use of pipes (
%>%
) and thesummarize
function. - Some R examples can be found in datavis_intro.R and nycflights_wrangle.R.
Intro to neural network slides here. Jupyter notebooks here.
Basics of clustering; K-means clustering; hierarchical clustering; spectral clustering
Slides: Introduction to clustering.
Scripts and data:
Readings:
- ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
- K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.
Principal component analysis (PCA). T-distributed stochastic neighbor embedding (tSNE).
Slides: Introduction to PCA and tSNE
Scripts and data for class:
- pca_intro.R
- nbc.R, nbc_showdetails.csv, nbc_pilotsurvey.csv
- congress109.R, congress109.csv, and congress109members.csv
- ercot_PCA.R, ercot.zip
- tSNE.ipynb
Readings:
- ISL Section 10.2 for the basics or Elements Chapter 14.5 (more advanced)
- Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.
Networks and association rule mining.
Slides: Intro to networks. Note: these slides refer to "lastfm.R" but this is the same thing as "playlists.R" below.
Software you'll need:
- Gephi, a great piece of software for exploring graphs
- The Gephi quick-start tutorial
Scripts and data:
- medici.R and medici.txt
- playlists.R and playlists.csv
- microfi.R, microfi_households.csv, and microfi_edges.txt.
Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).
Slides:
Scripts and data:
Treatment effects; multi-armed bandits and Thompson sampling; high-dimensional treatment effects with the lasso.
Slides:
Scripts and data:
- mab.R and Ads_CTR_Optimisation.csv
- abortion.R and abortion.dat
- smallbeer.R and smallbeer.csv
- hockey.R and all files in
data/hockey/