Merge branch 'master' into patch-2

brohrer · Jan 24, 2019 · 6f70e54 · 6f70e54
2 parents 81cbe93 + 6e69a76
commit 6f70e54
Show file tree

Hide file tree

Showing 3 changed files with 23 additions and 13 deletions.
diff --git a/curriculum_topics_draft.md → curriculum_roadmap.md b/curriculum_topics_draft.md → curriculum_roadmap.md
@@ -1,31 +1,33 @@
-# Draft -- Data science curriculum roadmap  -- Draft
+# Data science curriculum roadmap
 
-We venture to suggest a curriculum roadmap only after receiving multiple requests. As a group, we have spent the vast majority of our time in industry, although many of us have had spent time in one academic capacity or another.  What follows is a set of broad recommendations, and it will inevitably require a lot of adjustments in each case for implementation. Given that caveat, here are our curriculum recommendations. 
+We venture to suggest a curriculum roadmap after receiving multiple requests for one from academic partners. As a group, we have spent the vast majority of our time in industry, although many of us have had spent time in one academic capacity or another.  What follows is a set of broad recommendations, and it will inevitably require a lot of adjustments in each implementation. Given that caveat, here are our curriculum recommendations. 
 
 ### More application than theory
 
 We want to lead by emphasizing that the single most important factor in preparing students to apply their knowledge in an industry setting is application-centric learning. Working with realistic data to answer realistic questions is their best preparation. It grounds abstract concepts in hands-on experience, and it teaches data mechanics and data intuition at the same time, something that is impossible to do in isolation. 
 
-With that in mind, there are definitely a list of topics that prepare one well to practice data science. 
+With that as a foundation, we present a list of topics that prepare one well to practice data science. 
 
 ## Curriculum archetypes
 
-The types of data science and data centric academic programs closely mirror [the major skill areas](what_DS_do.md) we identified in our work. There are programs that emphasize **engineering**, programs that emphasize **analytics**, and programs that emphasize **modeling**.  The distinction between these is that analytics focuses on the question of what can we learn from our data, modeling focuses on the problem of estimating data we wish we had, and engineering focuses on how to make it all run faster,  more efficiently, and more robustly. 
+The types of data science and data-centric academic programs closely mirror [the major skill areas](what_DS_do.md) we have identified in our work. There are programs that emphasize **engineering**, programs that emphasize **analytics**, and programs that emphasize **modeling**.  The distinction between these is that analytics focuses on the question of what can we learn from our data, modeling focuses on the problem of estimating data we wish we had, and engineering focuses on how to make it all run faster,  more efficiently, and more robustly. 
 
 There are also **general data science programs** that cover all these areas to some degree. In addition there are quite a few **domain specific programs**, where a subset of engineering, analytics, and modeling skills specific to a given field are taught.
 
 ![Data program archetypes](program_archetypes.png)
 
-The curriculum recommendations for each of these program archetypes will be different. However, all of them will share a core of foundational topics. Then analytics, engineering, and modeling-centric programs will have additional topic areas of their own. A general curriculum will include some aspects of the analytics, engineering, and modeling curricula, although perhaps not to the same depth. It is common for students to self select courses from any combination of the three areas.
+The curriculum recommendations for each of these program archetypes will be different. However, all of them will share some core topics. Then analytics, engineering, and modeling-centric programs will have additional topic areas of their own. A general curriculum will include some aspects of the analytics, engineering, and modeling curricula, although perhaps not to the same depth. It is common for students to self-select courses from any combination of the three areas.
 
-Curricula for domain specific programs look similar to a general program, except that topics, and even entire courses, will be focused on specific skills common to the area. For instance, an actuarial-focused data analytics program would likely include  software tools most commonly used in insurance companies, time series and rare-event prediction algorithms, and visualization methods that are accepted throughout the insurance industry. The student can best practice their skills through a project based on real domain specific data.  Hands-on projects or internships are highly recommended.  When designing the programs, institutions may also consider offering interdisciplinary degrees and programs.  Domain specific programs often combine courses from multiple departments or colleges. 
+Curricula for domain specific programs look similar to a general program, except that topics, and even entire courses, will be focused on specific skills common to the area. For instance, an actuarial-focused data analytics program would likely include  software tools most commonly used in insurance companies, time series and rare-event prediction algorithms, and visualization methods that are accepted throughout the insurance industry. The student can best practice their skills through a project based on real domain-specific data.  Hands-on projects or internships are highly recommended.  When designing the programs, institutions may also consider offering interdisciplinary degrees and programs.  Domain specific programs often combine courses from multiple departments or colleges. 
 
 Here are the major topics we suggest including in each area, with some of the particularly important subtopics enumerated.
 
 ## Foundational topics
 * Programming
+    * File and data manipulation
     * Scripting
     * Plotting
+* Basic database queries
 * Probability and statistics
     * Probability distributions
     * Hypothesis testing
@@ -34,6 +36,9 @@ Here are the major topics we suggest including in each area, with some of the pa
 * Algebra
 * Data ethics
 * Data interpretation and communication
+    * Presentation
+    * Technical writing
+    * Data concepts for non-technical audiences 
 
 ## Analytics topics
 * Advanced statistics
@@ -63,7 +68,7 @@ Here are the major topics we suggest including in each area, with some of the pa
 * Data structures
 * Database design
     * Data modeling
-    * Database queries (SQL)
+    * Advanced database queries
 
 ## Modeling topics
 * Linear algebra

diff --git a/use_cases.md b/use_cases.md
@@ -32,18 +32,18 @@
 * **Privacy Assurance**
 * **Intrusion Detection**
 * **Phishing Detection**
-* **Malware Prediction**
-* **Unified Host and Network Dataset** 
+* **Malware Target Prediction**
+* **Malware Classification**
 
 ### Relevant data sets
 * **[KDD CUP 99](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)**: The task is to develop to detect network intrusions that protects a computer network from unauthorized users, including perhaps insiders.  The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between _bad_ connections, called intrusions or attacks, and _good_ normal connections. The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection.  A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided.  The 1999 KDD intrusion detection contest uses a version of this dataset.
 * **[NSL-KDD](https://www.unb.ca/cic/datasets/nsl.html)**: NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, the authors of this data set believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.
 * **[UNSW-NB15](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/)**: This data set is an advancement over the above-mentioned two data sets i.e. KDD CUP 99 and NSL-KDD. It captures more realistic features and way more instances than the other two. 
 * **[Phishing Websites](https://archive.ics.uci.edu/ml/datasets/phishing+websites)**: In this dataset, the authors shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, they proposed some new features.
-* **[Malware Prediction](https://www.kaggle.com/c/microsoft-malware-prediction/)**
+* **[Malware Target Prediction](https://www.kaggle.com/c/microsoft-malware-prediction/)**: This kaggle dataset challenges users to predict if a machine will soon be hit with malware.
+* **[Malware Classification](https://github.com/EndgameInc/ember)**: Static (without executing the file) features derived from domain experts are extracted from malicious, benign and _unlabeled_ data to detect a test set in the future.  
 * **[Unified Host and Network Dataset](https://csr.lanl.gov/data/2017.html)** This dataset contains a subset of (anonymized) network and computer events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days.  This dataset is useful because the computer host and network data are co-occurring.
 
-
 ## Economic development
 
 ## Finance

diff --git a/what_DS_do.md b/what_DS_do.md
@@ -11,7 +11,7 @@ Data science skill groups can be broken out as analysis, modeling, engineering,
 Here are some representative skills in each group.
 
 
-## **Analysis** – the process of turning raw information into knowledge that can be acted on. 
+## **Analysis** – the process of turning raw information into insights that can be acted on. 
 
 ### Domain knowledge
 
@@ -26,13 +26,18 @@ Their experience lets them **anticipate how things can fail**, catch effects tha
 Even after a problem is defined and a good quantitative question is formulated, **gathering the data** can require quite a bit of work and creativity.
 It can involve searching through known repositories, interviewing colleagues, delving deeply into documentation, and **sorting through data stores** to identify the relevant portions. If the data doesn't yet exist, research can also involve **designing and conducting experiments** to collect it.
 
+###  Exploration
+
+It's common for a data scientist to be presented with data of an unknown nature. Before it can be used to answer questions, it is first necessary to find out what type of information it contains. Exploration is the art of delving into a new data set to get a sense of its quality and extent. This is typically done by a combination of aggregation, slicing, and visualization. The emphasis is on making a quick survey, rather than diving into rigorous or exhaustive analysis. Often during this process insights emerge serendipitously.
+
 ### Interpretation
 
 Even after it is collected, data is not yet useful. Interpretation is the art of crossing the gap between stacks of numbers and what they actually signify.
 **Sumarization and aggregation** along appropriate dimensions is often required. Carefully selecting what to report is critical to clarity. Knowing what to omit is as as important as anything else.
 Answering the original quantitative question can require **statistical tools** like hypothesis testing, A/B testing, and confidence intervals.
 **Visualization** - turning data into a picture - is a powerful way to convey the message behind a table of numbers.
-For some use cases, a carefully constructed plot is all that is necessary to answer the question.  
+For some use cases, a carefully constructed plot is all that is necessary to answer the question.
+**Communication** is the act of bridging the gap between the analyst and the decision maker. Good data storytelling requires a keen focus on three points: 1) Who are your audience? 2) What do you want your audience to know or do? and 3) How can you use data to help make your point?
 
 
 ## **Modeling** – the process of using the data we have to estimate the data we wish we had.