Merge pull request swcarpentry#79 from morgantaschuk/improving-sql-ta…

…schuk Improving (!) SQL lesson
abbycabs · May 5, 2015 · 5dd6ac3 · 5dd6ac3
2 parents 287a009 + bb20266
commit 5dd6ac3
Show file tree

Hide file tree

Showing 4 changed files with 155 additions and 162 deletions.
diff --git a/00-setup.md b/00-setup.md
@@ -5,47 +5,54 @@ subtitle: Setting up
 minutes: 15
 ---
 
-### Installation
+For the SQL lessons, 
+we will use the [SQLite](https://www.sqlite.org/) relational database management system, 
+which comes pre-installed on most operating systems. 
+While these instructions are specific to SQLite,
+most other database management systems
+(e.g. MySQL, Oracle, PostGreSQL)
+have similar functions for loading data and performing basic operations.
 
-#### Windows
 
-Download the [sqlite3 program](http://www.sqlite.org/download.html).
+First, download and install SQLite if it is not already installed on your operating system:
 
-#### Mac OS X
-
-<code>sqlite3</code> comes pre-installed on Mac OS X.
-
-#### Linux
-
-<code>sqlite3</code> comes pre-installed on Linux.
-
-### Downloading the example database
+* Windows: Download the [SQLite program](http://www.sqlite.org/download.html).
+* Mac OS X: <code>sqlite3</code> comes pre-installed on Mac OS X.
+* Linux: <code>sqlite3</code> comes pre-installed on Linux.
 
 Create a directory where you will carry out the exercises for this lesson, and
-change to it using the <code>cd</code> command.
-
-Download the file [survey.db](http://files.software-carpentry.org/survey.db) into this
+change to it using the <code>cd</code> command. Download the file [survey.db](http://files.software-carpentry.org/survey.db) into this
 directory.
 
-### Testing the installation
+    $ mkdir swc_sql 
+    $ cd swc_sql
+    $ wget http://files.software-carpentry.org/survey.db
 
-#### Command line
 
-Type
+First, load the example database into SQLite. 
+On the shell command line, type
 
     sqlite3 survey.db
 
-Then type the SQLite command <code>.tables</code> to list the tables in the database.
+This command instructs SQLite to load the database in the `survey.db` file.
+
 You should see something like the following.
 
     SQLite version 3.8.8 2015-01-16 12:08:06
     Enter ".help" for usage hints.
+    sqlite>
+
+For a list of useful system commands, enter <code>.help</code>.
+
+All SQLite-specific commands are prefixed with a . to distinguish them from SQL commands. 
+Type <code>.tables</code> to list the tables in the database. 
+
     sqlite> .tables
     Person   Site     Survey   Visited
 
 
-Type the following SQL <code>SELECT</code> command. This <code>SELECT</code> statement
-selects all (*) rows from the Site table.
+Type the following SQL <code>SELECT</code> command. 
+This <code>SELECT</code> statement selects all (*) rows from the Site table.
 
 <code>select * from Site;</code>
 
@@ -56,8 +63,10 @@ Complete your SQL statement with a semicolon.
     DR-3|-47.15|-126.72
     MSK-4|-48.87|-123.4
 
-You can change some SQLite settings to make the output easier to read. First, set the
-output mode to display left-aligned columns. Then turn on the display of column headers.
+You can change some SQLite settings to make the output easier to read. 
+First, 
+set the output mode to display left-aligned columns. 
+Then turn on the display of column headers.
 
     sqlite> .mode column
     sqlite> .header on
@@ -69,9 +78,6 @@ output mode to display left-aligned columns. Then turn on the display of column
     MSK-4       -48.87      -123.4
 
 
-#### Helpful Commands
+To exit SQLite and return to the shell command line, 
+you can use either `.quit` or `.exit`.
 
-* For a list of useful system commands, enter <code>.help</code>
-* To exit sqlite and return to the shell command line, you can use either
-	* <code>.quit</code> *or*
-	* <code>.exit</code>
diff --git a/01-select.md b/01-select.md
@@ -11,9 +11,9 @@ minutes: 30
 > *   Write a query to select all values for specific fields from a single table.
 
 A [relational database](reference.html#relational-database)
-is a way to store and manipulate information
-that is arranged as [tables](reference.html#table).
-Each table has columns (also known as [fields](reference.html#field)) which describe the data,
+is a way to store and manipulate information. 
+Databases are arranged as [tables](reference.html#table).
+Each table has columns (also known as [fields](reference.html#field)) that describe the data,
 and rows (also known as [records](reference.html#record)) which contain the data.
 
 When we are using a spreadsheet,
@@ -39,8 +39,8 @@ that we can then use as a starting point for further queries.
 
 Queries are written in a language called [SQL](reference.html#sql),
 which stands for "Structured Query Language".
-SQL provides hundreds of different ways to analyze and recombine data;
-we will only look at a handful,
+SQL provides hundreds of different ways to analyze and recombine data.
+We will only look at a handful of queries,
 but that handful accounts for most of what scientists do.
 
 The tables below show the database we will use in our examples:

diff --git a/02-sort-dup.md b/02-sort-dup.md
@@ -9,59 +9,71 @@ minutes: 30
 > *   Write queries that display results in a particular order.
 > *   Write queries that eliminate duplicate values from data.
 
+
+In beginning our examination of the Antarctic data, we want to know:
+
+* what kind of quantity measurements were taken at each site; 
+* which scientists took measurements on the expedition;
+* the sites where each scientist took measurements
+
+
+To determine which measurements were taken at each site, 
+we can examine the `Survey` table.
 Data is often redundant,
 so queries often return redundant information.
 For example,
 if we select the quantitites that have been measured
-from the `survey` table,
+from the `Survey` table,
 we get this:
 
 ~~~ {.sql}
 SELECT quant FROM Survey;
 ~~~
 
-|quant|
-|-----|
-|rad  |
-|sal  |
-|rad  |
-|sal  |
-|rad  |
-|sal  |
-|temp |
-|rad  |
-|sal  |
-|temp |
-|rad  |
-|temp |
-|sal  |
-|rad  |
-|sal  |
-|temp |
-|sal  |
-|rad  |
-|sal  |
-|sal  |
-|rad  |
-
-We can eliminate the redundant output
-to make the result more readable
-by adding the `DISTINCT` keyword
+<table>
+<tr><th>quant</th></tr>
+<tr><td>rad  </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>rad  </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>rad  </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>temp </td></tr>
+<tr><td>rad  </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>temp </td></tr>
+<tr><td>rad  </td></tr>
+<tr><td>temp </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>rad  </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>temp </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>rad  </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>rad  </td></tr>
+</table>
+
+This result makes it difficult to see all of the different types of `quant` in the Survey table.
+We can eliminate the redundant output to make the result more readable by adding the `DISTINCT` keyword
 to our query:
 
 ~~~ {.sql}
 SELECT DISTINCT quant FROM Survey;
 ~~~
 
-|quant|
-|-----|
-|rad  |
-|sal  |
-|temp |
+<table>
+<tr><th>quant</th></tr>
+<tr><td>rad  </td></tr>
+<tr><td>sal  </td></tr>
+<tr><td>temp </td></tr>
+</table>
 
-If we select more than one column --- for example,
-both the survey site ID and the quantity measured --- then
-the distinct pairs of values are returned:
+If we want to determine which sites have which quant measurement, 
+we can use the `DISTINCT` keyword on multiple columns.
+If we select more than one column,
+the distinct *pairs* of values are returned:
 
 ~~~ {.sql}
 SELECT DISTINCT taken, quant FROM Survey;
@@ -90,17 +102,16 @@ SELECT DISTINCT taken, quant FROM Survey;
 |844  |rad  |
 
 Notice in both cases that duplicates are removed
-even if they didn't appear to be adjacent in the database.
-Again,
-it's important to remember that rows aren't actually ordered:
-they're just displayed that way.
+even if the rows they come from didn't appear to be adjacent in the database table.
 
+
+Our next task is to identify the scientists on the expedition by looking at the `Person` table.
 As we mentioned earlier,
 database records are not stored in any particular order.
 This means that query results aren't necessarily sorted,
 and even if they are,
 we often want to sort them in a different way,
-e.g., by the name of the project instead of by the name of the scientist.
+e.g., by their identifier instead of by their personal name.
 We can do this in SQL by adding an `ORDER BY` clause to our query:
 
 ~~~ {.sql}
@@ -122,7 +133,7 @@ from least to greatest).
 We can sort in the opposite order using `DESC` (for "descending"):
 
 ~~~ {.sql}
-SELECT * FROM person ORDER BY ident DESC;
+SELECT * FROM Person ORDER BY ident DESC;
 ~~~
 
 |ident  |personal |family  |
@@ -136,61 +147,66 @@ SELECT * FROM person ORDER BY ident DESC;
 (And if we want to make it clear that we're sorting in ascending order,
 we can use `ASC` instead of `DESC`.)
 
+
+In order to look at which scientist measured quantities at each site, 
+we can look again at the `Survey` table.
 We can also sort on several fields at once.
 For example,
 this query sorts results first in ascending order by `taken`,
 and then in descending order by `person`
 within each group of equal `taken` values:
 
 ~~~ {.sql}
-SELECT taken, person FROM Survey ORDER BY taken ASC, person DESC;
+SELECT taken, person, quant FROM Survey ORDER BY quant ASC, person DESC;
 ~~~
-
-|taken|person|
-|-----|------|
-|619  |dyer  |
-|619  |dyer  |
-|622  |dyer  |
-|622  |dyer  |
-|734  |pb    |
-|734  |pb    |
-|734  |lake  |
-|735  |pb    |
-|735  |-null-|
-|735  |-null-|
-|751  |pb    |
-|751  |pb    |
-|751  |lake  |
-|752  |roe   |
-|752  |lake  |
-|752  |lake  |
-|752  |lake  |
-|837  |roe   |
-|837  |lake  |
-|837  |lake  |
-|844  |roe   |
-
-This is easier to understand if we also remove duplicates:
+|taken|person|quant|
+|-----|------|-----|
+|619  |dyer  |rad  |
+|619  |dyer  |sal  |
+|622  |dyer  |rad  |
+|622  |dyer  |sal  |
+|734  |pb    |rad  |
+|734  |pb    |temp |
+|734  |lake  |sal  |
+|735  |pb    |rad  |
+|735  |-null-|sal  |
+|735  |-null-|temp |
+|751  |pb    |rad  |
+|751  |pb    |temp |
+|751  |lake  |sal  |
+|752  |roe   |sal  |
+|752  |lake  |rad  |
+|752  |lake  |sal  |
+|752  |lake  |temp |
+|837  |roe   |sal  |
+|837  |lake  |rad  |
+|837  |lake  |sal  |
+|844  |roe   |rad  |
+
+This query gives us a good idea of which scientist was at which site, 
+and what measurements they performed while they were there.
+
+Looking at the table, 
+it seems like some scientists specialized in certain kinds of measurements. 
+We can examine which scientists performed which measurements by selecting the appropriate columns and removing duplicates. 
 
 ~~~ {.sql}
-SELECT DISTINCT taken, person FROM Survey ORDER BY taken ASC, person DESC;
+SELECT DISTINCT quant, person FROM Survey ORDER BY quant ASC;
 ~~~
 
-|taken|person|
+|quant|person|
 |-----|------|
-|619  |dyer  |
-|622  |dyer  |
-|734  |pb    |
-|734  |lake  |
-|735  |pb    |
-|735  |-null-|
-|751  |pb    |
-|751  |lake  |
-|752  |roe   |
-|752  |lake  |
-|837  |roe   |
-|837  |lake  |
-|844  |roe   |
+|rad  |dyer  |
+|rad  |pb    |
+|rad  |lake  |
+|rad  |roe   |
+|sal  |dyer  |
+|sal  |lake  |
+|sal  |-null-|
+|sal  |roe   |
+|temp |pb    |
+|temp |-null-|
+|temp |lake  |
 
 > ## Finding Distinct Dates {.challenge}
 >