Skip to content

Commit

Permalink
More database lessons for novices. To do:
Browse files Browse the repository at this point in the history
*   `group by` in aggregation
*   data hygiene
*   explain the `%%sqlite` magic
  • Loading branch information
gvwilson committed Nov 10, 2013
1 parent 61ae6ff commit 2446029
Show file tree
Hide file tree
Showing 7 changed files with 2,086 additions and 0 deletions.
313 changes: 313 additions & 0 deletions 04-calc.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Calculating New Values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After carefully re-reading the expedition logs,\n",
"we realize that the radiation measurements they report\n",
"may need to be corrected upward by 5%.\n",
"Rather than modifying the stored data,\n",
"we can do this calculation on the fly\n",
"as part of our query:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%load_ext sqlitemagic"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select 1.05 * reading from Survey where quant='rad';"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<table>\n",
"<tr><td>10.311</td></tr>\n",
"<tr><td>8.19</td></tr>\n",
"<tr><td>8.8305</td></tr>\n",
"<tr><td>7.581</td></tr>\n",
"<tr><td>4.5675</td></tr>\n",
"<tr><td>2.2995</td></tr>\n",
"<tr><td>1.533</td></tr>\n",
"<tr><td>11.8125</td></tr>\n",
"</table>"
],
"metadata": {},
"output_type": "display_data",
"text": [
"<IPython.core.display.HTML at 0x1023c4e50>"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we run the query,\n",
"the expression `1.05 * reading` is evaluated for each row.\n",
"Expressions can use any of the fields,\n",
"all of usual arithmetic operators,\n",
"and a variety of common functions.\n",
"(Exactly which ones depends on which database manager is being used.)\n",
"For example,\n",
"we can convert temperature readings from Fahrenheit to Celsius\n",
"and round to two decimal places:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<table>\n",
"<tr><td>734</td><td>-29.72</td></tr>\n",
"<tr><td>735</td><td>-32.22</td></tr>\n",
"<tr><td>751</td><td>-28.06</td></tr>\n",
"<tr><td>752</td><td>-26.67</td></tr>\n",
"</table>"
],
"metadata": {},
"output_type": "display_data",
"text": [
"<IPython.core.display.HTML at 0x1023cbd90>"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also combine values from different fields,\n",
"for example by using the string concatenation operator `||`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select personal || ' ' || family from Person;"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<table>\n",
"<tr><td>William Dyer</td></tr>\n",
"<tr><td>Frank Pabodie</td></tr>\n",
"<tr><td>Anderson Lake</td></tr>\n",
"<tr><td>Valentina Roerich</td></tr>\n",
"<tr><td>Frank Danforth</td></tr>\n",
"</table>"
],
"metadata": {},
"output_type": "display_data",
"text": [
"<IPython.core.display.HTML at 0x1023c46d0>"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> It may seem strange to use `personal` and `family` as field names\n",
"> instead of `first` and `last`,\n",
"> but it's a necessary first step toward handling cultural differences.\n",
"> For example,\n",
"> consider the following rules:\n",
"\n",
"<table>\n",
" <tr> <th>Full Name</th> <th>Alphabetized Under</th> <th>Reason</th> </tr>\n",
" <tr> <td>Liu Xiaobo</td> <td>Liu</td> <td>Chinese family names come first</td> </tr>\n",
" <tr> <td> Leonardo da Vinci</td> <td>Leonardo</td> <td>\"da Vinci\" just means \"from Vinci\"</td> </tr>\n",
" <tr> <td> Catherine de Medici</td> <td>Medici</td> <td>family name</td> </tr>\n",
" <tr> <td> Jean de La Fontaine</td> <td>La Fontaine</td> <td>family name is \"La Fontaine\"</td> </tr>\n",
" <tr> <td> Juan Ponce de Leon</td> <td>Ponce de Leon</td> <td>full family name is \"Ponce de Leon\"</td> </tr>\n",
" <tr> <td> Gabriel Garcia Marquez</td> <td>Garcia Marquez</td> <td>double-barrelled Spanish surnames</td> </tr>\n",
" <tr> <td> Wernher von Braun</td> <td>von <em>or</em> Braun</td> <td>depending on whether he was in Germany or the US</td> </tr>\n",
" <tr> <td> Elizabeth Alexandra May Windsor</td> <td>Elizabeth</td> <td>monarchs alphabetize by the name under which they reigned</td> </tr>\n",
" <tr> <td> Thomas a Beckett</td> <td>Thomas</td> <td>and saints according to the names by which they were canonized</td> </tr>\n",
"</table>\n",
"\n",
"> Clearly,\n",
"> even a two-part division into \"personal\" and \"family\"\n",
"> isn't enough..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Challenges\n",
"\n",
"1. After further reading,\n",
" we realize that Valentina Roerich\n",
" was reporting salinity as percentages.\n",
" Write a query that returns all of her salinity measurements\n",
" from the `Survey` table\n",
" with the values divided by 100.\n",
"\n",
"2. The `union` operator combines the results of two queries:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select * from Person where ident='dyer' union select * from Person where ident='roe';"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<table>\n",
"<tr><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
"<tr><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
"</table>"
],
"metadata": {},
"output_type": "display_data",
"text": [
"<IPython.core.display.HTML at 0x1023cbd50>"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `union` to create a consolidated list of salinity measurements\n",
"in which Roerich's, and only Roerich's,\n",
"have been corrected as described in the previous challenge.\n",
"The output should be something like:\n",
"\n",
"<table>\n",
" <tr> <td>619</td> <td>0.13</td> </tr>\n",
" <tr> <td>622</td> <td>0.09</td> </tr>\n",
" <tr> <td>734</td> <td>0.05</td> </tr>\n",
" <tr> <td>751</td> <td>0.1</td> </tr>\n",
" <tr> <td>752</td> <td>0.09</td> </tr>\n",
" <tr> <td>752</td> <td>0.416</td> </tr>\n",
" <tr> <td>837</td> <td>0.21</td> </tr>\n",
" <tr> <td>837</td> <td>0.225</td> </tr>\n",
"</table>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. The site identifiers in the `Visited` table have two parts\n",
" separated by a '-':"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select distinct site from Visited;"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<table>\n",
"<tr><td>DR-1</td></tr>\n",
"<tr><td>DR-3</td></tr>\n",
"<tr><td>MSK-4</td></tr>\n",
"</table>"
],
"metadata": {},
"output_type": "display_data",
"text": [
"<IPython.core.display.HTML at 0x1023c4e50>"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some major site identifiers are two letters long and some are three.\n",
"The \"in string\" function `instr(X, Y)`\n",
"returns the 1-based index of the first occurrence of string Y in string X,\n",
"or 0 if Y does not exist in X.\n",
"The substring function `substr(X, I)`\n",
"returns the substring of X starting at index I.\n",
"Use these two functions to produce a list of unique major site identifiers.\n",
"(For this data,\n",
"the list should contain only \"DR\" and \"MSK\")."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Next Steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"FIXME"
]
}
],
"metadata": {}
}
]
}
Loading

0 comments on commit 2446029

Please sign in to comment.