forked from swcarpentry/sql-novice-survey
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-hygiene.html
165 lines (165 loc) · 7.41 KB
/
08-hygiene.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Databases and SQL</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title">Databases and SQL</h1>
<h2 class="subtitle">Data Hygiene</h2>
<div id="learning-objectives" class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Explain what an atomic value is.</li>
<li>Distinguish between atomic and non-atomic values.</li>
<li>Explain why every value in a database should be atomic.</li>
<li>Explain what a primary key is and why every record should have one.</li>
<li>Identify primary keys in database tables.</li>
<li>Explain why database entries should not contain redundant information.</li>
<li>Identify redundant information in databases.</li>
</ul>
</div>
</div>
<p>Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it. The first rule is that every value should be <a href="reference.html#atomic">atomic</a>, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don't have to use substring operations to get the name's components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like "Eloise St. Cyr" or "Jan Mikkel Steubart".</p>
<p>The second rule is that every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the <code>ident</code> field in the <code>Person</code> table), or even a combination of values: the triple <code>(taken, person, quant)</code> from the <code>Survey</code> table uniquely identifies every measurement.</p>
<p>The third rule is that there should be no redundant information. For example, we could get rid of the <code>Site</code> table and rewrite the <code>Visited</code> table like this:</p>
<table>
<thead>
<tr class="header">
<th align="left">ident</th>
<th align="left">lat</th>
<th align="left">long</th>
<th align="left">dated</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">619</td>
<td align="left">-49.85</td>
<td align="left">-128.57</td>
<td align="left">1927-02-08</td>
</tr>
<tr class="even">
<td align="left">622</td>
<td align="left">-49.85</td>
<td align="left">-128.57</td>
<td align="left">1927-02-10</td>
</tr>
<tr class="odd">
<td align="left">734</td>
<td align="left">-47.15</td>
<td align="left">-126.72</td>
<td align="left">1939-01-07</td>
</tr>
<tr class="even">
<td align="left">735</td>
<td align="left">-47.15</td>
<td align="left">-126.72</td>
<td align="left">1930-01-12</td>
</tr>
<tr class="odd">
<td align="left">751</td>
<td align="left">-47.15</td>
<td align="left">-126.72</td>
<td align="left">1930-02-26</td>
</tr>
<tr class="even">
<td align="left">752</td>
<td align="left">-47.15</td>
<td align="left">-126.72</td>
<td align="left">-null-</td>
</tr>
<tr class="odd">
<td align="left">837</td>
<td align="left">-48.87</td>
<td align="left">-123.40</td>
<td align="left">1932-01-14</td>
</tr>
<tr class="even">
<td align="left">844</td>
<td align="left">-49.85</td>
<td align="left">-128.57</td>
<td align="left">1932-03-22</td>
</tr>
</tbody>
</table>
<p>In fact, we could use a single table that recorded all the information about each reading in each row, just as a spreadsheet would. The problem is that it's very hard to keep data organized this way consistent: if we realize that the date of a particular visit to a particular site is wrong, we have to change multiple records in the database. What's worse, we may have to guess which records to change, since other sites may also have been visited on that date.</p>
<p>The fourth rule is that the units for every value should be stored explicitly. Our database doesn't do this, and that's a problem: Roerich's salinity measurements are several orders of magnitude larger than anyone else's, but we don't know if that means she was using parts per million instead of parts per thousand, or whether there actually was a saline anomaly at that site in 1932.</p>
<p>Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it's efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently if it's in a certain form. As anthropologists say, the tool shapes the hand that shapes the tool.</p>
<div id="identifying-atomic-values" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Identifying Atomic Values</h2>
</div>
<div class="panel-body">
<p>Which of the following are atomic values? Which are not? Why?</p>
<ul>
<li>New Zealand</li>
<li>87 Turing Avenue</li>
<li>January 25, 1971</li>
<li>the XY coordinate (0.5, 3.3)</li>
</ul>
</div>
</div>
<div id="identifying-a-primary-key" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Identifying a Primary Key</h2>
</div>
<div class="panel-body">
<p>What is the primary key in this table? I.e., what value or combination of values uniquely identifies a record?</p>
<table>
<thead>
<tr class="header">
<th align="left">latitude</th>
<th align="left">longitude</th>
<th align="left">date</th>
<th align="left">temperature</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">57.3</td>
<td align="left">-22.5</td>
<td align="left">2015-01-09</td>
<td align="left">-14.2</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/sql-novice-survey">Source</a>
<a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>