forked from swcarpentry/sql-novice-survey
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-null.html
353 lines (353 loc) · 13.9 KB
/
05-null.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Databases and SQL</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Databases and SQL</h1></a>
<h2 class="subtitle">Missing Data</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Explain how databases represent missing information.</li>
<li>Explain the three-valued logic databases use when manipulating missing information.</li>
<li>Write queries that handle missing information correctly.</li>
</ul>
</div>
</section>
<p>Real-world data is never complete — there are always holes. Databases represent these holes using a special value called <code>null</code>. <code>null</code> is not zero, <code>False</code>, or the empty string; it is a one-of-a-kind value that means “nothing here”. Dealing with <code>null</code> requires a few special tricks and some careful thinking.</p>
<p>To start, let’s have a look at the <code>Visited</code> table. There are eight records, but #752 doesn’t have a date — or rather, its date is null:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Visited;</code></pre>
<table>
<thead>
<tr class="header">
<th align="left">ident</th>
<th align="left">site</th>
<th align="left">dated</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">619</td>
<td align="left">DR-1</td>
<td align="left">1927-02-08</td>
</tr>
<tr class="even">
<td align="left">622</td>
<td align="left">DR-1</td>
<td align="left">1927-02-10</td>
</tr>
<tr class="odd">
<td align="left">734</td>
<td align="left">DR-3</td>
<td align="left">1930-01-07</td>
</tr>
<tr class="even">
<td align="left">735</td>
<td align="left">DR-3</td>
<td align="left">1930-01-12</td>
</tr>
<tr class="odd">
<td align="left">751</td>
<td align="left">DR-3</td>
<td align="left">1930-02-26</td>
</tr>
<tr class="even">
<td align="left">752</td>
<td align="left">DR-3</td>
<td align="left">-null-</td>
</tr>
<tr class="odd">
<td align="left">837</td>
<td align="left">MSK-</td>
<td align="left">1932-01-14</td>
</tr>
<tr class="even">
<td align="left">844</td>
<td align="left">DR-1</td>
<td align="left">1932-03-22</td>
</tr>
</tbody>
</table>
<p>Null doesn’t behave like other values. If we select the records that come before 1930:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Visited <span class="kw">WHERE</span> dated<<span class="st">'1930-01-01'</span>;</code></pre>
<table>
<thead>
<tr class="header">
<th align="left">ident</th>
<th align="left">site</th>
<th align="left">dated</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">619</td>
<td align="left">DR-1</td>
<td align="left">1927-02-08</td>
</tr>
<tr class="even">
<td align="left">622</td>
<td align="left">DR-1</td>
<td align="left">1927-02-10</td>
</tr>
</tbody>
</table>
<p>we get two results, and if we select the ones that come during or after 1930:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Visited <span class="kw">WHERE</span> dated>=<span class="st">'1930-01-01'</span>;</code></pre>
<table>
<thead>
<tr class="header">
<th align="left">ident</th>
<th align="left">site</th>
<th align="left">dated</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">734</td>
<td align="left">DR-3</td>
<td align="left">1930-01-07</td>
</tr>
<tr class="even">
<td align="left">735</td>
<td align="left">DR-3</td>
<td align="left">1930-01-12</td>
</tr>
<tr class="odd">
<td align="left">751</td>
<td align="left">DR-3</td>
<td align="left">1930-02-26</td>
</tr>
<tr class="even">
<td align="left">837</td>
<td align="left">MSK-</td>
<td align="left">1932-01-14</td>
</tr>
<tr class="odd">
<td align="left">844</td>
<td align="left">DR-1</td>
<td align="left">1932-03-22</td>
</tr>
</tbody>
</table>
<p>we get five, but record #752 isn’t in either set of results. The reason is that <code>null<'1930-01-01'</code> is neither true nor false: null means, “We don’t know,” and if we don’t know the value on the left side of a comparison, we don’t know whether the comparison is true or false. Since databases represent “don’t know” as null, the value of <code>null<'1930-01-01'</code> is actually <code>null</code>. <code>null>='1930-01-01'</code> is also null because we can’t answer to that question either. And since the only records kept by a <code>WHERE</code> are those for which the test is true, record #752 isn’t included in either set of results.</p>
<p>Comparisons aren’t the only operations that behave this way with nulls. <code>1+null</code> is <code>null</code>, <code>5*null</code> is <code>null</code>, <code>log(null)</code> is <code>null</code>, and so on. In particular, comparing things to null with = and != produces null:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Visited <span class="kw">WHERE</span> dated=NULL;</code></pre>
<p>produces no output, and neither does:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Visited <span class="kw">WHERE</span> dated!=NULL;</code></pre>
<p>To check whether a value is <code>null</code> or not, we must use a special test <code>IS NULL</code>:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Visited <span class="kw">WHERE</span> dated <span class="kw">IS</span> <span class="kw">NULL</span>;</code></pre>
<table>
<thead>
<tr class="header">
<th align="left">ident</th>
<th align="left">site</th>
<th align="left">dated</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">752</td>
<td align="left">DR-3</td>
<td align="left">-null-</td>
</tr>
</tbody>
</table>
<p>or its inverse <code>IS NOT NULL</code>:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Visited <span class="kw">WHERE</span> dated <span class="kw">IS</span> <span class="kw">NOT</span> <span class="kw">NULL</span>;</code></pre>
<table>
<thead>
<tr class="header">
<th align="left">ident</th>
<th align="left">site</th>
<th align="left">dated</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">619</td>
<td align="left">DR-1</td>
<td align="left">1927-02-08</td>
</tr>
<tr class="even">
<td align="left">622</td>
<td align="left">DR-1</td>
<td align="left">1927-02-10</td>
</tr>
<tr class="odd">
<td align="left">734</td>
<td align="left">DR-3</td>
<td align="left">1930-01-07</td>
</tr>
<tr class="even">
<td align="left">735</td>
<td align="left">DR-3</td>
<td align="left">1930-01-12</td>
</tr>
<tr class="odd">
<td align="left">751</td>
<td align="left">DR-3</td>
<td align="left">1930-02-26</td>
</tr>
<tr class="even">
<td align="left">837</td>
<td align="left">MSK-</td>
<td align="left">1932-01-14</td>
</tr>
<tr class="odd">
<td align="left">844</td>
<td align="left">DR-1</td>
<td align="left">1932-03-22</td>
</tr>
</tbody>
</table>
<p>Null values can cause headaches wherever they appear. For example, suppose we want to find all the salinity measurements that weren’t taken by Lake. It’s natural to write the query like this:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Survey <span class="kw">WHERE</span> quant=<span class="st">'sal'</span> <span class="kw">AND</span> person!=<span class="st">'lake'</span>;</code></pre>
<table>
<thead>
<tr class="header">
<th align="left">taken</th>
<th align="left">person</th>
<th align="left">quant</th>
<th align="left">reading</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">619</td>
<td align="left">dyer</td>
<td align="left">sal</td>
<td align="left">0.13</td>
</tr>
<tr class="even">
<td align="left">622</td>
<td align="left">dyer</td>
<td align="left">sal</td>
<td align="left">0.09</td>
</tr>
<tr class="odd">
<td align="left">752</td>
<td align="left">roe</td>
<td align="left">sal</td>
<td align="left">41.6</td>
</tr>
<tr class="even">
<td align="left">837</td>
<td align="left">roe</td>
<td align="left">sal</td>
<td align="left">22.5</td>
</tr>
</tbody>
</table>
<p>but this query filters omits the records where we don’t know who took the measurement. Once again, the reason is that when <code>person</code> is <code>null</code>, the <code>!=</code> comparison produces <code>null</code>, so the record isn’t kept in our results. If we want to keep these records we need to add an explicit check:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Survey <span class="kw">WHERE</span> quant=<span class="st">'sal'</span> <span class="kw">AND</span> (person!=<span class="st">'lake'</span> <span class="kw">OR</span> person <span class="kw">IS</span> <span class="kw">NULL</span>);</code></pre>
<table>
<thead>
<tr class="header">
<th align="left">taken</th>
<th align="left">person</th>
<th align="left">quant</th>
<th align="left">reading</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">619</td>
<td align="left">dyer</td>
<td align="left">sal</td>
<td align="left">0.13</td>
</tr>
<tr class="even">
<td align="left">622</td>
<td align="left">dyer</td>
<td align="left">sal</td>
<td align="left">0.09</td>
</tr>
<tr class="odd">
<td align="left">735</td>
<td align="left">-null-</td>
<td align="left">sal</td>
<td align="left">0.06</td>
</tr>
<tr class="even">
<td align="left">752</td>
<td align="left">roe</td>
<td align="left">sal</td>
<td align="left">41.6</td>
</tr>
<tr class="odd">
<td align="left">837</td>
<td align="left">roe</td>
<td align="left">sal</td>
<td align="left">22.5</td>
</tr>
</tbody>
</table>
<p>We still have to decide whether this is the right thing to do or not. If we want to be absolutely sure that we aren’t including any measurements by Lake in our results, we need to exclude all the records for which we don’t know who did the work.</p>
<p>In contrast to arithmetic or Boolean operators, aggregation functions that combine multiple values, such as <code>min</code>, <code>max</code> or <code>avg</code>, <em>ignore</em> <code>null</code> values. In the majority of cases, this is a desirable output: for example, unknown values are thus not affecting our data when we are averaging it. Aggregation functions will be addressed in more detail in <a href="06-agg.html">the next section</a>.</p>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Sorting by Known Date</h2>
</div>
<div class="panel-body">
<p>Write a query that sorts the records in <code>Visited</code> by date, omitting entries for which the date is not known (i.e., is null).</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>NULL in a Set</h2>
</div>
<div class="panel-body">
<p>What do you expect the query:</p>
<pre class="sourceCode sql"><code class="sourceCode sql"><span class="kw">SELECT</span> * <span class="kw">FROM</span> Visited <span class="kw">WHERE</span> dated <span class="kw">IN</span> (<span class="st">'1927-02-08'</span>, <span class="kw">NULL</span>);</code></pre>
<p>to produce? What does it actually produce?</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Pros and Cons of Sentinels</h2>
</div>
<div class="panel-body">
<p>Some database designers prefer to use a <a href="reference.html#sentinel-value">sentinel value</a>) to mark missing data rather than <code>null</code>. For example, they will use the date “0000-00-00” to mark a missing date, or -1.0 to mark a missing salinity or radiation reading (since actual readings cannot be negative). What does this simplify? What burdens or risks does it introduce?</p>
</div>
</section>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/sql-novice-survey">Source</a>
<a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>