Updated assignment 5, local version.

lintool · Feb 19, 2017 · 519f72e · 519f72e
1 parent 564c49c
commit 519f72e
Show file tree

Hide file tree

Showing 4 changed files with 130 additions and 67 deletions.
diff --git a/assignment5.html b/assignment5.html
@@ -120,23 +120,42 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 will help you make sense of the queries you are running.</p>
 
 <p>The TPC-H benchmark comes with a data generator, and we have
-generated some data for
-you <a href="assignments/TPC-H-0.1-TXT.tar.gz">here</a>
-(<code>TPC-H-0.1-TXT.tar.gz</code>). For the first
-part of the assignment where you will be working with Spark locally,
-you will run your queries against this data.
-Download and unpack the data above: you will see a number of text files,
-each corresponding to a table in the TPC-H schema. The files are
-delimited by <code>|</code>. You'll notice that some of the fields,
-especially the text fields, are gibberish&mdash;that's normal, since
-the data are randomly generated.</p>
+generated some data for you:
+
+<ul>
+
+  <li><a href="assignments/TPC-H-0.1-TXT.tar.gz"><code>TPC-H-0.1-TXT.tar.gz</code></a>:
+  the plain-text version of the data</li>
+
+  <li><a href="assignments/TPC-H-0.1-PARQUET.tar.gz"><code>TPC-H-0.1-PARQUET.tar.gz</code></a>:
+  the Parquet version of the data</li>
+
+</ul>
+
+<p>Download and unpack the datasets above. In the first part of the
+assignment where you will be working with Spark locally, you will run
+your queries against both datasets.</p>
+
+<p>For the plain-text version of the data, you will see a number of
+text files, each corresponding to a table in the TPC-H schema. The
+files are delimited by <code>|</code>. You'll notice that some of the
+fields, especially the text fields, are gibberish&mdash;that's normal,
+since the contents are randomly generated.</p>
+
+<p>The Parquet data has the same content, but is encoded in
+Parquet.</p>
 
 <p>Implement the following seven SQL queries, running on
-the <code>TPC-H-0.1-TXT</code> data. Each SQL query is
-accompanied by a written description of what the query does; if there
-are any ambiguities in the language, you can always assume that the
-SQL query is correct. Each of your query will be a separate Spark
-program. Put your code in the package
+the <code>TPC-H-0.1-TXT</code> plain-text data as well as
+the <code>TPC-H-0.1-PARQUET</code> Parquet data. For each query you
+will write a program that takes the option <code>--text</code> to work
+with the plain-text data and <code>--parquet</code>to work with the
+Parquet data. More details will be provided below.</p>
+
+<p>Each SQL query is accompanied by a written description of what the
+query does.; if there are any ambiguities in the language, you can
+always assume that the SQL query is correct. Each of your query will
+be a separate Spark program. Put your code in the package
 <code>ca.uwaterloo.cs.bigdata2017w.assignment5</code>, in the
 same repo that you've been working in all semester. Since you'll be
 writing Scala code, your source files should go into
@@ -149,11 +168,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 </pre>
 
 <p>Write a program such that when we execute the following
-command:</p>
+command (on the plain-text data):</p>
 
 <pre>
 spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q1 \
-   target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
 </pre>
 
 <p>the answer to the above SQL query will be printed to stdout (on the
@@ -168,9 +187,16 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 but there must be a line with the answer in <b>exactly</b> the above
 format.</p>
 
-<p>The value of the
+<p>And the Parquet version:</p>
+
+<pre>
+spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q1 \
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
+</pre>
+
+<p>In both cases, the value of the
 <code>--input</code> argument is the directory that contains the
-plain-text tables. The value of the <code>--date</code> argument
+data (either in plain text or in Parquet). The value of the <code>--date</code> argument
 corresponds to the <code>l_shipdate</code> predicate in the where
 clause in the SQL query. You need to anticipate dates of the
 form <code>YYYY-MM-DD</code>, <code>YYYY-MM</code>, or
@@ -192,11 +218,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 </pre>
 
 <p>Write a program such that when we execute the following
-command:</p>
+command (on the plain-text data):</p>
 
 <pre>
 spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q2 \
-   target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
 </pre>
 
 <p>the answer to the above SQL query will be printed to stdout (on the
@@ -213,6 +239,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 parentheses. Everything described in <b>Q1</b> about dates applies
 here as well.</p>
 
+<p>And the Parquet version:</p>
+
+<pre>
+spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q2 \
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
+</pre>
+
 <p>In the design of this data warehouse, the <code>lineitem</code>
 and <code>orders</code> tables are not likely to fit in
 memory. Therefore, the only scalable join approach is the reduce-side
@@ -233,11 +266,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 </pre>
 
 <p>Write a program such that when we execute the following
-command:</p>
+command (on the plain-text data):</p>
 
 <pre>
 spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q3 \
-   target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
 </pre>
 
 <p>the answer to the above SQL query will be printed to stdout (on the
@@ -254,6 +287,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 parentheses. Everything described in <b>Q1</b> about dates applies
 here as well.</p>
 
+<p>And the Parquet version:</p>
+
+<pre>
+spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q3 \
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
+</pre>
+
 <p>In the design of this data warehouse, it is assumed that
 the <code>part</code> and <code>supplier</code> tables will fit in
 memory. Therefore, it is possible to implement a hash join. For this
@@ -275,11 +315,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 </pre>
 
 <p>Write a program such that when we execute the following
-command:</p>
+command (on the plain-text data):</p>
 
 <pre>
 spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q4 \
-   target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
 </pre>
 
 <p>the answer to the above SQL query will be printed to stdout (on the
@@ -289,6 +329,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 parentheses. Everything described in <b>Q1</b> about dates applies
 here as well.</p>
 
+<p>And the Parquet version:</p>
+
+<pre>
+spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q4 \
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
+</pre>
+
 <p>Implement this query with different join techniques as you see
 fit. You can assume that the <code>lineitem</code>
 and <code>orders</code> table will not fit in memory, but you can
@@ -307,11 +354,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 Generate this graph for your boss.</p>
 
 <p>First, write a program such that when we execute the following
-command:</p>
+command (on plain text):</p>
 
 <pre>
 spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q5 \
-   target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --text
 </pre>
 
 <p>the raw data necessary for the graph will be printed to stdout (on the
@@ -320,6 +367,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 tuple per line, where each tuple is comma-delimited and surrounded by
 parentheses.</p>
 
+<p>And the Parquet version:</p>
+
+<pre>
+spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q5 \
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --parquet
+</pre>
+
 <p>Next, create this actual graph: use whatever tool you are
 comfortable with, e.g., Excel, gnuplot, etc.</p>
 
@@ -346,11 +400,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 </pre>
 
 <p>Write a program such that when we execute the following
-command:</p>
+command (on the plain-text data):</p>
 
 <pre>
 spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q6 \
-   target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
 </pre>
 
 <p>the answer to the above SQL query will be printed to stdout (on the
@@ -360,6 +414,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 parentheses. Everything described in <b>Q1</b> about dates applies
 here as well.</p>
 
+<p>And the Parquet version:</p>
+
+<pre>
+spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q6 \
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
+</pre>
+
 <p>Implement this query as efficiently as you can, using all of the
 optimizations we discussed in lecture. You will only get full points
 for this question if you exploit all the optimization opportunities
@@ -393,11 +454,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 </pre>
 
 <p>Write a program such that when we execute the following
-command:</p>
+command (on the plain-text data):</p>
 
 <pre>
 spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q7 \
-   target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
 </pre>
 
 <p>the answer to the above SQL query will be printed to stdout (on the
@@ -407,6 +468,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
 parentheses. Here you can assume that the date argument is only in the
 format <code>YYYY-MM-DD</code> and that it is a valid date.</p>
 
+<p>And the Parquet version:</p>
+
+<pre>
+spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q7 \
+  target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
+</pre>
+
 <p>Implement this query as efficiently as you can, using all of the
 optimizations we discussed in lecture. Plan joins as you see fit,
 keeping in mind above assumptions on what will and will not fit in

diff --git a/assignments/TPC-H-0.1-PARQUET.tar.gz b/assignments/TPC-H-0.1-PARQUET.tar.gz
diff --git a/assignments/check_assignment5_public_linux.py b/assignments/check_assignment5_public_linux.py
@@ -13,62 +13,57 @@
 from subprocess import call
 import re
 
-# add prefix 'a' if github-username starts from a numeric character
-def convertusername(u):
-  return re.sub(r'^(\d+.*)',r'a\1',u)
-
 def check_a5(u):
     """Run assignment5 in linux environment"""
-    call(["mvn","clean","package"])
+    call(["mvn", "clean", "package"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q1",
- 	   "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q1",
+ 	  "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q1",
-       "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q1",
+          "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q2",
- 	   "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q2",
+ 	  "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q2",
-       "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q2",
+          "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q3",
- 	   "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q3",
+ 	  "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q3",
-       "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q3",
+          "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q4",
- 	   "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q4",
+ 	  "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q4",
-       "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q4",
+          "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q5",
- 	   "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--text"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q5",
+ 	  "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--text"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q5",
-       "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--parquet"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q5",
+          "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--parquet"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q6",
- 	   "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q6",
+ 	  "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q6",
-       "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q6",
+          "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q7",
- 	   "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q7",
+ 	  "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
 
-    call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q7",
-       "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
+    call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q7",
+          "target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
 
 if __name__ == "__main__":
   try:
     if len(sys.argv) < 2:
         print "usage: "+sys.argv[0]+" [github-username]"
         exit(1)
-    u = convertusername(sys.argv[1])
-    check_a5(u)
+    check_a5(sys.argv[1])
   except Exception as e:
-    print(e)
+    print(e)
diff --git a/assignments/tpc-h_v2.17.1.pdf b/assignments/tpc-h_v2.17.1.pdf