Skip to content

Commit

Permalink
Updated assignment 5, local version.
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool committed Feb 19, 2017
1 parent 564c49c commit 519f72e
Show file tree
Hide file tree
Showing 4 changed files with 130 additions and 67 deletions.
130 changes: 99 additions & 31 deletions assignment5.html
Original file line number Diff line number Diff line change
Expand Up @@ -120,23 +120,42 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
will help you make sense of the queries you are running.</p>

<p>The TPC-H benchmark comes with a data generator, and we have
generated some data for
you <a href="assignments/TPC-H-0.1-TXT.tar.gz">here</a>
(<code>TPC-H-0.1-TXT.tar.gz</code>). For the first
part of the assignment where you will be working with Spark locally,
you will run your queries against this data.
Download and unpack the data above: you will see a number of text files,
each corresponding to a table in the TPC-H schema. The files are
delimited by <code>|</code>. You'll notice that some of the fields,
especially the text fields, are gibberish&mdash;that's normal, since
the data are randomly generated.</p>
generated some data for you:

<ul>

<li><a href="assignments/TPC-H-0.1-TXT.tar.gz"><code>TPC-H-0.1-TXT.tar.gz</code></a>:
the plain-text version of the data</li>

<li><a href="assignments/TPC-H-0.1-PARQUET.tar.gz"><code>TPC-H-0.1-PARQUET.tar.gz</code></a>:
the Parquet version of the data</li>

</ul>

<p>Download and unpack the datasets above. In the first part of the
assignment where you will be working with Spark locally, you will run
your queries against both datasets.</p>

<p>For the plain-text version of the data, you will see a number of
text files, each corresponding to a table in the TPC-H schema. The
files are delimited by <code>|</code>. You'll notice that some of the
fields, especially the text fields, are gibberish&mdash;that's normal,
since the contents are randomly generated.</p>

<p>The Parquet data has the same content, but is encoded in
Parquet.</p>

<p>Implement the following seven SQL queries, running on
the <code>TPC-H-0.1-TXT</code> data. Each SQL query is
accompanied by a written description of what the query does; if there
are any ambiguities in the language, you can always assume that the
SQL query is correct. Each of your query will be a separate Spark
program. Put your code in the package
the <code>TPC-H-0.1-TXT</code> plain-text data as well as
the <code>TPC-H-0.1-PARQUET</code> Parquet data. For each query you
will write a program that takes the option <code>--text</code> to work
with the plain-text data and <code>--parquet</code>to work with the
Parquet data. More details will be provided below.</p>

<p>Each SQL query is accompanied by a written description of what the
query does.; if there are any ambiguities in the language, you can
always assume that the SQL query is correct. Each of your query will
be a separate Spark program. Put your code in the package
<code>ca.uwaterloo.cs.bigdata2017w.assignment5</code>, in the
same repo that you've been working in all semester. Since you'll be
writing Scala code, your source files should go into
Expand All @@ -149,11 +168,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
</pre>

<p>Write a program such that when we execute the following
command:</p>
command (on the plain-text data):</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q1 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
Expand All @@ -168,9 +187,16 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
but there must be a line with the answer in <b>exactly</b> the above
format.</p>

<p>The value of the
<p>And the Parquet version:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q1 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>

<p>In both cases, the value of the
<code>--input</code> argument is the directory that contains the
plain-text tables. The value of the <code>--date</code> argument
data (either in plain text or in Parquet). The value of the <code>--date</code> argument
corresponds to the <code>l_shipdate</code> predicate in the where
clause in the SQL query. You need to anticipate dates of the
form <code>YYYY-MM-DD</code>, <code>YYYY-MM</code>, or
Expand All @@ -192,11 +218,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
</pre>

<p>Write a program such that when we execute the following
command:</p>
command (on the plain-text data):</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q2 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
Expand All @@ -213,6 +239,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>

<p>And the Parquet version:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q2 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>

<p>In the design of this data warehouse, the <code>lineitem</code>
and <code>orders</code> tables are not likely to fit in
memory. Therefore, the only scalable join approach is the reduce-side
Expand All @@ -233,11 +266,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
</pre>

<p>Write a program such that when we execute the following
command:</p>
command (on the plain-text data):</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q3 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
Expand All @@ -254,6 +287,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>

<p>And the Parquet version:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q3 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>

<p>In the design of this data warehouse, it is assumed that
the <code>part</code> and <code>supplier</code> tables will fit in
memory. Therefore, it is possible to implement a hash join. For this
Expand All @@ -275,11 +315,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
</pre>

<p>Write a program such that when we execute the following
command:</p>
command (on the plain-text data):</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q4 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
Expand All @@ -289,6 +329,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>

<p>And the Parquet version:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q4 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>

<p>Implement this query with different join techniques as you see
fit. You can assume that the <code>lineitem</code>
and <code>orders</code> table will not fit in memory, but you can
Expand All @@ -307,11 +354,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
Generate this graph for your boss.</p>

<p>First, write a program such that when we execute the following
command:</p>
command (on plain text):</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q5 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --text
</pre>

<p>the raw data necessary for the graph will be printed to stdout (on the
Expand All @@ -320,6 +367,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
tuple per line, where each tuple is comma-delimited and surrounded by
parentheses.</p>

<p>And the Parquet version:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q5 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --parquet
</pre>

<p>Next, create this actual graph: use whatever tool you are
comfortable with, e.g., Excel, gnuplot, etc.</p>

Expand All @@ -346,11 +400,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
</pre>

<p>Write a program such that when we execute the following
command:</p>
command (on the plain-text data):</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q6 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
Expand All @@ -360,6 +414,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>

<p>And the Parquet version:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q6 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>

<p>Implement this query as efficiently as you can, using all of the
optimizations we discussed in lecture. You will only get full points
for this question if you exploit all the optimization opportunities
Expand Down Expand Up @@ -393,11 +454,11 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
</pre>

<p>Write a program such that when we execute the following
command:</p>
command (on the plain-text data):</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q7 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
Expand All @@ -407,6 +468,13 @@ <h3>Assignment 5: Data Warehousing <small>due 8:30am March 2</small></h3>
parentheses. Here you can assume that the date argument is only in the
format <code>YYYY-MM-DD</code> and that it is a valid date.</p>

<p>And the Parquet version:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q7 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>

<p>Implement this query as efficiently as you can, using all of the
optimizations we discussed in lecture. Plan joins as you see fit,
keeping in mind above assumptions on what will and will not fit in
Expand Down
Binary file modified assignments/TPC-H-0.1-PARQUET.tar.gz
Binary file not shown.
67 changes: 31 additions & 36 deletions assignments/check_assignment5_public_linux.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -13,62 +13,57 @@
from subprocess import call
import re

# add prefix 'a' if github-username starts from a numeric character
def convertusername(u):
return re.sub(r'^(\d+.*)',r'a\1',u)

def check_a5(u):
"""Run assignment5 in linux environment"""
call(["mvn","clean","package"])
call(["mvn", "clean", "package"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q1",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q1",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q1",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q1",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q2",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q2",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q2",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q2",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q3",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q3",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q3",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q3",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q4",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q4",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q4",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q4",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q5",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--text"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q5",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--text"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q5",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--parquet"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q5",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--parquet"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q6",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q6",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q6",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q6",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q7",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q7",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-TXT", "--date", "1996-01-01", "--text"])

call([ "spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w."+u+".assignment5.Q7",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])
call(["spark-submit", "--class", "ca.uwaterloo.cs.bigdata2017w.assignment5.Q7",
"target/bigdata2017w-0.1.0-SNAPSHOT.jar", "--input", "TPC-H-0.1-PARQUET", "--date", "1996-01-01", "--parquet"])

if __name__ == "__main__":
try:
if len(sys.argv) < 2:
print "usage: "+sys.argv[0]+" [github-username]"
exit(1)
u = convertusername(sys.argv[1])
check_a5(u)
check_a5(sys.argv[1])
except Exception as e:
print(e)
print(e)
Binary file added assignments/tpc-h_v2.17.1.pdf
Binary file not shown.

0 comments on commit 519f72e

Please sign in to comment.