Merge pull request #21 from Victor0118/update_paper

delete sift and code in paper
Victor0118 · Dec 16, 2017 · 3c665b2 · 3c665b2
2 parents 7d03ca4 + a7b378d
commit 3c665b2
Show file tree

Hide file tree

Showing 2 changed files with 10 additions and 30 deletions.
diff --git a/paper/largelsh.tex b/paper/largelsh.tex
@@ -360,17 +360,17 @@ \subsubsection{LSH algorithm for k-Nearest Neighbor search for Classification}
 	\Output{Prediction List of labels $L$}
 	$L \leftarrow \infty$
 
-	\For(\tcp*[f]{for each query sample}){$q=1, 2, \ldots, n$ }{
-		\For(\tcp*[f]{for each hash table}){$i=1, 2, \ldots, m$}{        
-			compute $h_{q, i} = \mathsf{g}_i(Q_q)$ \\
-			find candidates $C_i = findHash(H, h_{q, i})$
+	\For(\tcp*[f]{for each query sample}){$q\leftarrow 1, 2, \ldots, n$ }{
+		\For(\tcp*[f]{for each hash table}){$i\leftarrow 1, 2, \ldots, m$}{        
+			compute $h_{q, i} \leftarrow  \mathsf{g}_i(Q_q)$ \\
+			find candidates $C_i \leftarrow findHash(H, h_{q, i})$
 		}
-		collect candidates C = $\bigcup\limits_{i=1}^{m} C_{i}$ \\
+		collect candidates $C \leftarrow \bigcup\limits_{i=1}^{m} C_{i}$ \\
 
 		\eIf{$length(C)\neq 0$}
 		{
-			find best candidates by some distance matric $C^{\prime} = findTop(C, min(k, length(C)))$ \\
-			get the majority $l_q = \arg \max\limits_{c \in C^{\prime}} freq(label_c)$
+			find best candidates by some distance matric $C^{\prime} \leftarrow findTop(C, min(k, length(C)))$ \\
+			get the majority $l_q \leftarrow \arg \max\limits_{c \in C^{\prime}} freq(label_c)$
 			c
 		}
 		{
@@ -430,22 +430,8 @@ \subsection{Distributed LSH through Spark}
 operations, the outer query set Q cannot be a RDD or DataFrame. Instead, we collect the query points 
 in groups to the Spark cluster driver (instead of the whole set to avoid OutOfMemory errors), and do a 
 Scala parallel map over the query set groups to take advantage the parallel collections built-in to the 
-Scala language despite the fact we cannot use Spark's parallelism. For more details, one can refer to 
-the \texttt{SparkLSHv2.scala} class in the code. \\
-
-{\tt \small
-	\begin{verbatim}
-	Q.map(vec => {
-	 // T is the training set
-	 // vec is the feature vector of a query example
-			ann = model.approxNearestNeighbors(T, vec, k)
-			pred = ann.select("label")
-			            .groupBy("label").count
-			            .sort(desc("label")).first
-			return prediction
-	})
-	\end{verbatim}
-}
+Scala language despite the fact we cannot use Spark's parallelism. 
+
 
 \begin{figure}[H]
 	\center
@@ -479,12 +465,6 @@ \subsubsection{SVHN}
 SVHN\footnote{\url{http://ufldl.stanford.edu/housenumbers/}}\cite{netzer2011reading} is a real-world color image dataset obtained from house numbers in Google
 Street View images. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped 32-by-32 digits), but incorporates an order of magnitude more labeled data (73,257 digit images for training and 26032 digits images for testing) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). 
 
-\subsubsection{SIFT}
-SIFT1M\footnote{\url{http://corpus-texmex.irisa.fr/}}\cite{jegou2011product} is created to evaluate the quality of approximate nearest neighbors search algorithm and contains the 128-dimensional SIFT descriptor of 1,000,000 base vectors and 10,000 query vectors. 
-
-\subsection{Task}
-\subsubsection{Nearest Neighbor Search}
-\subsubsection{Image Classification}
 \subsection{Parameters}
 \subsubsection{Hardware Configuration}
 
@@ -510,6 +490,7 @@ \subsubsection{Parameters in Distributed LSH }
 	\item Bucket Length ($w$): determines the length of each line segment and is used to control the average size of hash buckets (and thus the number of buckets). The larger the $w$, the more possible to project close points into the same bucket, while a large $w$ also increase false positives points.
 	\item Bias ($b$): controls the center of hash values and always is set to 0 in our experiment.
 	\item Number of Hash Tables ($M$): is also the number of randomly generated lines in data space. Increasing $M$ will help get rid of the randomness effect because of RP. 
+	\item Number of Nearest Neighbor ($k$): is a parameter for classification task if the kNN classifier is applied. We mainly tried $k \in \left\{1, 5, 9\right\}$ in our experiments to show the effect of $k$.
 \end{itemize}
 
 \section{Results}

diff --git a/src/main/scala/largelsh/SparkLSH.scala b/src/main/scala/largelsh/SparkLSH.scala
@@ -19,7 +19,6 @@ class SparkLSHConf(arguments: Seq[String]) extends ScallopConf(arguments) {
   verify()
 }
 
-
 object SparkLSH {
 
   def main(args: Array[String]) {
-Original file line number
+Diff line change
@@ Expand Up @@
       verify()
     }
     object SparkLSH {
       def main(args: Array[String]) {
@@ Expand Down @@