Skip to content

Commit

Permalink
Merge pull request #21 from Victor0118/update_paper
Browse files Browse the repository at this point in the history
delete sift and code in paper
  • Loading branch information
tuzhucheng authored Dec 16, 2017
2 parents 7d03ca4 + a7b378d commit 3c665b2
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 30 deletions.
39 changes: 10 additions & 29 deletions paper/largelsh.tex
Original file line number Diff line number Diff line change
Expand Up @@ -360,17 +360,17 @@ \subsubsection{LSH algorithm for k-Nearest Neighbor search for Classification}
\Output{Prediction List of labels $L$}
$L \leftarrow \infty$

\For(\tcp*[f]{for each query sample}){$q=1, 2, \ldots, n$ }{
\For(\tcp*[f]{for each hash table}){$i=1, 2, \ldots, m$}{
compute $h_{q, i} = \mathsf{g}_i(Q_q)$ \\
find candidates $C_i = findHash(H, h_{q, i})$
\For(\tcp*[f]{for each query sample}){$q\leftarrow 1, 2, \ldots, n$ }{
\For(\tcp*[f]{for each hash table}){$i\leftarrow 1, 2, \ldots, m$}{
compute $h_{q, i} \leftarrow \mathsf{g}_i(Q_q)$ \\
find candidates $C_i \leftarrow findHash(H, h_{q, i})$
}
collect candidates C = $\bigcup\limits_{i=1}^{m} C_{i}$ \\
collect candidates $C \leftarrow \bigcup\limits_{i=1}^{m} C_{i}$ \\

\eIf{$length(C)\neq 0$}
{
find best candidates by some distance matric $C^{\prime} = findTop(C, min(k, length(C)))$ \\
get the majority $l_q = \arg \max\limits_{c \in C^{\prime}} freq(label_c)$
find best candidates by some distance matric $C^{\prime} \leftarrow findTop(C, min(k, length(C)))$ \\
get the majority $l_q \leftarrow \arg \max\limits_{c \in C^{\prime}} freq(label_c)$
c
}
{
Expand Down Expand Up @@ -430,22 +430,8 @@ \subsection{Distributed LSH through Spark}
operations, the outer query set Q cannot be a RDD or DataFrame. Instead, we collect the query points
in groups to the Spark cluster driver (instead of the whole set to avoid OutOfMemory errors), and do a
Scala parallel map over the query set groups to take advantage the parallel collections built-in to the
Scala language despite the fact we cannot use Spark's parallelism. For more details, one can refer to
the \texttt{SparkLSHv2.scala} class in the code. \\

{\tt \small
\begin{verbatim}
Q.map(vec => {
// T is the training set
// vec is the feature vector of a query example
ann = model.approxNearestNeighbors(T, vec, k)
pred = ann.select("label")
.groupBy("label").count
.sort(desc("label")).first
return prediction
})
\end{verbatim}
}
Scala language despite the fact we cannot use Spark's parallelism.


\begin{figure}[H]
\center
Expand Down Expand Up @@ -479,12 +465,6 @@ \subsubsection{SVHN}
SVHN\footnote{\url{http://ufldl.stanford.edu/housenumbers/}}\cite{netzer2011reading} is a real-world color image dataset obtained from house numbers in Google
Street View images. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped 32-by-32 digits), but incorporates an order of magnitude more labeled data (73,257 digit images for training and 26032 digits images for testing) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images).

\subsubsection{SIFT}
SIFT1M\footnote{\url{http://corpus-texmex.irisa.fr/}}\cite{jegou2011product} is created to evaluate the quality of approximate nearest neighbors search algorithm and contains the 128-dimensional SIFT descriptor of 1,000,000 base vectors and 10,000 query vectors.

\subsection{Task}
\subsubsection{Nearest Neighbor Search}
\subsubsection{Image Classification}
\subsection{Parameters}
\subsubsection{Hardware Configuration}

Expand All @@ -510,6 +490,7 @@ \subsubsection{Parameters in Distributed LSH }
\item Bucket Length ($w$): determines the length of each line segment and is used to control the average size of hash buckets (and thus the number of buckets). The larger the $w$, the more possible to project close points into the same bucket, while a large $w$ also increase false positives points.
\item Bias ($b$): controls the center of hash values and always is set to 0 in our experiment.
\item Number of Hash Tables ($M$): is also the number of randomly generated lines in data space. Increasing $M$ will help get rid of the randomness effect because of RP.
\item Number of Nearest Neighbor ($k$): is a parameter for classification task if the kNN classifier is applied. We mainly tried $k \in \left\{1, 5, 9\right\}$ in our experiments to show the effect of $k$.
\end{itemize}

\section{Results}
Expand Down
1 change: 0 additions & 1 deletion src/main/scala/largelsh/SparkLSH.scala
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ class SparkLSHConf(arguments: Seq[String]) extends ScallopConf(arguments) {
verify()
}


object SparkLSH {

def main(args: Array[String]) {
Expand Down

0 comments on commit 3c665b2

Please sign in to comment.