final report

ppwwyyxx · Jan 3, 2014 · b4b139e · b4b139e
1 parent 7d25bdf
commit b4b139e
Show file tree

Hide file tree

Showing 46 changed files with 1,117 additions and 4 deletions.
diff --git a/doc/Final-Report-Complete/Makefile b/doc/Final-Report-Complete/Makefile
@@ -0,0 +1,32 @@
+TARGET=report
+TEX=xelatex -shell-escape
+BIBTEX=biber
+READER=mupdf
+
+all: rebuild
+
+rebuild  output/$(TARGET).pdf: *.tex *.bib output
+	cd output && rm -f *.tex *.bib && ln -fs ../*.tex ../*.bib ../img .
+	pgrep -a $(TEX) || cd output && $(TEX) $(TARGET).tex && $(BIBTEX) $(TARGET) #&& $(TEX) $(TARGET).tex
+
+output:
+	mkdir output
+	cd output && rm -f data res src && ln -s ../img .
+
+view: output/$(TARGET).pdf
+	$(READER) output/$(TARGET).pdf &
+	(inotifywait -mqe CLOSE_WRITE output/report.pdf | while read; do killall -SIGHUP mupdf; done)
+
+clean:
+	rm -rf output
+
+run: view
+
+dist: output/$(TARGET).pdf
+	rm -rf paper
+	mkdir paper
+	cp output/$(TARGET).pdf paper/
+	7z a -tzip paper.zip paper
+	rm -rf paper
+
+.PHONY: all view clean rebuild dist
diff --git a/doc/Final-Report-Complete/algorithm.tex b/doc/Final-Report-Complete/algorithm.tex
@@ -0,0 +1,31 @@
+\section{Algorithms}
+	In this section we will present our aproach to tackle the speaker recognition problem.
+
+    An utterance of a user is collected during enrollment procedure.
+    Further processing of the utterance follows following steps:
+    \subsection{VAD}
+        Signals must be first filtered to rule out the silence part, otherwise the
+        training might be seriously biased. Therefore \textbf{Voice Activity Detection} must
+        be first performed.
+
+        An observation found is that, the corpus provided is nearly noise-free.
+        Therefore we use a simple energy-based approach
+        to remove the silence part, by simply remove the frames that the average
+        energy is below 0.01 times the average energy of the whole utterance.
+
+        This energy-based method is found to work well on database, but not
+        on GUI.
+        We use LTSD(Long-Term Spectral Divergence) \cite{ltsd1}\cite{ltsd2}
+        algorithm on GUI, as well as noise reduction technique from SOX\cite{sox} to gain better result.
+
+        LTSD algorithm splits a utterance into overlapped frames, and give scores for each frame on
+        the probability that there is voice activity in this frame. This probability will be accumulated
+        to extract all the intervals with voice activity. A picture showing the principle of LTSD is as followed:
+
+        \begin{figure}[H]
+          \centering
+          \includegraphics[width=0.6\textwidth]{img/ltsd.png}
+        \end{figure}
+
+        \input{feature}
+        \input{model}
diff --git a/doc/Final-Report-Complete/dataset.tex b/doc/Final-Report-Complete/dataset.tex
@@ -0,0 +1,14 @@
+\section{Dataset}
+	The dataset provided by teacher comprised of 102 speaker, in which 60 are
+	females and the rest are males, with three different speaking style: Spontaneous,
+	Reading and Whisper. A statistic is as follows:
+	\begin{table}[!ht]
+		\centering
+		\begin{tabular}{|c|c|c|c|}
+			\hline
+			& Spontaneous & Reading & Whisper \\\hline
+			Average Duration & 202s & 205s & 221s \\\hline
+			Female Average Duration & 205s & 202s & 217s \\\hline
+			Male Average Duration & 200s & 203s & 223s \\\hline
+		\end{tabular}
+	\end{table}
diff --git a/doc/Final-Report-Complete/feature.tex b/doc/Final-Report-Complete/feature.tex
@@ -0,0 +1,100 @@
+%File: feature.tex
+%Date: Fri Jan 03 17:40:07 2014 +0800
+%Author: Yuxin Wu <ppwwyyxxc@gmail.com>
+
+\subsection{Feature Extraction}
+%We extract \textbf{Mel-frequency cepstral coefficients} and \textbf{Linear Predictive
+%Coding} features using following parameter are found to be
+%optimal, according to our experiments in \secref{result}:
+%\begin{itemize}
+%\item Common parameters:
+%\begin{itemize}
+%\item Frame size: 32ms
+%\item Frame shift: 16ms
+%\item Preemphasis coefficient: 0.95
+%\end{itemize}
+%\item MFCC parameters:
+%\begin{itemize}
+%\item number of cepstral coefficient: 15
+%\item number of filter banks: 55
+%\item maximal frequency of the filter bank: 6000
+%\end{itemize}
+%\item LPC Parameters:
+%\begin{itemize}
+%\item number of coefficient: 23
+%\end{itemize}
+%\end{itemize}
+
+%and then concatenate the two feature vectors of the same frame forming
+%a larger feature vector of 15 + 23 = 38 dimension.
+
+\subsubsection{MFCC}
+\label{sec:mfcc}
+\textbf{Mel-Frequency Cepstral Coefficient} is a representation of the short-term power spectrum of a sound,
+based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency \cite{mfcc} .
+MFCC is the mostly widely used features in Automatic Speech Recognition(ASR), and it can also be applied to Speaker Recognition task.
+
+
+The process to extract MFCC feature is as followed:
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=\textwidth]{img/MFCC.png}
+\end{figure}
+
+First, the input speech should be divided into successive short-time frames of length $L$,
+neighboring frames shall have overlap $R$.
+Those frames are then windowed by Hamming Window, as shown in \figref{framming}
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=0.7\textwidth]{img/MFCC-windowing-frames.png}
+  \caption{Framing and Windowing \label{fig:framming}}
+\end{figure}
+
+Then, We perform Discrete Fourier Transform (DFT) on windowed signals to compute their spectrums.
+For each of $N$ discrete frequency bands we get a complex number $X[k]$ representing
+magnitude and phase of that frequency component in the original signal.
+
+Considering the fact that human hearing is not equally sensitive to all frequency bands, and especially,
+it has lower resolution at higher frequencies.
+Scaling methods like Mel-scale are aimed at scaling the frequency domain to better fit human auditory perception.
+They are approximately linear below 1 kHz and logarithmic above 1 kHz, as shown below:
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=0.5\textwidth]{img/mel-scale.png}
+\end{figure}
+
+In MFCC, Mel-scale is applied on the spectrums of the signals.
+The expression of Mel-scale warpping is as followed:
+\[ M(f) = 2595 \log_{10}(1 + \dfrac{f}{700}) \]
+
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=0.5\textwidth]{img/bank.png}
+  \caption{Filter Banks (6 filters) \label{fig:bank}}
+\end{figure}
+Then,  we appply the bank of filters according to Mel-scale on the spectrum,
+calculate the logarithm of energy under each bank by $E_i[m] = \log (\sum_{k=0}^{N-1}{X_i[k]^2 H_m[k]}) $ and apply Discrete
+Cosine Transform (DCT) on $E_i[m](m = 1, 2, \cdots M) $ to get an array $c_i $:
+\[ c_i[n] = \sum_{m=0}^{M-1}{E_i[m]\cos(\dfrac{\pi n}{M}(m - \dfrac{1}{2}))} \]
+
+Then, the first $k$ terms in $c_i $ can be used as features for future training.
+The number of $k$ varies in different cases, we will further discuss the choice of $k$ in \secref{result}.
+
+\subsubsection{LPC}
+\textbf{Linear predictive coding} is a tool used mostly in audio signal processing and speech
+processing for representing the spectral envelope of a
+digital signal of speech in compressed form, using the information of a linear predictive model.\cite{lpc}
+
+The basic assumption in LPC is that,
+    in a short period, the $n$th signal is a linear combination of previous $p$ signals:
+    $ \hat{x}(n) = \sum_{i=1}^pa_i x(n-i)$
+    Therefore, to estimate the coefficients $ a_i$, we have to minimize the squared error
+    $ \text{E}\left[ \hat{x}(n) - x(n)\right]$.
+    This optimization can be done by Levinson-Durbin algorithm.\cite{levinson-durbin}
+
+    Therefore, we first split the input signal into frames, as is done in MFCC feature extraction \secref{mfcc}.
+    Then we calculate the $k$ order LPC coefficients for the signal in this frame.
+    Since the coefficients is a compressed description for the original audio signal,
+    the coefficients is also a good feature for speech/speaker recognition.
+    The choice of $k$ will also be further discussed in \secref{result}.
+
diff --git a/doc/Final-Report-Complete/gui.tex b/doc/Final-Report-Complete/gui.tex
@@ -0,0 +1,78 @@
+\section{GUI}
+The GUI contains following tabs:
+\begin{itemize}
+  \item \textbf{Enrollment} \\
+
+    \begin{figure}[H]
+      \centering
+      \includegraphics[width=0.8\textwidth]{img/enrollment.png}
+    \end{figure}
+
+    A new user may start his or her first step by clicking the
+    tab Enrollment. New users could provide personal information
+    such as name, sex, and age. then upload personal avatar to
+    build up their own data. Experienced users can choose from
+    the userlist and update their infomation.
+
+    Next the user needs to provide a piece of utterance for
+    the enrollment and training process.
+
+    There are two ways to enroll a user:
+    \begin{itemize}
+      \item \textbf{Enroll by Recording}
+        Click Record and start talking while click Stop to stop
+        and save.There is no limit of the content of the utterance,
+        whileit is highly recommended that the user speaks long enough
+        to provide sufficient message for the enrollment.
+
+      \item \textbf{Enroll from Wav Files}
+        User can upload a pre-recorded voice of a speaker.(*.wav recommended)
+        The systemaccepts the voice given and the enrollment of a speaker is done.
+    \end{itemize}
+
+    The user can train, dump or load his/her voice features after enrollment.
+
+  \item \textbf{Recognition of a user} \\
+    \begin{figure}[H]
+      \centering
+      \includegraphics[width=0.8\textwidth]{img/recognition.png}
+    \end{figure}
+
+    A enrolled user present or record a piece of utterance,
+    the system tells who the person is and show user's avatar.
+    Recognition of multiple pre-recorded files can be done as well.
+
+  \item \textbf{Conversation Recognition Mode} \\
+    \begin{figure}[H]
+      \centering
+      \includegraphics[width=0.8\textwidth]{img/conversation.png}
+      \caption{\label{fig:}}
+    \end{figure}
+
+    In Conversation Recognition mode, multiple users can have conversations
+    together near the microphone. Same recording procedure as above.
+    The system will continuously collect voice data, and determine
+    who is speaking right now. Current speaker's anvatar will show up
+    in screen; otherwise the name will be shown. The conversation
+    audio can be downloaded and saved.
+    There are some ways to visualize the speaker-distribution in the
+    conversation.
+    \begin{itemize}
+      \item \textbf{Conversation log}
+        A detailed log, including start time, stop time,
+        current speaker of each period is generated.
+      \item \textbf{Conversation flow graph}
+        \begin{figure}[H]
+          \centering
+          \includegraphics[width=0.8\textwidth]{img/conversationgraph.png}
+        \end{figure}
+
+        A timeline of the conversation will be shown by a number of
+        talking-clouds joining together, with start time, stop time
+        and users' avatars labeled. Different users are presented
+        with different colors.The timeline will flow to the left dynamically
+        just as time elapses. The visualization of the conversation is done
+        in this way. This functionality is still under development.
+    \end{itemize}
+
+\end{itemize}
diff --git a/doc/Final-Report-Complete/img/50.trimed.png b/doc/Final-Report-Complete/img/50.trimed.png
diff --git a/doc/Final-Report-Complete/img/MFCC-mel-filterbank.png b/doc/Final-Report-Complete/img/MFCC-mel-filterbank.png
@@ -0,0 +1 @@
+../../Presentation/res/MFCC-mel-filterbank.png
diff --git a/doc/Final-Report-Complete/img/MFCC-windowing-frames.png b/doc/Final-Report-Complete/img/MFCC-windowing-frames.png
@@ -0,0 +1 @@
+../../Presentation/res/MFCC-windowing-frames.png
diff --git a/doc/Final-Report-Complete/img/MFCC.png b/doc/Final-Report-Complete/img/MFCC.png
@@ -0,0 +1 @@
+../../Presentation/res/MFCC.png
diff --git a/doc/Final-Report-Complete/img/a0.png b/doc/Final-Report-Complete/img/a0.png
@@ -0,0 +1 @@
+../../Presentation/res/a0.png
diff --git a/doc/Final-Report-Complete/img/a1.png b/doc/Final-Report-Complete/img/a1.png
@@ -0,0 +1 @@
+../../Presentation/res/a1.png
diff --git a/doc/Final-Report-Complete/img/all.trimed.png b/doc/Final-Report-Complete/img/all.trimed.png
diff --git a/doc/Final-Report-Complete/img/bank.png b/doc/Final-Report-Complete/img/bank.png
diff --git a/doc/Final-Report-Complete/img/conversation.png b/doc/Final-Report-Complete/img/conversation.png
diff --git a/doc/Final-Report-Complete/img/conversationgraph.png b/doc/Final-Report-Complete/img/conversationgraph.png
diff --git a/doc/Final-Report-Complete/img/crbm.pdf b/doc/Final-Report-Complete/img/crbm.pdf
@@ -0,0 +1 @@
+../../Presentation/res/crbm.pdf
diff --git a/doc/Final-Report-Complete/img/enrollment.png b/doc/Final-Report-Complete/img/enrollment.png
diff --git a/doc/Final-Report-Complete/img/gmm-compare.pdf b/doc/Final-Report-Complete/img/gmm-compare.pdf
@@ -0,0 +1 @@
+../../Presentation/res/gmm-compare.pdf
diff --git a/doc/Final-Report-Complete/img/gmm.png b/doc/Final-Report-Complete/img/gmm.png
@@ -0,0 +1 @@
+../../Presentation/res/gmm.png
diff --git a/doc/Final-Report-Complete/img/lpc-frame-len.pdf b/doc/Final-Report-Complete/img/lpc-frame-len.pdf
@@ -0,0 +1 @@
+../../Presentation/res/lpc-frame-len.pdf
diff --git a/doc/Final-Report-Complete/img/lpc-nceps.pdf b/doc/Final-Report-Complete/img/lpc-nceps.pdf
@@ -0,0 +1 @@
+../../Presentation/res/lpc-nceps.pdf
diff --git a/doc/Final-Report-Complete/img/ltsd.png b/doc/Final-Report-Complete/img/ltsd.png
@@ -0,0 +1 @@
+../../Presentation/res/ltsd.png
diff --git a/doc/Final-Report-Complete/img/mel-scale.png b/doc/Final-Report-Complete/img/mel-scale.png
@@ -0,0 +1 @@
+../../Presentation/res/mel-scale.png
diff --git a/doc/Final-Report-Complete/img/mfcc-frame-len.pdf b/doc/Final-Report-Complete/img/mfcc-frame-len.pdf
@@ -0,0 +1 @@
+../../Presentation/res/mfcc-frame-len.pdf
diff --git a/doc/Final-Report-Complete/img/mfcc-nceps.pdf b/doc/Final-Report-Complete/img/mfcc-nceps.pdf
@@ -0,0 +1 @@
+../../Presentation/res/mfcc-nceps.pdf
diff --git a/doc/Final-Report-Complete/img/mfcc-nfilter.pdf b/doc/Final-Report-Complete/img/mfcc-nfilter.pdf
@@ -0,0 +1 @@
+../../Presentation/res/mfcc-nfilter.pdf
diff --git a/doc/Final-Report-Complete/img/nmixture.pdf b/doc/Final-Report-Complete/img/nmixture.pdf
@@ -0,0 +1 @@
+../../Presentation/res/nmixture.pdf
diff --git a/doc/Final-Report-Complete/img/performance.pdf b/doc/Final-Report-Complete/img/performance.pdf
@@ -0,0 +1 @@
+../../Presentation/res/performance.pdf
diff --git a/doc/Final-Report-Complete/img/rbm-original.png b/doc/Final-Report-Complete/img/rbm-original.png
@@ -0,0 +1 @@
+../../Presentation/res/rbm-original.png
diff --git a/doc/Final-Report-Complete/img/rbm-reconstruct.png b/doc/Final-Report-Complete/img/rbm-reconstruct.png
@@ -0,0 +1 @@
+../../Presentation/res/rbm-reconstruct.png
diff --git a/doc/Final-Report-Complete/img/reading.pdf b/doc/Final-Report-Complete/img/reading.pdf
@@ -0,0 +1 @@
+../../Presentation/res/reading.pdf
diff --git a/doc/Final-Report-Complete/img/recognition.png b/doc/Final-Report-Complete/img/recognition.png
diff --git a/doc/Final-Report-Complete/img/spont.pdf b/doc/Final-Report-Complete/img/spont.pdf
@@ -0,0 +1 @@
+../../Presentation/res/spont.pdf
diff --git a/doc/Final-Report-Complete/img/time-comp-small.pdf b/doc/Final-Report-Complete/img/time-comp-small.pdf
@@ -0,0 +1 @@
+../../Presentation/res/time-comp-small.pdf
diff --git a/doc/Final-Report-Complete/img/time-comp.pdf b/doc/Final-Report-Complete/img/time-comp.pdf
diff --git a/doc/Final-Report-Complete/img/whisper.pdf b/doc/Final-Report-Complete/img/whisper.pdf
@@ -0,0 +1 @@
+../../Presentation/res/whisper.pdf
diff --git a/doc/Final-Report-Complete/implementation.tex b/doc/Final-Report-Complete/implementation.tex
@@ -0,0 +1,4 @@
+%File: implementation.tex
+%Date: Fri Jan 03 17:18:14 2014 +0800
+%Author: Yuxin Wu <ppwwyyxxc@gmail.com>
+
diff --git a/doc/Final-Report-Complete/intro.tex b/doc/Final-Report-Complete/intro.tex
@@ -0,0 +1,45 @@
+%File: intro.tex
+%Date: Fri Jan 03 17:03:58 2014 +0800
+%Author: Yuxin Wu <ppwwyyxxc@gmail.com>
+
+
+\section{Introduction}
+\textbf{Speaker recognition} is the identification of the person who is speaking by characteristics
+of their voices (voice biometrics), also called voice recognition. \cite{SRwiki}
+
+A \textbf{Speaker Recognition} tasks can be classified with respect to different criterion:
+Text-dependent or Text-independent, Verification (decide whether the person is he claimed to be) or
+Identification (decide who the person is by its voice).\cite{SRwiki}
+
+Speech is a kind of complicated signal produced as a result of several transformations occurring at
+different levels: semantic, linguistic and acoustic.
+Differences in these transformations may lead to differences in the acoustic properties of the signals.
+The recognizability of speaker can be affected not only by the linguistic message
+but also the age, health, emotional state and effort level of the speaker.
+Background noise and performance of recording device also interfere
+the classification process.
+
+Speaker recognition is an important part of Human-Computer Interaction (HCI).
+As the trend of employing wearable computer reveals,
+Voice User Interface (VUI) has been a vital part of such computer.
+As these devices are particularly small, they are more likely to lose and be stolen.
+In these scenarios, speaker recognition is not only a good HCI,
+but also a combination of seamless interaction with computer and security guard
+when the device is lost.
+The need of personal identity validation will become more acute in the future.
+Speaker verification may be essential in business telecommunications.
+Telephone banking and telephone reservation services will develop rapidly
+when secure means of authentication were available.
+
+Also,the identity of a speaker is quite often at issue in court cases.
+A crime victim may have heard but not seen the perpetrator,
+but claim to recognize the perpetrator as someone whose voice was previously familiar;
+or there may be recordings of a criminal whose identity is unknown.
+Speaker recognition technique may bring a reliable scientific determination.
+
+Furthermore, these techniques can be used in environment which demands high security.
+It can be combined with other biological metrics to form a multi-modal authentication system.
+
+In this task, we have built a proof-of-concept text-independent speaker recognition system with
+GUI support. It is fast, accurate based on our tests on large corpus.
+And the gui program only require very short utterance to quickly respond.
diff --git a/doc/Final-Report-Complete/mint-defs.tex b/doc/Final-Report-Complete/mint-defs.tex
@@ -0,0 +1,18 @@
+% $File: mint-defs.tex
+% $Date: Thu Sep 26 22:11:33 2013 +0800
+% $Author: Xinyu Zhou <zxytim@gmail.com>
+
+\newcommand{\inputmintedConfigured}[3][]{\inputminted[fontsize=\footnotesize,
+	label=#3,linenos,frame=lines,framesep=0.8em,tabsize=4,#1]{#2}{#3}}
+
+\newcommand{\txtsrc}[2][]{\inputmintedConfigured[#1]{text}{#2}}
+\newcommand{\txtsrcpart}[4][]{\txtsrc[firstline=#3,firstnumber=#3,lastline=#4,#1]{#2}}
+
+\newcommand{\cppsrc}[2][]{\inputmintedConfigured[#1]{cpp}{#2}}
+\newcommand{\cppsrcpart}[4][]{\cppsrc[firstline=#3,firstnumber=#3,lastline=#4,#1]{#2}}
+
+\newcommand{\javasrc}[2][]{\inputmintedConfigured[#1]{java}{#2}}
+\newcommand{\javasrcpart}[4][]{\javasrc[firstline=#3,firstnumber=#3,lastline=#4,#1]{#2}}
+
+\newcommand{\matlabsrc}[2][]{\inputmintedConfigured[#1]{matlab}{#2}}
+\newcommand{\matlabsrcpart}[4][]{\matlabsrc[firstline=#3,firstnumber=#3,lastline=#4,#1]{#2}}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../Presentation/res/MFCC-windowing-frames.png