Merge branch 'master' of git.net9.org:ppwwyyxx/speaker-recognition

ppwwyyxx · Jan 3, 2014 · 98c6efd · 98c6efd
2 parents eab5791 + deb7d3e
commit 98c6efd
Show file tree

Hide file tree

Showing 492 changed files with 163,753 additions and 2,608 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,3 @@
+## Introduction
+
+This is a speaker-recognition system with GUI, served as an SRT project for the course *Signal Processing* (2013fall) in Tsinghua Univ.
diff --git a/Reports/Openning-Report.pdf → doc/01-Openning-Report.pdf b/Reports/Openning-Report.pdf → doc/01-Openning-Report.pdf
diff --git a/Reports/Progress-Report.pdf → doc/02-Progress-Report.pdf b/Reports/Progress-Report.pdf → doc/02-Progress-Report.pdf
diff --git a/Reports/Checkpoint-Report.pdf → doc/03-Checkpoint-Report.pdf b/Reports/Checkpoint-Report.pdf → doc/03-Checkpoint-Report.pdf
diff --git a/Reports/Midterm-Report.pdf → doc/04-Midterm-Report.pdf b/Reports/Midterm-Report.pdf → doc/04-Midterm-Report.pdf
diff --git a/Reports/Stage-Report.pdf → doc/05-Stage-Report.pdf b/Reports/Stage-Report.pdf → doc/05-Stage-Report.pdf
diff --git a/doc/06-Final-Report.pdf b/doc/06-Final-Report.pdf
diff --git a/Reports/Checkpoint-Report/Makefile → doc/Checkpoint-Report/Makefile b/Reports/Checkpoint-Report/Makefile → doc/Checkpoint-Report/Makefile
diff --git a/Reports/Checkpoint-Report/futurework.tex → doc/Checkpoint-Report/futurework.tex b/Reports/Checkpoint-Report/futurework.tex → doc/Checkpoint-Report/futurework.tex
diff --git a/Reports/Checkpoint-Report/mint-defs.tex → doc/Checkpoint-Report/mint-defs.tex b/Reports/Checkpoint-Report/mint-defs.tex → doc/Checkpoint-Report/mint-defs.tex
diff --git a/Reports/Checkpoint-Report/pandoc-defs.tex → doc/Checkpoint-Report/pandoc-defs.tex b/Reports/Checkpoint-Report/pandoc-defs.tex → doc/Checkpoint-Report/pandoc-defs.tex
diff --git a/Reports/Checkpoint-Report/progress.tex → doc/Checkpoint-Report/progress.tex b/Reports/Checkpoint-Report/progress.tex → doc/Checkpoint-Report/progress.tex
diff --git a/Reports/Checkpoint-Report/refs.bib → doc/Checkpoint-Report/refs.bib b/Reports/Checkpoint-Report/refs.bib → doc/Checkpoint-Report/refs.bib
diff --git a/Reports/Checkpoint-Report/report.tex → doc/Checkpoint-Report/report.tex b/Reports/Checkpoint-Report/report.tex → doc/Checkpoint-Report/report.tex
diff --git a/doc/Final-Report-Complete/Makefile b/doc/Final-Report-Complete/Makefile
@@ -0,0 +1,32 @@
+TARGET=report
+TEX=xelatex -shell-escape
+BIBTEX=biber
+READER=mupdf
+
+all: rebuild
+
+rebuild  output/$(TARGET).pdf: *.tex *.bib output
+	cd output && rm -f *.tex *.bib && ln -fs ../*.tex ../*.bib ../img .
+	pgrep -a $(TEX) || cd output && $(TEX) $(TARGET).tex && $(BIBTEX) $(TARGET) #&& $(TEX) $(TARGET).tex
+
+output:
+	mkdir output
+	cd output && rm -f data res src && ln -s ../img .
+
+view: output/$(TARGET).pdf
+	$(READER) output/$(TARGET).pdf &
+	(inotifywait -mqe CLOSE_WRITE output/report.pdf | while read; do killall -SIGHUP mupdf; done)
+
+clean:
+	rm -rf output
+
+run: view
+
+dist: output/$(TARGET).pdf
+	rm -rf paper
+	mkdir paper
+	cp output/$(TARGET).pdf paper/
+	7z a -tzip paper.zip paper
+	rm -rf paper
+
+.PHONY: all view clean rebuild dist
diff --git a/doc/Final-Report-Complete/algorithm.tex b/doc/Final-Report-Complete/algorithm.tex
@@ -0,0 +1,31 @@
+\section{Algorithms}
+	In this section we will present our aproach to tackle the speaker recognition problem.
+
+    An utterance of a user is collected during enrollment procedure.
+    Further processing of the utterance follows following steps:
+    \subsection{VAD}
+        Signals must be first filtered to rule out the silence part, otherwise the
+        training might be seriously biased. Therefore \textbf{Voice Activity Detection} must
+        be first performed.
+
+        An observation found is that, the corpus provided is nearly noise-free.
+        Therefore we use a simple energy-based approach
+        to remove the silence part, by simply remove the frames that the average
+        energy is below 0.01 times the average energy of the whole utterance.
+
+        This energy-based method is found to work well on database, but not
+        on GUI.
+        We use LTSD(Long-Term Spectral Divergence) \cite{ltsd1}\cite{ltsd2}
+        algorithm on GUI, as well as noise reduction technique from SOX\cite{sox} to gain better result.
+
+        LTSD algorithm splits a utterance into overlapped frames, and give scores for each frame on
+        the probability that there is voice activity in this frame. This probability will be accumulated
+        to extract all the intervals with voice activity. A picture showing the principle of LTSD is as followed:
+
+        \begin{figure}[H]
+          \centering
+          \includegraphics[width=0.6\textwidth]{img/ltsd.png}
+        \end{figure}
+
+        \input{feature}
+        \input{model}
diff --git a/Reports/Final-Report/dataset.tex → doc/Final-Report-Complete/dataset.tex b/Reports/Final-Report/dataset.tex → doc/Final-Report-Complete/dataset.tex
diff --git a/doc/Final-Report-Complete/feature.tex b/doc/Final-Report-Complete/feature.tex
@@ -0,0 +1,100 @@
+%File: feature.tex
+%Date: Fri Jan 03 17:40:07 2014 +0800
+%Author: Yuxin Wu <ppwwyyxxc@gmail.com>
+
+\subsection{Feature Extraction}
+%We extract \textbf{Mel-frequency cepstral coefficients} and \textbf{Linear Predictive
+%Coding} features using following parameter are found to be
+%optimal, according to our experiments in \secref{result}:
+%\begin{itemize}
+%\item Common parameters:
+%\begin{itemize}
+%\item Frame size: 32ms
+%\item Frame shift: 16ms
+%\item Preemphasis coefficient: 0.95
+%\end{itemize}
+%\item MFCC parameters:
+%\begin{itemize}
+%\item number of cepstral coefficient: 15
+%\item number of filter banks: 55
+%\item maximal frequency of the filter bank: 6000
+%\end{itemize}
+%\item LPC Parameters:
+%\begin{itemize}
+%\item number of coefficient: 23
+%\end{itemize}
+%\end{itemize}
+
+%and then concatenate the two feature vectors of the same frame forming
+%a larger feature vector of 15 + 23 = 38 dimension.
+
+\subsubsection{MFCC}
+\label{sec:mfcc}
+\textbf{Mel-Frequency Cepstral Coefficient} is a representation of the short-term power spectrum of a sound,
+based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency \cite{mfcc} .
+MFCC is the mostly widely used features in Automatic Speech Recognition(ASR), and it can also be applied to Speaker Recognition task.
+
+
+The process to extract MFCC feature is as followed:
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=\textwidth]{img/MFCC.png}
+\end{figure}
+
+First, the input speech should be divided into successive short-time frames of length $L$,
+neighboring frames shall have overlap $R$.
+Those frames are then windowed by Hamming Window, as shown in \figref{framming}
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=0.7\textwidth]{img/MFCC-windowing-frames.png}
+  \caption{Framing and Windowing \label{fig:framming}}
+\end{figure}
+
+Then, We perform Discrete Fourier Transform (DFT) on windowed signals to compute their spectrums.
+For each of $N$ discrete frequency bands we get a complex number $X[k]$ representing
+magnitude and phase of that frequency component in the original signal.
+
+Considering the fact that human hearing is not equally sensitive to all frequency bands, and especially,
+it has lower resolution at higher frequencies.
+Scaling methods like Mel-scale are aimed at scaling the frequency domain to better fit human auditory perception.
+They are approximately linear below 1 kHz and logarithmic above 1 kHz, as shown below:
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=0.5\textwidth]{img/mel-scale.png}
+\end{figure}
+
+In MFCC, Mel-scale is applied on the spectrums of the signals.
+The expression of Mel-scale warpping is as followed:
+\[ M(f) = 2595 \log_{10}(1 + \dfrac{f}{700}) \]
+
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=0.5\textwidth]{img/bank.png}
+  \caption{Filter Banks (6 filters) \label{fig:bank}}
+\end{figure}
+Then,  we appply the bank of filters according to Mel-scale on the spectrum,
+calculate the logarithm of energy under each bank by $E_i[m] = \log (\sum_{k=0}^{N-1}{X_i[k]^2 H_m[k]}) $ and apply Discrete
+Cosine Transform (DCT) on $E_i[m](m = 1, 2, \cdots M) $ to get an array $c_i $:
+\[ c_i[n] = \sum_{m=0}^{M-1}{E_i[m]\cos(\dfrac{\pi n}{M}(m - \dfrac{1}{2}))} \]
+
+Then, the first $k$ terms in $c_i $ can be used as features for future training.
+The number of $k$ varies in different cases, we will further discuss the choice of $k$ in \secref{result}.
+
+\subsubsection{LPC}
+\textbf{Linear predictive coding} is a tool used mostly in audio signal processing and speech
+processing for representing the spectral envelope of a
+digital signal of speech in compressed form, using the information of a linear predictive model.\cite{lpc}
+
+The basic assumption in LPC is that,
+    in a short period, the $n$th signal is a linear combination of previous $p$ signals:
+    $ \hat{x}(n) = \sum_{i=1}^pa_i x(n-i)$
+    Therefore, to estimate the coefficients $ a_i$, we have to minimize the squared error
+    $ \text{E}\left[ \hat{x}(n) - x(n)\right]$.
+    This optimization can be done by Levinson-Durbin algorithm.\cite{levinson-durbin}
+
+    Therefore, we first split the input signal into frames, as is done in MFCC feature extraction \secref{mfcc}.
+    Then we calculate the $k$ order LPC coefficients for the signal in this frame.
+    Since the coefficients is a compressed description for the original audio signal,
+    the coefficients is also a good feature for speech/speaker recognition.
+    The choice of $k$ will also be further discussed in \secref{result}.
+
diff --git a/Reports/Final-Report/gui.tex → doc/Final-Report-Complete/gui.tex b/Reports/Final-Report/gui.tex → doc/Final-Report-Complete/gui.tex
diff --git a/Reports/Final-Report/img/50.trimed.png → doc/Final-Report-Complete/img/50.trimed.png b/Reports/Final-Report/img/50.trimed.png → doc/Final-Report-Complete/img/50.trimed.png
diff --git a/doc/Final-Report-Complete/img/MFCC-mel-filterbank.png b/doc/Final-Report-Complete/img/MFCC-mel-filterbank.png
@@ -0,0 +1 @@
+../../Presentation/res/MFCC-mel-filterbank.png
diff --git a/doc/Final-Report-Complete/img/MFCC-windowing-frames.png b/doc/Final-Report-Complete/img/MFCC-windowing-frames.png
@@ -0,0 +1 @@
+../../Presentation/res/MFCC-windowing-frames.png
diff --git a/doc/Final-Report-Complete/img/MFCC.png b/doc/Final-Report-Complete/img/MFCC.png
@@ -0,0 +1 @@
+../../Presentation/res/MFCC.png
diff --git a/doc/Final-Report-Complete/img/a0.png b/doc/Final-Report-Complete/img/a0.png
@@ -0,0 +1 @@
+../../Presentation/res/a0.png
diff --git a/doc/Final-Report-Complete/img/a1.png b/doc/Final-Report-Complete/img/a1.png
@@ -0,0 +1 @@
+../../Presentation/res/a1.png
diff --git a/Reports/Final-Report/img/all.trimed.png → doc/Final-Report-Complete/img/all.trimed.png b/Reports/Final-Report/img/all.trimed.png → doc/Final-Report-Complete/img/all.trimed.png
diff --git a/Reports/Progress-Report/res/bank.png → doc/Final-Report-Complete/img/bank.png b/Reports/Progress-Report/res/bank.png → doc/Final-Report-Complete/img/bank.png
diff --git a/Reports/Final-Report/img/conversation.png → ...inal-Report-Complete/img/conversation.png b/Reports/Final-Report/img/conversation.png → ...inal-Report-Complete/img/conversation.png
diff --git a/...ts/Final-Report/img/conversationgraph.png → ...Report-Complete/img/conversationgraph.png b/...ts/Final-Report/img/conversationgraph.png → ...Report-Complete/img/conversationgraph.png
diff --git a/doc/Final-Report-Complete/img/crbm.pdf b/doc/Final-Report-Complete/img/crbm.pdf
@@ -0,0 +1 @@
+../../Presentation/res/crbm.pdf
diff --git a/Reports/Final-Report/img/enrollment.png → doc/Final-Report-Complete/img/enrollment.png b/Reports/Final-Report/img/enrollment.png → doc/Final-Report-Complete/img/enrollment.png
diff --git a/doc/Final-Report-Complete/img/gmm-compare.pdf b/doc/Final-Report-Complete/img/gmm-compare.pdf
@@ -0,0 +1 @@
+../../Presentation/res/gmm-compare.pdf
diff --git a/doc/Final-Report-Complete/img/gmm.png b/doc/Final-Report-Complete/img/gmm.png
@@ -0,0 +1 @@
+../../Presentation/res/gmm.png
diff --git a/doc/Final-Report-Complete/img/lpc-frame-len.pdf b/doc/Final-Report-Complete/img/lpc-frame-len.pdf
@@ -0,0 +1 @@
+../../Presentation/res/lpc-frame-len.pdf
diff --git a/doc/Final-Report-Complete/img/lpc-nceps.pdf b/doc/Final-Report-Complete/img/lpc-nceps.pdf
@@ -0,0 +1 @@
+../../Presentation/res/lpc-nceps.pdf
diff --git a/doc/Final-Report-Complete/img/ltsd.png b/doc/Final-Report-Complete/img/ltsd.png
@@ -0,0 +1 @@
+../../Presentation/res/ltsd.png
diff --git a/doc/Final-Report-Complete/img/mel-scale.png b/doc/Final-Report-Complete/img/mel-scale.png
@@ -0,0 +1 @@
+../../Presentation/res/mel-scale.png
diff --git a/doc/Final-Report-Complete/img/mfcc-frame-len.pdf b/doc/Final-Report-Complete/img/mfcc-frame-len.pdf
@@ -0,0 +1 @@
+../../Presentation/res/mfcc-frame-len.pdf
diff --git a/doc/Final-Report-Complete/img/mfcc-nceps.pdf b/doc/Final-Report-Complete/img/mfcc-nceps.pdf
@@ -0,0 +1 @@
+../../Presentation/res/mfcc-nceps.pdf
diff --git a/doc/Final-Report-Complete/img/mfcc-nfilter.pdf b/doc/Final-Report-Complete/img/mfcc-nfilter.pdf
@@ -0,0 +1 @@
+../../Presentation/res/mfcc-nfilter.pdf
diff --git a/doc/Final-Report-Complete/img/nmixture.pdf b/doc/Final-Report-Complete/img/nmixture.pdf
@@ -0,0 +1 @@
+../../Presentation/res/nmixture.pdf
diff --git a/doc/Final-Report-Complete/img/performance.pdf b/doc/Final-Report-Complete/img/performance.pdf
@@ -0,0 +1 @@
+../../Presentation/res/performance.pdf
diff --git a/doc/Final-Report-Complete/img/rbm-original.png b/doc/Final-Report-Complete/img/rbm-original.png
@@ -0,0 +1 @@
+../../Presentation/res/rbm-original.png
diff --git a/doc/Final-Report-Complete/img/rbm-reconstruct.png b/doc/Final-Report-Complete/img/rbm-reconstruct.png
@@ -0,0 +1 @@
+../../Presentation/res/rbm-reconstruct.png
diff --git a/doc/Final-Report-Complete/img/reading.pdf b/doc/Final-Report-Complete/img/reading.pdf
@@ -0,0 +1 @@
+../../Presentation/res/reading.pdf
diff --git a/Reports/Final-Report/img/recognition.png → ...Final-Report-Complete/img/recognition.png b/Reports/Final-Report/img/recognition.png → ...Final-Report-Complete/img/recognition.png
diff --git a/doc/Final-Report-Complete/img/spont.pdf b/doc/Final-Report-Complete/img/spont.pdf
@@ -0,0 +1 @@
+../../Presentation/res/spont.pdf
diff --git a/doc/Final-Report-Complete/img/time-comp-small.pdf b/doc/Final-Report-Complete/img/time-comp-small.pdf
@@ -0,0 +1 @@
+../../Presentation/res/time-comp-small.pdf
diff --git a/Reports/Final-Report/img/time-comp.pdf → doc/Final-Report-Complete/img/time-comp.pdf b/Reports/Final-Report/img/time-comp.pdf → doc/Final-Report-Complete/img/time-comp.pdf
diff --git a/doc/Final-Report-Complete/img/whisper.pdf b/doc/Final-Report-Complete/img/whisper.pdf
@@ -0,0 +1 @@
+../../Presentation/res/whisper.pdf
diff --git a/doc/Final-Report-Complete/implementation.tex b/doc/Final-Report-Complete/implementation.tex
@@ -0,0 +1,5 @@
+%File: implementation.tex
+%Date: Fri Jan 03 18:37:07 2014 +0800
+%Author: Yuxin Wu <ppwwyyxxc@gmail.com>
+
+\section{Implementation}
diff --git a/doc/Final-Report-Complete/intro.tex b/doc/Final-Report-Complete/intro.tex
@@ -0,0 +1,45 @@
+%File: intro.tex
+%Date: Fri Jan 03 17:03:58 2014 +0800
+%Author: Yuxin Wu <ppwwyyxxc@gmail.com>
+
+
+\section{Introduction}
+\textbf{Speaker recognition} is the identification of the person who is speaking by characteristics
+of their voices (voice biometrics), also called voice recognition. \cite{SRwiki}
+
+A \textbf{Speaker Recognition} tasks can be classified with respect to different criterion:
+Text-dependent or Text-independent, Verification (decide whether the person is he claimed to be) or
+Identification (decide who the person is by its voice).\cite{SRwiki}
+
+Speech is a kind of complicated signal produced as a result of several transformations occurring at
+different levels: semantic, linguistic and acoustic.
+Differences in these transformations may lead to differences in the acoustic properties of the signals.
+The recognizability of speaker can be affected not only by the linguistic message
+but also the age, health, emotional state and effort level of the speaker.
+Background noise and performance of recording device also interfere
+the classification process.
+
+Speaker recognition is an important part of Human-Computer Interaction (HCI).
+As the trend of employing wearable computer reveals,
+Voice User Interface (VUI) has been a vital part of such computer.
+As these devices are particularly small, they are more likely to lose and be stolen.
+In these scenarios, speaker recognition is not only a good HCI,
+but also a combination of seamless interaction with computer and security guard
+when the device is lost.
+The need of personal identity validation will become more acute in the future.
+Speaker verification may be essential in business telecommunications.
+Telephone banking and telephone reservation services will develop rapidly
+when secure means of authentication were available.
+
+Also,the identity of a speaker is quite often at issue in court cases.
+A crime victim may have heard but not seen the perpetrator,
+but claim to recognize the perpetrator as someone whose voice was previously familiar;
+or there may be recordings of a criminal whose identity is unknown.
+Speaker recognition technique may bring a reliable scientific determination.
+
+Furthermore, these techniques can be used in environment which demands high security.
+It can be combined with other biological metrics to form a multi-modal authentication system.
+
+In this task, we have built a proof-of-concept text-independent speaker recognition system with
+GUI support. It is fast, accurate based on our tests on large corpus.
+And the gui program only require very short utterance to quickly respond.
diff --git a/Reports/Final-Report/mint-defs.tex → doc/Final-Report-Complete/mint-defs.tex b/Reports/Final-Report/mint-defs.tex → doc/Final-Report-Complete/mint-defs.tex
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		## Introduction

		This is a speaker-recognition system with GUI, served as an SRT project for the course Signal Processing (2013fall) in Tsinghua Univ.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../Presentation/res/MFCC-windowing-frames.png