Skip to content

Commit

Permalink
final report
Browse files Browse the repository at this point in the history
  • Loading branch information
ppwwyyxx committed Jan 3, 2014
1 parent 7d25bdf commit b4b139e
Show file tree
Hide file tree
Showing 46 changed files with 1,117 additions and 4 deletions.
32 changes: 32 additions & 0 deletions doc/Final-Report-Complete/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
TARGET=report
TEX=xelatex -shell-escape
BIBTEX=biber
READER=mupdf

all: rebuild

rebuild output/$(TARGET).pdf: *.tex *.bib output
cd output && rm -f *.tex *.bib && ln -fs ../*.tex ../*.bib ../img .
pgrep -a $(TEX) || cd output && $(TEX) $(TARGET).tex && $(BIBTEX) $(TARGET) #&& $(TEX) $(TARGET).tex

output:
mkdir output
cd output && rm -f data res src && ln -s ../img .

view: output/$(TARGET).pdf
$(READER) output/$(TARGET).pdf &
(inotifywait -mqe CLOSE_WRITE output/report.pdf | while read; do killall -SIGHUP mupdf; done)

clean:
rm -rf output

run: view

dist: output/$(TARGET).pdf
rm -rf paper
mkdir paper
cp output/$(TARGET).pdf paper/
7z a -tzip paper.zip paper
rm -rf paper

.PHONY: all view clean rebuild dist
31 changes: 31 additions & 0 deletions doc/Final-Report-Complete/algorithm.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
\section{Algorithms}
In this section we will present our aproach to tackle the speaker recognition problem.

An utterance of a user is collected during enrollment procedure.
Further processing of the utterance follows following steps:
\subsection{VAD}
Signals must be first filtered to rule out the silence part, otherwise the
training might be seriously biased. Therefore \textbf{Voice Activity Detection} must
be first performed.

An observation found is that, the corpus provided is nearly noise-free.
Therefore we use a simple energy-based approach
to remove the silence part, by simply remove the frames that the average
energy is below 0.01 times the average energy of the whole utterance.

This energy-based method is found to work well on database, but not
on GUI.
We use LTSD(Long-Term Spectral Divergence) \cite{ltsd1}\cite{ltsd2}
algorithm on GUI, as well as noise reduction technique from SOX\cite{sox} to gain better result.

LTSD algorithm splits a utterance into overlapped frames, and give scores for each frame on
the probability that there is voice activity in this frame. This probability will be accumulated
to extract all the intervals with voice activity. A picture showing the principle of LTSD is as followed:

\begin{figure}[H]
\centering
\includegraphics[width=0.6\textwidth]{img/ltsd.png}
\end{figure}

\input{feature}
\input{model}
14 changes: 14 additions & 0 deletions doc/Final-Report-Complete/dataset.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
\section{Dataset}
The dataset provided by teacher comprised of 102 speaker, in which 60 are
females and the rest are males, with three different speaking style: Spontaneous,
Reading and Whisper. A statistic is as follows:
\begin{table}[!ht]
\centering
\begin{tabular}{|c|c|c|c|}
\hline
& Spontaneous & Reading & Whisper \\\hline
Average Duration & 202s & 205s & 221s \\\hline
Female Average Duration & 205s & 202s & 217s \\\hline
Male Average Duration & 200s & 203s & 223s \\\hline
\end{tabular}
\end{table}
100 changes: 100 additions & 0 deletions doc/Final-Report-Complete/feature.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
%File: feature.tex
%Date: Fri Jan 03 17:40:07 2014 +0800
%Author: Yuxin Wu <ppwwyyxxc@gmail.com>

\subsection{Feature Extraction}
%We extract \textbf{Mel-frequency cepstral coefficients} and \textbf{Linear Predictive
%Coding} features using following parameter are found to be
%optimal, according to our experiments in \secref{result}:
%\begin{itemize}
%\item Common parameters:
%\begin{itemize}
%\item Frame size: 32ms
%\item Frame shift: 16ms
%\item Preemphasis coefficient: 0.95
%\end{itemize}
%\item MFCC parameters:
%\begin{itemize}
%\item number of cepstral coefficient: 15
%\item number of filter banks: 55
%\item maximal frequency of the filter bank: 6000
%\end{itemize}
%\item LPC Parameters:
%\begin{itemize}
%\item number of coefficient: 23
%\end{itemize}
%\end{itemize}

%and then concatenate the two feature vectors of the same frame forming
%a larger feature vector of 15 + 23 = 38 dimension.

\subsubsection{MFCC}
\label{sec:mfcc}
\textbf{Mel-Frequency Cepstral Coefficient} is a representation of the short-term power spectrum of a sound,
based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency \cite{mfcc} .
MFCC is the mostly widely used features in Automatic Speech Recognition(ASR), and it can also be applied to Speaker Recognition task.


The process to extract MFCC feature is as followed:
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{img/MFCC.png}
\end{figure}

First, the input speech should be divided into successive short-time frames of length $L$,
neighboring frames shall have overlap $R$.
Those frames are then windowed by Hamming Window, as shown in \figref{framming}
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{img/MFCC-windowing-frames.png}
\caption{Framing and Windowing \label{fig:framming}}
\end{figure}

Then, We perform Discrete Fourier Transform (DFT) on windowed signals to compute their spectrums.
For each of $N$ discrete frequency bands we get a complex number $X[k]$ representing
magnitude and phase of that frequency component in the original signal.

Considering the fact that human hearing is not equally sensitive to all frequency bands, and especially,
it has lower resolution at higher frequencies.
Scaling methods like Mel-scale are aimed at scaling the frequency domain to better fit human auditory perception.
They are approximately linear below 1 kHz and logarithmic above 1 kHz, as shown below:
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{img/mel-scale.png}
\end{figure}

In MFCC, Mel-scale is applied on the spectrums of the signals.
The expression of Mel-scale warpping is as followed:
\[ M(f) = 2595 \log_{10}(1 + \dfrac{f}{700}) \]

\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{img/bank.png}
\caption{Filter Banks (6 filters) \label{fig:bank}}
\end{figure}
Then, we appply the bank of filters according to Mel-scale on the spectrum,
calculate the logarithm of energy under each bank by $E_i[m] = \log (\sum_{k=0}^{N-1}{X_i[k]^2 H_m[k]}) $ and apply Discrete
Cosine Transform (DCT) on $E_i[m](m = 1, 2, \cdots M) $ to get an array $c_i $:
\[ c_i[n] = \sum_{m=0}^{M-1}{E_i[m]\cos(\dfrac{\pi n}{M}(m - \dfrac{1}{2}))} \]

Then, the first $k$ terms in $c_i $ can be used as features for future training.
The number of $k$ varies in different cases, we will further discuss the choice of $k$ in \secref{result}.

\subsubsection{LPC}
\textbf{Linear predictive coding} is a tool used mostly in audio signal processing and speech
processing for representing the spectral envelope of a
digital signal of speech in compressed form, using the information of a linear predictive model.\cite{lpc}

The basic assumption in LPC is that,
in a short period, the $n$th signal is a linear combination of previous $p$ signals:
$ \hat{x}(n) = \sum_{i=1}^pa_i x(n-i)$
Therefore, to estimate the coefficients $ a_i$, we have to minimize the squared error
$ \text{E}\left[ \hat{x}(n) - x(n)\right]$.
This optimization can be done by Levinson-Durbin algorithm.\cite{levinson-durbin}

Therefore, we first split the input signal into frames, as is done in MFCC feature extraction \secref{mfcc}.
Then we calculate the $k$ order LPC coefficients for the signal in this frame.
Since the coefficients is a compressed description for the original audio signal,
the coefficients is also a good feature for speech/speaker recognition.
The choice of $k$ will also be further discussed in \secref{result}.

78 changes: 78 additions & 0 deletions doc/Final-Report-Complete/gui.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
\section{GUI}
The GUI contains following tabs:
\begin{itemize}
\item \textbf{Enrollment} \\

\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{img/enrollment.png}
\end{figure}

A new user may start his or her first step by clicking the
tab Enrollment. New users could provide personal information
such as name, sex, and age. then upload personal avatar to
build up their own data. Experienced users can choose from
the userlist and update their infomation.

Next the user needs to provide a piece of utterance for
the enrollment and training process.

There are two ways to enroll a user:
\begin{itemize}
\item \textbf{Enroll by Recording}
Click Record and start talking while click Stop to stop
and save.There is no limit of the content of the utterance,
whileit is highly recommended that the user speaks long enough
to provide sufficient message for the enrollment.

\item \textbf{Enroll from Wav Files}
User can upload a pre-recorded voice of a speaker.(*.wav recommended)
The systemaccepts the voice given and the enrollment of a speaker is done.
\end{itemize}

The user can train, dump or load his/her voice features after enrollment.

\item \textbf{Recognition of a user} \\
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{img/recognition.png}
\end{figure}

A enrolled user present or record a piece of utterance,
the system tells who the person is and show user's avatar.
Recognition of multiple pre-recorded files can be done as well.

\item \textbf{Conversation Recognition Mode} \\
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{img/conversation.png}
\caption{\label{fig:}}
\end{figure}

In Conversation Recognition mode, multiple users can have conversations
together near the microphone. Same recording procedure as above.
The system will continuously collect voice data, and determine
who is speaking right now. Current speaker's anvatar will show up
in screen; otherwise the name will be shown. The conversation
audio can be downloaded and saved.
There are some ways to visualize the speaker-distribution in the
conversation.
\begin{itemize}
\item \textbf{Conversation log}
A detailed log, including start time, stop time,
current speaker of each period is generated.
\item \textbf{Conversation flow graph}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{img/conversationgraph.png}
\end{figure}

A timeline of the conversation will be shown by a number of
talking-clouds joining together, with start time, stop time
and users' avatars labeled. Different users are presented
with different colors.The timeline will flow to the left dynamically
just as time elapses. The visualization of the conversation is done
in this way. This functionality is still under development.
\end{itemize}

\end{itemize}
Binary file added doc/Final-Report-Complete/img/50.trimed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/MFCC-mel-filterbank.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/MFCC-windowing-frames.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/MFCC.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/a0.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/a1.png
Binary file added doc/Final-Report-Complete/img/all.trimed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/Final-Report-Complete/img/bank.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/Final-Report-Complete/img/conversation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/crbm.pdf
Binary file added doc/Final-Report-Complete/img/enrollment.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/gmm-compare.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/gmm.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/lpc-frame-len.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/lpc-nceps.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/ltsd.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/mel-scale.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/mfcc-frame-len.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/mfcc-nceps.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/mfcc-nfilter.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/nmixture.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/performance.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/rbm-original.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/rbm-reconstruct.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/reading.pdf
Binary file added doc/Final-Report-Complete/img/recognition.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/spont.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/time-comp-small.pdf
Binary file added doc/Final-Report-Complete/img/time-comp.pdf
Binary file not shown.
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/whisper.pdf
4 changes: 4 additions & 0 deletions doc/Final-Report-Complete/implementation.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
%File: implementation.tex
%Date: Fri Jan 03 17:18:14 2014 +0800
%Author: Yuxin Wu <ppwwyyxxc@gmail.com>

45 changes: 45 additions & 0 deletions doc/Final-Report-Complete/intro.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
%File: intro.tex
%Date: Fri Jan 03 17:03:58 2014 +0800
%Author: Yuxin Wu <ppwwyyxxc@gmail.com>


\section{Introduction}
\textbf{Speaker recognition} is the identification of the person who is speaking by characteristics
of their voices (voice biometrics), also called voice recognition. \cite{SRwiki}

A \textbf{Speaker Recognition} tasks can be classified with respect to different criterion:
Text-dependent or Text-independent, Verification (decide whether the person is he claimed to be) or
Identification (decide who the person is by its voice).\cite{SRwiki}

Speech is a kind of complicated signal produced as a result of several transformations occurring at
different levels: semantic, linguistic and acoustic.
Differences in these transformations may lead to differences in the acoustic properties of the signals.
The recognizability of speaker can be affected not only by the linguistic message
but also the age, health, emotional state and effort level of the speaker.
Background noise and performance of recording device also interfere
the classification process.

Speaker recognition is an important part of Human-Computer Interaction (HCI).
As the trend of employing wearable computer reveals,
Voice User Interface (VUI) has been a vital part of such computer.
As these devices are particularly small, they are more likely to lose and be stolen.
In these scenarios, speaker recognition is not only a good HCI,
but also a combination of seamless interaction with computer and security guard
when the device is lost.
The need of personal identity validation will become more acute in the future.
Speaker verification may be essential in business telecommunications.
Telephone banking and telephone reservation services will develop rapidly
when secure means of authentication were available.

Also,the identity of a speaker is quite often at issue in court cases.
A crime victim may have heard but not seen the perpetrator,
but claim to recognize the perpetrator as someone whose voice was previously familiar;
or there may be recordings of a criminal whose identity is unknown.
Speaker recognition technique may bring a reliable scientific determination.

Furthermore, these techniques can be used in environment which demands high security.
It can be combined with other biological metrics to form a multi-modal authentication system.

In this task, we have built a proof-of-concept text-independent speaker recognition system with
GUI support. It is fast, accurate based on our tests on large corpus.
And the gui program only require very short utterance to quickly respond.
18 changes: 18 additions & 0 deletions doc/Final-Report-Complete/mint-defs.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
% $File: mint-defs.tex
% $Date: Thu Sep 26 22:11:33 2013 +0800
% $Author: Xinyu Zhou <zxytim@gmail.com>

\newcommand{\inputmintedConfigured}[3][]{\inputminted[fontsize=\footnotesize,
label=#3,linenos,frame=lines,framesep=0.8em,tabsize=4,#1]{#2}{#3}}

\newcommand{\txtsrc}[2][]{\inputmintedConfigured[#1]{text}{#2}}
\newcommand{\txtsrcpart}[4][]{\txtsrc[firstline=#3,firstnumber=#3,lastline=#4,#1]{#2}}

\newcommand{\cppsrc}[2][]{\inputmintedConfigured[#1]{cpp}{#2}}
\newcommand{\cppsrcpart}[4][]{\cppsrc[firstline=#3,firstnumber=#3,lastline=#4,#1]{#2}}

\newcommand{\javasrc}[2][]{\inputmintedConfigured[#1]{java}{#2}}
\newcommand{\javasrcpart}[4][]{\javasrc[firstline=#3,firstnumber=#3,lastline=#4,#1]{#2}}

\newcommand{\matlabsrc}[2][]{\inputmintedConfigured[#1]{matlab}{#2}}
\newcommand{\matlabsrcpart}[4][]{\matlabsrc[firstline=#3,firstnumber=#3,lastline=#4,#1]{#2}}
Loading

0 comments on commit b4b139e

Please sign in to comment.