Skip to content

Commit

Permalink
Merge branch 'master' of git.net9.org:ppwwyyxx/speaker-recognition
Browse files Browse the repository at this point in the history
  • Loading branch information
zxytim committed Jan 3, 2014
2 parents eab5791 + deb7d3e commit 98c6efd
Show file tree
Hide file tree
Showing 492 changed files with 163,753 additions and 2,608 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Introduction

This is a speaker-recognition system with GUI, served as an SRT project for the course *Signal Processing* (2013fall) in Tsinghua Univ.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file added doc/06-Final-Report.pdf
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
32 changes: 32 additions & 0 deletions doc/Final-Report-Complete/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
TARGET=report
TEX=xelatex -shell-escape
BIBTEX=biber
READER=mupdf

all: rebuild

rebuild output/$(TARGET).pdf: *.tex *.bib output
cd output && rm -f *.tex *.bib && ln -fs ../*.tex ../*.bib ../img .
pgrep -a $(TEX) || cd output && $(TEX) $(TARGET).tex && $(BIBTEX) $(TARGET) #&& $(TEX) $(TARGET).tex

output:
mkdir output
cd output && rm -f data res src && ln -s ../img .

view: output/$(TARGET).pdf
$(READER) output/$(TARGET).pdf &
(inotifywait -mqe CLOSE_WRITE output/report.pdf | while read; do killall -SIGHUP mupdf; done)

clean:
rm -rf output

run: view

dist: output/$(TARGET).pdf
rm -rf paper
mkdir paper
cp output/$(TARGET).pdf paper/
7z a -tzip paper.zip paper
rm -rf paper

.PHONY: all view clean rebuild dist
31 changes: 31 additions & 0 deletions doc/Final-Report-Complete/algorithm.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
\section{Algorithms}
In this section we will present our aproach to tackle the speaker recognition problem.

An utterance of a user is collected during enrollment procedure.
Further processing of the utterance follows following steps:
\subsection{VAD}
Signals must be first filtered to rule out the silence part, otherwise the
training might be seriously biased. Therefore \textbf{Voice Activity Detection} must
be first performed.

An observation found is that, the corpus provided is nearly noise-free.
Therefore we use a simple energy-based approach
to remove the silence part, by simply remove the frames that the average
energy is below 0.01 times the average energy of the whole utterance.

This energy-based method is found to work well on database, but not
on GUI.
We use LTSD(Long-Term Spectral Divergence) \cite{ltsd1}\cite{ltsd2}
algorithm on GUI, as well as noise reduction technique from SOX\cite{sox} to gain better result.

LTSD algorithm splits a utterance into overlapped frames, and give scores for each frame on
the probability that there is voice activity in this frame. This probability will be accumulated
to extract all the intervals with voice activity. A picture showing the principle of LTSD is as followed:

\begin{figure}[H]
\centering
\includegraphics[width=0.6\textwidth]{img/ltsd.png}
\end{figure}

\input{feature}
\input{model}
File renamed without changes.
100 changes: 100 additions & 0 deletions doc/Final-Report-Complete/feature.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
%File: feature.tex
%Date: Fri Jan 03 17:40:07 2014 +0800
%Author: Yuxin Wu <ppwwyyxxc@gmail.com>

\subsection{Feature Extraction}
%We extract \textbf{Mel-frequency cepstral coefficients} and \textbf{Linear Predictive
%Coding} features using following parameter are found to be
%optimal, according to our experiments in \secref{result}:
%\begin{itemize}
%\item Common parameters:
%\begin{itemize}
%\item Frame size: 32ms
%\item Frame shift: 16ms
%\item Preemphasis coefficient: 0.95
%\end{itemize}
%\item MFCC parameters:
%\begin{itemize}
%\item number of cepstral coefficient: 15
%\item number of filter banks: 55
%\item maximal frequency of the filter bank: 6000
%\end{itemize}
%\item LPC Parameters:
%\begin{itemize}
%\item number of coefficient: 23
%\end{itemize}
%\end{itemize}

%and then concatenate the two feature vectors of the same frame forming
%a larger feature vector of 15 + 23 = 38 dimension.

\subsubsection{MFCC}
\label{sec:mfcc}
\textbf{Mel-Frequency Cepstral Coefficient} is a representation of the short-term power spectrum of a sound,
based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency \cite{mfcc} .
MFCC is the mostly widely used features in Automatic Speech Recognition(ASR), and it can also be applied to Speaker Recognition task.


The process to extract MFCC feature is as followed:
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{img/MFCC.png}
\end{figure}

First, the input speech should be divided into successive short-time frames of length $L$,
neighboring frames shall have overlap $R$.
Those frames are then windowed by Hamming Window, as shown in \figref{framming}
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{img/MFCC-windowing-frames.png}
\caption{Framing and Windowing \label{fig:framming}}
\end{figure}

Then, We perform Discrete Fourier Transform (DFT) on windowed signals to compute their spectrums.
For each of $N$ discrete frequency bands we get a complex number $X[k]$ representing
magnitude and phase of that frequency component in the original signal.

Considering the fact that human hearing is not equally sensitive to all frequency bands, and especially,
it has lower resolution at higher frequencies.
Scaling methods like Mel-scale are aimed at scaling the frequency domain to better fit human auditory perception.
They are approximately linear below 1 kHz and logarithmic above 1 kHz, as shown below:
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{img/mel-scale.png}
\end{figure}

In MFCC, Mel-scale is applied on the spectrums of the signals.
The expression of Mel-scale warpping is as followed:
\[ M(f) = 2595 \log_{10}(1 + \dfrac{f}{700}) \]

\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{img/bank.png}
\caption{Filter Banks (6 filters) \label{fig:bank}}
\end{figure}
Then, we appply the bank of filters according to Mel-scale on the spectrum,
calculate the logarithm of energy under each bank by $E_i[m] = \log (\sum_{k=0}^{N-1}{X_i[k]^2 H_m[k]}) $ and apply Discrete
Cosine Transform (DCT) on $E_i[m](m = 1, 2, \cdots M) $ to get an array $c_i $:
\[ c_i[n] = \sum_{m=0}^{M-1}{E_i[m]\cos(\dfrac{\pi n}{M}(m - \dfrac{1}{2}))} \]

Then, the first $k$ terms in $c_i $ can be used as features for future training.
The number of $k$ varies in different cases, we will further discuss the choice of $k$ in \secref{result}.

\subsubsection{LPC}
\textbf{Linear predictive coding} is a tool used mostly in audio signal processing and speech
processing for representing the spectral envelope of a
digital signal of speech in compressed form, using the information of a linear predictive model.\cite{lpc}

The basic assumption in LPC is that,
in a short period, the $n$th signal is a linear combination of previous $p$ signals:
$ \hat{x}(n) = \sum_{i=1}^pa_i x(n-i)$
Therefore, to estimate the coefficients $ a_i$, we have to minimize the squared error
$ \text{E}\left[ \hat{x}(n) - x(n)\right]$.
This optimization can be done by Levinson-Durbin algorithm.\cite{levinson-durbin}

Therefore, we first split the input signal into frames, as is done in MFCC feature extraction \secref{mfcc}.
Then we calculate the $k$ order LPC coefficients for the signal in this frame.
Since the coefficients is a compressed description for the original audio signal,
the coefficients is also a good feature for speech/speaker recognition.
The choice of $k$ will also be further discussed in \secref{result}.

File renamed without changes.
File renamed without changes
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/MFCC-mel-filterbank.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/MFCC-windowing-frames.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/MFCC.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/a0.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/a1.png
File renamed without changes
File renamed without changes
File renamed without changes
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/crbm.pdf
File renamed without changes
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/gmm-compare.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/gmm.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/lpc-frame-len.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/lpc-nceps.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/ltsd.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/mel-scale.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/mfcc-frame-len.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/mfcc-nceps.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/mfcc-nfilter.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/nmixture.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/performance.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/rbm-original.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/rbm-reconstruct.png
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/reading.pdf
File renamed without changes
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/spont.pdf
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/time-comp-small.pdf
File renamed without changes.
1 change: 1 addition & 0 deletions doc/Final-Report-Complete/img/whisper.pdf
5 changes: 5 additions & 0 deletions doc/Final-Report-Complete/implementation.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
%File: implementation.tex
%Date: Fri Jan 03 18:37:07 2014 +0800
%Author: Yuxin Wu <ppwwyyxxc@gmail.com>

\section{Implementation}
45 changes: 45 additions & 0 deletions doc/Final-Report-Complete/intro.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
%File: intro.tex
%Date: Fri Jan 03 17:03:58 2014 +0800
%Author: Yuxin Wu <ppwwyyxxc@gmail.com>


\section{Introduction}
\textbf{Speaker recognition} is the identification of the person who is speaking by characteristics
of their voices (voice biometrics), also called voice recognition. \cite{SRwiki}

A \textbf{Speaker Recognition} tasks can be classified with respect to different criterion:
Text-dependent or Text-independent, Verification (decide whether the person is he claimed to be) or
Identification (decide who the person is by its voice).\cite{SRwiki}

Speech is a kind of complicated signal produced as a result of several transformations occurring at
different levels: semantic, linguistic and acoustic.
Differences in these transformations may lead to differences in the acoustic properties of the signals.
The recognizability of speaker can be affected not only by the linguistic message
but also the age, health, emotional state and effort level of the speaker.
Background noise and performance of recording device also interfere
the classification process.

Speaker recognition is an important part of Human-Computer Interaction (HCI).
As the trend of employing wearable computer reveals,
Voice User Interface (VUI) has been a vital part of such computer.
As these devices are particularly small, they are more likely to lose and be stolen.
In these scenarios, speaker recognition is not only a good HCI,
but also a combination of seamless interaction with computer and security guard
when the device is lost.
The need of personal identity validation will become more acute in the future.
Speaker verification may be essential in business telecommunications.
Telephone banking and telephone reservation services will develop rapidly
when secure means of authentication were available.

Also,the identity of a speaker is quite often at issue in court cases.
A crime victim may have heard but not seen the perpetrator,
but claim to recognize the perpetrator as someone whose voice was previously familiar;
or there may be recordings of a criminal whose identity is unknown.
Speaker recognition technique may bring a reliable scientific determination.

Furthermore, these techniques can be used in environment which demands high security.
It can be combined with other biological metrics to form a multi-modal authentication system.

In this task, we have built a proof-of-concept text-independent speaker recognition system with
GUI support. It is fast, accurate based on our tests on large corpus.
And the gui program only require very short utterance to quickly respond.
File renamed without changes.
Loading

0 comments on commit 98c6efd

Please sign in to comment.