-
Notifications
You must be signed in to change notification settings - Fork 274
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' of git.net9.org:ppwwyyxx/speaker-recognition
- Loading branch information
Showing
492 changed files
with
163,753 additions
and
2,608 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
## Introduction | ||
|
||
This is a speaker-recognition system with GUI, served as an SRT project for the course *Signal Processing* (2013fall) in Tsinghua Univ. |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
TARGET=report | ||
TEX=xelatex -shell-escape | ||
BIBTEX=biber | ||
READER=mupdf | ||
|
||
all: rebuild | ||
|
||
rebuild output/$(TARGET).pdf: *.tex *.bib output | ||
cd output && rm -f *.tex *.bib && ln -fs ../*.tex ../*.bib ../img . | ||
pgrep -a $(TEX) || cd output && $(TEX) $(TARGET).tex && $(BIBTEX) $(TARGET) #&& $(TEX) $(TARGET).tex | ||
|
||
output: | ||
mkdir output | ||
cd output && rm -f data res src && ln -s ../img . | ||
|
||
view: output/$(TARGET).pdf | ||
$(READER) output/$(TARGET).pdf & | ||
(inotifywait -mqe CLOSE_WRITE output/report.pdf | while read; do killall -SIGHUP mupdf; done) | ||
|
||
clean: | ||
rm -rf output | ||
|
||
run: view | ||
|
||
dist: output/$(TARGET).pdf | ||
rm -rf paper | ||
mkdir paper | ||
cp output/$(TARGET).pdf paper/ | ||
7z a -tzip paper.zip paper | ||
rm -rf paper | ||
|
||
.PHONY: all view clean rebuild dist |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
\section{Algorithms} | ||
In this section we will present our aproach to tackle the speaker recognition problem. | ||
|
||
An utterance of a user is collected during enrollment procedure. | ||
Further processing of the utterance follows following steps: | ||
\subsection{VAD} | ||
Signals must be first filtered to rule out the silence part, otherwise the | ||
training might be seriously biased. Therefore \textbf{Voice Activity Detection} must | ||
be first performed. | ||
|
||
An observation found is that, the corpus provided is nearly noise-free. | ||
Therefore we use a simple energy-based approach | ||
to remove the silence part, by simply remove the frames that the average | ||
energy is below 0.01 times the average energy of the whole utterance. | ||
|
||
This energy-based method is found to work well on database, but not | ||
on GUI. | ||
We use LTSD(Long-Term Spectral Divergence) \cite{ltsd1}\cite{ltsd2} | ||
algorithm on GUI, as well as noise reduction technique from SOX\cite{sox} to gain better result. | ||
|
||
LTSD algorithm splits a utterance into overlapped frames, and give scores for each frame on | ||
the probability that there is voice activity in this frame. This probability will be accumulated | ||
to extract all the intervals with voice activity. A picture showing the principle of LTSD is as followed: | ||
|
||
\begin{figure}[H] | ||
\centering | ||
\includegraphics[width=0.6\textwidth]{img/ltsd.png} | ||
\end{figure} | ||
|
||
\input{feature} | ||
\input{model} |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
%File: feature.tex | ||
%Date: Fri Jan 03 17:40:07 2014 +0800 | ||
%Author: Yuxin Wu <ppwwyyxxc@gmail.com> | ||
|
||
\subsection{Feature Extraction} | ||
%We extract \textbf{Mel-frequency cepstral coefficients} and \textbf{Linear Predictive | ||
%Coding} features using following parameter are found to be | ||
%optimal, according to our experiments in \secref{result}: | ||
%\begin{itemize} | ||
%\item Common parameters: | ||
%\begin{itemize} | ||
%\item Frame size: 32ms | ||
%\item Frame shift: 16ms | ||
%\item Preemphasis coefficient: 0.95 | ||
%\end{itemize} | ||
%\item MFCC parameters: | ||
%\begin{itemize} | ||
%\item number of cepstral coefficient: 15 | ||
%\item number of filter banks: 55 | ||
%\item maximal frequency of the filter bank: 6000 | ||
%\end{itemize} | ||
%\item LPC Parameters: | ||
%\begin{itemize} | ||
%\item number of coefficient: 23 | ||
%\end{itemize} | ||
%\end{itemize} | ||
|
||
%and then concatenate the two feature vectors of the same frame forming | ||
%a larger feature vector of 15 + 23 = 38 dimension. | ||
|
||
\subsubsection{MFCC} | ||
\label{sec:mfcc} | ||
\textbf{Mel-Frequency Cepstral Coefficient} is a representation of the short-term power spectrum of a sound, | ||
based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency \cite{mfcc} . | ||
MFCC is the mostly widely used features in Automatic Speech Recognition(ASR), and it can also be applied to Speaker Recognition task. | ||
|
||
|
||
The process to extract MFCC feature is as followed: | ||
\begin{figure}[H] | ||
\centering | ||
\includegraphics[width=\textwidth]{img/MFCC.png} | ||
\end{figure} | ||
|
||
First, the input speech should be divided into successive short-time frames of length $L$, | ||
neighboring frames shall have overlap $R$. | ||
Those frames are then windowed by Hamming Window, as shown in \figref{framming} | ||
\begin{figure}[H] | ||
\centering | ||
\includegraphics[width=0.7\textwidth]{img/MFCC-windowing-frames.png} | ||
\caption{Framing and Windowing \label{fig:framming}} | ||
\end{figure} | ||
|
||
Then, We perform Discrete Fourier Transform (DFT) on windowed signals to compute their spectrums. | ||
For each of $N$ discrete frequency bands we get a complex number $X[k]$ representing | ||
magnitude and phase of that frequency component in the original signal. | ||
|
||
Considering the fact that human hearing is not equally sensitive to all frequency bands, and especially, | ||
it has lower resolution at higher frequencies. | ||
Scaling methods like Mel-scale are aimed at scaling the frequency domain to better fit human auditory perception. | ||
They are approximately linear below 1 kHz and logarithmic above 1 kHz, as shown below: | ||
\begin{figure}[H] | ||
\centering | ||
\includegraphics[width=0.5\textwidth]{img/mel-scale.png} | ||
\end{figure} | ||
|
||
In MFCC, Mel-scale is applied on the spectrums of the signals. | ||
The expression of Mel-scale warpping is as followed: | ||
\[ M(f) = 2595 \log_{10}(1 + \dfrac{f}{700}) \] | ||
|
||
\begin{figure}[H] | ||
\centering | ||
\includegraphics[width=0.5\textwidth]{img/bank.png} | ||
\caption{Filter Banks (6 filters) \label{fig:bank}} | ||
\end{figure} | ||
Then, we appply the bank of filters according to Mel-scale on the spectrum, | ||
calculate the logarithm of energy under each bank by $E_i[m] = \log (\sum_{k=0}^{N-1}{X_i[k]^2 H_m[k]}) $ and apply Discrete | ||
Cosine Transform (DCT) on $E_i[m](m = 1, 2, \cdots M) $ to get an array $c_i $: | ||
\[ c_i[n] = \sum_{m=0}^{M-1}{E_i[m]\cos(\dfrac{\pi n}{M}(m - \dfrac{1}{2}))} \] | ||
|
||
Then, the first $k$ terms in $c_i $ can be used as features for future training. | ||
The number of $k$ varies in different cases, we will further discuss the choice of $k$ in \secref{result}. | ||
|
||
\subsubsection{LPC} | ||
\textbf{Linear predictive coding} is a tool used mostly in audio signal processing and speech | ||
processing for representing the spectral envelope of a | ||
digital signal of speech in compressed form, using the information of a linear predictive model.\cite{lpc} | ||
|
||
The basic assumption in LPC is that, | ||
in a short period, the $n$th signal is a linear combination of previous $p$ signals: | ||
$ \hat{x}(n) = \sum_{i=1}^pa_i x(n-i)$ | ||
Therefore, to estimate the coefficients $ a_i$, we have to minimize the squared error | ||
$ \text{E}\left[ \hat{x}(n) - x(n)\right]$. | ||
This optimization can be done by Levinson-Durbin algorithm.\cite{levinson-durbin} | ||
|
||
Therefore, we first split the input signal into frames, as is done in MFCC feature extraction \secref{mfcc}. | ||
Then we calculate the $k$ order LPC coefficients for the signal in this frame. | ||
Since the coefficients is a compressed description for the original audio signal, | ||
the coefficients is also a good feature for speech/speaker recognition. | ||
The choice of $k$ will also be further discussed in \secref{result}. | ||
|
File renamed without changes.
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/MFCC-mel-filterbank.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/MFCC-windowing-frames.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/MFCC.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/a0.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/a1.png |
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/crbm.pdf |
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/gmm-compare.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/gmm.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/lpc-frame-len.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/lpc-nceps.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/ltsd.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/mel-scale.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/mfcc-frame-len.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/mfcc-nceps.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/mfcc-nfilter.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/nmixture.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/performance.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/rbm-original.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/rbm-reconstruct.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/reading.pdf |
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/spont.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/time-comp-small.pdf |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../Presentation/res/whisper.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
%File: implementation.tex | ||
%Date: Fri Jan 03 18:37:07 2014 +0800 | ||
%Author: Yuxin Wu <ppwwyyxxc@gmail.com> | ||
|
||
\section{Implementation} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
%File: intro.tex | ||
%Date: Fri Jan 03 17:03:58 2014 +0800 | ||
%Author: Yuxin Wu <ppwwyyxxc@gmail.com> | ||
|
||
|
||
\section{Introduction} | ||
\textbf{Speaker recognition} is the identification of the person who is speaking by characteristics | ||
of their voices (voice biometrics), also called voice recognition. \cite{SRwiki} | ||
|
||
A \textbf{Speaker Recognition} tasks can be classified with respect to different criterion: | ||
Text-dependent or Text-independent, Verification (decide whether the person is he claimed to be) or | ||
Identification (decide who the person is by its voice).\cite{SRwiki} | ||
|
||
Speech is a kind of complicated signal produced as a result of several transformations occurring at | ||
different levels: semantic, linguistic and acoustic. | ||
Differences in these transformations may lead to differences in the acoustic properties of the signals. | ||
The recognizability of speaker can be affected not only by the linguistic message | ||
but also the age, health, emotional state and effort level of the speaker. | ||
Background noise and performance of recording device also interfere | ||
the classification process. | ||
|
||
Speaker recognition is an important part of Human-Computer Interaction (HCI). | ||
As the trend of employing wearable computer reveals, | ||
Voice User Interface (VUI) has been a vital part of such computer. | ||
As these devices are particularly small, they are more likely to lose and be stolen. | ||
In these scenarios, speaker recognition is not only a good HCI, | ||
but also a combination of seamless interaction with computer and security guard | ||
when the device is lost. | ||
The need of personal identity validation will become more acute in the future. | ||
Speaker verification may be essential in business telecommunications. | ||
Telephone banking and telephone reservation services will develop rapidly | ||
when secure means of authentication were available. | ||
|
||
Also,the identity of a speaker is quite often at issue in court cases. | ||
A crime victim may have heard but not seen the perpetrator, | ||
but claim to recognize the perpetrator as someone whose voice was previously familiar; | ||
or there may be recordings of a criminal whose identity is unknown. | ||
Speaker recognition technique may bring a reliable scientific determination. | ||
|
||
Furthermore, these techniques can be used in environment which demands high security. | ||
It can be combined with other biological metrics to form a multi-modal authentication system. | ||
|
||
In this task, we have built a proof-of-concept text-independent speaker recognition system with | ||
GUI support. It is fast, accurate based on our tests on large corpus. | ||
And the gui program only require very short utterance to quickly respond. |
File renamed without changes.
Oops, something went wrong.