index.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

<title></title>
<meta name="keywords" content="">
<meta name="description" content="">
<link rel="stylesheet" href="./index_files/screen.css" type="text/css" media="screen, projection">
<link rel="stylesheet" href="./index_files/people.css" type="text/css" media="screen, projection">
<link rel="stylesheet" href="./index_files/print.css" type="text/css" media="print">
<!--[if lte IE 8]>
<link rel="stylesheet"
    href="/styles/ie.css" type="text/css" media="screen, projection" />
<![endif]-->
<style>[href^="http://www.linkbucks.com/referral/"],
#content > #right > .dose > .dosesingle,
#content > #center > .dose > .dosesingle,
#header + #content > #left > #rlblock_left
{display:none !important;}</style></head>

<body data-feedly-mini="yes">
	<div id="container" class="container">
		<div id="contact-info" class="span-10">
						<img src="./index_files/91_person-original.jpg" width="200" alt="Vijayaditya Peddinti" class="profile-photo">
			<br>
						<h3><strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#aboutme">About me</a><br></span></strong>
							<strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#academics">Academics</a><br></span></strong>
							<strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#publications">Publications</a><br></span></strong>
							<strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#experience">Experience</a><br></span></strong>
							<strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#projects">Projects</a><br></span></strong>
							<strong><span style="font-size: 16px;"><a href="https://github.com/vijayaditya/vijayaditya.github.io/raw/master/resume/resume.pdf">Resume</a></span></strong>
						</h3>

<br>
<br>
<address>
	<span style="font-size: 14px;">
		<strong><span style="text-decoration: underline;">
			<span style="font-family: mceinline;"><span style="font-family: mceinline;">Contact Info:
				<br><br></span></span></span></strong><span style="font-family: mceinline;">vijay [dot] p [at] jhu [dot] edu<br></span></span></address>


				<p><span style="font-size: 16px;"><br></span></p>
<h1><span style="font-size: 18px;"><br></span><a href="http://www.clsp.jhu.edu/about-clsp" target="_blank"><span style="font-size: 18px;"><img title="logo" src="./index_files/clsp.gif" alt="clsp logo" width="88" height="95"></span></a></h1>		</div>
		<div class="span-35 prepend-top-2">
			<div id="content">


<h1>Vijayaditya Peddinti</h1>


<p><a name="aboutme"></a></p>
<h3>About me</h3>
<p><span style="text-align: justify; color: #222222; font-size: 14px;">


I graduated from the PhD program of the Electrical and Computer engineering department at Johns Hopkins University.
I am currently a research scientist @ Google Speech.
<p>

Previously I worked in the</span><a style="text-align: justify; font-size: 14px;" href="http://www.clsp.jhu.edu/"> Center for Language and Speech Processing</a> on acoustic models for speech recognition, with <a href="http://www.danielpovey.com/" target="_blank">Dan Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a>.


I <a style="text-align: justify; font-size: 14px;" href="https://github.com/kaldi-asr/kaldi/commits?author=vijayaditya"> contribute</a> to the acoustic modelling code in Kaldi
<a href="http://kaldi-asr.org"><img style="border: 0px solid ; width: 17px; height: 20px;" alt="" src="http://kaldi-asr.org/kaldi_logo.png" /></a> project. <br>
<p>
I had previously worked with Hynek Hermansky, on distortion invariant feature design for acoustic models.
I worked in <a style="text-align: justify; font-size: 14px;" href="http://speech.iiit.ac.in/">Speech and Vision Lab</a><span style="text-align: justify; color: #222222; font-size: 14px;"> at IIIT-Hyd with </span><a style="text-align: justify; font-size: 14px;" href="https://sites.google.com/site/kishoreprahallad/">Kishore Prahallad</a><span style="text-align: justify; color: #222222; font-size: 14px;">, on efficient back-off strategies for quality speech synthesis, for my Masters (by research)</span></p>


<p><strong>Research Interests:</strong> Speech Recognition, Machine Learning</p>


<hr>
<p><a name="academics"></a></p>
<h3>Academics</h3>
<ul>
<br>
<li><strong>Johns Hopkins University, </strong>Maryland, US<br>
	&nbsp;PhD in Electrical and Computer Engineering, 2011 - 2017;</li>
<li><strong>International Institute of Information Technology</strong>, Hyderabad, India<br>
	&nbsp;Master of Science (by Research) in Computer Science, 2011
	<br />&nbsp;&nbsp;
    <span style="font-style: italic;">Thesis: </span>
    <a style="text-align: justify; font-size: 14px; font-style: italic;" href="https://github.com/vijayaditya/vijayaditya.github.io/raw/master/thesis_softcopy.pdf">
     Synthesis of missing units in Telugu text-to-speech system</a>
  </li>

<li><strong>Dhirubhai Ambani Institute of Information and Communication Technology</strong>, Gandhinagar, India<br>Bachelor of Technology in Information and Communication Technology, 2007</li>
</ul>
<hr>


<!-- Publications -->
<p><a name="publications"></a></p>
<h3>Publications</h3>
<h3 style="text-align: start;"><br></h3>
<ul style="text-align: start;">
</ul>			<div id="publications" class="prepend-top-1">
				<p align="right"><a href="http://www.clsp.jhu.edu/_dataproviders/bibtex_export.php?author=91" title="Export Vijayaditya Peddinti&#39;s Publications to BibTeX"><img src="./index_files/bibtex.png" alt=""></a>&nbsp;&nbsp;<a class="expand cursor compress"></a></p>


				<h3>2016</h3>

				<!-- Paper 1-->
				<a id="peddinti2016ami"></a>
				<div id="pub-1052" class="publications">
                    <p>
                    <a class="title-link" href="http://www.danielpovey.com/files/2016_interspeech_ami.pdf" target="_blank">
					Far-field ASR without parallel data<br>
					<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Vimal Manohar, Yiming Wang, <a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
					<em>Submitted to Interspeech, 2016</em></p>

					<p align="right"><a id="abstract-1052" class="cursor" rel="toggle" title="Far-field ASR without parallel data">[abstract]</a> <a id="bib-1052" class="cursor" rel="toggle" title="Far-field ASR without parallel data">[bib]</a></p><div id="abstract-1052" class="hide abstract" style="display: none;">
                        <h3>Abstract</h3>

                        In far-field speech recognition systems, training
                        acoustic models with alignments generated from parallel
                        close-talk microphone data provides significant
                        improvements. However it is not practical to assume the
                        availability of large corpora of parallel close-talk
                        microphone data, for training. In this paper we
                        explore methods to reduce the performance gap between
                        far-field ASR systems trained with alignments from
                        distant microphone data and those trained with
                        alignments from parallel close-talk microphone data.
                        These methods include the use of a lattice-free
                        sequence objective function which tolerates minor
                        mis-alignment errors; and the use of data selection
                        techniques to discard badly aligned data. We present
                        results on single distant microphone and multiple
                        distant microphone scenarios of the AMI LVCSR task. We
                        identify prominent causes of alignment errors in AMI
                        data.
                    </div>

					<div id="bib-1052" class="hide bib align-left">
					@inproceedings{peddinti2016ami,<br>
						author = {Peddinti, Vijayaditya and Manohar, Vimal and Wang, Yiming and Povey, Daniel and Khudanpur, Sanjeev}, <br>
						title = {Far-field ASR without parallel data}, <br>
						booktitle = {Submitted to Interspeech}<br>
					 }</div>

				</div>


				<!-- Paper 2-->
 				<a id="povey2016"></a>
 				<div id="pub-1051" class="publications alt">
		 			<p><a class="title-link" href="http://www.danielpovey.com/files/2016_interspeech_mmi.pdf" target="_blank">
 					Purely sequence-trained neural networks for ASR based on lattice-free MMI<br>
 					<a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a>, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Daniel Galvez, Pegah Ghahrmani, Vimal Manohar, Yiming Wang, Xingyu Na and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
 					<em>Submitted to Interspeech, 2016</em></p>

                    <p align="right"><a id="abstract-1051" class="cursor" rel="toggle" title="Purely sequence-trained neural networks for ASR based on lattice-free MMI">[abstract]</a> <a id="bib-1051" class="cursor" rel="toggle" title="Purely sequence-trained neural networks for ASR based on lattice-free MMI">[bib]</a></p><div id="abstract-1051" class="hide abstract" style="display: none;"><h3>Abstract</h3>

                        In this paper we describe a method to perform sequence-
                        discriminative training of neural network acoustic
                        models without the need for frame-level cross-entropy
                        pre-training. We use the lattice-free version of the
                        maximum mutual information (MMI) criterion. To make its
                        computation feasible we use a phone n-gram language
                        model, in place of the word language model. To further
                        reduce its space and time complexity we compute the
                        objective function using neural network outputs at one
                        third the standard frame rate. These changes enable us
                        to perform the computation for the forward-backward
                        algorithm on GPUs. Further the reduced output
                        frame-rate also provides a significant speed-up during
                        decoding.  We present results on 5 different LVCSR
                        tasks with training data ranging from 100 to 2100
                        hours. Models trained with this lattice-free MMI
                        criterion provide a relative word error rate reduction
                        of ∼ 15%, over those trained with cross-entropy
                        objective function, and ∼ 8%, over those trained with
                        cross-entropy and sMBR objective functions. A further
                        reduction of ∼ 2.5%, relative, can be obtained by fine
                        tuning these models with the word-lattice based sMBR
                        objective function.

                    </div>

 					<div id="bib-1051" class="hide bib align-left">
 					@inproceedings{povey2016,<br>
						author = {Povey, Daniel and Peddinti, Vijayaditya and Galvez, Daniel and Pegah Ghahrmani and Manohar, Vimal and Wang, Yiming and Na, Xingyu and Khudanpur, Sanjeev}, <br>
						title = {Purely sequence-trained neural networks for ASR based on lattice-free MMI}, <br>
						booktitle = {Submitted to Interspeech}<br>
 					}</div>
 				</div>


				<!-- 2015 publications -->
				<h3>2015</h3>

				<!-- Paper 1-->
				<a id="peddinti2015reverb"></a>
				<div id="pub-1049" class="publications">
                    <p>
                        <a href="http://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge" target="_blank"> <strong> <font color=red> Winner of the IARPA ASpIRE challenge [press announcement] </font> </strong> </a><br><br>
                    <a class="title-link" href="http://www.danielpovey.com/files/2015_interspeech_aspire.pdf" target="_blank">
					Reverberation robust acoustic modeling using with time delay neural networks<br>
					<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, <a href="http://www.clsp.jhu.edu/~guoguo/" target="_blank">Guoguo Chen</a>, <a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
					<em>Proceedings of Interspeech, 2015</em></p>

					<p align="right"><a id="abstract-1049" class="cursor" rel="toggle" title="Reverberation robust acoustic modeling using with time delay neural networks">[abstract]</a> <a id="bib-1049" class="cursor" rel="toggle" title="Reverberation robust acoustic modeling using with time delay neural networks">[bib]</a></p><div id="abstract-1049" class="hide abstract" style="display: none;">
					<h3>Abstract</h3>In reverberant environments there are long term interactions between speech and corrupting sources. In this paper a time delay neural network (TDNN) architecture, capable of learning long term temporal relationships and translation invariant representations, is used for reverberation robust acoustic model- ing. Further, iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 10% relative improvement in word error rate. By sub- sampling the outputs at TDNN layers across time steps, training time is reduced. Using a parallel training algorithm we show that the TDNN can be trained on ~ 5500 hours of speech data in 3 days using up to 32 GPUs. The TDNN is shown to provide results competitive with state of the art systems in the IARPA ASpIRE challenge, with 27.7% WER on the dev test set.</div>

					<div id="bib-1049" class="hide bib align-left">
					@inproceedings{peddinti2015reverb,<br>
						author = {Peddinti, Vijayaditya and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, <br>
						title = {Reverberation robust acoustic modeling using with time delay neural networks}, <br>
						booktitle = {Proceedings of Interspeech}<br>
					 }</div>

				</div>


				<!-- Paper 2-->
 				<a id="ko2015augmentation"></a>
 				<div id="pub-1050" class="publications alt">
		 			<p><a class="title-link" href="http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf" target="_blank">
 					Audio Augmentation for Speech Recognition<br>
 					Tom Ko, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, <a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
 					<em>Proceedings of Interspeech, 2015</em></p>

 					<p align="right"><a id="abstract-1050" class="cursor" rel="toggle" title="Audio Augmentation for Speech Recognition">[abstract]</a> <a id="bib-1050" class="cursor" rel="toggle" title="Audio Augmentation for Speech Recognition">[bib]</a></p><div id="abstract-1050" class="hide abstract" style="display: none;"><h3>Abstract</h3>Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.</div>

 					<div id="bib-1050" class="hide bib align-left">
 					@inproceedings{ko2015augmentation,<br>
						author = {Tom Ko and Peddinti, Vijayaditya and Povey, Daniel and Khudanpur, Sanjeev}, <br>
						title = {Audio Augmentation for Speech Recognition}, <br>
						booktitle = {Proceedings of Interspeech}<br>
 					}</div>
 				</div>


				<!-- Paper 3-->
 				<a id="peddinti2015multisplice"></a>
 				<div id="pub-1048" class="publications">
		 			<p><a class="title-link" href="http://www.danielpovey.com/files/2015_interspeech_multisplice.pdf" target="_blank">
                        <strong> <font color=red> Best paper award </font> </strong><br><br>
 					A time delay neural network architecture for efficient modeling of long temporal contexts<br>
 					<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, <a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
 					<em>Proceedings of Interspeech, 2015</em></p>

 					<p align="right"><a id="abstract-1048" class="cursor" rel="toggle" title="A time delay neural network architecture for efficient modeling of long temporal contexts">[abstract]</a> <a id="bib-1048" class="cursor" rel="toggle" title="A time delay neural network architecture for efficient modeling of long temporal contexts">[bib]</a></p><div id="abstract-1048" class="hide abstract" style="display: none;"><h3>Abstract</h3>Recurrent neural network architectures have been shown to efficiently model long term temporal dependencies between acoustic events. However the training time of recurrent networks is higher than feedforward networks due to the sequential nature of the learning algorithm. In this paper we propose a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs. The network uses sub-sampling to reduce computation during training. On the Switchboard task we show a relative improvement of 6% over the baseline DNN model. We present results on several LVCSR tasks with training data ranging from 3 to 1800 hours to show the effectiveness of the TDNN architecture in learning wider temporal dependencies in both small and large data scenarios.</div>

 					<div id="bib-1048" class="hide bib align-left">
 					@inproceedings{peddinti2015multisplice,<br>
						author = {Peddinti, Vijayaditya and Povey, Daniel and Khudanpur, Sanjeev}, <br>
						title = {A time delay neural network architecture for efficient modeling of long temporal contexts}, <br>
						booktitle = {Proceedings of Interspeech}, <br>
						publisher = {ISCA}<br>
 					}</div>
 				</div>

 				<!-- Back to top option -->
				<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>


				<!-- 2014 publications -->

				<h3>2014</h3>

				<!-- Paper 4-->
				<a id="peddinti2014"></a>
				<div id="pub-1057" class="publications alt">
		 			<p><a class="title-link" href="http://www.mirlab.org/conference_papers/International_Conference/ICASSP%202014/papers/p210-peddinti.pdf" target="_blank">
					Deep Scattering Spectrum with deep neural networks<br>
					<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, T. Sainath, S. Maymon, B. Ramabhadran, D. Nahamoo and <a href="http://www.linkedin.com/pub/vaibhava-goel/0/54a/190" target="_blank">Vaibhava Goel</a><br>
					<em>Proceedings of ICASSP, 2014</em></p>


					<p align="right"><a id="abstract-1057" class="cursor" rel="toggle" title="Deep Scattering Spectrum with deep neural networks">[abstract]</a> <a id="bib-1057" class="cursor" rel="toggle" title="Deep Scattering Spectrum with deep neural networks">[bib]</a></p>
					<div id="abstract-1057" class="hide abstract" style="display: none;"><h3>Abstract</h3>State-of-the-art convolutional neural networks (CNNs) typically use a log-mel spectral representation of the speech signal. However, this representation is limited by the spectro-temporal resolution afforded by log-mel filter-banks. A novel technique known as Deep Scattering Spectrum (DSS) addresses this limitation and preserves higher resolution information, while ensuring time warp stability, through the cascaded application of the wavelet-modulus operator. The first order scatter is equivalent to log-mel features and standard CNN modeling techniques can directly be used with these features. However the higher order scatter, which preserves the higher resolution information, presents new challenges in modeling. This paper explores how to effectively use DSS features with CNN acoustic models. Specifically, we identify the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter. The use of these higher order scatter features, in conjunction with CNNs, results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate (PER) of 17.4%, one of the lowest reported PERs to date on this task.</div>

					<div id="bib-1057" class="hide bib align-left">
					@inproceedings{peddinti2014,<br>
						author = {Peddinti, Vijayaditya and T. Sainath and S. Maymon and B. Ramabhadran and D. Nahamoo and Goel, Vaibhava}, <br>
						title = {Deep Scattering Spectrum with deep neural networks}, <br>
						booktitle = {Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on}, <br>
						pages = {210-214}<br>
 					}</div>
 				</div>


				<!-- Paper 5-->
 				<a id="schatz-peddinti-cao-bach-hermansky-dupoux:is2014c"></a>
 				<div id="pub-988" class="publications">
 					<p><a class="title-link" href="https://hal.archives-ouvertes.fr/hal-00918599/document" target="_blank">
					Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise<br>
 					Thomas Schatz, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, <a href="" target="_blank">Yuan Cao</a>, Francis Bach, <a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a> and Emmanuel Dupoux<br>

 					<em>Proceedings of Interspeech, 2014</em></p>

 					<p align="right"><a id="bib-988" class="cursor" rel="toggle" title="Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise">[bib]</a></p>

 					<div id="bib-988" class="hide bib align-left">
 					@inproceedings{schatz-peddinti-cao-bach-hermansky-dupoux:is2014c,<br>
						author = {Thomas Schatz and Peddinti, Vijayaditya and Cao, Yuan and Francis Bach and Hermansky, Hynek and Emmanuel Dupoux}, <br>
						title = {Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise}, <br>
						booktitle = {Proc. of INTERSPEECH}<br>
 					}</div>
 				</div>


 				<!-- Paper 6-->
 				<a id="sainath2014deep"></a>
 				<div id="pub-1055" class="publications alt">
	 				<p><a class="title-link" href="http://ttic.uchicago.edu/~haotang/speech/IS140389.pdf" target="_blank">
	 				Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks</a><br>
	 				Tara N Sainath, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Brian Kingsbury, Petr Fousek, Bhuvana Ramabhadran and David Nahamoo<br>
					<em>Proceedings of Interspeech, 2014</em></p>

	 				<p align="right"><a id="abstract-1055" class="cursor" rel="toggle" title="Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks">[abstract]</a> <a id="bib-1055" class="cursor" rel="toggle" title="Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks">[bib]</a></p><div id="abstract-1055" class="hide abstract" style="display: none;">
	 				<h3>Abstract</h3>Log-mel filterbank features, which are commonly used features for CNNs, can remove higher-resolution information from the speech signal. A novel technique, known as Deep Scattering Spectrum (DSS), addresses this issue and looks to preserve this information. DSS features have shown promise on TIMIT, both for classification and recognition. In this paper, we extend the use of DSS features for LVCSR tasks. First, we explore the optimal multi-resolution time and frequency scattering operations for LVCSR tasks. Next, we explore techniques to reduce the dimension of the DSS features. We also incorporate speaker adaptation techniques into the DSS features. Results on a 50 and 430 hour English Broadcast News task show that the DSS features provide between a 4-7% relative improvement in WER over log-mel features, within a state-of-the-art CNN framework which incorporates speaker-adaptation and sequence training. Finally, we show that DSS features are similar to multi-resolution log-mel + MFCCs, and similar improvements can be obtained with this representation.</div>


	 				<div id="bib-1055" class="hide bib align-left">
	 				@inproceedings{sainath2014deep,<br>
						author = {Tara N Sainath and Peddinti, Vijayaditya and Brian Kingsbury and Petr Fousek and Bhuvana Ramabhadran and David Nahamoo}, <br>
						title = {Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks}, <br>
						publisher = {ISCA}, <br>
						url = {http://ttic.uchicago.edu/~haotang/speech/IS140389.pdf}<br>
	 				}</div>
	 			</div>


 			<!-- Back to top option -->
 			<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>


 			<!-- 2013 publications -->
 			<h3>2013</h3>

 			<!-- Paper 7-->
 			<a id="schatz-peddinti-bach-jansen-hermansky-dupoux:is2013"></a>
 			<div id="pub-998" class="publications">
	 			<p><a class="title-link" href="https://hal.archives-ouvertes.fr/hal-00918599/document" target="_blank">
 				Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline<br>
 				Thomas Schatz, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Francis Bach, <a href="http://www.clsp.jhu.edu/~ajansen/" target="_blank">Aren Jansen</a>, <a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a> and Emmanuel Dupoux<br>
 				<em>Proceedings of Interspeech, 2013</em></p>

 				<p align="right"><a id="bib-998" class="cursor" rel="toggle" title="Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline">[bib]</a></p>

 				<div id="bib-998" class="hide bib align-left">
 				@inproceedings{schatz-peddinti-bach-jansen-hermansky-dupoux:is2013,<br>
					author = {Thomas Schatz and Peddinti, Vijayaditya and Francis Bach and Jansen, Aren and Hermansky, Hynek and Emmanuel Dupoux}, <br>
					title = {Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline}, <br>
					booktitle = {Proc. INTERSPEECH}<br>
 				}</div>
 			</div>


 			<!-- Paper 8-->
 			<a id="jansen-dupoux-goldwater-johnson-khudanpur-church-feldman-hermansky-metze-rose-seltzer-clark-mcgraw-varadarajan-bennett-borschinger-chiu-dunbar-fourtassi-harwath-lee-levin-norouzain-peddinti-richardson-schatz-thomas:icassp2013"></a>
 			<div id="pub-1005" class="publications alt">

	 			<p><a class="title-link" href="http://repository.cmu.edu/cgi/viewcontent.cgi?article=1095&amp;context=lti" target="_blank">
 				A Summary Of The 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition<br>
 				<a href="http://www.clsp.jhu.edu/~ajansen/" target="_blank">Aren Jansen</a>, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a>, <a href="http://www.clsp.jhu.edu/~kchurch/" target="_blank">Kenneth Church</a>, Naomi Feldman, <a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a>, Florian Metze, Richard Rose, Michael Seltzer, Pascal Clark, Ian Mcgraw, <a href="http://sites.google.com/site/balakrishnanvaradarajan/" target="_blank">Balakrishnan Varadarajan</a>, Erin Bennett, Benjamin Borschinger, Justin Chiu, Ewan Dunbar, Abdellah Fourtassi, David Harwath, Chia-Ying Lee, <a href="" target="_blank">Keith Levin</a>, Atta Norouzain, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Rachael Richardson, Thomas Schatz and <a href="http://www.clsp.jhu.edu/~samuel/" target="_blank">Samuel Thomas</a><br>

 				<em>Proceedings of ICASSP, 2013</em></p>

 				<p align="right"><a id="bib-1005" class="cursor" rel="toggle" title="A Summary Of The 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition">[bib]</a></p>

 				<div id="bib-1005" class="hide bib align-left">
 				@inproceedings{jansen-dupoux-goldwater-johnson-khudanpur-church-feldman-hermansky-metze-rose-seltzer-clark-mcgraw-varadarajan-bennett-borschinger-chiu-dunbar-fourtassi-harwath-lee-levin-norouzain-peddinti-richardson-schatz-thomas:icassp2013,<br>
					author = {Jansen, Aren and Emmanuel Dupoux and Sharon Goldwater and Mark Johnson and Khudanpur, Sanjeev and Church, Kenneth and Naomi Feldman and Hermansky, Hynek and Florian Metze and Richard Rose and Michael Seltzer and Pascal Clark and Ian Mcgraw and Varadarajan, Balakrishnan and Erin Bennett and Benjamin Borschinger and Justin Chiu and Ewan Dunbar and Abdellah Fourtassi and David Harwath and Chia-Ying Lee and Levin, Keith and Atta Norouzain and Peddinti, Vijayaditya and Rachael Richardson and Thomas Schatz and Thomas, Samuel}, <br>
					title = {A Summary Of The 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition}, <br>
					booktitle = {Proc. ICASSP}, <br>
					address = {Vancouver, Canada}<br>
 				}</div>
 			</div>

 			<!-- Paper 9-->
 			<a id="hermansky-variani-peddinti:icassp2013"></a>
 			<div id="pub-1006" class="publications">
	 			<p><a class="title-link" href="http://hltcoe.jhu.edu/uploads/publications/papers/16660_slides.pdf" target="_blank">
 				Mean Temporal Distance: Predicting ASR Error from Temporal Properties of Speech Signal<br>
 				<a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a>, <a href="http://www.clsp.jhu.edu/~variani/" target="_blank">Ehsan Variani</a> and <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a><br>

 				<em>Proceedings of ICASSP, 2013</em></p>

 				<p align="right"><a id="bib-1006" class="cursor" rel="toggle" title="Mean Temporal Distance: Predicting ASR Error from Temporal Properties of Speech Signal">[bib]</a></p>

 				<div id="bib-1006" class="hide bib align-left">
 				@inproceedings{hermansky-variani-peddinti:icassp2013,<br>
					author = {Hermansky, Hynek and Variani, Ehsan and Peddinti, Vijayaditya}, <br>
					title = {Mean Temporal Distance: Predicting ASR Error from Temporal Properties of Speech Signal}, <br>
					booktitle = {Proc. ICASSP}, <br>
					address = {Vancouver, Canada}<br>
 				}</div>
 			</div>

 			<!-- Paper 10-->
 			<a id="peddinti2013filterbank"></a>
 			<div id="pub-1007" class="publications alt">
	 			<p><a class="title-link" href="http://mahe.ece.jhu.edu/uploads/publications/papers/16653_slides.pdf" target="_blank">
 				Filter-Bank Optimization for Frequency Domain Linear Prediction<br>
 				<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a> and <a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a><br>

 				<em>Proceedings of ICASSP, 2013</em></p>

 				<p align="right"><a id="abstract-1007" class="cursor" rel="toggle" title="Filter-Bank Optimization for Frequency Domain Linear Prediction">[abstract]</a> <a id="bib-1007" class="cursor" rel="toggle" title="Filter-Bank Optimization for Frequency Domain Linear Prediction">[bib]</a></p>
 				<div id="abstract-1007" class="hide abstract" style="display: none;">
 				<h3>Abstract</h3>
 				The sub-band Frequency Domain Linear Prediction (FDLP) technique estimates autoregressive models of Hilbert envelopes of subband signals, from segments of discrete cosine transform (DCT) of a speech signal, using windows. Shapes of the windows and their positions on the cosine transform of the signal determine implied filtering of the signal. Thus, the choices of shape, position and number of these windows can be critical for the performance of the FDLP technique. So far, we have used Gaussian or rectangular windows. In this paper asymmetric cochlear-like filters are being studied. Further, a frequency differentiation operation, that introduces an additional set of parameters describing local spectral slope in each frequency sub-band, is introduced to increase the robustness of sub-band envelopes in noise. The performance gains achieved by these changes are reported in a variety of additive noise conditions, with an average relative improvement of 8.04% in phoneme recognition accuracy.</div>

 				<div id="bib-1007" class="hide bib align-left">
 				@inproceedings{peddinti2013filterbank,<br>
					author = {Peddinti, Vijayaditya and Hermansky, Hynek}, <br>
					title = {Filter-Bank Optimization for Frequency Domain Linear Prediction}, <br>
					booktitle = {Proceedings of ICASSP}, <br>
					address = {Vancouver, Canada}, <br>
					publisher = {IEEE}, <br>
					pages = {7102 - 7106}<br>
 				}</div>
 			</div>

 			 <!-- Back to top option -->
 			<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>


 			<!-- 2011 publications -->
 			<h3>2011</h3>

 			<!-- Paper 11-->
 			<a id="peddinti2011"></a>
 			<div id="pub-1056" class="publications">
	 			<p><a class="title-link" href="http://ravi.iiit.ac.in/%7Espeech/publications/C47.pdf" target="_blank">
 				Significance of vowel epenthesis in Telugu text-to-speech synthesis<br>
 				<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a> and K. Prahallad<br>

 				<em>Proceedings of ICASSP, 2011</em></p>

 				<p align="right"><a id="abstract-1056" class="cursor" rel="toggle" title="Significance of vowel epenthesis in Telugu text-to-speech synthesis">[abstract]</a> <a id="bib-1056" class="cursor" rel="toggle" title="Significance of vowel epenthesis in Telugu text-to-speech synthesis">[bib]</a></p>
 				<div id="abstract-1056" class="hide abstract" style="display: none;">
 				<h3>Abstract</h3>
 				Unit selection synthesis inventories have coverage issues, which lead to missing syllable or diphone units. In the conventional back-off strategy of substituting the missing unit with approximate unit(s), the rules for approximate matching are hard to derive. In this paper we propose a back-off strategy for Telugu TTS systems emulating native speaker intuition. It uses reduced vowel insertion in complex consonant clusters to replace missing units. The inserted vowel identity is determined using a rule-set adapted from L2 (second language) acquisition research in Telugu, reducing the effort required in preparing the rule-set. Subjective evaluations show that the proposed back-off method performs better than the conventional methods.
 				</div>

 				<div id="bib-1056" class="hide bib align-left">
 				@inproceedings{peddinti2011,<br>
					author = {Peddinti, Vijayaditya and K. Prahallad}, <br>
					title = {Significance of vowel epenthesis in Telugu text-to-speech synthesis}, <br>
					booktitle = {Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on}, <br>
					pages = {5348-5351}<br>
 				}</div>
 			</div>

 			<!-- Paper 12-->
 			<a id="peddinti2011exploiting"></a>
 			<div id="pub-1059" class="publications alt">
	 			<p><a class="title-link" href="http://ravi.iiit.ac.in/%7Espeech/publications/C48.pdf" target="_blank">
 				Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases<br>
 				<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a> and Kishore Prahallad<br>
 				<em>Proceedings of Interspeech, 2011</em></p>

 				<p align="right"><a id="abstract-1059" class="cursor" rel="toggle" title="Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases">[abstract]</a> <a id="bib-1059" class="cursor" rel="toggle" title="Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases">[bib]</a></p>

 				<div id="abstract-1059" class="hide abstract" style="display: none;">
 				<h3>Abstract</h3>
 				High accuracy speech segmentation methods invariably depend on manually labelled data. However under-resourced languages do not have annotated speech corpora required for training these segmentors. In this paper we propose a boundary refinement technique which uses knowledge of phone-class specific subband energy events, in place of manual labels, to guide the refinement process. The use of this knowledge enables proper placement of boundaries in regions with multiple spectral discontinuities in close proximity. It also helps in the correction of large alignment errors. The proposed refinement technique provides boundaries with an accuracy of 82% within 20ms of actual boundary. Combining the proposed technique with iterative isolated HMM training technique boosts the accuracy to 89%, without the use of any manually labelled data.</div>

 				<div id="bib-1059" class="hide bib align-left">
 				@inproceedings{peddinti2011exploiting,<br>
					author = {Peddinti, Vijayaditya and Kishore Prahallad}, <br>
					title = {Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases}, <br>
					booktitle = {Proceedings of Interspeech 2011}<br>
 				}</div>
 			</div>

 			<!-- Back to top option -->
 			<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>


 		</div>
 	</div>


<h1>
<hr>
</h1>

<p><a name="experience"></a></p>
<h3>Experience</h3>
<ul>

    <li><strong>Participant, JSALT-2015 Workshop </strong>Jul '15 - Aug '15
        <a href="http://www.clsp.jhu.edu/workshops/15-workshop/far-field-enhancement-and-recognition-in-mismatched-settings/" > [homepage] </a>
        <a href="https://www.youtube.com/watch?v=OncfrRwZPs8&#t=54m20s" > [video] </a>
    </li>
<li><strong>Research Intern, Microsoft Research </strong>
	<br>&nbsp; Mentor: <a href="http://research.microsoft.com/en-us/people/mseltzer/" target="_blank">Mike Seltzer</a>&nbsp;Sept '14 - Dec '14</li>
<li><strong>Research Intern, IBM T.J. Watson Research Center</strong>
	<br>&nbsp; Mentor:  <a href="https://sites.google.com/site/tsainath/" target="_blank">Tara Sainath</a>&nbsp;May '13 - Aug '13</li>
<li><strong>Participant, JSALT-2014 Workshop &nbsp;</strong>July '14 - August '14
        <a href="http://www.clsp.jhu.edu/workshops/14-workshop/asr-machines-that-know-when-they-do-not-know/" > [homepage] </a>
        <a href="https://vijayaditya.github.io/other_files/jsalt_14.pdf" > [pdf] </a>
        <a href="https://www.youtube.com/watch?v=RAgQe0EsPwA#t=15m38s" > [video] </a>
    </li>
<li><strong>Participant, Zero resource workshop &nbsp;</strong>July '12
        <a href="http://www.clsp.jhu.edu/workshops/12-workshop/zero-resource-speech-technologies-and-models-of-early-language-acquisition/" > [homepage] </a>
        <a href="http://www.clsp.jhu.edu/workshops/12-workshop/zero-resource-speech-technologies-and-models-of-early-language-acquisition/combined-final-presentation">[pdf]</a>
        <a href="http://webcast.jhu.edu/Mediasite/Play/12a4b40e7b524bec93d273ca66d453e11d">[video1]</a>
        <a href="http://webcast.jhu.edu/mediasite/Viewer/?peid=e864d9ab0270404f9b09ec664cee1ebf1d">[video2]</a> </li>
<li><strong>Research Assistant, JHU</strong>, September '11 - </li>
<li><strong>Teaching Assistant, JHU</strong><br>
	Course: Processing of Audio and Visual Signals (Instructor: Prof. Hynek Hermansky)<br>
	Course: Speech and Audio processing by humans and machines (Instructor: Prof. Hynek Hermansky)</li>
<li><strong>Analytics Intern, I-Labs 24/7 Customer , Bangalore</strong>, Jan '11 - Jul '11<br>Part of text and data mining team. Developed a prototype for Event detection in Twitter, Facebook, Forum and Customer Care chat data.</li>
<li><strong>Research Assistant, IIIT-Hyderabad, India</strong>, Dec '08 - Dec '10<br>Worked on the <em>Indian Language TTS (Ministry of Commn. and Info. Tech., India)</em> and <em>Indian Language Data Collection (LDC-IL) </em>projects</li>
<li><strong>Technical Associate, TechMahindra Ltd., </strong>Jul '07 - Jul '08</li>
</ul>


<!-- Back to top option -->
 <p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>

<hr>


<p><a name="projects"></a></p>
<h3>Projects</h3>
<ul>
<li><strong>Robust Automatic Transcription of Speech (RATS)<em>:</em></strong><br> DARPA project </li>
<li><strong>Indian Language TTS ,<br></strong><em>Funded By Ministry of Commn. &amp; Info. Tech., India (MCIT)</em><br>Involved in the development of a text-to-speech (TTS) synthesizer for Telugu as part of the Indian Language TTS Consortium. Developed an algorithm for automatic segmentation of audio databases (published in Interspeech, 2011) and designed a back-off strategy for missing units (published in ICASSP,2011), implementation syllable based synthesizer in the Festival framework.</li>
<li><strong>Indian Language Data Collection</strong><br><em>Funded by Lang. Data Consortium Indian Languages (LDC-IL)<br></em>Worked on Automatic generation of phonetic alignments of audio data with erroneous transcripts for speech data in Telugu as part of Indian Language Data Collection project for LDC-IL for the collection of 500 hours of speech data each for Telugu, Kannada and English languages.</li>
<li><strong>Temporal Event Detection in Social Media Streams</strong><br>At I-labs, 24/7 Customer<br>Developed an algorithm for event detection in volume time series created from multiple data streams like microblogs (like Twitter), social networks (like Facebook) and Chats (from customer service centers).</li>
</ul>


<!-- Back to top option -->
 <p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>

<hr>


	<div id="footer">
			<p>&copy Vijayaditya Peddinti, 2015. All rights reserved.</p>
		</div>
	</div>
		<div class="clear"></div>
	</div>


	<!-- Java Script -->
	<script type="text/javascript" src="./index_files/jquery.min.js"></script>
	<script type="text/javascript">
		$('.publications:odd').addClass('alt');
		$('a.expand').bind('click', function () {
			$('div.abstract').toggle('blind');
			$(this).toggleClass('compress');
		});
		$('a.compress').bind('click', function () {
			$('div.abstract').toggle('blind');
			$(this).toggleClass('compress');
		});
		$("a[rel=toggle]").bind('click', function () {
			var id = $(this).attr('id');
			$('div#' + id).toggle('blind');
		});
	</script>


		<iframe frameborder="0" scrolling="no" style="border: 0px; display: none; background-color: transparent;"></iframe><div id="GOOGLE_INPUT_CHEXT_FLAG" style="display: none;" input="" input_stat="{&quot;tlang&quot;:true,&quot;tsbc&quot;:true,&quot;pun&quot;:true,&quot;mk&quot;:false,&quot;ss&quot;:true}"></div><div id="feedly-mini" title="feedly Mini tookit"></div></body></html>