-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
594 lines (447 loc) · 42.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title></title>
<meta name="keywords" content="">
<meta name="description" content="">
<link rel="stylesheet" href="./index_files/screen.css" type="text/css" media="screen, projection">
<link rel="stylesheet" href="./index_files/people.css" type="text/css" media="screen, projection">
<link rel="stylesheet" href="./index_files/print.css" type="text/css" media="print">
<!--[if lte IE 8]>
<link rel="stylesheet"
href="/styles/ie.css" type="text/css" media="screen, projection" />
<![endif]-->
<style>[href^="http://www.linkbucks.com/referral/"],
#content > #right > .dose > .dosesingle,
#content > #center > .dose > .dosesingle,
#header + #content > #left > #rlblock_left
{display:none !important;}</style></head>
<body data-feedly-mini="yes">
<div id="container" class="container">
<div id="contact-info" class="span-10">
<img src="./index_files/91_person-original.jpg" width="200" alt="Vijayaditya Peddinti" class="profile-photo">
<br>
<h3><strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#aboutme">About me</a><br></span></strong>
<strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#academics">Academics</a><br></span></strong>
<strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#publications">Publications</a><br></span></strong>
<strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#experience">Experience</a><br></span></strong>
<strong><span style="font-size: 16px;"><a href="http://vijaypeddinti.com#projects">Projects</a><br></span></strong>
<strong><span style="font-size: 16px;"><a href="https://github.com/vijayaditya/vijayaditya.github.io/raw/master/resume/resume.pdf">Resume</a></span></strong>
</h3>
<br>
<br>
<address>
<span style="font-size: 14px;">
<strong><span style="text-decoration: underline;">
<span style="font-family: mceinline;"><span style="font-family: mceinline;">Contact Info:
<br><br></span></span></span></strong><span style="font-family: mceinline;">vijay [dot] p [at] jhu [dot] edu<br></span></span></address>
<p><span style="font-size: 16px;"><br></span></p>
<h1><span style="font-size: 18px;"><br></span><a href="http://www.clsp.jhu.edu/about-clsp" target="_blank"><span style="font-size: 18px;"><img title="logo" src="./index_files/clsp.gif" alt="clsp logo" width="88" height="95"></span></a></h1> </div>
<div class="span-35 prepend-top-2">
<div id="content">
<h1>Vijayaditya Peddinti</h1>
<p><a name="aboutme"></a></p>
<h3>About me</h3>
<p><span style="text-align: justify; color: #222222; font-size: 14px;">
I graduated from the PhD program of the Electrical and Computer engineering department at Johns Hopkins University.
I am currently a research scientist @ Google Speech.
<p>
Previously I worked in the</span><a style="text-align: justify; font-size: 14px;" href="http://www.clsp.jhu.edu/"> Center for Language and Speech Processing</a> on acoustic models for speech recognition, with <a href="http://www.danielpovey.com/" target="_blank">Dan Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a>.
I <a style="text-align: justify; font-size: 14px;" href="https://github.com/kaldi-asr/kaldi/commits?author=vijayaditya"> contribute</a> to the acoustic modelling code in Kaldi
<a href="http://kaldi-asr.org"><img style="border: 0px solid ; width: 17px; height: 20px;" alt="" src="http://kaldi-asr.org/kaldi_logo.png" /></a> project. <br>
<p>
I had previously worked with Hynek Hermansky, on distortion invariant feature design for acoustic models.
I worked in <a style="text-align: justify; font-size: 14px;" href="http://speech.iiit.ac.in/">Speech and Vision Lab</a><span style="text-align: justify; color: #222222; font-size: 14px;"> at IIIT-Hyd with </span><a style="text-align: justify; font-size: 14px;" href="https://sites.google.com/site/kishoreprahallad/">Kishore Prahallad</a><span style="text-align: justify; color: #222222; font-size: 14px;">, on efficient back-off strategies for quality speech synthesis, for my Masters (by research)</span></p>
<p><strong>Research Interests:</strong> Speech Recognition, Machine Learning</p>
<hr>
<p><a name="academics"></a></p>
<h3>Academics</h3>
<ul>
<br>
<li><strong>Johns Hopkins University, </strong>Maryland, US<br>
PhD in Electrical and Computer Engineering, 2011 - 2017;</li>
<li><strong>International Institute of Information Technology</strong>, Hyderabad, India<br>
Master of Science (by Research) in Computer Science, 2011
<br />
<span style="font-style: italic;">Thesis: </span>
<a style="text-align: justify; font-size: 14px; font-style: italic;" href="https://github.com/vijayaditya/vijayaditya.github.io/raw/master/thesis_softcopy.pdf">
Synthesis of missing units in Telugu text-to-speech system</a>
</li>
<li><strong>Dhirubhai Ambani Institute of Information and Communication Technology</strong>, Gandhinagar, India<br>Bachelor of Technology in Information and Communication Technology, 2007</li>
</ul>
<hr>
<!-- Publications -->
<p><a name="publications"></a></p>
<h3>Publications</h3>
<h3 style="text-align: start;"><br></h3>
<ul style="text-align: start;">
</ul> <div id="publications" class="prepend-top-1">
<p align="right"><a href="http://www.clsp.jhu.edu/_dataproviders/bibtex_export.php?author=91" title="Export Vijayaditya Peddinti's Publications to BibTeX"><img src="./index_files/bibtex.png" alt=""></a> <a class="expand cursor compress"></a></p>
<h3>2016</h3>
<!-- Paper 1-->
<a id="peddinti2016ami"></a>
<div id="pub-1052" class="publications">
<p>
<a class="title-link" href="http://www.danielpovey.com/files/2016_interspeech_ami.pdf" target="_blank">
Far-field ASR without parallel data<br>
<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Vimal Manohar, Yiming Wang, <a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
<em>Submitted to Interspeech, 2016</em></p>
<p align="right"><a id="abstract-1052" class="cursor" rel="toggle" title="Far-field ASR without parallel data">[abstract]</a> <a id="bib-1052" class="cursor" rel="toggle" title="Far-field ASR without parallel data">[bib]</a></p><div id="abstract-1052" class="hide abstract" style="display: none;">
<h3>Abstract</h3>
In far-field speech recognition systems, training
acoustic models with alignments generated from parallel
close-talk microphone data provides significant
improvements. However it is not practical to assume the
availability of large corpora of parallel close-talk
microphone data, for training. In this paper we
explore methods to reduce the performance gap between
far-field ASR systems trained with alignments from
distant microphone data and those trained with
alignments from parallel close-talk microphone data.
These methods include the use of a lattice-free
sequence objective function which tolerates minor
mis-alignment errors; and the use of data selection
techniques to discard badly aligned data. We present
results on single distant microphone and multiple
distant microphone scenarios of the AMI LVCSR task. We
identify prominent causes of alignment errors in AMI
data.
</div>
<div id="bib-1052" class="hide bib align-left">
@inproceedings{peddinti2016ami,<br>
author = {Peddinti, Vijayaditya and Manohar, Vimal and Wang, Yiming and Povey, Daniel and Khudanpur, Sanjeev}, <br>
title = {Far-field ASR without parallel data}, <br>
booktitle = {Submitted to Interspeech}<br>
}</div>
</div>
<!-- Paper 2-->
<a id="povey2016"></a>
<div id="pub-1051" class="publications alt">
<p><a class="title-link" href="http://www.danielpovey.com/files/2016_interspeech_mmi.pdf" target="_blank">
Purely sequence-trained neural networks for ASR based on lattice-free MMI<br>
<a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a>, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Daniel Galvez, Pegah Ghahrmani, Vimal Manohar, Yiming Wang, Xingyu Na and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
<em>Submitted to Interspeech, 2016</em></p>
<p align="right"><a id="abstract-1051" class="cursor" rel="toggle" title="Purely sequence-trained neural networks for ASR based on lattice-free MMI">[abstract]</a> <a id="bib-1051" class="cursor" rel="toggle" title="Purely sequence-trained neural networks for ASR based on lattice-free MMI">[bib]</a></p><div id="abstract-1051" class="hide abstract" style="display: none;"><h3>Abstract</h3>
In this paper we describe a method to perform sequence-
discriminative training of neural network acoustic
models without the need for frame-level cross-entropy
pre-training. We use the lattice-free version of the
maximum mutual information (MMI) criterion. To make its
computation feasible we use a phone n-gram language
model, in place of the word language model. To further
reduce its space and time complexity we compute the
objective function using neural network outputs at one
third the standard frame rate. These changes enable us
to perform the computation for the forward-backward
algorithm on GPUs. Further the reduced output
frame-rate also provides a significant speed-up during
decoding. We present results on 5 different LVCSR
tasks with training data ranging from 100 to 2100
hours. Models trained with this lattice-free MMI
criterion provide a relative word error rate reduction
of ∼ 15%, over those trained with cross-entropy
objective function, and ∼ 8%, over those trained with
cross-entropy and sMBR objective functions. A further
reduction of ∼ 2.5%, relative, can be obtained by fine
tuning these models with the word-lattice based sMBR
objective function.
</div>
<div id="bib-1051" class="hide bib align-left">
@inproceedings{povey2016,<br>
author = {Povey, Daniel and Peddinti, Vijayaditya and Galvez, Daniel and Pegah Ghahrmani and Manohar, Vimal and Wang, Yiming and Na, Xingyu and Khudanpur, Sanjeev}, <br>
title = {Purely sequence-trained neural networks for ASR based on lattice-free MMI}, <br>
booktitle = {Submitted to Interspeech}<br>
}</div>
</div>
<!-- 2015 publications -->
<h3>2015</h3>
<!-- Paper 1-->
<a id="peddinti2015reverb"></a>
<div id="pub-1049" class="publications">
<p>
<a href="http://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge" target="_blank"> <strong> <font color=red> Winner of the IARPA ASpIRE challenge [press announcement] </font> </strong> </a><br><br>
<a class="title-link" href="http://www.danielpovey.com/files/2015_interspeech_aspire.pdf" target="_blank">
Reverberation robust acoustic modeling using with time delay neural networks<br>
<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, <a href="http://www.clsp.jhu.edu/~guoguo/" target="_blank">Guoguo Chen</a>, <a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
<em>Proceedings of Interspeech, 2015</em></p>
<p align="right"><a id="abstract-1049" class="cursor" rel="toggle" title="Reverberation robust acoustic modeling using with time delay neural networks">[abstract]</a> <a id="bib-1049" class="cursor" rel="toggle" title="Reverberation robust acoustic modeling using with time delay neural networks">[bib]</a></p><div id="abstract-1049" class="hide abstract" style="display: none;">
<h3>Abstract</h3>In reverberant environments there are long term interactions between speech and corrupting sources. In this paper a time delay neural network (TDNN) architecture, capable of learning long term temporal relationships and translation invariant representations, is used for reverberation robust acoustic model- ing. Further, iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 10% relative improvement in word error rate. By sub- sampling the outputs at TDNN layers across time steps, training time is reduced. Using a parallel training algorithm we show that the TDNN can be trained on ~ 5500 hours of speech data in 3 days using up to 32 GPUs. The TDNN is shown to provide results competitive with state of the art systems in the IARPA ASpIRE challenge, with 27.7% WER on the dev test set.</div>
<div id="bib-1049" class="hide bib align-left">
@inproceedings{peddinti2015reverb,<br>
author = {Peddinti, Vijayaditya and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, <br>
title = {Reverberation robust acoustic modeling using with time delay neural networks}, <br>
booktitle = {Proceedings of Interspeech}<br>
}</div>
</div>
<!-- Paper 2-->
<a id="ko2015augmentation"></a>
<div id="pub-1050" class="publications alt">
<p><a class="title-link" href="http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf" target="_blank">
Audio Augmentation for Speech Recognition<br>
Tom Ko, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, <a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
<em>Proceedings of Interspeech, 2015</em></p>
<p align="right"><a id="abstract-1050" class="cursor" rel="toggle" title="Audio Augmentation for Speech Recognition">[abstract]</a> <a id="bib-1050" class="cursor" rel="toggle" title="Audio Augmentation for Speech Recognition">[bib]</a></p><div id="abstract-1050" class="hide abstract" style="display: none;"><h3>Abstract</h3>Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.</div>
<div id="bib-1050" class="hide bib align-left">
@inproceedings{ko2015augmentation,<br>
author = {Tom Ko and Peddinti, Vijayaditya and Povey, Daniel and Khudanpur, Sanjeev}, <br>
title = {Audio Augmentation for Speech Recognition}, <br>
booktitle = {Proceedings of Interspeech}<br>
}</div>
</div>
<!-- Paper 3-->
<a id="peddinti2015multisplice"></a>
<div id="pub-1048" class="publications">
<p><a class="title-link" href="http://www.danielpovey.com/files/2015_interspeech_multisplice.pdf" target="_blank">
<strong> <font color=red> Best paper award </font> </strong><br><br>
A time delay neural network architecture for efficient modeling of long temporal contexts<br>
<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, <a href="http://www.danielpovey.com/" target="_blank">Daniel Povey</a> and <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a><br>
<em>Proceedings of Interspeech, 2015</em></p>
<p align="right"><a id="abstract-1048" class="cursor" rel="toggle" title="A time delay neural network architecture for efficient modeling of long temporal contexts">[abstract]</a> <a id="bib-1048" class="cursor" rel="toggle" title="A time delay neural network architecture for efficient modeling of long temporal contexts">[bib]</a></p><div id="abstract-1048" class="hide abstract" style="display: none;"><h3>Abstract</h3>Recurrent neural network architectures have been shown to efficiently model long term temporal dependencies between acoustic events. However the training time of recurrent networks is higher than feedforward networks due to the sequential nature of the learning algorithm. In this paper we propose a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs. The network uses sub-sampling to reduce computation during training. On the Switchboard task we show a relative improvement of 6% over the baseline DNN model. We present results on several LVCSR tasks with training data ranging from 3 to 1800 hours to show the effectiveness of the TDNN architecture in learning wider temporal dependencies in both small and large data scenarios.</div>
<div id="bib-1048" class="hide bib align-left">
@inproceedings{peddinti2015multisplice,<br>
author = {Peddinti, Vijayaditya and Povey, Daniel and Khudanpur, Sanjeev}, <br>
title = {A time delay neural network architecture for efficient modeling of long temporal contexts}, <br>
booktitle = {Proceedings of Interspeech}, <br>
publisher = {ISCA}<br>
}</div>
</div>
<!-- Back to top option -->
<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>
<!-- 2014 publications -->
<h3>2014</h3>
<!-- Paper 4-->
<a id="peddinti2014"></a>
<div id="pub-1057" class="publications alt">
<p><a class="title-link" href="http://www.mirlab.org/conference_papers/International_Conference/ICASSP%202014/papers/p210-peddinti.pdf" target="_blank">
Deep Scattering Spectrum with deep neural networks<br>
<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, T. Sainath, S. Maymon, B. Ramabhadran, D. Nahamoo and <a href="http://www.linkedin.com/pub/vaibhava-goel/0/54a/190" target="_blank">Vaibhava Goel</a><br>
<em>Proceedings of ICASSP, 2014</em></p>
<p align="right"><a id="abstract-1057" class="cursor" rel="toggle" title="Deep Scattering Spectrum with deep neural networks">[abstract]</a> <a id="bib-1057" class="cursor" rel="toggle" title="Deep Scattering Spectrum with deep neural networks">[bib]</a></p>
<div id="abstract-1057" class="hide abstract" style="display: none;"><h3>Abstract</h3>State-of-the-art convolutional neural networks (CNNs) typically use a log-mel spectral representation of the speech signal. However, this representation is limited by the spectro-temporal resolution afforded by log-mel filter-banks. A novel technique known as Deep Scattering Spectrum (DSS) addresses this limitation and preserves higher resolution information, while ensuring time warp stability, through the cascaded application of the wavelet-modulus operator. The first order scatter is equivalent to log-mel features and standard CNN modeling techniques can directly be used with these features. However the higher order scatter, which preserves the higher resolution information, presents new challenges in modeling. This paper explores how to effectively use DSS features with CNN acoustic models. Specifically, we identify the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter. The use of these higher order scatter features, in conjunction with CNNs, results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate (PER) of 17.4%, one of the lowest reported PERs to date on this task.</div>
<div id="bib-1057" class="hide bib align-left">
@inproceedings{peddinti2014,<br>
author = {Peddinti, Vijayaditya and T. Sainath and S. Maymon and B. Ramabhadran and D. Nahamoo and Goel, Vaibhava}, <br>
title = {Deep Scattering Spectrum with deep neural networks}, <br>
booktitle = {Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on}, <br>
pages = {210-214}<br>
}</div>
</div>
<!-- Paper 5-->
<a id="schatz-peddinti-cao-bach-hermansky-dupoux:is2014c"></a>
<div id="pub-988" class="publications">
<p><a class="title-link" href="https://hal.archives-ouvertes.fr/hal-00918599/document" target="_blank">
Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise<br>
Thomas Schatz, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, <a href="" target="_blank">Yuan Cao</a>, Francis Bach, <a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a> and Emmanuel Dupoux<br>
<em>Proceedings of Interspeech, 2014</em></p>
<p align="right"><a id="bib-988" class="cursor" rel="toggle" title="Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise">[bib]</a></p>
<div id="bib-988" class="hide bib align-left">
@inproceedings{schatz-peddinti-cao-bach-hermansky-dupoux:is2014c,<br>
author = {Thomas Schatz and Peddinti, Vijayaditya and Cao, Yuan and Francis Bach and Hermansky, Hynek and Emmanuel Dupoux}, <br>
title = {Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise}, <br>
booktitle = {Proc. of INTERSPEECH}<br>
}</div>
</div>
<!-- Paper 6-->
<a id="sainath2014deep"></a>
<div id="pub-1055" class="publications alt">
<p><a class="title-link" href="http://ttic.uchicago.edu/~haotang/speech/IS140389.pdf" target="_blank">
Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks</a><br>
Tara N Sainath, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Brian Kingsbury, Petr Fousek, Bhuvana Ramabhadran and David Nahamoo<br>
<em>Proceedings of Interspeech, 2014</em></p>
<p align="right"><a id="abstract-1055" class="cursor" rel="toggle" title="Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks">[abstract]</a> <a id="bib-1055" class="cursor" rel="toggle" title="Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks">[bib]</a></p><div id="abstract-1055" class="hide abstract" style="display: none;">
<h3>Abstract</h3>Log-mel filterbank features, which are commonly used features for CNNs, can remove higher-resolution information from the speech signal. A novel technique, known as Deep Scattering Spectrum (DSS), addresses this issue and looks to preserve this information. DSS features have shown promise on TIMIT, both for classification and recognition. In this paper, we extend the use of DSS features for LVCSR tasks. First, we explore the optimal multi-resolution time and frequency scattering operations for LVCSR tasks. Next, we explore techniques to reduce the dimension of the DSS features. We also incorporate speaker adaptation techniques into the DSS features. Results on a 50 and 430 hour English Broadcast News task show that the DSS features provide between a 4-7% relative improvement in WER over log-mel features, within a state-of-the-art CNN framework which incorporates speaker-adaptation and sequence training. Finally, we show that DSS features are similar to multi-resolution log-mel + MFCCs, and similar improvements can be obtained with this representation.</div>
<div id="bib-1055" class="hide bib align-left">
@inproceedings{sainath2014deep,<br>
author = {Tara N Sainath and Peddinti, Vijayaditya and Brian Kingsbury and Petr Fousek and Bhuvana Ramabhadran and David Nahamoo}, <br>
title = {Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks}, <br>
publisher = {ISCA}, <br>
url = {http://ttic.uchicago.edu/~haotang/speech/IS140389.pdf}<br>
}</div>
</div>
<!-- Back to top option -->
<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>
<!-- 2013 publications -->
<h3>2013</h3>
<!-- Paper 7-->
<a id="schatz-peddinti-bach-jansen-hermansky-dupoux:is2013"></a>
<div id="pub-998" class="publications">
<p><a class="title-link" href="https://hal.archives-ouvertes.fr/hal-00918599/document" target="_blank">
Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline<br>
Thomas Schatz, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Francis Bach, <a href="http://www.clsp.jhu.edu/~ajansen/" target="_blank">Aren Jansen</a>, <a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a> and Emmanuel Dupoux<br>
<em>Proceedings of Interspeech, 2013</em></p>
<p align="right"><a id="bib-998" class="cursor" rel="toggle" title="Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline">[bib]</a></p>
<div id="bib-998" class="hide bib align-left">
@inproceedings{schatz-peddinti-bach-jansen-hermansky-dupoux:is2013,<br>
author = {Thomas Schatz and Peddinti, Vijayaditya and Francis Bach and Jansen, Aren and Hermansky, Hynek and Emmanuel Dupoux}, <br>
title = {Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline}, <br>
booktitle = {Proc. INTERSPEECH}<br>
}</div>
</div>
<!-- Paper 8-->
<a id="jansen-dupoux-goldwater-johnson-khudanpur-church-feldman-hermansky-metze-rose-seltzer-clark-mcgraw-varadarajan-bennett-borschinger-chiu-dunbar-fourtassi-harwath-lee-levin-norouzain-peddinti-richardson-schatz-thomas:icassp2013"></a>
<div id="pub-1005" class="publications alt">
<p><a class="title-link" href="http://repository.cmu.edu/cgi/viewcontent.cgi?article=1095&context=lti" target="_blank">
A Summary Of The 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition<br>
<a href="http://www.clsp.jhu.edu/~ajansen/" target="_blank">Aren Jansen</a>, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, <a href="http://www.clsp.jhu.edu/~sanjeev/" target="_blank">Sanjeev Khudanpur</a>, <a href="http://www.clsp.jhu.edu/~kchurch/" target="_blank">Kenneth Church</a>, Naomi Feldman, <a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a>, Florian Metze, Richard Rose, Michael Seltzer, Pascal Clark, Ian Mcgraw, <a href="http://sites.google.com/site/balakrishnanvaradarajan/" target="_blank">Balakrishnan Varadarajan</a>, Erin Bennett, Benjamin Borschinger, Justin Chiu, Ewan Dunbar, Abdellah Fourtassi, David Harwath, Chia-Ying Lee, <a href="" target="_blank">Keith Levin</a>, Atta Norouzain, <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a>, Rachael Richardson, Thomas Schatz and <a href="http://www.clsp.jhu.edu/~samuel/" target="_blank">Samuel Thomas</a><br>
<em>Proceedings of ICASSP, 2013</em></p>
<p align="right"><a id="bib-1005" class="cursor" rel="toggle" title="A Summary Of The 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition">[bib]</a></p>
<div id="bib-1005" class="hide bib align-left">
@inproceedings{jansen-dupoux-goldwater-johnson-khudanpur-church-feldman-hermansky-metze-rose-seltzer-clark-mcgraw-varadarajan-bennett-borschinger-chiu-dunbar-fourtassi-harwath-lee-levin-norouzain-peddinti-richardson-schatz-thomas:icassp2013,<br>
author = {Jansen, Aren and Emmanuel Dupoux and Sharon Goldwater and Mark Johnson and Khudanpur, Sanjeev and Church, Kenneth and Naomi Feldman and Hermansky, Hynek and Florian Metze and Richard Rose and Michael Seltzer and Pascal Clark and Ian Mcgraw and Varadarajan, Balakrishnan and Erin Bennett and Benjamin Borschinger and Justin Chiu and Ewan Dunbar and Abdellah Fourtassi and David Harwath and Chia-Ying Lee and Levin, Keith and Atta Norouzain and Peddinti, Vijayaditya and Rachael Richardson and Thomas Schatz and Thomas, Samuel}, <br>
title = {A Summary Of The 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition}, <br>
booktitle = {Proc. ICASSP}, <br>
address = {Vancouver, Canada}<br>
}</div>
</div>
<!-- Paper 9-->
<a id="hermansky-variani-peddinti:icassp2013"></a>
<div id="pub-1006" class="publications">
<p><a class="title-link" href="http://hltcoe.jhu.edu/uploads/publications/papers/16660_slides.pdf" target="_blank">
Mean Temporal Distance: Predicting ASR Error from Temporal Properties of Speech Signal<br>
<a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a>, <a href="http://www.clsp.jhu.edu/~variani/" target="_blank">Ehsan Variani</a> and <a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a><br>
<em>Proceedings of ICASSP, 2013</em></p>
<p align="right"><a id="bib-1006" class="cursor" rel="toggle" title="Mean Temporal Distance: Predicting ASR Error from Temporal Properties of Speech Signal">[bib]</a></p>
<div id="bib-1006" class="hide bib align-left">
@inproceedings{hermansky-variani-peddinti:icassp2013,<br>
author = {Hermansky, Hynek and Variani, Ehsan and Peddinti, Vijayaditya}, <br>
title = {Mean Temporal Distance: Predicting ASR Error from Temporal Properties of Speech Signal}, <br>
booktitle = {Proc. ICASSP}, <br>
address = {Vancouver, Canada}<br>
}</div>
</div>
<!-- Paper 10-->
<a id="peddinti2013filterbank"></a>
<div id="pub-1007" class="publications alt">
<p><a class="title-link" href="http://mahe.ece.jhu.edu/uploads/publications/papers/16653_slides.pdf" target="_blank">
Filter-Bank Optimization for Frequency Domain Linear Prediction<br>
<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a> and <a href="http://www.clsp.jhu.edu/~hynek/" target="_blank">Hynek Hermansky</a><br>
<em>Proceedings of ICASSP, 2013</em></p>
<p align="right"><a id="abstract-1007" class="cursor" rel="toggle" title="Filter-Bank Optimization for Frequency Domain Linear Prediction">[abstract]</a> <a id="bib-1007" class="cursor" rel="toggle" title="Filter-Bank Optimization for Frequency Domain Linear Prediction">[bib]</a></p>
<div id="abstract-1007" class="hide abstract" style="display: none;">
<h3>Abstract</h3>
The sub-band Frequency Domain Linear Prediction (FDLP) technique estimates autoregressive models of Hilbert envelopes of subband signals, from segments of discrete cosine transform (DCT) of a speech signal, using windows. Shapes of the windows and their positions on the cosine transform of the signal determine implied filtering of the signal. Thus, the choices of shape, position and number of these windows can be critical for the performance of the FDLP technique. So far, we have used Gaussian or rectangular windows. In this paper asymmetric cochlear-like filters are being studied. Further, a frequency differentiation operation, that introduces an additional set of parameters describing local spectral slope in each frequency sub-band, is introduced to increase the robustness of sub-band envelopes in noise. The performance gains achieved by these changes are reported in a variety of additive noise conditions, with an average relative improvement of 8.04% in phoneme recognition accuracy.</div>
<div id="bib-1007" class="hide bib align-left">
@inproceedings{peddinti2013filterbank,<br>
author = {Peddinti, Vijayaditya and Hermansky, Hynek}, <br>
title = {Filter-Bank Optimization for Frequency Domain Linear Prediction}, <br>
booktitle = {Proceedings of ICASSP}, <br>
address = {Vancouver, Canada}, <br>
publisher = {IEEE}, <br>
pages = {7102 - 7106}<br>
}</div>
</div>
<!-- Back to top option -->
<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>
<!-- 2011 publications -->
<h3>2011</h3>
<!-- Paper 11-->
<a id="peddinti2011"></a>
<div id="pub-1056" class="publications">
<p><a class="title-link" href="http://ravi.iiit.ac.in/%7Espeech/publications/C47.pdf" target="_blank">
Significance of vowel epenthesis in Telugu text-to-speech synthesis<br>
<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a> and K. Prahallad<br>
<em>Proceedings of ICASSP, 2011</em></p>
<p align="right"><a id="abstract-1056" class="cursor" rel="toggle" title="Significance of vowel epenthesis in Telugu text-to-speech synthesis">[abstract]</a> <a id="bib-1056" class="cursor" rel="toggle" title="Significance of vowel epenthesis in Telugu text-to-speech synthesis">[bib]</a></p>
<div id="abstract-1056" class="hide abstract" style="display: none;">
<h3>Abstract</h3>
Unit selection synthesis inventories have coverage issues, which lead to missing syllable or diphone units. In the conventional back-off strategy of substituting the missing unit with approximate unit(s), the rules for approximate matching are hard to derive. In this paper we propose a back-off strategy for Telugu TTS systems emulating native speaker intuition. It uses reduced vowel insertion in complex consonant clusters to replace missing units. The inserted vowel identity is determined using a rule-set adapted from L2 (second language) acquisition research in Telugu, reducing the effort required in preparing the rule-set. Subjective evaluations show that the proposed back-off method performs better than the conventional methods.
</div>
<div id="bib-1056" class="hide bib align-left">
@inproceedings{peddinti2011,<br>
author = {Peddinti, Vijayaditya and K. Prahallad}, <br>
title = {Significance of vowel epenthesis in Telugu text-to-speech synthesis}, <br>
booktitle = {Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on}, <br>
pages = {5348-5351}<br>
}</div>
</div>
<!-- Paper 12-->
<a id="peddinti2011exploiting"></a>
<div id="pub-1059" class="publications alt">
<p><a class="title-link" href="http://ravi.iiit.ac.in/%7Espeech/publications/C48.pdf" target="_blank">
Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases<br>
<a href="http://vijaypeddinti.com" target="_blank">Vijayaditya Peddinti</a> and Kishore Prahallad<br>
<em>Proceedings of Interspeech, 2011</em></p>
<p align="right"><a id="abstract-1059" class="cursor" rel="toggle" title="Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases">[abstract]</a> <a id="bib-1059" class="cursor" rel="toggle" title="Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases">[bib]</a></p>
<div id="abstract-1059" class="hide abstract" style="display: none;">
<h3>Abstract</h3>
High accuracy speech segmentation methods invariably depend on manually labelled data. However under-resourced languages do not have annotated speech corpora required for training these segmentors. In this paper we propose a boundary refinement technique which uses knowledge of phone-class specific subband energy events, in place of manual labels, to guide the refinement process. The use of this knowledge enables proper placement of boundaries in regions with multiple spectral discontinuities in close proximity. It also helps in the correction of large alignment errors. The proposed refinement technique provides boundaries with an accuracy of 82% within 20ms of actual boundary. Combining the proposed technique with iterative isolated HMM training technique boosts the accuracy to 89%, without the use of any manually labelled data.</div>
<div id="bib-1059" class="hide bib align-left">
@inproceedings{peddinti2011exploiting,<br>
author = {Peddinti, Vijayaditya and Kishore Prahallad}, <br>
title = {Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases}, <br>
booktitle = {Proceedings of Interspeech 2011}<br>
}</div>
</div>
<!-- Back to top option -->
<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>
</div>
</div>
<h1>
<hr>
</h1>
<p><a name="experience"></a></p>
<h3>Experience</h3>
<ul>
<li><strong>Participant, JSALT-2015 Workshop </strong>Jul '15 - Aug '15
<a href="http://www.clsp.jhu.edu/workshops/15-workshop/far-field-enhancement-and-recognition-in-mismatched-settings/" > [homepage] </a>
<a href="https://www.youtube.com/watch?v=OncfrRwZPs8&#t=54m20s" > [video] </a>
</li>
<li><strong>Research Intern, Microsoft Research </strong>
<br> Mentor: <a href="http://research.microsoft.com/en-us/people/mseltzer/" target="_blank">Mike Seltzer</a> Sept '14 - Dec '14</li>
<li><strong>Research Intern, IBM T.J. Watson Research Center</strong>
<br> Mentor: <a href="https://sites.google.com/site/tsainath/" target="_blank">Tara Sainath</a> May '13 - Aug '13</li>
<li><strong>Participant, JSALT-2014 Workshop </strong>July '14 - August '14
<a href="http://www.clsp.jhu.edu/workshops/14-workshop/asr-machines-that-know-when-they-do-not-know/" > [homepage] </a>
<a href="https://vijayaditya.github.io/other_files/jsalt_14.pdf" > [pdf] </a>
<a href="https://www.youtube.com/watch?v=RAgQe0EsPwA#t=15m38s" > [video] </a>
</li>
<li><strong>Participant, Zero resource workshop </strong>July '12
<a href="http://www.clsp.jhu.edu/workshops/12-workshop/zero-resource-speech-technologies-and-models-of-early-language-acquisition/" > [homepage] </a>
<a href="http://www.clsp.jhu.edu/workshops/12-workshop/zero-resource-speech-technologies-and-models-of-early-language-acquisition/combined-final-presentation">[pdf]</a>
<a href="http://webcast.jhu.edu/Mediasite/Play/12a4b40e7b524bec93d273ca66d453e11d">[video1]</a>
<a href="http://webcast.jhu.edu/mediasite/Viewer/?peid=e864d9ab0270404f9b09ec664cee1ebf1d">[video2]</a> </li>
<li><strong>Research Assistant, JHU</strong>, September '11 - </li>
<li><strong>Teaching Assistant, JHU</strong><br>
Course: Processing of Audio and Visual Signals (Instructor: Prof. Hynek Hermansky)<br>
Course: Speech and Audio processing by humans and machines (Instructor: Prof. Hynek Hermansky)</li>
<li><strong>Analytics Intern, I-Labs 24/7 Customer , Bangalore</strong>, Jan '11 - Jul '11<br>Part of text and data mining team. Developed a prototype for Event detection in Twitter, Facebook, Forum and Customer Care chat data.</li>
<li><strong>Research Assistant, IIIT-Hyderabad, India</strong>, Dec '08 - Dec '10<br>Worked on the <em>Indian Language TTS (Ministry of Commn. and Info. Tech., India)</em> and <em>Indian Language Data Collection (LDC-IL) </em>projects</li>
<li><strong>Technical Associate, TechMahindra Ltd., </strong>Jul '07 - Jul '08</li>
</ul>
<!-- Back to top option -->
<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>
<hr>
<p><a name="projects"></a></p>
<h3>Projects</h3>
<ul>
<li><strong>Robust Automatic Transcription of Speech (RATS)<em>:</em></strong><br> DARPA project </li>
<li><strong>Indian Language TTS ,<br></strong><em>Funded By Ministry of Commn. & Info. Tech., India (MCIT)</em><br>Involved in the development of a text-to-speech (TTS) synthesizer for Telugu as part of the Indian Language TTS Consortium. Developed an algorithm for automatic segmentation of audio databases (published in Interspeech, 2011) and designed a back-off strategy for missing units (published in ICASSP,2011), implementation syllable based synthesizer in the Festival framework.</li>
<li><strong>Indian Language Data Collection</strong><br><em>Funded by Lang. Data Consortium Indian Languages (LDC-IL)<br></em>Worked on Automatic generation of phonetic alignments of audio data with erroneous transcripts for speech data in Telugu as part of Indian Language Data Collection project for LDC-IL for the collection of 500 hours of speech data each for Telugu, Kannada and English languages.</li>
<li><strong>Temporal Event Detection in Social Media Streams</strong><br>At I-labs, 24/7 Customer<br>Developed an algorithm for event detection in volume time series created from multiple data streams like microblogs (like Twitter), social networks (like Facebook) and Chats (from customer service centers).</li>
</ul>
<!-- Back to top option -->
<p align="right"><small><a href="http://vijaypeddinti.com#">Back to Top</a></small></p>
<hr>
<div id="footer">
<p>© Vijayaditya Peddinti, 2015. All rights reserved.</p>
</div>
</div>
<div class="clear"></div>
</div>
<!-- Java Script -->
<script type="text/javascript" src="./index_files/jquery.min.js"></script>
<script type="text/javascript">
$('.publications:odd').addClass('alt');
$('a.expand').bind('click', function () {
$('div.abstract').toggle('blind');
$(this).toggleClass('compress');
});
$('a.compress').bind('click', function () {
$('div.abstract').toggle('blind');
$(this).toggleClass('compress');
});
$("a[rel=toggle]").bind('click', function () {
var id = $(this).attr('id');
$('div#' + id).toggle('blind');
});
</script>
<iframe frameborder="0" scrolling="no" style="border: 0px; display: none; background-color: transparent;"></iframe><div id="GOOGLE_INPUT_CHEXT_FLAG" style="display: none;" input="" input_stat="{"tlang":true,"tsbc":true,"pun":true,"mk":false,"ss":true}"></div><div id="feedly-mini" title="feedly Mini tookit"></div></body></html>