GigaSpeech recipe #120

wgb14 · 2021-11-14T06:15:35Z

Features:

support BPE based lang
~~on-the-fly feature extraction~~ chunked feature extraction with GPU by default

TODO:

csukuangfj · 2021-11-14T06:41:22Z

egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py

+                    # Therefore, we sacrifice some storage for the ability to
+                    # precompute features on shorter chunks,
+                    # without memory blow-ups.
+                    cut_set = cut_set.compute_and_store_features(


I thought Piotr was suggesting compute_and_store_features_batch
See lhotse-speech/lhotse#452 (comment)

Yes -- you'd do best to leverage the changes made by @glynpu here: #100

@csukuangfj @pzelasko Does this method also work for the XL set? And we no longer need on-the-fly feature extraction, nor pre-shuffling by default, right?

I am testing the pre-computed-features interface with "L" and "XL" subset. As suggested by protr here lhotse-speech/lhotse#452

This method should work for XL -- Lyong ran into some issues before but I suspect it was related to data prep, we should see soon.

You won't need on-the-fly extraction, but it still allows you to save a lot of disk space. You don't need pre-shuffling anymore though -- Lhotse samplers implement a streaming shuffle variant for lazily-opened CutSets (it works with zero code changes when shuffle=True is passed to a sampler).

changed to chunked feature extraction with GPU by default here, but got OOM error while running on grid

Any insights?

Is it possible that the default value of --batch-duration, which is 600, is too large?

That is not about GPU memory, it's an error during fork, which is likely due to a limit of virtual memory. This is probably about the number of dataloader processes (if that's an option), and about the overall size of the manifest.

Yeah it's your CPU memory that's blown up. You can try to decrease num workers (there is an option), but I think Dan is right about the manifest size. There is a way to make it load the manifest lazily, but I suggest you start with splitting the cut manifest into N splits and read->process->save each of them sequentially, and then recombine (you can even do that with bash commands gunzip, split, cat, gzip). I will add some code to support CUDA extraction with streaming manifest reads and writes later.

... actually, if you're using speed perturbation or other augmentations, ditching them might solve your issues. You can still use MUSAN, SpecAugment, etc. in Dataset later.

csukuangfj · 2021-11-17T01:56:53Z

egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py

+                # Note:
+                # we support very efficient "chunked" feature reads with
+                # the argument `storage_type=ChunkedLilcomHdf5Writer`,
+                # but we don't support efficient data augmentation and


# but we don't support efficient data augmentation and # feature computation for long recordings yet.

Is it still the case with kaldifeat?

No, kaldifeat solves the problem of feature computation, but data augmentation is still an issue. This may actually be another reason why @wgb14 observed OOM (I remember running into this before).

csukuangfj · 2021-11-17T01:57:46Z

egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py

+                # precompute features on shorter chunks,
+                # without memory blow-ups.
+                if torch.cuda.is_available():
+                    logging.info("GPU detected, do the CUDA extraction.")


I think this line can be moved to where extractor is created, i.e, to line 143.

csukuangfj · 2021-11-17T02:09:40Z

egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py

+        help="Number of dataloading workers used for reading the audio.",
+    )
+    parser.add_argument(
+        "--batch-duration",


Also, there is another parameter:
https://github.com/lhotse-speech/lhotse/blob/master/lhotse/features/kaldifeat.py#L158

chunk_size: Optional[int] = 1000

You can use

extractor = KaldifeatFbank( KaldifeatFbankConfig(device="cuda", chunk_size=<some_value>), )

to limit the chunk size. I am not sure whether your OOM error is related to these two parameters.

1000 is a reasonable default, you shouldn't get OOM with that value.

[posting on web as email unreliable] There are also commands in lhotse to split and recombine manifests. I forget the names/invocation but Piotr answered my question on this in either this repo or lhotse's repo.

It is in lhotse-speech/lhotse#452 (comment)

danpovey · 2021-11-17T14:47:11Z

There are also commands in lhotse to split and recombine data. I forget the names/invocation but Piotr answered my question on this in either this repo or lhotse's repo.

…

On Wed, Nov 17, 2021 at 9:57 PM Piotr Żelasko ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py <#120 (comment)>: > + # as the sampler won't be able to do it later in an + # efficient manner. + cut_set = cut_set.shuffle() + + if args.precomputed_features: + # Extract the features after cutting large recordings into + # smaller cuts. + # Note: + # we support very efficient "chunked" feature reads with + # the argument `storage_type=ChunkedLilcomHdf5Writer`, + # but we don't support efficient data augmentation and + # feature computation for long recordings yet. + # Therefore, we sacrifice some storage for the ability to + # precompute features on shorter chunks, + # without memory blow-ups. + cut_set = cut_set.compute_and_store_features( ... actually, if you're using speed perturbation or other augmentations, ditching them might solve your issues. You can still use MUSAN, SpecAugment, etc. in Dataset later. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#120 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO664DU5Z5JGHDEOIZDUMOYF5ANCNFSM5H7PTUFA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

csukuangfj · 2021-11-17T15:14:39Z

I forget the names/invocation but Piotr answered my question on this in either this
repo or lhotse's repo.

It is in lhotse-speech/lhotse#452 (comment)

csukuangfj · 2021-11-28T05:34:15Z

Looks like there is not much progress about the GigaSpeech recipe for more than 2 weeks.

I just made a PR wgb14#1
to compute the features by splitting the manifests before extraction and combining them afterwards.

It seems to resolve the OOM issue. The expected time to extract the features of the XL subset is about 2 days using a single GPU, I think. If you use more GPUs, it should decrease the time linearly.
(Note: After speed perturbing, the XL subset contains 30k hours of data. The GPU is idle most of the time, so I think computation is not the bottleneck).

Compute features for GigaSpeech by splitting the manifest.

csukuangfj · 2021-11-28T07:33:18Z

The screenshot below compares the speed of feature extraction between CUDA and CPU on 1 of the 1000 pieces of the XL subset.

CUDA: 3 minutes 13 seconds = 193 seconds
CPU: 7 minutes 19 seconds = 439 seconds
439 / 193 = 2.2746

csukuangfj · 2021-11-28T07:36:53Z

The following screenshot shows the memory consumption about extracting features of the XL subset on CUDA with --num-workers=20 --batch-duration=600.

I believe there will be no OOM anymore. Otherwise, we have to use a larger splits number, e.g., 2000 instead of 1000.
(Note: A split value of 100 still causes OOM.)

csukuangfj · 2021-11-28T07:45:20Z

Also, note that we don't need to limit the number of decoding threads used byffmpeg.

diff --git a/lhotse/audio.py b/lhotse/audio.py
index 2190dc9..ca623cf 100644
--- a/lhotse/audio.py
+++ b/lhotse/audio.py
@@ -1437,7 +1437,8 @@ def read_opus_ffmpeg(
     :return: a tuple of audio samples and the sampling rate.
     """
     # Construct the ffmpeg command depending on the arguments passed.
-    cmd = f"ffmpeg -threads 1"
+    #  cmd = f"ffmpeg -threads 1"
+    cmd = f"ffmpeg"
     sampling_rate = 48000
     # Note: we have to add offset and duration options (-ss and -t) BEFORE specifying the input
     #       (-i), otherwise ffmpeg will decode everything and trim afterwards...
@@ -1452,7 +1453,8 @@ def read_opus_ffmpeg(
         cmd += f" -ar {force_opus_sampling_rate}"
         sampling_rate = force_opus_sampling_rate
     # Read audio samples directly as float32.
-    cmd += " -f f32le -threads 1 pipe:1"
+    #  cmd += " -f f32le -threads 1 pipe:1"
+    cmd += " -f f32le pipe:1"
     # Actual audio reading.
     proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
     raw_audio = proc.stdout

@pzelasko Shall we revert https://github.com/lhotse-speech/lhotse/pull/481/files

danpovey · 2021-11-28T08:42:20Z

Great!
Regarding limiting ffmpeg threads, can we see whether it makes a difference to speed before reverting that?
Sometimes when you run something in multiple processes, having it use multiple threads can be slower, depending on the mechanism it uses.

csukuangfj · 2021-11-28T09:08:19Z

Regarding limiting ffmpeg threads, can we see whether it makes a difference to speed before reverting that?

Will compare the speed with/without multiple decoding threads for ffmpeg.

wgb14 · 2021-11-28T22:44:31Z

egs/gigaspeech/ASR/local/compute_fbank_gigaspeech_splits.py

+    logging.info(f"device: {device}")
+
+    for i in range(num_splits):
+        idx = f"{i + 1}".zfill(num_digits)


Got an error

AssertionError: No such path: data/fbank/XL_split_1000/cuts_XL_raw.0001.jsonl.gz

So I modified here to idx = f"{i + 1}"

So in my experiment it's data/fbank/XL_split_1000/cuts_XL_raw.1.jsonl.gz, but from @csukuangfj posted it's data/fbank/XL_split_1000/cuts_XL_raw.0001.jsonl.gz.
How did this inconsistency arise?

How did this inconsistency arise?

Please use the latest master of lhotse.

wgb14 · 2021-11-28T23:14:30Z

This time I didn't get OOM, but got a CUDA OOM error while processing 137/1000:

RuntimeError: CUDA out of memory. Tried to allocate 8.07 GiB (GPU 0; 22.41 GiB total capacity; 12.49 GiB already allocated; 7.77 GiB free; 14.00 GiB reserved in total by PyTorch)

I believe that 22GB of GPU memory is already higher than the average level of commonly used GPUs, so I'll reduce the value of some params. Which one is preferred, --num-workers=20 or --batch-duration=600?

csukuangfj · 2021-11-28T23:20:39Z

This time I didn't get OOM, but got a CUDA OOM error while processing 137/1000:

Please install the latest kaldifeat. There was a bug in it. It was not using chunk_size in computing features, so it caused CUDA OOM for long utterances.

I just fixed it in csukuangfj/kaldifeat#22

BTW, ./prepare.sh --stage 6 will continue the extraction from where it stopped.

csukuangfj · 2021-11-28T23:24:41Z

I believe that 22GB of GPU memory is already higher than the average level of commonly used GPUs

Because utterances in GigaSpeech are several hours long andchunkwise extraction was disabled before by mistake.

After that fix, it should not use that much GPU memory anymore.

wgb14 · 2022-04-08T06:59:01Z

Updated results here: (using HLG decoding + n-gram LM rescoring + attention decoder rescoring):

	Dev	Test
WER	10.5

Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:

ngram_lm_scale	attention_scale
0.3	1.5

Got a much better number. But it seems that this pair of scales is no longer the best. I'll redo the param scanning and update numbers later.

chenguoguo · 2022-04-09T00:24:11Z

This is great news. Guanbo, once you get all the numbers and clean up the recipe, you can update the leaderboard here: https://github.com/SpeechColab/GigaSpeech#leaderboard

wgb14 · 2022-04-11T22:08:05Z

Updating results:
Results using HLG decoding + n-gram LM rescoring + attention decoder rescoring:

	Dev	Test
WER	10.47	10.58

Results using HLG decoding + whole lattice rescoring:

	Dev	Test
WER	10.51	10.62

dophist · 2022-04-12T00:41:50Z

Updating results: Results using HLG decoding + n-gram LM rescoring + attention decoder rescoring:

Dev Test
WER 10.47 10.58
Results using HLG decoding + whole lattice rescoring:

Dev Test
WER 10.51 10.62

This is the best number I've ever seen for GigaSpeech, great.

danpovey · 2022-04-12T04:36:39Z

Guanbo, if you want to run an RNN-T recipe, please now use the setup in pruned_transducer_stateless2 from librispeech. This converges much faster than the old setup.
The only options you might want to change are:
--lr-epochs (reduce from 6 to some number less than about half the number of epochs you plan to train, e.g. 2 or 3).
You might want to add --use-fp16=True to use half-precision to speed up training...
and you can adjust --max-duration accordingly. --use-fp16=True allows, on our GPUs, to increase max-duration from 300 to 550. (However it will converge slower near the start of training if the max-duration is too high, so this is a tradeoff).

csukuangfj · 2022-04-12T10:02:11Z

@wgb14

Could you also upload the training log exp/log/log-xxxx-0 in the result directory?
I want to compare the training time across batches as I find that it is slow to do on-the-fly feature extraction with GigaSpeech during training.

wgb14 · 2022-04-12T19:57:43Z

@wgb14

Could you also upload the training log exp/log/log-xxxx-0 in the result directory? I want to compare the training time across batches as I find that it is slow to do on-the-fly feature extraction with GigaSpeech during training.

log-train-2022-04-01-02-04-11-0.txt

I used precomputed feature to train. It took about 5 minutes per 500 batch, with --max-duration 120 --num-workers 1

wgb14 · 2022-04-12T20:04:06Z

Guanbo, if you want to run an RNN-T recipe, please now use the setup in pruned_transducer_stateless2 from librispeech. This converges much faster than the old setup. The only options you might want to change are: --lr-epochs (reduce from 6 to some number less than about half the number of epochs you plan to train, e.g. 2 or 3). You might want to add --use-fp16=True to use half-precision to speed up training... and you can adjust --max-duration accordingly. --use-fp16=True allows, on our GPUs, to increase max-duration from 300 to 550. (However it will converge slower near the start of training if the max-duration is too high, so this is a tradeoff).

Thanks for so many suggestions! I'll open a PR if I got resources to train a RNN-T model

csukuangfj · 2022-04-13T02:34:48Z

Is it ready for merge?

chenguoguo · 2022-04-13T03:35:07Z

I think we can use the cluster we borrowed from THU/SJTU. I'll make sure I get some computation resources for you. Guoguo

…

On Tue, Apr 12, 2022 at 1:04 PM Wang, Guanbo ***@***.***> wrote: Guanbo, if you want to run an RNN-T recipe, please now use the setup in pruned_transducer_stateless2 from librispeech. This converges much faster than the old setup. The only options you might want to change are: --lr-epochs (reduce from 6 to some number less than about half the number of epochs you plan to train, e.g. 2 or 3). You might want to add --use-fp16=True to use half-precision to speed up training... and you can adjust --max-duration accordingly. --use-fp16=True allows, on our GPUs, to increase max-duration from 300 to 550. (However it will converge slower near the start of training if the max-duration is too high, so this is a tradeoff). Thanks for so many suggestions! I'll open a PR if I got resources to train a RNN-T model — Reply to this email directly, view it on GitHub <#120 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACZ57E7XP7KYF3Z5NO64QLDVEXJMBANCNFSM5H7PTUFA> . You are receiving this because you commented.Message ID: ***@***.***>

wgb14 · 2022-04-13T22:14:30Z

Is it ready for merge?

Ready for merge now

csukuangfj · 2022-04-14T07:52:16Z

egs/gigaspeech/ASR/conformer_ctc/gigaspeech_scoring.py

+                        print(f"{text} {uttid_field}", file=fo)
+
+    # GigaSpeech's uttid comforms to swb
+    os.system(f"sclite -r {REF} trn -h {HYP} trn -i swb | tee {RESULT}")


Is there any documentation about how to install sclite?

I think k2 is using https://github.com/pzelasko/kaldialign to compute the edit distance and it also outputs different types of errors.

We don't use sclite to do scoring in icefall, only using the funtion asr_text_post_processing in decode.py

k2 and sclite return same WER numbers

It seems the "-i swb" option means the "utterance-id type"... it seems this means:
"the utterance id is made up of a
speaker code, followed by a hyphen or underscore,
followed by an utterance number. "
I think this means that if we only care about the final WER, sclite is probably doing the same as kaldialign and may not be necessary.
Installation of sctk is not something super-trivial, so it would be good to get rid of this dependency.

I find that this code is only executed when this script is invoked from the command line.
Its core part is asr_text_post_processing, which is imported in decode.py.

Yes, I agree. The main funtion of this file isn't used in our script. This is just an option. I can delete these lines if they introduce any confusion.

danpovey · 2022-04-14T08:04:15Z

I'm OK to keep the sclite stuff in, for documentation purposes, as long as it's totally optional and this is made clear to the user.

csukuangfj · 2022-04-14T08:05:48Z

Thanks! I am merging it.

…er model (#327) * Fix torch.nn.Embedding error for torch below 1.8.0 * Changes to fbank computation, use lilcom chunky writer * Add min in q,k,v of attention * Remove learnable offset, use relu instead. * Experiments based on SpecAugment change * Merge specaug change from Mingshuang. * Use much more aggressive SpecAug setup * Fix to num_feature_masks bug I introduced; reduce max_frames_mask_fraction 0.4->0.3 * Change p=0.5->0.9, mask_fraction 0.3->0.2 * Change p=0.9 to p=0.8 in SpecAug * Fix num_time_masks code; revert 0.8 to 0.9 * Change max_frames from 0.2 to 0.15 * Remove ReLU in attention * Adding diagnostics code... * Refactor/simplify ConformerEncoder * First version of rand-combine iterated-training-like idea. * Improvements to diagnostics (RE those with 1 dim * Add pelu to this good-performing setup.. * Small bug fixes/imports * Add baseline for the PeLU expt, keeping only the small normalization-related changes. * pelu_base->expscale, add 2xExpScale in subsampling, and in feedforward units. * Double learning rate of exp-scale units * Combine ExpScale and swish for memory reduction * Add import * Fix backprop bug * Fix bug in diagnostics * Increase scale on Scale from 4 to 20 * Increase scale from 20 to 50. * Fix duplicate Swish; replace norm+swish with swish+exp-scale in convolution module * Reduce scale from 50 to 20 * Add deriv-balancing code * Double the threshold in brelu; slightly increase max_factor. * Fix exp dir * Convert swish nonlinearities to ReLU * Replace relu with swish-squared. * Restore ConvolutionModule to state before changes; change all Swish,Swish(Swish) to SwishOffset. * Replace norm on input layer with scale of 0.1. * Extensions to diagnostics code * Update diagnostics * Add BasicNorm module * Replace most normalizations with scales (still have norm in conv) * Change exp dir * Replace norm in ConvolutionModule with a scaling factor. * use nonzero threshold in DerivBalancer * Add min-abs-value 0.2 * Fix dirname * Change min-abs threshold from 0.2 to 0.5 * Scale up pos_bias_u and pos_bias_v before use. * Reduce max_factor to 0.01 * Fix q*scaling logic * Change max_factor in DerivBalancer from 0.025 to 0.01; fix scaling code. * init 1st conv module to smaller variance * Change how scales are applied; fix residual bug * Reduce min_abs from 0.5 to 0.2 * Introduce in_scale=0.5 for SwishExpScale * Fix scale from 0.5 to 2.0 as I really intended.. * Set scaling on SwishExpScale * Add identity pre_norm_final for diagnostics. * Add learnable post-scale for mha * Fix self.post-scale-mha * Another rework, use scales on linear/conv * Change dir name * Reduce initial scaling of modules * Bug-fix RE bias * Cosmetic change * Reduce initial_scale. * Replace ExpScaleRelu with DoubleSwish() * DoubleSwish fix * Use learnable scales for joiner and decoder * Add max-abs-value constraint in DerivBalancer * Add max-abs-value * Change dir name * Remove ExpScale in feedforward layes. * Reduce max-abs limit from 1000 to 100; introduce 2 DerivBalancer modules in conv layer. * Make DoubleSwish more memory efficient * Reduce constraints from deriv-balancer in ConvModule. * Add warmup mode * Remove max-positive constraint in deriv-balancing; add second DerivBalancer in conv module. * Add some extra info to diagnostics * Add deriv-balancer at output of embedding. * Add more stats. * Make epsilon in BasicNorm learnable, optionally. * Draft of 0mean changes.. * Rework of initialization * Fix typo * Remove dead code * Modifying initialization from normal->uniform; add initial_scale when initializing * bug fix re sqrt * Remove xscale from pos_embedding * Remove some dead code. * Cosmetic changes/renaming things * Start adding some files.. * Add more files.. * update decode.py file type * Add remaining files in pruned_transducer_stateless2 * Fix diagnostics-getting code * Scale down pruned loss in warmup mode * Reduce warmup scale on pruned loss form 0.1 to 0.01. * Remove scale_speed, make swish deriv more efficient. * Cosmetic changes to swish * Double warm_step * Fix bug with import * Change initial std from 0.05 to 0.025. * Set also scale for embedding to 0.025. * Remove logging code that broke with newer Lhotse; fix bug with pruned_loss * Add norm+balancer to VggSubsampling * Incorporate changes from master into pruned_transducer_stateless2. * Add max-abs=6, debugged version * Change 0.025,0.05 to 0.01 in initializations * Fix balancer code * Whitespace fix * Reduce initial pruned_loss scale from 0.01 to 0.0 * Increase warm_step (and valid_interval) * Change max-abs from 6 to 10 * Change how warmup works. * Add changes from master to decode.py, train.py * Simplify the warmup code; max_abs 10->6 * Make warmup work by scaling layer contributions; leave residual layer-drop * Fix bug * Fix test mode with random layer dropout * Add random-number-setting function in dataloader * Fix/patch how fix_random_seed() is imported. * Reduce layer-drop prob * Reduce layer-drop prob after warmup to 1 in 100 * Change power of lr-schedule from -0.5 to -0.333 * Increase model_warm_step to 4k * Change max-keep-prob to 0.95 * Refactoring and simplifying conformer and frontend * Rework conformer, remove some code. * Reduce 1st conv channels from 64 to 32 * Add another convolutional layer * Fix padding bug * Remove dropout in output layer * Reduce speed of some components * Initial refactoring to remove unnecessary vocab_size * Fix RE identity * Bug-fix * Add final dropout to conformer * Remove some un-used code * Replace nn.Linear with ScaledLinear in simple joiner * Make 2 projections.. * Reduce initial_speed * Use initial_speed=0.5 * Reduce initial_speed further from 0.5 to 0.25 * Reduce initial_speed from 0.5 to 0.25 * Change how warmup is applied. * Bug fix to warmup_scale * Fix test-mode * Remove final dropout * Make layer dropout rate 0.075, was 0.1. * First draft of model rework * Various bug fixes * Change learning speed of simple_lm_proj * Revert transducer_stateless/ to state in upstream/master * Fix to joiner to allow different dims * Some cleanups * Make training more efficient, avoid redoing some projections. * Change how warm-step is set * First draft of new approach to learning rates + init * Some fixes.. * Change initialization to 0.25 * Fix type of parameter * Fix weight decay formula by adding 1/1-beta * Fix weight decay formula by adding 1/1-beta * Fix checkpoint-writing * Fix to reading scheudler from optim * Simplified optimizer, rework somet things.. * Reduce model_warm_step from 4k to 3k * Fix bug in lambda * Bug-fix RE sign of target_rms * Changing initial_speed from 0.25 to 01 * Change some defaults in LR-setting rule. * Remove initial_speed * Set new scheduler * Change exponential part of lrate to be epoch based * Fix bug * Set 2n rule.. * Implement 2o schedule * Make lrate rule more symmetric * Implement 2p version of learning rate schedule. * Refactor how learning rate is set. * Fix import * Modify init (#301) * update icefall/__init__.py to import more common functions. * update icefall/__init__.py * make imports style consistent. * exclude black check for icefall/__init__.py in pyproject.toml. * Minor fixes for logging (#296) * Minor fixes for logging * Minor fix * Fix dir names * Modify beam search to be efficient with current joienr * Fix adding learning rate to tensorboard * Fix docs in optim.py * Support mix precision training on the reworked model (#305) * Add mix precision support * Minor fixes * Minor fixes * Minor fixes * Tedlium3 pruned transducer stateless (#261) * update tedlium3-pruned-transducer-stateless-codes * update README.md * update README.md * add fast beam search for decoding * do a change for RESULTS.md * do a change for RESULTS.md * do a fix * do some changes for pruned RNN-T * Add mix precision support * Minor fixes * Minor fixes * Updating RESULTS.md; fix in beam_search.py * Fix rebase * Code style check for librispeech pruned transducer stateless2 (#308) * Update results for tedlium3 pruned RNN-T (#307) * Update README.md * Fix CI errors. (#310) * Add more results * Fix tensorboard log location * Add one more epoch of full expt * fix comments * Add results for mixed precision with max-duration 300 * Changes for pretrained.py (tedlium3 pruned RNN-T) (#311) * GigaSpeech recipe (#120) * initial commit * support download, data prep, and fbank * on-the-fly feature extraction by default * support BPE based lang * support HLG for BPE * small fix * small fix * chunked feature extraction by default * Compute features for GigaSpeech by splitting the manifest. * Fixes after review. * Split manifests into 2000 pieces. * set audio duration mismatch tolerance to 0.01 * small fix * add conformer training recipe * Add conformer.py without pre-commit checking * lazy loading and use SingleCutSampler * DynamicBucketingSampler * use KaldifeatFbank to compute fbank for musan * use pretrained language model and lexicon * use 3gram to decode, 4gram to rescore * Add decode.py * Update .flake8 * Delete compute_fbank_gigaspeech.py * Use BucketingSampler for valid and test dataloader * Update params in train.py * Use bpe_500 * update params in decode.py * Decrease num_paths while CUDA OOM * Added README * Update RESULTS * black * Decrease num_paths while CUDA OOM * Decode with post-processing * Update results * Remove lazy_load option * Use default `storage_type` * Keep the original tolerance * Use split-lazy * black * Update pretrained model Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> * Add LG decoding (#277) * Add LG decoding * Add log weight pushing * Minor fixes * Support computing RNN-T loss with torchaudio (#316) * Support modified beam search decoding for streaming inference with Emformer model. * Formatted imports. * Update results for torchaudio RNN-T. (#322) * Fixed streaming decoding codes for emformer model. * Fixed docs. * Sorted imports for transducer_emformer/streaming_feature_extractor.py * Minor fix for transducer_emformer/streaming_feature_extractor.py Co-authored-by: pkufool <wkang@pku.org.cn> Co-authored-by: Daniel Povey <dpovey@gmail.com> Co-authored-by: Mingshuang Luo <37799481+luomingshuang@users.noreply.github.com> Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> Co-authored-by: Guo Liyong <guonwpu@qq.com> Co-authored-by: Wang, Guanbo <wgb14@outlook.com>

* Remove ReLU in attention * Adding diagnostics code... * Refactor/simplify ConformerEncoder * First version of rand-combine iterated-training-like idea. * Improvements to diagnostics (RE those with 1 dim * Add pelu to this good-performing setup.. * Small bug fixes/imports * Add baseline for the PeLU expt, keeping only the small normalization-related changes. * pelu_base->expscale, add 2xExpScale in subsampling, and in feedforward units. * Double learning rate of exp-scale units * Combine ExpScale and swish for memory reduction * Add import * Fix backprop bug * Fix bug in diagnostics * Increase scale on Scale from 4 to 20 * Increase scale from 20 to 50. * Fix duplicate Swish; replace norm+swish with swish+exp-scale in convolution module * Reduce scale from 50 to 20 * Add deriv-balancing code * Double the threshold in brelu; slightly increase max_factor. * Fix exp dir * Convert swish nonlinearities to ReLU * Replace relu with swish-squared. * Restore ConvolutionModule to state before changes; change all Swish,Swish(Swish) to SwishOffset. * Replace norm on input layer with scale of 0.1. * Extensions to diagnostics code * Update diagnostics * Add BasicNorm module * Replace most normalizations with scales (still have norm in conv) * Change exp dir * Replace norm in ConvolutionModule with a scaling factor. * use nonzero threshold in DerivBalancer * Add min-abs-value 0.2 * Fix dirname * Change min-abs threshold from 0.2 to 0.5 * Scale up pos_bias_u and pos_bias_v before use. * Reduce max_factor to 0.01 * Fix q*scaling logic * Change max_factor in DerivBalancer from 0.025 to 0.01; fix scaling code. * init 1st conv module to smaller variance * Change how scales are applied; fix residual bug * Reduce min_abs from 0.5 to 0.2 * Introduce in_scale=0.5 for SwishExpScale * Fix scale from 0.5 to 2.0 as I really intended.. * Set scaling on SwishExpScale * Add identity pre_norm_final for diagnostics. * Add learnable post-scale for mha * Fix self.post-scale-mha * Another rework, use scales on linear/conv * Change dir name * Reduce initial scaling of modules * Bug-fix RE bias * Cosmetic change * Reduce initial_scale. * Replace ExpScaleRelu with DoubleSwish() * DoubleSwish fix * Use learnable scales for joiner and decoder * Add max-abs-value constraint in DerivBalancer * Add max-abs-value * Change dir name * Remove ExpScale in feedforward layes. * Reduce max-abs limit from 1000 to 100; introduce 2 DerivBalancer modules in conv layer. * Make DoubleSwish more memory efficient * Reduce constraints from deriv-balancer in ConvModule. * Add warmup mode * Remove max-positive constraint in deriv-balancing; add second DerivBalancer in conv module. * Add some extra info to diagnostics * Add deriv-balancer at output of embedding. * Add more stats. * Make epsilon in BasicNorm learnable, optionally. * Draft of 0mean changes.. * Rework of initialization * Fix typo * Remove dead code * Modifying initialization from normal->uniform; add initial_scale when initializing * bug fix re sqrt * Remove xscale from pos_embedding * Remove some dead code. * Cosmetic changes/renaming things * Start adding some files.. * Add more files.. * update decode.py file type * Add remaining files in pruned_transducer_stateless2 * Fix diagnostics-getting code * Scale down pruned loss in warmup mode * Reduce warmup scale on pruned loss form 0.1 to 0.01. * Remove scale_speed, make swish deriv more efficient. * Cosmetic changes to swish * Double warm_step * Fix bug with import * Change initial std from 0.05 to 0.025. * Set also scale for embedding to 0.025. * Remove logging code that broke with newer Lhotse; fix bug with pruned_loss * Add norm+balancer to VggSubsampling * Incorporate changes from master into pruned_transducer_stateless2. * Add max-abs=6, debugged version * Change 0.025,0.05 to 0.01 in initializations * Fix balancer code * Whitespace fix * Reduce initial pruned_loss scale from 0.01 to 0.0 * Increase warm_step (and valid_interval) * Change max-abs from 6 to 10 * Change how warmup works. * Add changes from master to decode.py, train.py * Simplify the warmup code; max_abs 10->6 * Make warmup work by scaling layer contributions; leave residual layer-drop * Fix bug * Fix test mode with random layer dropout * Add random-number-setting function in dataloader * Fix/patch how fix_random_seed() is imported. * Reduce layer-drop prob * Reduce layer-drop prob after warmup to 1 in 100 * Change power of lr-schedule from -0.5 to -0.333 * Increase model_warm_step to 4k * Change max-keep-prob to 0.95 * Refactoring and simplifying conformer and frontend * Rework conformer, remove some code. * Reduce 1st conv channels from 64 to 32 * Add another convolutional layer * Fix padding bug * Remove dropout in output layer * Reduce speed of some components * Initial refactoring to remove unnecessary vocab_size * Fix RE identity * Bug-fix * Add final dropout to conformer * Remove some un-used code * Replace nn.Linear with ScaledLinear in simple joiner * Make 2 projections.. * Reduce initial_speed * Use initial_speed=0.5 * Reduce initial_speed further from 0.5 to 0.25 * Reduce initial_speed from 0.5 to 0.25 * Change how warmup is applied. * Bug fix to warmup_scale * Fix test-mode * Remove final dropout * Make layer dropout rate 0.075, was 0.1. * First draft of model rework * Various bug fixes * Change learning speed of simple_lm_proj * Revert transducer_stateless/ to state in upstream/master * Fix to joiner to allow different dims * Some cleanups * Make training more efficient, avoid redoing some projections. * Change how warm-step is set * First draft of new approach to learning rates + init * Some fixes.. * Change initialization to 0.25 * Fix type of parameter * Fix weight decay formula by adding 1/1-beta * Fix weight decay formula by adding 1/1-beta * Fix checkpoint-writing * Fix to reading scheudler from optim * Simplified optimizer, rework somet things.. * Reduce model_warm_step from 4k to 3k * Fix bug in lambda * Bug-fix RE sign of target_rms * Changing initial_speed from 0.25 to 01 * Change some defaults in LR-setting rule. * Remove initial_speed * Set new scheduler * Change exponential part of lrate to be epoch based * Fix bug * Set 2n rule.. * Implement 2o schedule * Make lrate rule more symmetric * Implement 2p version of learning rate schedule. * Refactor how learning rate is set. * Fix import * Modify init (#301) * update icefall/__init__.py to import more common functions. * update icefall/__init__.py * make imports style consistent. * exclude black check for icefall/__init__.py in pyproject.toml. * Minor fixes for logging (#296) * Minor fixes for logging * Minor fix * Fix dir names * Modify beam search to be efficient with current joienr * Fix adding learning rate to tensorboard * Fix docs in optim.py * Support mix precision training on the reworked model (#305) * Add mix precision support * Minor fixes * Minor fixes * Minor fixes * Tedlium3 pruned transducer stateless (#261) * update tedlium3-pruned-transducer-stateless-codes * update README.md * update README.md * add fast beam search for decoding * do a change for RESULTS.md * do a change for RESULTS.md * do a fix * do some changes for pruned RNN-T * Add mix precision support * Minor fixes * Minor fixes * Updating RESULTS.md; fix in beam_search.py * Fix rebase * Code style check for librispeech pruned transducer stateless2 (#308) * Update results for tedlium3 pruned RNN-T (#307) * Update README.md * Fix CI errors. (#310) * Add more results * Fix tensorboard log location * Add one more epoch of full expt * fix comments * Add results for mixed precision with max-duration 300 * Changes for pretrained.py (tedlium3 pruned RNN-T) (#311) * GigaSpeech recipe (#120) * initial commit * support download, data prep, and fbank * on-the-fly feature extraction by default * support BPE based lang * support HLG for BPE * small fix * small fix * chunked feature extraction by default * Compute features for GigaSpeech by splitting the manifest. * Fixes after review. * Split manifests into 2000 pieces. * set audio duration mismatch tolerance to 0.01 * small fix * add conformer training recipe * Add conformer.py without pre-commit checking * lazy loading and use SingleCutSampler * DynamicBucketingSampler * use KaldifeatFbank to compute fbank for musan * use pretrained language model and lexicon * use 3gram to decode, 4gram to rescore * Add decode.py * Update .flake8 * Delete compute_fbank_gigaspeech.py * Use BucketingSampler for valid and test dataloader * Update params in train.py * Use bpe_500 * update params in decode.py * Decrease num_paths while CUDA OOM * Added README * Update RESULTS * black * Decrease num_paths while CUDA OOM * Decode with post-processing * Update results * Remove lazy_load option * Use default `storage_type` * Keep the original tolerance * Use split-lazy * black * Update pretrained model Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> * Add LG decoding (#277) * Add LG decoding * Add log weight pushing * Minor fixes * Support computing RNN-T loss with torchaudio (#316) * Update results for torchaudio RNN-T. (#322) * Fix some typos. (#329) * fix fp16 option in example usage (#332) * Support averaging models with weight tying. (#333) * Support specifying iteration number of checkpoints for decoding. (#336) See also #289 * Modified conformer with multi datasets (#312) * Copy files for editing. * Use librispeech + gigaspeech with modified conformer. * Support specifying number of workers for on-the-fly feature extraction. * Feature extraction code for GigaSpeech. * Combine XL splits lazily during training. * Fix warnings in decoding. * Add decoding code for GigaSpeech. * Fix decoding the gigaspeech dataset. We have to use the decoder/joiner networks for the GigaSpeech dataset. * Disable speed perturbe for XL subset. * Compute the Nbest oracle WER for RNN-T decoding. * Minor fixes. * Minor fixes. * Add results. * Update results. * Update CI. * Update results. * Fix style issues. * Update results. * Fix style issues. * Update results. (#340) * Update results. * Typo fixes. * Validate generated manifest files. (#338) * Validate generated manifest files. (#338) * Save batch to disk on OOM. (#343) * Save batch to disk on OOM. * minor fixes * Fixes after review. * Fix style issues. * Fix decoding for gigaspeech in the libri + giga setup. (#345) * Model average (#344) * First upload of model average codes. * minor fix * update decode file * update .flake8 * rename pruned_transducer_stateless3 to pruned_transducer_stateless4 * change epoch number counter starting from 1 instead of 0 * minor fix of pruned_transducer_stateless4/train.py * refactor the checkpoint.py * minor fix, update docs, and modify the epoch number to count from 1 in the pruned_transducer_stateless4/decode.py * update author info * add docs of the scaling in function average_checkpoints_with_averaged_model * Save batch to disk on exception. (#350) * Bug fix (#352) * Keep model_avg on cpu (#348) * keep model_avg on cpu * explicitly convert model_avg to cpu * minor fix * remove device convertion for model_avg * modify usage of the model device in train.py * change model.device to next(model.parameters()).device for decoding * assert params.start_epoch>0 * assert params.start_epoch>0, params.start_epoch * Do some changes for aishell/ASR/transducer stateless/export.py (#347) * do some changes for aishell/ASR/transducer_stateless/export.py * Support decoding with averaged model when using --iter (#353) * support decoding with averaged model when using --iter * minor fix * monir fix of copyright date * Stringify torch.__version__ before serializing it. (#354) * Run decode.py in GitHub actions. (#356) * Ignore padding frames during RNN-T decoding. (#358) * Ignore padding frames during RNN-T decoding. * Fix outdated decoding code. * Minor fixes. * Support --iter in export.py (#360) * GigaSpeech RNN-T experiments (#318) * Copy RNN-T recipe from librispeech * flake8 * flake8 * Update params * gigaspeech decode * black * Update results * syntax highlight * Update RESULTS.md * typo * Update decoding script for gigaspeech and remove duplicate files. (#361) * Validate that there are no OOV tokens in BPE-based lexicons. (#359) * Validate that there are no OOV tokens in BPE-based lexicons. * Typo fixes. * Decode gigaspeech in GitHub actions (#362) * Add CI for gigaspeech. * Update results for libri+giga multi dataset setup. (#363) * Update results for libri+giga multi dataset setup. * Update GigaSpeech reults (#364) * Update decode.py * Update export.py * Update results * Update README.md * Fix GitHub CI for decoding GigaSpeech dev/test datasets (#366) * modify .flake8 * minor fix * minor fix Co-authored-by: Daniel Povey <dpovey@gmail.com> Co-authored-by: Wei Kang <wkang@pku.org.cn> Co-authored-by: Mingshuang Luo <37799481+luomingshuang@users.noreply.github.com> Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> Co-authored-by: Guo Liyong <guonwpu@qq.com> Co-authored-by: Wang, Guanbo <wgb14@outlook.com> Co-authored-by: whsqkaak <whsqkaak@naver.com> Co-authored-by: pehonnet <pe.honnet@gmail.com>

wgb14 added 7 commits November 9, 2021 01:12

initial commit

b7bda9e

support download, data prep, and fbank

7586015

on-the-fly feature extraction by default

1d58765

support BPE based lang

3dbb15b

support HLG for BPE

16f1799

small fix

9d08b44

small fix

89c0e2e

csukuangfj reviewed Nov 14, 2021

View reviewed changes

chunked feature extraction by default

fa734e0

csukuangfj reviewed Nov 17, 2021

View reviewed changes

This was referenced Nov 17, 2021

Ubuntu/Debian pip install kaldifeat: cannot find CUDNN csukuangfj/kaldifeat#12

Closed

Build error: /usr/bin/ld: cannot find /path/to/miniconda3/envs/k2s/lib: File format not recognized csukuangfj/kaldifeat#15

Closed

csukuangfj mentioned this pull request Nov 27, 2021

Support speed perturbing in a parallelized way lhotse-speech/lhotse#488

Closed

Compute features for GigaSpeech by splitting the manifest.

317f5ec

csukuangfj and others added 2 commits November 28, 2021 15:10

Fixes after review.

4351e1e

Merge pull request #1 from csukuangfj/fix-giga

ee7c56c

Compute features for GigaSpeech by splitting the manifest.

wgb14 commented Nov 28, 2021

View reviewed changes

wgb14 added 4 commits April 11, 2022 21:45

Decrease num_paths while CUDA OOM

6d07cf9

Decode with post-processing

f485b66

Update results

22f011e

Merge remote-tracking branch 'upstream/master' into gigaspeech_recipe

36ec10c

wgb14 added 3 commits April 11, 2022 18:12

Remove lazy_load option

4079982

Use default storage_type

ba245aa

Keep the original tolerance

6a425ed

wgb14 added 3 commits April 13, 2022 18:08

Use split-lazy

e83b703

Merge branch 'k2-fsa:master' into gigaspeech_recipe

0986b8f

black

2ec8b06

Update pretrained model

00fa309

csukuangfj reviewed Apr 14, 2022

View reviewed changes

csukuangfj merged commit 5fe58de into k2-fsa:master Apr 14, 2022

wgb14 mentioned this pull request Apr 14, 2022

CUDA OOM when rescore_with_attention_decoder #263

Closed

wgb14 deleted the gigaspeech_recipe branch May 17, 2022 05:19

GigaSpeech recipe #120

GigaSpeech recipe #120

Conversation

wgb14 commented Nov 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgb14 Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Nov 17, 2021 via email

csukuangfj commented Nov 17, 2021

csukuangfj commented Nov 28, 2021 • edited Loading

csukuangfj commented Nov 28, 2021

csukuangfj commented Nov 28, 2021 • edited Loading

csukuangfj commented Nov 28, 2021

danpovey commented Nov 28, 2021

csukuangfj commented Nov 28, 2021

Choose a reason for hiding this comment

wgb14 Nov 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgb14 commented Nov 28, 2021

csukuangfj commented Nov 28, 2021

csukuangfj commented Nov 28, 2021

wgb14 commented Apr 8, 2022 • edited Loading

chenguoguo commented Apr 9, 2022

wgb14 commented Apr 11, 2022

dophist commented Apr 12, 2022

danpovey commented Apr 12, 2022

csukuangfj commented Apr 12, 2022

wgb14 commented Apr 12, 2022

wgb14 commented Apr 12, 2022

csukuangfj commented Apr 13, 2022

chenguoguo commented Apr 13, 2022 via email

wgb14 commented Apr 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgb14 Apr 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Apr 14, 2022

csukuangfj commented Apr 14, 2022

wgb14 commented Nov 14, 2021 •

edited

Loading

wgb14 Nov 16, 2021 •

edited

Loading

danpovey Nov 17, 2021 •

edited

Loading

csukuangfj commented Nov 28, 2021 •

edited

Loading

csukuangfj commented Nov 28, 2021 •

edited

Loading

wgb14 Nov 28, 2021 •

edited

Loading

wgb14 commented Apr 8, 2022 •

edited

Loading

wgb14 Apr 14, 2022 •

edited

Loading