Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL fixes and URL check fixes #2692

Merged
merged 28 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
24d791e
Fix URLs to `unstable-v0.6`, use relative link for `recipes/`
asumagic Sep 19, 2024
ef6f3ba
Relative link in readme for PERFORMANCE.md
asumagic Sep 19, 2024
6886524
URL to Spectral Clutsering in README dead, point to web archive copy
asumagic Sep 19, 2024
10c3118
Replace dead URL to `wham_noise.zip`
asumagic Sep 19, 2024
bc70db6
check_url.yaml rework, regex support, parallel, expanded scope
asumagic Sep 19, 2024
ebbf402
Tutorial URL fix
asumagic Sep 19, 2024
d032298
Fix links to inference code that was changed in 1.0
asumagic Sep 19, 2024
a4715d5
Web archive BPE_Gage.pdf
asumagic Sep 19, 2024
19dc35f
Remove broken link to PyTorch doc in tutorial
asumagic Sep 19, 2024
60d7757
Fix link to doc in quaternion tutorial
asumagic Sep 19, 2024
5e1909f
Fix link to papers in tutorial
asumagic Sep 19, 2024
7790cf9
Fix Colab/GitHub URL for asr-metrics.ipynb
asumagic Sep 19, 2024
13ac765
Fix more tutorial dead links to the web archive
asumagic Sep 19, 2024
9ae77d6
Update ESC-50 dataset link and ignore dead URL false positive
asumagic Sep 19, 2024
5b87f65
Ignore dead URL false positive in DNS
asumagic Sep 19, 2024
95a5042
Be more verbose about URL check errors
asumagic Sep 19, 2024
294d085
Ignore URL check true positive for urbansounddataset
asumagic Sep 19, 2024
5188b4a
Fix format string typo
asumagic Sep 19, 2024
8054439
Fix formatting
asumagic Sep 19, 2024
db9a5c7
Add the web archive to ignored URLs for URL checks
asumagic Sep 19, 2024
61ff779
Add arXiv to ignored URLs for URL checks
asumagic Sep 19, 2024
0a03ede
Disable TLS verification in URL checks
asumagic Sep 19, 2024
09cb6f4
Add kaggle to URL exclusion regex
asumagic Sep 19, 2024
247fe7f
VoxLingua107 pre-compiled shards are dead, add warning + ignore check
asumagic Sep 19, 2024
c63f089
Undo broken ')' handling for URL check, just ignore one URL for now
asumagic Sep 19, 2024
d9e6637
Fix URL and typo in speech-classification-from-scratch
asumagic Sep 19, 2024
4ad3006
Formatting
asumagic Sep 19, 2024
a7f06eb
Ignore TLS verify=False warning in URL check
asumagic Sep 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 24 additions & 24 deletions README.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"\n",
"Do you have a large dataset stored in a shared filesystem, and you want to use it for training a neural network? Is this dataset so large that it doesn't even fit into the local SSD of your computation nodes? If so, this tutorial will walk you through all the needed steps to manage reading large files from a shared filesystem.\n",
"\n",
"In many compute clusters, the main data storage is a network filesystem (NFS), for example [Lustre](https://en.wikipedia.org/wiki/Lustre_(file_system)). The NFS can serve many users concurrently and provide high data throughput from a single file. However, opening or listing many different files is slow - and doing so may slow the whole system down for everyone, not just the offending user. Speech datasets usually consist of very many small recordings. Reading every file again and again is exactly the kind of data IO that can slow down an NFS.\n",
"In many compute clusters, the main data storage is a network filesystem (NFS), for example [Lustre](https://en.wikipedia.org/wiki/Lustre_(file_system)). <!-- ignore-url-check --> The NFS can serve many users concurrently and provide high data throughput from a single file. However, opening or listing many different files is slow - and doing so may slow the whole system down for everyone, not just the offending user. Speech datasets usually consist of very many small recordings. Reading every file again and again is exactly the kind of data IO that can slow down an NFS.\n",
"\n",
"One solution is to copy the dataset into the **local SSD** of the computing node. This can be done relatively efficiently by compressing the dataset into a single file (e.g. `dataset.tar.gz`), copying it into the local node, and finally, uncompressing (untarring) the file. Reading files from the local SSD is very efficient and does not harm the performance of the shared filesystem.\n",
"The standard SpeechBrain data IO works well in this case, see [this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html).\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@
"\n",
"### 2. Using the `EndoderDecoderASR` interface\n",
"\n",
"The [EncoderDecoderASR class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py#L353). interface allows you to decouple your trained model from the training recipe and to infer (or encode) on any new audio file in few lines of code. If you are not interested in ASR, you'll find many other interfaces to fit your purpose in the `interfaces.py` file. This solution must be preferred if you intend to deploy your model in a production fashion i.e. if you plan to use your model a lot and in a stable way. Of course, this will require you to slightly rework the yaml.\n",
"The [EncoderDecoderASR class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/ASR.py). interface allows you to decouple your trained model from the training recipe and to infer (or encode) on any new audio file in few lines of code. If you are not interested in ASR, you'll find many other interfaces to fit your purpose in the `interfaces.py` file. This solution must be preferred if you intend to deploy your model in a production fashion i.e. if you plan to use your model a lot and in a stable way. Of course, this will require you to slightly rework the yaml.\n",
"\n",
"The class has the following methods:\n",
"\n",
Expand Down Expand Up @@ -441,7 +441,7 @@
"\n",
"While the `EncoderDecoderASR` class has been designed to be as generic as possible, your might require a more complex inference scheme that better fits your needs. In this case, you have to develop your own interface. To do so, follow these steps:\n",
"\n",
"1. Create your custom interface inheriting from `Pretrained` (code [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)):\n",
"1. Create your custom interface inheriting from `Pretrained` (code [in this file](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/interfaces.py)):\n",
"\n",
"\n",
"```python\n",
Expand Down Expand Up @@ -499,11 +499,11 @@
"\n",
"As you can see, this formalism is extremely flexible and enables you to create a holistic interface that can be used to do anything you want with your pretrained model.\n",
"\n",
"We provide different generic interfaces for E2E ASR, speaker recognition, source separation, speech enhancement, etc. Please have a look [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py) if interested!\n",
"We provide different generic interfaces for E2E ASR, speaker recognition, source separation, speech enhancement, etc. Please have a look [here](https://github.com/speechbrain/speechbrain/tree/develop/speechbrain/inference) if interested!\n",
"\n",
"\n",
"## General Pretraining Inference\n",
"In some cases, users might want to develop their inference interface in an external file. This can be done using the [foreign class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py#L28).\n",
"In some cases, users might want to develop their inference interface in an external file. This can be done using the [foreign class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/interfaces.py).\n",
"You can take a look at the example reported [here](https://huggingface.co/speechbrain/emotion-recognition-wav2vec2-IEMOCAP):\n",
"\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/advanced/text-tokenizer.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
"\n",
"\n",
"SpeechBrain currently relies on a custom integration of the [*SentencePiece tokenizer*](https://github.com/google/sentencepiece) which treats the input as a raw input stream. The following tokenizer algorithms are supported:\n",
"1. [BPE](https://www.derczynski.com/papers/archive/BPE_Gage.pdf).\n",
"1. [BPE](https://web.archive.org/web/20230319172720/https://www.derczynski.com/papers/archive/BPE_Gage.pdf).\n",
"2. [Unigram](https://arxiv.org/pdf/1804.10959.pdf) (Subword Regularization).\n",
"\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/basics/data-loading-pipeline.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@
},
"source": [
"### Dataset\n",
"The role of the Dataset is to produce single data points. Typically they are loaded off the disk, but they could also come from some more complex source or in some cases just from RAM. You can write your own Dataset subclass or sometimes you can use a standardized class, such as [this](https://pytorch.org/docs/stable/torchvision/datasets.html#datasetfolder). The training, validation, and test subsets get their own Dataset instances.\n",
"The role of the Dataset is to produce single data points. Typically they are loaded off the disk, but they could also come from some more complex source or in some cases just from RAM. You can write your own Dataset subclass or sometimes you can use a standardized class. The training, validation, and test subsets get their own Dataset instances.\n",
"\n",
"The Dataset interface is simple; it implements\n",
"`__getitem__` and usually also `__len__`. Usually, \"map-style\" Datasets are used, but it's worth noting that PyTorch also has a notion of [IterableDataset](https://pytorch.org/docs/stable/data.html#iterable-style-datasets)s.\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -560,7 +560,7 @@
"1. Compose a real-valued matrix from the different weight components\n",
"2. Apply a matrix product between the input and this rotation matrix!\n",
"\n",
"[Check the code!](http://www.darnault-parcollet.fr/Parcollet/hiddennoshare/speechbrain.github.io/documentation/speechbrain.nnet.quaternion_networks.q_ops.html#speechbrain.nnet.quaternion_networks.q_ops.quaternion_linear_rotation_op)\n",
"[Check the code!](https://speechbrain.readthedocs.io/en/latest/API/speechbrain.nnet.quaternion_networks.q_ops.html#speechbrain.nnet.quaternion_networks.q_ops.quaternion_linear_rotation_op)\n",
"\n",
"### Turning a quaternion layer into a spinor layer\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/preprocessing/speech-features.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -392,7 +392,7 @@
},
"source": [
"## References\n",
"[1] P. Mermelstein (1976), \"Distance measures for speech recognition, psychological and instrumental,\" in Pattern Recognition and Artificial Intelligence. [ArXiv](http://www.haskins.yale.edu/sr/SR047/SR047_07.pdf)\n",
"[1] P. Mermelstein (1976), \"Distance measures for speech recognition, psychological and instrumental,\" in Pattern Recognition and Artificial Intelligence. [pdf (Web Archive)](https://web.archive.org/web/20200714014004/http://www.haskins.yale.edu/sr/SR047/SR047_07.pdf)\n",
"\n",
"[2] X. Huang, A. Acero (Author), H.-W. Hon, \"Spoken Language Processing: A Guide to Theory, Algorithm and System Development Paperback – 2001\n",
"\n",
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/tasks/asr-metrics.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@
"<!-- This cell is automatically updated by tools/tutorial-cell-updater.py -->\n",
"<!-- The contents are initialized from tutorials/notebook-header.md -->\n",
"\n",
"[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/tasks/pr2451-new-metrics.ipynb)\n",
"[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/tasks/asr-metrics.ipynb)\n",
"to execute or view/download this notebook on\n",
"[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/tasks/pr2451-new-metrics.ipynb)"
"[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/tasks/asr-metrics.ipynb)"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/tasks/speech-classification-from-scratch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -871,7 +871,7 @@
"source": [
"## Step 3: Inference\n",
"\n",
"At this point, we can use the trained classifier to perform **predictions on new data**. Speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)) such as the `EncoderClassifier` one that can make inference easier. The class can also be used to extract some embeddings at the output of the encoder.\n",
"At this point, we can use the trained classifier to perform **predictions on new data**. Speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/classifiers.py)) such as the `EncoderClassifier` one that can make inference easier. The class can also be used to extract some embeddings at the output of the encoder.\n",
"\n",
"Let's see first how can we used it to load our best xvector model (trained on Voxceleb and stored on HuggingFace) to compute some embeddings and perform a speaker classification:\n"
]
Expand Down Expand Up @@ -1256,7 +1256,7 @@
"\n",
"### Use the EncoderClassifier interface on your model\n",
"\n",
"The [EncoderClassidier class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py#L591) takes a pre-trained model and performs inference on it with the following methods:\n",
"The [EncoderClassifier class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/classifiers.py) takes a pre-trained model and performs inference on it with the following methods:\n",
"\n",
"- **encode_batch**: applies the encoder to an input batch and returns some encoded embeddings.\n",
"- **classify_batch**: performs a full classification step and returns the output probabilities of the classifier, the best score, the index of the best class, and its label in text format (see example above).\n",
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorials/tasks/speech-recognition-from-scratch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@
"source": [
"We encourage the readers not familiar enough with speech recognition to gain more familiarity with this technology before moving on. Beyond scientific papers, online you can find amazing tutorials and blog posts, such as:\n",
"- [An Intuitive Explanation of Connectionist Temporal Classification](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n",
"- [Connectionist Temporal Classification](https://machinelearning-blog.com/2018/09/05/753/)\n",
"- [Connectionist Temporal Classification](https://web.archive.org/web/20211017041333/https://machinelearning-blog.com/2018/09/05/753/)\n",
"- [Sequence-to-sequence learning with Transducers](https://lorenlugosch.github.io/posts/2020/11/transducer/)\n",
"- [Understanding Encoder-Decoder Sequence to Sequence Model](https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346)\n",
"- [What is a Transformer?](https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04)\n",
Expand Down Expand Up @@ -1845,7 +1845,7 @@
"source": [
"## Step 5: Inference\n",
"\n",
"At this point, we can use the trained speech recognizer. For this type of ASR model, speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)) such as the `EncoderDecoderASR` one that can make inference easier. For instance, we can transcribe an audio file with a pre-trained model hosted in our [HuggingFace repository](https://huggingface.co/speechbrain) in solely 4 lines of code:\n"
"At this point, we can use the trained speech recognizer. For this type of ASR model, speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/ASR.py)) such as the `EncoderDecoderASR` one that can make inference easier. For instance, we can transcribe an audio file with a pre-trained model hosted in our [HuggingFace repository](https://huggingface.co/speechbrain) in solely 4 lines of code:\n"
]
},
{
Expand Down Expand Up @@ -2174,7 +2174,7 @@
"\n",
"While the `EncoderDecoderASR` class has been designed to be as generic as possible, your might require a more complex inference scheme that better fits your needs. In this case, you have to develop your own interface. To do so, follow these steps:\n",
"\n",
"1. Create your custom interface inheriting from `Pretrained` (code [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)):\n",
"1. Create your custom interface inheriting from `Pretrained` (code [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/interfaces.py)):\n",
"\n",
"\n",
"```python\n",
Expand Down
2 changes: 1 addition & 1 deletion recipes/Aishell1Mix/prepare_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def prepare_aishell1mix(
if not os.path.exists(wham_dir):
print("Download Wham noise dataset into %s" % datapath)
urlretrieve(
"https://storage.googleapis.com/whisper-public/wham_noise.zip",
"https://my-bucket-a8b4b49c25c811ee9a7e8bba05fa24c7.s3.amazonaws.com/wham_noise.zip",
os.path.join(datapath, "wham_noise.zip"),
reporthook=reporthook,
)
Expand Down
4 changes: 1 addition & 3 deletions recipes/DNS/dns_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,9 +164,7 @@
"datasets_fullband.dev_testset_000.tar.bz2",
]

AZURE_URL = (
"https://dns4public.blob.core.windows.net/dns4archive/datasets_fullband"
)
AZURE_URL = "https://dns4public.blob.core.windows.net/dns4archive/datasets_fullband" # noqa ignore-url-check

# Impulse response and Blind testset
OTHER_URLS = {
Expand Down
6 changes: 3 additions & 3 deletions recipes/ESC50/esc50_prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Creates data manifest files for ESC50
If the data does not exist in the specified --data_folder, we download the data automatically.

https://github.com/karoldvl/ESC-50/
https://github.com/karolpiczak/ESC-50/

Authors:
* Cem Subakan 2022, 2023
Expand All @@ -25,7 +25,7 @@

logger = get_logger(__name__)

ESC50_DOWNLOAD_URL = "https://github.com/karoldvl/ESC-50/archive/master.zip"
ESC50_DOWNLOAD_URL = "https://github.com/karolpiczak/ESC-50/archive/master.zip"
MODIFIED_METADATA_FILE_NAME = "esc50_speechbrain.csv"

ACCEPTABLE_FOLD_NUMS = [1, 2, 3, 4, 5]
Expand All @@ -49,7 +49,7 @@ def download_esc50(data_path):
# download the data
archive_path = fetch(
"master.zip",
"https://github.com/karoldvl/ESC-50/archive/",
"https://github.com/karolpiczak/ESC-50/archive/", # noqa ignore-url-check
savedir=temp_path,
# URL, so will be fetched directly in the savedir anyway
local_strategy=LocalStrategy.COPY_SKIP_CACHE,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Created By
Justin Salamon*^, Christopher Jacoby* and Juan Pablo Bello*
* Music and Audio Research Lab (MARL), New York University, USA
^ Center for Urban Science and Progress (CUSP), New York University, USA
http://serv.cusp.nyu.edu/projects/urbansounddataset
http://serv.cusp.nyu.edu/projects/urbansounddataset (dead link? ignore-url-check)
http://cusp.nyu.edu/

Version 1.0
Expand Down
8 changes: 7 additions & 1 deletion recipes/VoxLingua107/lang_id/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,20 @@ python create_wds_shards.py /data/voxlingua107/dev/ /data/voxlingua107_shards/de

### 2nd option: download the pre-compiled WebDataset shards

> [!IMPORTANT]
> As of 2024-09-19, according to the
> [official website](https://bark.phon.ioc.ee/voxlingua107/), the pre-compiled
> WebDataset shards are currently unavailable. As a result, this method is
> likely broken. If you get a 503 error, it is because of that.

Download the shards:

```
# Select a place with around 1 TB of free space
cd /data/
mkdir voxlingua107_shards
cd voxlingua107_shards
wget -r -nH --cut-dirs=4 --no-parent --reject="index.html*" http://bark.phon.ioc.ee/lw/korpused/voxlingua107/shards/
wget -r -nH --cut-dirs=4 --no-parent --reject="index.html*" http://bark.phon.ioc.ee/lw/korpused/voxlingua107/shards/ # ignore-url-check
```

## Installing Extra Dependencies
Expand Down
2 changes: 1 addition & 1 deletion templates/speaker_id/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ Please reach out to the SpeechBrain
team if any errors are found or clarification is needed about how
parts of the template work. Good Luck!

[For more information, please take a look into the "speaker-id from scratch" tutorial](https://speechbrain.readthedocs.io/en/latest/en/2685/tutorials/tasks/speech-classification-from-scratch.html)
[For more information, please take a look into the "speaker-id from scratch" tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/tasks/speech-classification-from-scratch.html)
Loading
Loading