Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcoded num_mels to 80? #166

Open
bzp83 opened this issue Jun 16, 2024 · 5 comments
Open

Hardcoded num_mels to 80? #166

bzp83 opened this issue Jun 16, 2024 · 5 comments

Comments

@bzp83
Copy link

bzp83 commented Jun 16, 2024

self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))

Hi, why is 80 hardcoded here? Should it match num_mels?

Thanks

@harsh40c
Copy link

Hey bro, i tried this repo code and i encountered the same error. I used librosa instead of tacotron2 for melspectogram generation and my spectograms has shape of (128×387). But since as shown above they hardcoded it to 80 and changing here doesnt solve the error as many other places needed to change so i changed the n_mels to 80 while generating melspectograms from librosa features. This solves this error but now i m getting cuDNN error as the version they used for CUDA and cuDNN are incompatible with GPU (using RTX3090). If we used newer pytorch which correseponds to CUDA 11.1 and cuDNN relevent version, I got kernels error as no available kernel something and using old version gives CUDNN_EXECUTION_FAILED error. If u have any solution regarding this please tell me. As for your querry as i told u change n_mels of spectograms generated to 80 to solve the issue.

@bzp83
Copy link
Author

bzp83 commented Jun 18, 2024

yes... and to help me get even more confused, vits changes the code of hifi gan slightly and use "initial_channel" (https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L249) instead of hardcoded 80... I'm having a hard time figuring it out.

Anyway, yes I solved the problem and it works great on my rtx4090:

1 - update your requirements.txt to the code below, this will install latest version of those packages:

numpy
librosa
scipy
tensorboard
soundfile
matplotlib

2 - install latest pytorch, ie for 2.3.1 and cuda 12.1 do:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

3 - update your mel_spectrogram method in meldataset.py to:

def mel_spectrogram(
    y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
):
    if torch.min(y) < -1.0:
        print("min value is ", torch.min(y))
    if torch.max(y) > 1.0:
        print("max value is ", torch.max(y))

    global mel_basis, hann_window
    dtype_device = str(y.dtype) + "_" + str(y.device)
    fmax_dtype_device = str(fmax) + "_" + dtype_device
    wnsize_dtype_device = str(win_size) + "_" + dtype_device
    if fmax_dtype_device not in mel_basis:
        mel = librosa_mel_fn(
            sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
        )
        mel_basis[fmax_dtype_device] = torch.from_numpy(mel).type_as(y)
    if wnsize_dtype_device not in hann_window:
        hann_window[wnsize_dtype_device] = torch.hann_window(win_size).type_as(y)

    y = torch.nn.functional.pad(
        y.unsqueeze(1),
        (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
        mode="reflect",
    )
    y = y.squeeze(1)
    
    spec = torch.view_as_real(
        torch.stft(
            y,
            n_fft,
            hop_length=hop_size,
            win_length=win_size,
            window=hann_window[wnsize_dtype_device],
            center=center,
            pad_mode="reflect",
            normalized=False,
            onesided=True,
            return_complex=True,
        )
    )

    spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)

    spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
    spec = spectral_normalize_torch(spec)

    return spec

that should do it!

@bzp83
Copy link
Author

bzp83 commented Jun 18, 2024

btw... I managed to train a model with 128 mels and 44100hz by using the config below. I also had to change that hardcoded 80 to 128 or just do self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3)) so I suspect that is indeed num_mels... but as I said, vits use initial_channels, which seems to be 192 all the time in the configs but num_mels is 80 😵

{
    "resblock": "1",
    "num_gpus": 0,
    "batch_size": 8,
    "learning_rate": 0.0002,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999875,
    "seed": 1234,
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_kernel_sizes": [
      16,
      16,
      4,
      4,
      4
    ],
    "upsample_initial_channel": 512,
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "segment_size": 16384,
    "num_mels": 128,
    "num_freq": 1025,
    "n_fft": 2048,
    "hop_size": 512,
    "win_size": 2048,
    "sampling_rate": 44100,
    "fmin": 0,
    "fmax": 22050,
    "fmax_for_loss": null,
    "num_workers": 16,
    "dist_config": {
      "dist_backend": "nccl",
      "dist_url": "tcp://localhost:54321",
      "world_size": 1
    }
  }

@harsh40c
Copy link

Hey man, thanks for solution it worked. Just consuming too much GPU memory but since other trainings were going on our server machine i will start its training when GPU is free. Then hope it will train properly. Anyway thanks a bunch

@Yaodada12
Copy link

btw... I managed to train a model with 128 mels and 44100hz by using the config below. I also had to change that hardcoded 80 to 128 or just do self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3)) so I suspect that is indeed num_mels... but as I said, vits use initial_channels, which seems to be 192 all the time in the configs but num_mels is 80 😵

{
    "resblock": "1",
    "num_gpus": 0,
    "batch_size": 8,
    "learning_rate": 0.0002,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999875,
    "seed": 1234,
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_kernel_sizes": [
      16,
      16,
      4,
      4,
      4
    ],
    "upsample_initial_channel": 512,
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "segment_size": 16384,
    "num_mels": 128,
    "num_freq": 1025,
    "n_fft": 2048,
    "hop_size": 512,
    "win_size": 2048,
    "sampling_rate": 44100,
    "fmin": 0,
    "fmax": 22050,
    "fmax_for_loss": null,
    "num_workers": 16,
    "dist_config": {
      "dist_backend": "nccl",
      "dist_url": "tcp://localhost:54321",
      "world_size": 1
    }
  }

我也在训128通道mel的hifigan,你的效果如何,mel损失最后可以降低到多少?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants