To use a NeMo models in .nemo
format, like NeMo Megatron-GPT-20B, download and un-tar it:
tar xvf nemo_gpt20B_bf16_tp4.nemo
This will extract the model weights and the model config.
Then set train.trainer_kwargs.pretrained_model
to the path to the directory containing the parameters. The model hyperparameters in the train.trainer_kwargs.megatron_cfg
should match the ones in the model config.
To load a checkpoint, run
python examples/ configs/nemo_configs/megatron_20b.yaml "/path/to/ilql_sentiments_logs/checkpoints"
To save checkpoints, ensure the following is set in the NeMo config:
explicit_log_dir: ilql_sentiments_logs
create_checkpoint_callback: True
To resume training, ensure the following is set in the NeMo config:
resume_if_exists: True
Clone (currently only up to r1.15.0
is supoprted) and apex from
install conda (or mamba/micromamba)
srun into a compute node with a gpu (if running on HPC cluster)
srun --pty bash -i
- copy the conda env export below and change the name and prefix
conda env create -f env.yaml
- install nemo
git clone
cd NeMo
git checkout r1.15.0
pip install '.[all]'
- install apex (or clone the github)
git clone
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
