-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible training bugs #11
Comments
Again, regarding point 2, the line EDIT: Sorry about this comment, but I'm not sure yet about the actual batch size when using multi-GPU training: https://discuss.huggingface.co/t/what-is-my-batch-size/41390/2?u=cgr71ii . As far as I understand from the HF accelerate documentation, if your batch size was 128, EDIT 2: Yes, according to the response I received in the HF forum, the actual batch_size should be 256 if you used 2 GPUs with the current training script. I think this comment can be ignored, sorry. EDIT 3: In #3 (comment) you mentioned that you used 8 GPUs. Then your actual batch size was |
Hi! I think I found another problem in the training code. When you run EDIT: For this problem I've found 2 different fixes which I don't know if it solves or not the problem, but they are:
Both options solves the problem that raises from the situation I explained in the original content of this comment. The problem is that I don't know how these changes affect to the training. EDIT 2: Sorry, I said that the length of docids is 3 according to the paper, but that is for the example of the Figure 1. Anyway, in the paper is stated that "K = 512 for all datasets, with the length of the docid M being dependent on the number of documents present". Am I wrong thinking that |
Hi,
Yes, I train on a single node. In multi-GPU training, I simply test on a separate process that uses only one GPU. I have not checked accelerate.utils.broadcast_object_list, but this may be a better solution.
First of all, I define 1 epoch as the model going through |
I am not sure if the problem arises from your modification of the batch size. I add an extra ID to distinguish docs with same IDs. This extra ID may be helpful when many documents have conflicting IDs, although ideally, the conflict rate should be low. Besides lean extra ID, the final step of training is still necessary as it trains the model to memorize the first 3 semantic IDs in a teacher-forcing manner (using |
Hi! First, thank you very much for your time and responses. Regarding my modifications to the code, I agree that I might have added added some lines that might change the results and generate some of the problems that I identified. For that reason, I started again from a clean version of The
The changes I added in
I don't know if I'm having this errors because I'm trying to execute Command I execute: CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" accelerate launch run2.py --model_name "google-t5/t5-small" --code_num 512 --max_length 3 --train_data ../europarl/europarl.2000.genret.train.50perc.json --dev_data ../europarl/europarl.2000.genret.dev.50perc.json --corpus_data ../europarl/europarl.2000.genret.corpus.json --save_path out.sentences.europarl/model4 Executing Again, thank you. PS: I'm aware that my data is not query/documents pairs but parallel sentences. If I'm not wrong, this should not be a problem. run2.zip |
When I run the original code (I only modified the call to
This is the error I tried to explain I was getting due to The command I ran is the one is specified in the README: CUDA_VISIBLE_DEVICES="0" python3 original_run_test_min.py --model_name t5-base --code_num 512 --max_length 3 --train_data dataset/nq320k/train.json --dev_data dataset/nq320k/dev.json --corpus_data dataset/nq320k/corpus_lite.json --save_path out/model Regarding the changes I said on my last comment about the batch size now I see that it's not a but it's managed the way it is because the |
@cgr71ii i am also meet this problem. |
Dear authors,
First, thank you for sharing the code!
I've observed 2 possible bugs in the training loop, which I'd appreciate if you could either confirm or explain the reason behind the piece of code.
test
andtest_dr
functions)? I managed to separate the testing part from the multi-GPU training usingaccelerator.is_local_main_process
, but the methodsafe_save
kept returningNone
for all the processes but for the main process (this leaded to fail the loading of the different components for all the processes but the main process). I think this is a bug. I assumed that it is, and I managed to fix it usingaccelerate.utils.broadcast_object_list
.train
you setepoch_step = len(data_loader) // in_batch_size
. If I'm not wrong, you manually split your dataset usingBiDataset
usingin_batch_size
, and use the variablebatch_size
with the pytorch data loader. I think thatepoch_step
is supposed to contain the number of steps per epoch, but with respect tobatch_size
, notin_batch_size
because we are using the data loader. I observed in the debugging executions I made that not all the steps per epoch are executed, but only a part of the epoch. Specifically, onlylen(data_loader) // in_batch_size
is executed, which is not the 100% of the dataset (the higherin_batch_size
, the fewer steps are executed). If I'm not wrong about this bug, this does not affect your conclusions, but your results might be even better because not all the training data is being used. The fix I applied assuming this is a bug islen(data_loader) // batch_size
, which sincebach_size = 1
, basically means that you see all the data batches.Thank you!
EDIT:
Regarding point 2, if I'm not wrong,
if in_batch_size != 1 and step > (epoch + 1) * epoch_step:
should be modified toif in_batch_size != 1 and step >= (epoch + 1) * epoch_step:
in order to save the model once the epoch is executed.The text was updated successfully, but these errors were encountered: