-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to collect training data for sparsity predictor #11
Comments
Hello Authors, Encountered the above RuntimeError related to tensor size mismatch while trying to train the sparsity predictor as per the instructions in the README. The issue arises when running ./run_infer_opt_175b_collect_sp_data.sh. All relevant files are created successfully until the script hits a tensor size error.
Can you please provide additional information on how can we resolve the issue ? Thank You . |
I met the same error when tp=1. Looking for the author to fix it. |
Hi everyone! I am facing this issue right now, after trying to use @zhaoyang-star Where is this parameter |
Same things happens to me too when I'm trying to use
@zhaoyang-star Is |
@2455DD I reached out to the author earlier this month, and she replied with this "I believe there is a temporary work around suggested on GitHub by setting the top_p = 2 in get_data file." So, you're correct, it is |
I followed the instructions and modified the file DejaVu/Decentralized_FM_alpha/c4_train/getdata.py in the following way: data = { |
I used the run_infer_opt_1_3_b_collect_sp_data.sh script from your forked repository, but I encountered the same size mismatch error. Did you run the script successfully? |
Can confirm I am having a similar issue after setting
(the difference being that the mismatch happens at dimension 1 and not 3). |
it works for me if i set
|
Hi,
I was trying to train the sparsity predictor by following the instructions in the README file, but encountered some error when running
./run_infer_opt_175b_collect_sp_data.sh
.Following is the printed message:
I used WSL2 with Ubuntu 20.04 and CUDA 11.3.
Due to storage limitations, I changed some setting to use a smaller pretrained model (opt-125m).
In
DejaVu/Decentralized_FM_alpha/c4_train/get_data.py
:In
DejaVu/Decentralized_FM_alpha/convert_opt_checkpoint.py
:In
DejaVu/Decentralized_FM_alpha/run_infer_opt_175b_collect_sp_data.sh
:The att_sp_x_0.mmap ~ att_sp_x_11.mmap, mlp_sp_x_0.mmap ~ mlp_sp_x_11.mmap, mlp_label_0.mmap ~ mlp_label_11.mmap, and score_norm_0 ~ score_norm_11.mmap are created successfully before the error occured.
Can you think of any possible reason or solution?
Thanks!
The text was updated successfully, but these errors were encountered: