You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description: I am encountering an issue with DGL's random_walk() function when running a script on a Graham cluster using SLURM (srun/sbatch). The function is supposed to return node IDs as part of its tensor output; however, when executed on the compute nodes of Graham through SLURM, it returns very large integers, which seem like memory addresses. This behavior is not observed when the script is run locally on my machine, on Google Colab.
Environment:
Description: I am encountering an issue with DGL's
random_walk()
function when running a script on a Graham cluster using SLURM (srun
/sbatch
). The function is supposed to return node IDs as part of its tensor output; however, when executed on the compute nodes of Graham through SLURM, it returns very large integers, which seem like memory addresses. This behavior is not observed when the script is run locally on my machine, on Google Colab.Reproduction Steps:
srun
/sbatch
on the Graham cluster.Expected Behavior: The
random_walk()
function should return a tensor of node IDs, similar to when run on a local machine or the login node.Actual Behavior: The function returns tensors containing very large integers, as shown below:
(test_py39) [rahit@gra-login1 modspy-data]$ srun --ntasks=1 --cpus-per-task=1 --time=3:00 --mem=500 python ./src/modspy_data/test_dgl.py srun: job 14487954 queued and waiting for resources srun: job 14487954 has been allocated resources (tensor([[ 0, 1, 1, 0, 1, 1, 3], [7802034886504505161, 8028865303377573743, 563406901963619, 2987123997513744384, 225, 114293136, 47056890387456], [6866107348136439416, 8386095522570323780, 5795977025519175781, 7022329414053225321, 110416352208244, 161, 114293136], [ 47056890387456, 5782977472600960876, 7802034886504505161, 7237089388030031727, 7453010364987428197, 6485183463639119872, 97]]), tensor([1, 1, 0, 1, 1, 0, 1]))
Troubleshooting Done:
Questions/Support Needed:
random_walk()
or other functions when used in a distributed environment like SLURM-based HPC environment?Additional Information:
(test_py39) [rahit@gra-login1 modspy-data]$ module list Currently Loaded Modules: 1) CCconfig 3) imkl/2020.1.217 (math) 5) gcccore/.9.3.0 (H) 7) ucx/1.8.0 9) openmpi/4.0.3 (m) 11) python/3.9.6 (t) 13) protobuf/3.21.3 (t) 2) gentoo/2020 (S) 4) StdEnv/2020 (S) 6) gcc/9.3.0 (t) 8) libfabric/1.10.1 10) libffi/3.3 12) cmake/3.27.7 (t) Where: S: Module is Sticky, requires --force to unload or purge m: MPI implementations / Implémentations MPI math: Mathematical libraries / Bibliothèques mathématiques t: Tools for development / Outils de développement H: Hidden Module
The text was updated successfully, but these errors were encountered: