random_walk() producing large integer possibly memory address on SLURM-based HPC environment #6946

rahit · 2024-01-14T08:12:01Z

Environment:

Cluster: SLURM-based HPC platform by The Alliance's Graham Cluster
Python Version: Python 3.9.6
DGL Version: 1.1.1+computecanada
PyTorch Version: 2.0.1+computecanada

Description: I am encountering an issue with DGL's random_walk() function when running a script on a Graham cluster using SLURM (srun/sbatch). The function is supposed to return node IDs as part of its tensor output; however, when executed on the compute nodes of Graham through SLURM, it returns very large integers, which seem like memory addresses. This behavior is not observed when the script is run locally on my machine, on Google Colab.

Reproduction Steps:

The code with the toy example is available on DGL's official documentation: https://docs.dgl.ai/en/1.1.x/generated/dgl.sampling.random\_walk.html Create a heterograph in DGL with the following code:

    from dgl import heterography
    from dgl.sampling import random_walk
    
    g2 = heterograph({
        ('user', 'follow', 'user'): ([0, 1, 1, 2, 3], [1, 2, 3, 0, 0]),
        ('user', 'view', 'item'): ([0, 0, 1, 2, 3, 3], [0, 1, 1, 2, 2, 1]),
        ('item', 'viewed-by', 'user'): ([0, 1, 1, 2, 2, 1], [0, 0, 1, 2, 3, 3])})
    
    print(random_walk(g2, [0, 1, 2, 0], metapath=['follow', 'view', 'viewed-by'] * 2))

Execute the script using srun/sbatch on the Graham cluster.

Expected Behavior: The random_walk() function should return a tensor of node IDs, similar to when run on a local machine or the login node.

    (test_py39) [rahit@gra-login1 modspy-data]$ python src/modspy_data/test_dgl.py 
    (tensor([[0, 1, 1, 0, 1, 1, 3],
            [1, 3, 2, 2, 0, 0, 0],
            [2, 0, 1, 1, 2, 2, 3],
            [0, 1, 1, 3, 0, 1, 1]]), tensor([1, 1, 0, 1, 1, 0, 1]))

Actual Behavior: The function returns tensors containing very large integers, as shown below:

    (test_py39) [rahit@gra-login1 modspy-data]$ srun --ntasks=1 --cpus-per-task=1 --time=3:00 --mem=500 python ./src/modspy_data/test_dgl.py
    srun: job 14487954 queued and waiting for resources
    srun: job 14487954 has been allocated resources
    (tensor([[                  0,                   1,                   1,
                               0,                   1,                   1,
                               3],
            [7802034886504505161, 8028865303377573743,     563406901963619,
             2987123997513744384,                 225,           114293136,
                  47056890387456],
            [6866107348136439416, 8386095522570323780, 5795977025519175781,
             7022329414053225321,     110416352208244,                 161,
                       114293136],
            [     47056890387456, 5782977472600960876, 7802034886504505161,
             7237089388030031727, 7453010364987428197, 6485183463639119872,
                              97]]), tensor([1, 1, 0, 1, 1, 0, 1]))

Troubleshooting Done:

Verified that the script runs as expected on local environments and the login node.
Checked for any discrepancies in the environment and DGL version between the local setup and the cluster.
Ensured that the Python and DGL environments are consistent.

Questions/Support Needed:

Is there any known issue with DGL's random_walk() or other functions when used in a distributed environment like SLURM-based HPC environment?
Could this be related to how memory is managed or accessed differently in the compute nodes via SLURM?
Are there any additional configurations or environment settings I should consider for running DGL on a distributed system like Graham?

Additional Information:

Modules Loaded:

    (test_py39) [rahit@gra-login1 modspy-data]$ module list
    
    Currently Loaded Modules:
      1) CCconfig          3) imkl/2020.1.217 (math)   5) gcccore/.9.3.0 (H)   7) ucx/1.8.0          9) openmpi/4.0.3 (m)  11) python/3.9.6 (t)  13) protobuf/3.21.3 (t)
      2) gentoo/2020 (S)   4) StdEnv/2020     (S)      6) gcc/9.3.0      (t)   8) libfabric/1.10.1  10) libffi/3.3         12) cmake/3.27.7 (t)
    
      Where:
       S:     Module is Sticky, requires --force to unload or purge
       m:     MPI implementations / Implémentations MPI
       math:  Mathematical libraries / Bibliothèques mathématiques
       t:     Tools for development / Outils de développement
       H:                Hidden Module

PyPI packages installed in the virtual environment:

    (test_py39) [rahit@gra-login1 modspy-data]$ pip list
    Package            Version
    ------------------ --------------------
    certifi            2023.11.17
    charset-normalizer 3.3.2
    dgl                1.1.1+computecanada
    filelock           3.13.1+computecanada
    idna               3.6
    Jinja2             3.1.2+computecanada
    MarkupSafe         2.1.3+computecanada
    mpmath             1.3.0+computecanada
    networkx           3.2.1+computecanada
    numpy              1.25.2+computecanada
    pip                23.0+computecanada
    psutil             5.9.5+computecanada
    requests           2.31.0+computecanada
    scipy              1.11.2+computecanada
    setuptools         46.1.3
    sympy              1.12+computecanada
    torch              2.0.1+computecanada
    tqdm               4.66.1+computecanada
    typing_extensions  4.8.0+computecanada
    urllib3            2.1.0+computecanada
    wheel              0.34.2

The text was updated successfully, but these errors were encountered:

github-actions · 2024-02-18T01:28:27Z

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions · 2024-03-24T01:30:02Z

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

frozenbugs added the bug:unconfirmed May be a bug. Need further investigation. label Jan 18, 2024

github-actions bot added the stale-issue label Feb 18, 2024

Rhett-Ying removed the stale-issue label Feb 22, 2024

Rhett-Ying assigned frozenbugs Feb 22, 2024

frozenbugs assigned RamonZhou Feb 29, 2024

github-actions bot added the stale-issue label Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random_walk() producing large integer possibly memory address on SLURM-based HPC environment #6946

random_walk() producing large integer possibly memory address on SLURM-based HPC environment #6946

rahit commented Jan 14, 2024

github-actions bot commented Feb 18, 2024

github-actions bot commented Mar 24, 2024

random_walk() producing large integer possibly memory address on SLURM-based HPC environment #6946

random_walk() producing large integer possibly memory address on SLURM-based HPC environment #6946

Comments

rahit commented Jan 14, 2024

github-actions bot commented Feb 18, 2024

github-actions bot commented Mar 24, 2024