Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random_walk() producing large integer possibly memory address on SLURM-based HPC environment #6946

Open
rahit opened this issue Jan 14, 2024 · 2 comments
Assignees
Labels
bug:unconfirmed May be a bug. Need further investigation. stale-issue

Comments

@rahit
Copy link

rahit commented Jan 14, 2024

Environment:

  • Cluster: SLURM-based HPC platform by The Alliance's Graham Cluster
  • Python Version: Python 3.9.6
  • DGL Version: 1.1.1+computecanada
  • PyTorch Version: 2.0.1+computecanada

Description: I am encountering an issue with DGL's random_walk() function when running a script on a Graham cluster using SLURM (srun/sbatch). The function is supposed to return node IDs as part of its tensor output; however, when executed on the compute nodes of Graham through SLURM, it returns very large integers, which seem like memory addresses. This behavior is not observed when the script is run locally on my machine, on Google Colab.

Reproduction Steps:

  1. The code with the toy example is available on DGL's official documentation: https://docs.dgl.ai/en/1.1.x/generated/dgl.sampling.random\_walk.html Create a heterograph in DGL with the following code:
    from dgl import heterography
    from dgl.sampling import random_walk
    
    g2 = heterograph({
        ('user', 'follow', 'user'): ([0, 1, 1, 2, 3], [1, 2, 3, 0, 0]),
        ('user', 'view', 'item'): ([0, 0, 1, 2, 3, 3], [0, 1, 1, 2, 2, 1]),
        ('item', 'viewed-by', 'user'): ([0, 1, 1, 2, 2, 1], [0, 0, 1, 2, 3, 3])})
    
    print(random_walk(g2, [0, 1, 2, 0], metapath=['follow', 'view', 'viewed-by'] * 2))
  1. Execute the script using srun/sbatch on the Graham cluster.

Expected Behavior: The random_walk() function should return a tensor of node IDs, similar to when run on a local machine or the login node.

    (test_py39) [rahit@gra-login1 modspy-data]$ python src/modspy_data/test_dgl.py 
    (tensor([[0, 1, 1, 0, 1, 1, 3],
            [1, 3, 2, 2, 0, 0, 0],
            [2, 0, 1, 1, 2, 2, 3],
            [0, 1, 1, 3, 0, 1, 1]]), tensor([1, 1, 0, 1, 1, 0, 1]))

Actual Behavior: The function returns tensors containing very large integers, as shown below:

    (test_py39) [rahit@gra-login1 modspy-data]$ srun --ntasks=1 --cpus-per-task=1 --time=3:00 --mem=500 python ./src/modspy_data/test_dgl.py
    srun: job 14487954 queued and waiting for resources
    srun: job 14487954 has been allocated resources
    (tensor([[                  0,                   1,                   1,
                               0,                   1,                   1,
                               3],
            [7802034886504505161, 8028865303377573743,     563406901963619,
             2987123997513744384,                 225,           114293136,
                  47056890387456],
            [6866107348136439416, 8386095522570323780, 5795977025519175781,
             7022329414053225321,     110416352208244,                 161,
                       114293136],
            [     47056890387456, 5782977472600960876, 7802034886504505161,
             7237089388030031727, 7453010364987428197, 6485183463639119872,
                              97]]), tensor([1, 1, 0, 1, 1, 0, 1]))

Troubleshooting Done:

  • Verified that the script runs as expected on local environments and the login node.
  • Checked for any discrepancies in the environment and DGL version between the local setup and the cluster.
  • Ensured that the Python and DGL environments are consistent.

Questions/Support Needed:

  • Is there any known issue with DGL's random_walk() or other functions when used in a distributed environment like SLURM-based HPC environment?
  • Could this be related to how memory is managed or accessed differently in the compute nodes via SLURM?
  • Are there any additional configurations or environment settings I should consider for running DGL on a distributed system like Graham?

Additional Information:

  • Modules Loaded:
    (test_py39) [rahit@gra-login1 modspy-data]$ module list
    
    Currently Loaded Modules:
      1) CCconfig          3) imkl/2020.1.217 (math)   5) gcccore/.9.3.0 (H)   7) ucx/1.8.0          9) openmpi/4.0.3 (m)  11) python/3.9.6 (t)  13) protobuf/3.21.3 (t)
      2) gentoo/2020 (S)   4) StdEnv/2020     (S)      6) gcc/9.3.0      (t)   8) libfabric/1.10.1  10) libffi/3.3         12) cmake/3.27.7 (t)
    
      Where:
       S:     Module is Sticky, requires --force to unload or purge
       m:     MPI implementations / Implémentations MPI
       math:  Mathematical libraries / Bibliothèques mathématiques
       t:     Tools for development / Outils de développement
       H:                Hidden Module
  • PyPI packages installed in the virtual environment:
    (test_py39) [rahit@gra-login1 modspy-data]$ pip list
    Package            Version
    ------------------ --------------------
    certifi            2023.11.17
    charset-normalizer 3.3.2
    dgl                1.1.1+computecanada
    filelock           3.13.1+computecanada
    idna               3.6
    Jinja2             3.1.2+computecanada
    MarkupSafe         2.1.3+computecanada
    mpmath             1.3.0+computecanada
    networkx           3.2.1+computecanada
    numpy              1.25.2+computecanada
    pip                23.0+computecanada
    psutil             5.9.5+computecanada
    requests           2.31.0+computecanada
    scipy              1.11.2+computecanada
    setuptools         46.1.3
    sympy              1.12+computecanada
    torch              2.0.1+computecanada
    tqdm               4.66.1+computecanada
    typing_extensions  4.8.0+computecanada
    urllib3            2.1.0+computecanada
    wheel              0.34.2
@frozenbugs frozenbugs added the bug:unconfirmed May be a bug. Need further investigation. label Jan 18, 2024
Copy link

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Copy link

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:unconfirmed May be a bug. Need further investigation. stale-issue
Projects
Status: 🏠 Backlog
Development

No branches or pull requests

4 participants