Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pytorch hooks #179

Merged
merged 4 commits into from
Jan 25, 2022
Merged

add pytorch hooks #179

merged 4 commits into from
Jan 25, 2022

Conversation

feifeibear
Copy link
Contributor

fix #175

@feifeibear feifeibear changed the title [WIP] add pytorch hooks add pytorch hooks Jan 21, 2022
@FrankLeeeee
Copy link
Contributor

Hi, thanks for your awesome PR. In colossalai/trainer/ophooks/__init__.py, I think it is not necessary to expose TestOpHook to the user?

@feifeibear
Copy link
Contributor Author

Hi, thanks for your awesome PR. In colossalai/trainer/ophooks/__init__.py, I think it is not necessary to expose TestOpHook to the user?

Sure, I will fix it.

@feifeibear
Copy link
Contributor Author

Hey, I polished the code and added a simple GPU memory tracer using the ophooks. It can dump the GPU memory usage curve during #niter iterations to a file. I hope you like the feature.

  1. The ophooks are applied on the class engine, therefore I put them under the dir of the engine.
  2. I am not sure how to pass the ophooks to the engine constructor. The project seems to recommend initialing the hooks in a config file.

@FrankLeeeee
Copy link
Contributor

Hi, the ophooks look great to me. As for your second point, we initialize the trainer hooks objects in the python script and pass the hook objects to the trainer instead of defining them in the config file. I am wondering if you can tell us where did you see such usage as it might be deprecated doc and we can update it.

@FrankLeeeee FrankLeeeee self-requested a review January 25, 2022 02:34
@feifeibear
Copy link
Contributor Author

Hi, the ophooks look great to me. As for your second point, we initialize the trainer hooks objects in the python script and pass the hook objects to the trainer instead of defining them in the config file. I am wondering if you can tell us where did you see such usage as it might be deprecated doc and we can update it.

I saw APIs in builder/builder.py

def build_hooks(config, trainer):

Most of the functions in this file are not used in the project.

@FrankLeeeee
Copy link
Contributor

Noted, these are deprecated after we update some APIs. Some code cleanup is needed.

@FrankLeeeee
Copy link
Contributor

Meanwhile, I saw some print statements in the op hooks. I would recommend using DistributedLogger instead as the print statement may not be captured by the standard output when logging is used. Using logger also allows the user to see the message in their log txt file.

A general usage of logger will be like this:

from colossalai.logging import get_dist_logger
from colossalai.context import ParallelMode


# this will get the root Python logger
logger = get_dist_logger()

# log on all ranks
logger.info("some message")

# log only on rank 0
logger.info("some messsages", ranks=[0])

# log on rank 0 and rank 1
logger.info("some messsages", ranks=[0, 1])

# log on all data parallel rank 0
logger.info("some messages", ranks=[0], parallel_mode=ParallelMode.DATA)

# save the log
logger.log_to_file('./logs')

You can find the API doc here.

@feifeibear
Copy link
Contributor Author

The MR has been polished. Replacing print() with logger.

@FrankLeeeee
Copy link
Contributor

Awesome, thanks for the contribution :)

@FrankLeeeee FrankLeeeee merged commit 569357f into hpcaitech:main Jan 25, 2022
ver217 pushed a commit to ver217/ColossalAI that referenced this pull request Feb 14, 2022
* add pytorch hooks
fix hpcaitech#175

* remove licenses in src code

* add gpu memory tracer

* replacing print with logger in ophooks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add hooks before and after operators
2 participants