Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【NOMERGE】Profile load graph time #619

Closed
wants to merge 9 commits into from
Closed

Conversation

doombeaker
Copy link
Contributor

No description provided.

image = base(
prompt=args.prompt,
height=args.height,
width=args.width,
num_inference_steps=args.n_steps,
output_type=OUTPUT_TYPE,
).images
flow._oneflow_internal.eager.Sync()
end_time = time.time()
print(f"{end_time-start_time}s elapsed: 1st infer")
Copy link
Contributor Author

@doombeaker doombeaker Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

运行:

python examples/text_to_image_sdxl.py --compile_unet true --base /share_nfs/hf_models/stable-diffusion-xl-base-1.0

得到的结果比较奇怪:

虽然有 load_graph 的耗时(10 s 左右),但是第一次推理出图需要的时间,仍然需要 54s ,而第二次推理是 4s。

说明 加载预编译图之后,开始进行第一次迭代采样之前,有操作耗时 50s 左右

我想 vyro 一直和我们抱怨 load_graph 慢其实是抱怨的这点,并不真是函数 load_graph 耗时久,而是函数 load_graph 调用、到第一次出图的耗时久。

Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  4.43it/s]
Compiling unet with oneflow.
1.2790741920471191s elapsed: compile
Loading from graph
    2.3534035682678223s elpased: assign of_module time
    0.03800535202026367s elpased: get_oneflow_graph time
  2.3914899826049805s elpased: dpl_graph time
  7.607182264328003s elpased: load_graph time
9.998764038085938s elapsed: unet.load_graph
Warmup with running graphs...
  0%|                                                                                                                                                                                      | 0/30 [00:00<?, ?it/s]    8.821487426757812e-06s elpased: assign of_module time
    0.03774738311767578s elpased: get_oneflow_graph time
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:53<00:00,  1.78s/it]
54.89510440826416s elapsed: 1st infer
Normal SDXL run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00,  7.40it/s]
4.648606061935425s elapsed: 2nd infer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

54.89510440826416s elapsed: 1st infer

怎么感觉在编译。

可以开下 ONEDIFF_BEBUG=1 来看是否触发了编译

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@strint strint marked this pull request as ready for review February 4, 2024 08:27
@doombeaker
Copy link
Contributor Author

经确认,不管是本 PR 的新增(简单)脚本,还是 https://github.com/siliconflow/onediff/blob/main/examples/text_to_image_sdxl_save_load.py 脚本。

不管有否 load_graph,第一次调用都会进入这个 self._compile 调用中,这个调用中会编译整个 graph

https://github.com/siliconflow/oneflow/blob/876c6d6b258e5ada564a0a71f490918317d764ab/python/oneflow/nn/graph/graph.py#L294-L295

这里有几个问题需要讨论和定位:

  1. 之前有 真正做到过 load_graph 后,调用时直接使用加载的 graph,整个端到端构图省掉编译时间的例子吗。如果有,那说明中间某些 commit,把功能改坏过
  2. 一个 graph 调用时,到底是不是进行编译,判断条件是? "graph 层就一个判断逻辑 if not self._is_compiled",oneflow_compile 中还有些包装后的判断(如 dpl_graph._is_raw_deployable_module),感觉没有一致导致了这个问题。

@isidentical
Copy link
Contributor

this is also something we observed quite recently, the first inference after loading the graph takes 40-50 seconds but I wasn't sure what happened. I think this is a new regression, but can't really point into the particular commit.

@doombeaker
Copy link
Contributor Author

doombeaker commented Feb 5, 2024

this is also something we observed quite recently, the first inference after loading the graph takes 40-50 seconds but I wasn't sure what happened. I think this is a new regression, but can't really point into the particular commit.

sorry for that. I have confirmed this branch raise the bug:

if count != input_count:
logger.warning(
f"Module {type(self._deployable_module_model.oneflow_module)} input tensor count changed from {count} to {input_count}, will compile again."
)
self._deployable_module_dpl_graph = None
self._load_graph_first_run = True
self._deployable_module_input_count = input_count

we will fix it soon.

update: it has been fixed by #622

@ccssu ccssu mentioned this pull request Feb 5, 2024
ccssu added a commit that referenced this pull request Feb 5, 2024
@strint strint added Request-new_feature Request for a new feature Rsp-milestone labels Feb 7, 2024
@strint strint added this to the v0.12.1 milestone Feb 7, 2024
@strint
Copy link
Collaborator

strint commented Feb 29, 2024

fix with #622

@strint strint closed this Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Request-new_feature Request for a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants