-
Notifications
You must be signed in to change notification settings - Fork 22.1k
[WIP] What is a SEV
Suraj Subramanian edited this page Jul 24, 2023
·
1 revision
"OSS CI SEV" represents the incident response process for PyTorch OSS CI, including incidents that breaks the HUD status, trunk health, PR health, and CI infrastructure stability. The goal of ci: sev
process is to maintain a healthy trunk for better developer experience.
- [OSS] PyTorch Metrics Platform: https://metrics.pytorch.org/
- [FB Only] Green HUD Top Level Metrics: https://fburl.com/unidash/961dprzj
Create an issue that clearly indicates the scope and the impact area. Tag the issue with ci: sev
label so that it appears on the HUD. https://hud.pytorch.org/build2/pytorch-master
- Raise the awareness. SEV events visibility on HUD should be able to help tree-hugger oncalls to clarify if some "test failures" are SEV or infra flaky issues.
- Notify the related tests' owner team.
- Escalate the issue with
high priority
label if necessary - After the issue is resolved, simply close the issue (but don't remove the label
ci: sev
).
- Gathering the recent SEV issues: https://github.com/pytorch/pytorch/issues?q=is%3Aissue+label%3A%22ci%3A+sev%22+
- Summarize what can we do to prevent similar issues in the future
- Actionable Items
- Improved Detection