GitHub Availability Report: September 2024
In September, we experienced three incidents that resulted in degraded performance across GitHub services.
In September, we experienced three incidents that resulted in degraded performance across GitHub services.
September 16 21:11 UTC (lasting 57 minutes)
On September 16, 2024, between 21:11 UTC and 22:08 UTC, GitHub Actions and GitHub Pages services were degraded. Customers who deploy Pages from a source branch experienced delayed runs. We determined the root cause to be a misconfiguration in the service that manages runner connections that led to CPU throttling and performance degradation in that service. Actions jobs experienced average delays of 23 minutes, with some jobs experiencing delays as high as 45 minutes. During the course of the incident, 17% of runs were delayed by more than five minutes. At peak, as many as 80% of runs experienced delays exceeding five minutes.
We mitigated the incident by diverting runner connections away from the misconfigured nodes, starting at 21:16 UTC. In addition to addressing the configuration issue we discovered through this, we have improved our general monitoring to reduce the risk of a similar recurrence and reduce our time to automated detection and mitigation of issues like this in the future.
September 24 08:20 UTC (lasting 44 minutes)
On September 24, 2024 from 08:20 UTC to 09:04 UTC the GitHub Codespaces service experienced an interruption in network connectivity, leading to an approximate 25% error rate for the outage period. We traced the cause to an interruption in network connectivity caused by Source Network Address Translation (SNAT) port exhaustion following a deployment, causing individual codespaces to lose their connection to the service. To mitigate the impact, we increased port allocations to give enough buffer for increased outbound connections shortly after deployments. We will be scaling up our outbound connectivity in the near future, as well as adding improved monitoring of network capacity to prevent future regressions.
September 30 10:43 UTC (lasting 43 minutes)
On September 30, 2024 from 10:43 UTC to 11:26 UTC GitHub Codespaces customers in the Central India region were unable to create new codespaces. Resumes were not impacted and there was no impact to customers in other regions. We traced the cause to storage capacity constraints in the region and mitigated by temporarily redirecting create requests to other regions. Afterwards, we added additional storage capacity to the region and traffic was routed back. We also identified a bug that caused some available capacity not to be utilized, artificially constraining capacity and halting creations in the region prematurely. We have since fixed this bug as well so that available capacity scales as expected according to our capacity planning projections.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.
Tags:
Written by
Related posts
Inside the research: How GitHub Copilot impacts the nature of work for open source maintainers
An interview with economic researchers analyzing the causal effect of GitHub Copilot on how open source maintainers work.
OpenAI’s latest o1 model now available in GitHub Copilot and GitHub Models
The December 17 release of OpenAI’s o1 model is now available in GitHub Copilot and GitHub Models, bringing advanced coding capabilities to your workflows.
Announcing 150M developers and a new free tier for GitHub Copilot in VS Code
Come and join 150M developers on GitHub that can now code with Copilot for free in VS Code.