Incident with Pull Requests: High percentage of 500sOn Monday March 31st, 2026, between 13:53 UTC and 21:23 UTC the Pull Requests service experienced elevated latency and failures. On average, the error rate was 0.15% and peaked at 0.28% of requests to the service. This was due to a change in garbage collection (GC) settings for a Go-based internal service that provides access to Git repository data. The changes caused more frequent GC activity and elevated CPU consumption on a subset of storage nodes, increasing latency and failure rates for some internal API operations.
We mitigated the incident by reverting the GC changes. To prevent future incidents and improve time to detection and mitigation, we are instrumenting additional metrics and alerting for GC-related behavior, improving our visibility into other signals that could cause degraded impact of this type, and updating our best practices and standards for garbage collection in Go-based services.
Mar 31, 15:05 - 21:23 UTC
Issues with metered billing report generationOn March 31, 2026, between 06:15 UTC and 15:30 UTC, the GitHub billing usage reports feature was degraded due to reduced server capacity. Customers requesting billing usage reports and loading the top usage by organization and repository on the billing overview and usage pages were impacted. The average error rate for usage report requests was 15%, peaking at 98% over an eight-minute window. For the billing pages, an average of 56% of requests failed to load the top usage cards. The root cause was an increase in billing usage report requests with large datasets, which exhausted the capacity of the nodes responsible for reporting data. There was no impact on billing charges.
We mitigated the incident by adjusting our auto-scaling thresholds to better meet our capacity needs. We are working to improve our metrics to reduce time to detection and mitigation for similar issues in the future.
Mar 31, 13:47 - 15:10 UTC
Elevated delays in Actions workflow runs and Pull Request status updatesOn March 30, 2026, between 10:11 UTC and 13:25 UTC, GitHub Actions experienced degraded performance. During this time, approximately 2.65% of workflow jobs triggered by pull request events experienced start delays exceeding 5 minutes. The issue was caused by replication lag on an internal database cluster used by Actions, which triggered write throttling in our database protection layer and slowed job queue processing.
The replication lag originated from planned maintenance to scale the internal database. Newly added database hosts triggered guardrails in the throttling layer, restricting write throughput. The incident was mitigated by excluding the new hosts from replication delay calculations.
To prevent recurrence, we have updated our maintenance procedures to ensure new hosts are excluded from throttling assessments during scaling operations. Additionally, we are investing in automation to streamline this type of maintenance activity.
Mar 30, 13:02 - 13:25 UTC
Incident with CopilotOn March 27, 2026, from 02:30 to 04:56 UTC, a misconfiguration in our rate limiting system caused users on Copilot Free, Student, Pro, and Pro+ plans to experience unexpected rate limit errors. The configuration that was incorrectly applied was intended solely for internal staff testing of rate-limiting experiences. Copilot Business and Copilot Enterprise accounts were not affected.
During this period, affected users received error messages instructing them to retry after a certain time. Approximately 32% of active Free users, 35% of active Student users, 46% of active Pro users, and 66% of active Pro+ users were affected.
After identifying the root cause, we reverted the change and restored the expected rate limits. We are reviewing our deployment and validation processes to help ensure configurations used for internal testing cannot be inadvertently applied to production environments.
Mar 27, 05:00 - 05:00 UTC
Disruption with some GitHub servicesThis incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
Mar 24, 20:18 - 20:56 UTC
Teams Github Notifications App is downOn March 24, 2026, between 15:57 UTC and 19:51 UTC, the Microsoft Teams Integration and Teams Copilot Integration services were degraded and unable to deliver GitHub event notifications to Microsoft Teams. On average, the error rate was 37.4% and peaked at 90.1% of requests to the service -- approximately 19% of all integration installs failed to receive GitHub-to-Teams notifications in this time period.
This was due to an outage at one of our upstream dependencies, which caused HTTP 500 errors and connection resets for our Teams integration.
We coordinated with the relevant service teams, and the issue was resolved at 19:51 UTC when the upstream incident was mitigated.
We are working to update observability and runbooks to reduce time to mitigation for issues like this in the future.
Mar 24, 16:59 - 19:51 UTC
Disruption with some GitHub servicesOn March 22, 2026, between 09:05 UTC and 10:02 UTC, users may have experienced intermittent errors and increased latency when performing Git http read operations. On average, the error rate was 3.84% and peaked at 15.55% of requests to the service. The issue was caused by elevated latency in an internal authentication service within one of our regional clusters. We mitigated the issue by redirecting traffic away from the affected cluster at 09:39 UTC, after which error rates returned to normal. The incident was fully resolved at 10:02 UTC.
We are working to scale the authentication service and reduce our time to detection and mitigation of issues like this one in the future.
Mar 22, 09:08 - 10:02 UTC
Disruption with Copilot Coding Agent SessionsOn March 19, 2026, between 01:05 UTC and 02:52 UTC, and again on March 20, 2026, between 00:42 UTC and 01:58 UTC, the Copilot Coding Agent service was degraded and users were unable to start new Copilot Agent sessions or view existing ones. During the first incident, the average error rate was ~53% and
peaked at ~93% of requests to the service. During the second incident, the average error rate was ~99%% and peaked at ~100%% of requests with significant retry amplification. Both incidents were caused by the same underlying system authentication issue that prevented the service from connecting to its
backing datastore.
We mitigated each incident by rotating the affected credentials, which restored connectivity and returned error rates to normal. The mitigation time was 01:24. The second occurrence was due to an incomplete remediation of the first.
We are implementing automated monitoring for credential lifecycle events and improving operational processes to reduce our time to detection and mitigation of issues like this one in the future.
Mar 20, 00:58 - 01:58 UTC
Git operations for users in the west coast are experiencing an increase in latencyOn March 19, 2026 between 16:10 UTC and 00:05 UTC (March 20), Git operations (clone, fetch, push) from the US west coast experienced elevated latency and degraded throughput. Users reported clone speeds dropping from typical speeds to under 1 MiB/s in extreme cases. The root cause was network transport link saturation at our Seattle edge site, where a fiber cut affecting our backbone transport resulted in saturation and packet loss. We had a planned scale-up in progress for the site that was accelerated to resolve the backbone capacity pressure. We also brought online additional edge capacity in a cloud region and redirected some users there. Current scale with the upgraded network capacity is sufficient to prevent reoccurrence, as we upgraded from 800Gbps to 3.2Tbps total capacity on this path. We will continue to monitor network health and respond to any further issues.
Mar 19, 16:25 - Mar 20, 00:05 UTC
Issues with Copilot Coding AgentOn March 19, 2026, between 01:05 UTC and 02:52 UTC, and again on March 20, 2026, between 00:42 UTC and 01:58 UTC, the Copilot Coding Agent service was degraded and users were unable to start new Copilot Agent sessions or view existing ones. During the first incident, the average error rate was ~53% and
peaked at ~93% of requests to the service. During the second incident, the average error rate was ~99%% and peaked at ~100%% of requests with significant retry amplification. Both incidents were caused by the same underlying system authentication issue that prevented the service from connecting to its
backing datastore.
We mitigated each incident by rotating the affected credentials, which restored connectivity and returned error rates to normal. The mitigation time was 01:24. The second occurrence was due to an incomplete remediation of the first.
We are implementing automated monitoring for credential lifecycle events and improving operational processes to reduce our time to detection and mitigation of issues like this one in the future.
Mar 19, 13:44 - 14:32 UTC
Disruption with Copilot Coding Agent sessionsOn March 19, 2026, between 01:05 UTC and 02:52 UTC, and again on March 20, 2026, between 00:42 UTC and 01:58 UTC, the Copilot Coding Agent service was degraded and users were unable to start new Copilot Agent sessions or view existing ones. During the first incident, the average error rate was ~53% and
peaked at ~93% of requests to the service. During the second incident, the average error rate was ~99%% and peaked at ~100%% of requests with significant retry amplification. Both incidents were caused by the same underlying system authentication issue that prevented the service from connecting to its
backing datastore.
We mitigated each incident by rotating the affected credentials, which restored connectivity and returned error rates to normal. The mitigation time was 01:24. The second occurrence was due to an incomplete remediation of the first.
We are implementing automated monitoring for credential lifecycle events and improving operational processes to reduce our time to detection and mitigation of issues like this one in the future.
Mar 19, 02:05 - 02:52 UTC
Disruption with some GitHub servicesOn March 19, 2026 between 16:10 UTC and 00:05 UTC (March 20), Git operations (clone, fetch, push) from the US west coast experienced elevated latency and degraded throughput. Users reported clone speeds dropping from typical speeds to under 1 MiB/s in extreme cases. The root cause was network transport link saturation at our Seattle edge site, where a fiber cut affecting our backbone transport resulted in saturation and packet loss. We had a planned scale-up in progress for the site that was accelerated to resolve the backbone capacity pressure. We also brought online additional edge capacity in a cloud region and redirected some users there. Current scale with the upgraded network capacity is sufficient to prevent reoccurrence, as we upgraded from 800Gbps to 3.2Tbps total capacity on this path. We will continue to monitor network health and respond to any further issues.
This was the same incident declared in https://www.githubstatus.com/incidents/xs6xtcv196g7
Mar 18, 22:36 - Mar 19, 01:44 UTC
Webhook delivery is delayedOn March 18, 2026, between 18:18 UTC and 19:46 UTC all webhook deliveries experienced elevated latency. During this time, average delivery latency increased from a baseline of approximately 5 seconds to a peak of approximately 160 seconds. This was due to resource constraints in the webhook delivery pipeline, which caused queue backlog growth and increased delivery latency. We mitigated the incident by shifting traffic and adding capacity, after which webhook delivery latency returned to normal. We are working to improve capacity management and detection in the webhook delivery pipeline to help prevent similar issues in the future.
Mar 18, 18:51 - 19:46 UTC
Errors starting and connecting to CodespacesOn 16 March 2026, between 14:16 UTC and 15:18 UTC, Codespaces users encountered a download failure error message when starting newly created or resumed codespaces. At peak, 96% of the created or resumed codespaces were impacted. Active codespaces with a running VSCode environment were not affected.
The error was a result of an API deployment issue with our VS Code remote experience dependency and was resolved by rolling back that deployment. We are working with our partners to reduce our incident engagement time, improve early detection before they impact our customers, and ensure safe rollout of similar changes in the future.
Mar 16, 15:01 - 15:28 UTC
Degraded performance for various servicesOn March 13, 2026, between 13:35 UTC and 16:02 UTC, a configuration change to an internal authorization service reduced its processing capacity below what was needed during peak traffic. This caused intermittent timeouts when other GitHub services checked user permissions, resulting in four to five waves of errors over roughly two hours and forty minutes. In total, 0.4% of users were denied access to actions they were authorized to perform.
The root cause was a resource right-sizing change deployed to the authorization service the previous day. It reduced CPU allocation below what was required at peak, causing the service's network gateway to throttle under load. Because the change was deployed after peak traffic on March 12, the reduced capacity wasn't surfaced until the next day's peak.
The incident was mitigated by manually scaling up the authorization service and reverting the configuration change.
To prevent recurrence, we are adding further resource utilization monitors across our entire stack to detect throttling and improving error handling so transient infrastructure timeouts are distinguished from authorization failures, enabling quicker detection of the root issue.
Mar 13, 15:12 - 16:15 UTC
Degraded Codespaces experienceOn March 12, 2026, between 01:00 UTC and 18:53 UTC, users saw failures downloading extensions within created or resumed codespaces. Users would see an error when attempting to use an extension within VS Code. Active codespaces with extensions already downloaded were not impacted.
The extensions download failures were the result of a change introduced in our extension dependency and was resolved by updating the configuration of how those changes affect requests from Codespaces. We are enhancing observability and alerting of critical issues within regular codespace operations to better detect and mitigate similar issues in the future.
Mar 12, 13:06 - 18:53 UTC
Actions failures to download (401 Unauthorized)On March 12, 2026 between 02:30 and 06:02 UTC some GitHub Apps were unable to mint server to server tokens, resulting in 401 Unauthorized errors. During the outage window, ~1.3% of requests resulted in 401 errors incorrectly. This manifested in GitHub Actions jobs failing to download tarballs, as well as failing to mint fine-grained tokens. During this period, approximately 5% of Actions jobs were impacted
The root cause was a failure with the authentication service’s token cache layer, a newly created secondary cache layer backed by Redis – caused by Kubernetes control plane instability, leading to an inability to read certain tokens which resulted in 401 errors. The mitigation was to fallback reads to the primary cache layer backed by mysql. As permanent mitigations, we have made changes to how we deploy redis to not rely on the Kubernetes control plane and maintain service availability during similar failure modes. We also improved alerting to reduce overall impact time from similar failures.
Mar 12, 04:46 - 06:02 UTC
Disruption with some GitHub servicesBetween 01:36 and 08:11 UTC on Thursday March 12, GitHub.com experienced elevated error rates across Git operations, web requests, and related services. During a planned infrastructure upgrade, a configuration issue caused newly provisioned Kubernetes nodes to run an incompatible version of etcd, which disrupted cluster consensus across several production clusters. This led to intermittent 5XX errors on git push, git clone, and page loads. Deployments were paused for the duration of the incident.
Once the incompatible nodes were identified, they were removed and cluster consensus was restored. A validation deploy confirmed all systems were healthy before normal operations resumed.
To prevent recurrence, we are adding programmatic enforcement of version compatibility during node replacements, implementing monitoring to detect split-brain conditions earlier, and updating our recovery tooling to reduce restoration time.
Mar 12, 01:54 - 02:45 UTC
Degraded experience with Copilot Code ReviewOn March 11, 2026, between 13:00 UTC and 15:23 UTC the Copilot Code Review service was degraded and experienced longer than average review times. On average, Copilot Code Review requests took 4 minutes and peaked at just under 8 minutes. This was due to hitting worker capacity limits and CPU throttling. We mitigated the incident by increasing partitions, and we are improving our resource monitoring to identify potential issues sooner.
Mar 11, 14:25 - 15:53 UTC
Incident with API RequestsOn March 11, 2026, between 14:25 UTC and 14:34 UTC, the REST API platform was degraded, resulting in increased error rates and request timeouts. REST API 5xx error rates peaked at ~5% during the incident window with two distinct spikes: the first impacting REST services broadly, and the second driven by sustained timeouts on a subset of endpoints.
The incident was caused by a performance degradation in our data layer, which resulted in increased query latency across dependent services. Most services recovered quickly after the initial spike, but resource contention caused sustained 5xx errors due to how certain endpoints responded to the degraded state.
A fix addressing the behavior that prolonged impact has already been shipped. We are continuing to work to resolve the primary contributing factor of the degradation and to implement safeguards against issues causing cascading impact in the future.
Mar 11, 14:37 - 15:02 UTC
Incident With WebhooksOn March 10, 2026, between 23:00 UTC and 23:40 UTC, the Webhooks service was degraded and ~6% of users experienced intermittent errors when accessing webhook delivery history, retrying webhook deliveries, and listing webhooks via the UI and API. Approximately 0.37% of requests resulted in errors, while at peak 0.5% of requests resulted in errors.
This was due to unhealthy infrastructure. We mitigated the incident by redeploying affected services, after which service health returned to normal.
We are working to improve detection of unhealthy infrastructure and strengthen service safeguards to reduce time to detect and mitigate similar issues in the future.
Mar 10, 23:00 - 23:00 UTC
Incident with WebhooksOn March 9, 2026, between 15:03 and 20:52 UTC, the Webhooks API experienced was degraded, resulted in higher average latency on requests and in certain cases error responses. Approximately 0.6% of total requests exceeded the normal latency threshold of 3s, while 0.4% of requests resulted in 500 errors. At peak, 2.0% experienced latency greater than 3 seconds and 2.8% of requests returned 500 errors.
The issue was caused by a noisy actor that led to resource contention on the Webhooks API service. We mitigated the issue initially by increasing CPU resources for the Webhooks API service, and ultimately applied lower rate limiting thresholds to the noisy actor to prevent further impact to other users.
We are working to improve monitoring to more quickly ascertain noisy traffic and will continue to improve our rate-limiting mechanisms to help prevent similar issues in the future.
Mar 9, 15:50 - 17:03 UTC
Incident with CodespacesOn March 9, 2026, between 01:23 UTC and 03:25 UTC, users attempting to create or resume codespaces in the Australia East region experienced elevated failures, peaking at a 100% failure rate for this region. Codespaces in other regions were not affected.
The create and resume failures were caused by degraded network connectivity between our control plane services and the VMs hosting the codespaces. This was resolved by redirecting traffic to an alternate site within the region. While we are addressing the core network infrastructure issue, we have also improved our observability of components in this area to improve detection. This will also enable our existing automated failovers to cover this failure mode. These changes will prevent or significantly reduce the time any similar incident causes user impact.
Mar 9, 03:04 - 03:51 UTC
Incident with WebhooksOn March 6, 2026, between 16:16 UTC and 23:28 UTC the Webhooks service was degraded and some users experienced intermittent errors when accessing webhook delivery histories, retrying webhook deliveries, and listing webhooks via the UI and API. On average, the error rate was 0.57% and peaked at approximately 2.73% of requests to the service. This was due to unhealthy infrastructure affecting a portion of webhook API traffic.
We mitigated the incident by redeploying affected services, after which service health returned to normal.
We are working to improve detection of unhealthy infrastructure and strengthen service safeguards to reduce time to detection and mitigation of issues like this one in the future.
Mar 6, 16:58 - 23:28 UTC
Actions is experiencing degraded availabilityOn March 5, between 22:39 and 23:55 UTC, Actions was degraded due to a repeat of an incident a few hours prior. In this case, a Redis cluster topology change made as a follow-up to the earlier incident caused a repeat of the earlier degradation of Actions jobs. Details of both incidents and the follow-ups are shared at https://www.githubstatus.com/incidents/g5gnt5l5hf56.
Mar 5, 22:53 - 23:55 UTC
Multiple services are affected, service degradationOn Mar 5, 2026, between 16:24 UTC and 19:30 UTC, Actions was degraded. During this time, 95% of workflow runs failed to start within 5 minutes with an average delay of 30 minutes and 10% workflow runs failed with an infrastructure error. This was due to Redis infrastructure updates that were being rolled out to production to improve our resiliency. These changes introduced a set of incorrect configuration change into our Redis load balancer causing internal traffic to be routed to an incorrect host leading to two incidents.
We mitigated this incident by correcting the misconfigured load balancer. Actions jobs were running successfully starting at 17:24 UTC. The remaining time until we closed the incident was burning through the queue of jobs.
We immediately rolled back the updates that were a contributing factor and have frozen all changes in this area until we have completed follow-up work from this. We are working to improve our automation to ensure incorrect configuration changes are not able to propagate through our infrastructure. We are also working on improved alerting to catch misconfigured load balancers before it becomes an incident. Additionally, we are updating the Redis client configuration in Actions to improve resiliency to brief cache interruptions.
Mar 5, 16:35 - 19:30 UTC
Disruption with some GitHub servicesOn March 5, 2026, between 12:53 UTC and 13:35 UTC, the Copilot mission control service was degraded. This resulted in empty responses returned for users' agent session lists across GitHub web surfaces. Impacted users were unable to see their lists of current and previous agent sessions in GitHub web surfaces. This was caused by an incorrect database query that falsely excluded records that have an absent field.
We mitigated the incident by rolling back the database query change. There were no data alterations nor deletions during the incident.
To prevent similar issues in the future, we're improving our monitoring depth to more easily detect degradation before changes are fully rolled out.
Mar 5, 01:13 - 01:30 UTC
Some OpenAI models degraded in CopilotOn March 5th, 2026, between approximately 00:26 and 00:44 UTC, the Copilot service experienced a degradation of the GPT 3.5 Codex model due to an issue with our upstream provider. Users encountered elevated error rates when using GPT 3.5 Codex, impacting approximately 30% of requests. No other models were impacted.
The issue was resolved by a mitigation put in place by our provider.
Mar 5, 00:47 - 01:13 UTC
Claude Opus 4.6 Fast not appearing for some Copilot usersOn March 3, 2026, between 19:44 UTC and 21:05 UTC, some GitHub Copilot users reported that the Claude Opus 4.6 Fast model was no longer available in their IDE model selection. After investigation, we confirmed that this was caused by enterprise administrators adjusting their organization's model policies, which correctly removed the model for users in those organizations. No users outside the affected organizations lost access.
We confirmed that the Copilot settings were functioning as designed, and all expected users retained access to the model. The incident was resolved once we verified that the change was intentional and no platform regression had occurred.
Mar 3, 20:31 - 21:11 UTC
Incident with all GitHub servicesOn March 3, 2026, between 18:46 UTC and 20:09 UTC, GitHub experienced a period of degraded availability impacting GitHub.com, the GitHub API, GitHub Actions, Git operations, GitHub Copilot, and other dependent services. At the peak of the incident, GitHub.com request failures reached approximately 40%. During the same period, approximately 43% of GitHub API requests failed. Git operations over HTTP had an error rate of approximately 6%, while SSH was not impacted. GitHub Copilot requests had an error rate of approximately 21%. GitHub Actions experienced less than 1% impact.
This incident shared the same underlying cause as an incident in early February where we saw a large volume of writes to the user settings caching mechanism. While deploying a change to reduce the burden of these writes, a bug caused every user’s cache to expire, get recalculated, and get rewritten. The increased load caused replication delays that cascaded down to all affected services. We mitigated this issue by immediately rolling back the faulty deployment.
We understand these incidents disrupted the workflows of developers. While we have made substantial, long-term investments in how GitHub is built and operated to improve resilience, we acknowledge we have more work to do. Getting there requires deep architectural work that is already underway, as well as urgent, targeted improvements. We are taking the following immediate steps:
- We have added a killswitch and improved monitoring to the caching mechanism to ensure we are notified before there is user impact and can respond swiftly.
- We are moving the cache mechanism to a dedicated host, ensuring that any future issues will solely affect services that rely on it.
Mar 3, 18:59 - 20:09 UTC
Delayed visibility of newly added issues on project boardsBetween March 2, 21:42 UTC and March 3, 05:54 UTC project board updates, including adding new issues, PRs, and draft items to boards, were delayed from 30 minutes to over 2 hours, as a large backlog of messages accumulated in the Projects data denormalization pipeline.
The incident was caused by an anomalously large event that required longer processing time than expected. Processing this message exceeded the Kafka consumer heartbeat timeout, triggering repeated consumer group rebalances. As a result, the consumer group was unable to make forward progress, creating head-of-line blocking that delayed processing of subsequent project board updates.
We mitigated the issue by deploying a targeted fix that safely bypassed the offending message and allowed normal message consumption to resume. Consumer group stability recovered at 04:10 UTC, after which the backlog began draining. All queued messages were fully processed by 05:53 UTC, returning project board updates to normal processing latency.
We have identified several follow-up improvements to reduce the likelihood and impact of similar incidents in the future, including improved monitoring and alerting, as well as introducing limits for unusually large project events.
Mar 2, 23:10 - Mar 3, 05:54 UTC
Incident with Pull Requests /pullsOn March 2nd, 2026, between 7:10 UTC and 22:04 UTC the pull requests service was degraded. Users navigating between tabs on the pull requests dashboard were met with 404 errors or blank pages.
This was due to a configuration change deployed on February 27th at 11:03 PM UTC. We mitigated the incident by reverting the change.
We’re working to improve monitoring for the page to automatically detect and alert us to routing failures.
Mar 2, 19:11 - 22:04 UTC