GitHub header

May 2026

Incident with Actions
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
May 20, 16:58 - 20:14 UTC
Actions is experiencing degraded availability
On May 15, 2026, from approximately 07:43 UTC to 08:48 UTC, GitHub Actions experienced a degradation that caused workflow runs to fail or experience delayed starts for a subset of customers. The incident was triggered by a planned failover of supporting infrastructure used by GitHub Actions. During that operation, an automated service discovery update did not propagate correctly, which caused traffic to be routed incorrectly and increased request timeouts in a core dependency for workflow orchestration.

At peak impact, 42% of Actions runs failed. Downstream services that depend on Actions workflow execution were also impacted, including GitHub Pages and Copilot cloud services. At 08:12 UTC, responders manually corrected the service discovery routing issue. Timeout and failure rates recovered shortly after, and we continued monitoring until full stabilization was confirmed across all affected services. The incident was marked resolved at 08:48 UTC.

To prevent recurrence, we are implementing failover guardrails that validate service discovery state before completing failover operations, strengthening pre-flight and post-flight verification checks, and improving dependency resilience to reduce timeout cascades during infrastructure events.
May 15, 08:13 - 08:48 UTC
[Retroactive] Incident with GitHub.com
Beginning at 02:49 UTC on May 15 2026 and lasting until 03:04 UTC, GitHub.com was unavailable for a subset of customers. This impact has been mitigated and normal service resumed. The issue was rooted in a sudden spike in traffic, with intermittent impact. We've identified the source of the traffic and prevented further disruption.
May 15, 02:30 - 02:30 UTC
Incident with CodeQL
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
May 13, 14:41 - 16:03 UTC
Incident with CodeQL, Webhooks, Notifications, and Slack Integration
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
May 12, 14:38 - 17:43 UTC
Incident with high errors on Git Operations
On May 11th, 2026, between 14:00 UTC and 14:33 UTC, HTTP-based Git read operations were degraded. On average, the error rate was 2.8% and peaked at 7.5% of requests to the service. This was due to resource exhaustion in a networking gateway between GitHub.com’s frontend service for Git operations and a dependency service that performs authentication and authorization. Following the initial spike, the frontend service became stuck in a degraded state in one of our data centers, increasing time to mitigation.

We mitigated the incident by scaling the networking gateway and re-deploying the frontend service.

To reduce our time to detection and mitigation in the future, we are adding auto-scaling to the networking gateway, and resolving a bug which caused the frontend service to remain degraded.
May 11, 14:25 - 14:33 UTC
CCR and CCA failing to start for PR comments
On May 7, 2026, between 04:12 UTC and 06:13 UTC, Copilot Cloud Agent and Copilot Code Review Agent sessions for pull requests were delayed or failed to start.

The issue was caused by follow-up recovery work from a separate Pull Requests incident (https://www.githubstatus.com/incidents/f5pb5d5mr9yh). As part of that recovery, we ran a large database migration, which caused replication delays on several replica hosts.

Although those replicas were not serving user traffic, our safeguards correctly treated the elevated replication lag as a signal to slow down writes to the affected database cluster. As a result, some pull request background processing was temporarily delayed. That processing is responsible for sending the internal events that Copilot agents use to begin work, so affected agents did not start until the database replicas caught up.

The system recovered once replication lag returned to normal and pull request processing resumed. We are reviewing how this safeguard interacts with recovery migrations so we can reduce the chance of similar secondary impact during future incident recovery work.
May 7, 05:02 - 06:56 UTC
Incident with Pull Requests
On May 6, 2026 between 15:12 and 19:02 UTC creation of new pull request review threads on GitHub.com failed. This included new line comments and file comments on pull requests. Existing PRs and previously created comments were unaffected.

This incident was caused by a 32-bit integer key reaching its maximum value in a Vitess lookup table used during PR thread creation. The primary table had been migrated to a 64-bit integer key but the Vitesse lookup table remained 32-bit. Once the values in the primary table passed the available 32-bit ID space in the lookup table, attempts to create new review threads began failing, resulting in near 100% failure rate for new thread creation requests. We mitigated the issue by updating the impacted lookup table definitions across all shards to use 64-bit integer column types, increasing the available ID range and restoring normal operation. Service was fully restored once the schema changes competed globally.

To help prevent similar incidents, we are expanding existing monitoring of database columns to include Vitess lookup tables to notify in advance of any tables that is approaching a column size limit. This work is intended to provide earlier detection of columns approaching size limits before customer impact occurs.
May 6, 15:25 - 19:04 UTC
Disruption with some GitHub services
On May 6, 2026 between 11:02 UTC and 11:13 UTC, users were unable to start or view Copilot Cloud Agent or remote sessions. During this time, requests to the session API returned errors, preventing users from creating new sessions or viewing existing ones. The issue was caused by a configuration change to the service's network routing that inadvertently removed the ingress path for the service. The team reverted the change at 11:13 UTC which restored service. The incident remained open until 11:59 UTC while the team verified full recovery. We are taking steps to improve our deployment validation process to prevent similar configuration changes from impacting production traffic in the future.
May 6, 11:21 - 11:59 UTC
Incident with Actions, we are investigating reports of degraded availability
On May 6, 2026, from approximately 06:45 UTC to 09:15 UTC, GitHub Actions Standard Ubuntu hosted runners were degraded. 17.1% of jobs requesting a standard runner failed.

This was caused by an unexpected data shape in the allocation configuration data for standard runners. That data was introduced as part of post-incident remediation work for an incident the previous day and caused new allocations to be blocked as load ramped up for the day. Removing that data at 08:51 allowed allocations to proceed and hosted runner pools to scale up and recover.

We are updating the filter logic for this allocation data to be resilient to abnormal data shapes and improving monitoring to alert when allocations are blocked, allowing the team to respond before customer impact starts.
May 6, 07:19 - 09:44 UTC
Increased Latency and Failures for SSH Git Operations
Between approximately 14:00 and 16:10 UTC on May 5, 2026, SSH-based Git operations experienced elevated latency and intermittent failures. On average, the error rate was 0.46% and peaked at 0.6% of SSH write requests. HTTP-based Git operations, including web UI and HTTPS clones, were not affected.

The impact was caused by reduced SSH capacity at one of our data center sites. During a period of high traffic, the remaining hosts became overloaded, leading to connection exhaustion and some failures for SSH-based operations.

Additional capacity was provisioned to expand SSH capacity and resolve the incident. The expanded capacity was fully online by 18:18 UTC.

To reduce the likelihood of similar incidents, we will implement faster scaling solutions for SSH infrastructure and improved alerting for host availability and capacity thresholds.
May 5, 16:49 - 18:35 UTC
Incident with Actions
On May 5, 2026, from approximately 13:22 UTC to 17:05 UTC, GitHub Actions hosted runners in the East US region were degraded. 13.5% of jobs requesting a standard runner failed and ~16% of requested Larger Runners with private networking pinned to East US failed or were delayed by more than 5 minutes. Copilot Code Review requests were also impacted. Approximately 8,500 code review requests timed out during this window. Affected users saw an error comment on their pull requests and were able to retry by re-requesting a review. Most runner requests were picked up by other regions automatically, but a portion of requests still routing to East US were impacted.

This was triggered by a scale-up operation for hosted runner VMs in the East US region. This is a regular operation, but the VM create load hit an internal rate limit when VM creates pull images from storage. Existing backoff logic was not triggered because of the response code returned in this case. The rate limiting and VM creation failures were mitigated by reducing load to allow for recovery and allowing queued work to be processed. By 15:34 UTC, queued and failed job assignments were mostly mitigated, with less than 0.5% of runner assignments impacted between 15:34 and full recovery at 17:05.

We are improving our system’s throttling behavior when limits occur, improving our controls to more quickly mitigate similar situations in the future, and reviewing all limits end-to-end for similar operations. We also immediately paused all scale and similar operations until these changes are in place and validated.
May 5, 13:37 - 17:26 UTC
Incident with Issues and Webhooks
On 2026-05-04 at 3:37:17 PM UTC we detected increased latency on issues resulting in timeouts, and elevated 500 errors on webhooks. A scheduled workload drove high utilization on the primary host of a critical datastore, saturating the connection pool. We paused the job to mitigate the problem at 4:40:05 PM UTC and have implemented measures to prevent recurrence.
May 4, 15:45 - 16:40 UTC
Incomplete pull request results in repositories
On April 28, 2026, at approximately 14:07 UTC, GitHub received reports that pull requests were missing from search results across global and repository /pulls pages.

The issue was caused by a manually invoked repair job intended for a single repository, which was executed without the required safety flags. During execution of the repair job, the database query remained correctly scoped to the repo’s PR IDs. However, the Elasticsearch reconciliation logic did not apply the same scope. It interpreted the min and max PR IDs as a continuous range, causing unrelated PR documents across other repos to be marked for deletion. This resulted in the removal of 1,789,756,838 PR documents from the search index, approximately 49% of indexed PR documents.

Customer impact was limited to PR search and list discoverability. Primary storage was unaffected, and there was no impact to opening, updating, or merging PRs.

The issue was identified ~10 minutes after initial customer reports. Because it affected search index completeness rather than service availability, it was not caught by existing monitoring.

The root cause was a flaw in the search document repair framework: it allowed a scoped reconciliation to run without enforcing a matching Elasticsearch query scope. This created a destructive mismatch between the source-of-truth and the index. The issue was compounded by the ability to trigger the job from the production console without safety defaults. Prior testing focused only on safe backfill scenarios and did not cover this reconciliation path. Additionally, there was no automated detection for large-volume deletions in Elasticsearch.

We mitigated the incident through three parallel actions: (1) Deployed a MySQL-backed search fallback for the most active repos by traffic to restore PR visibility for highly impacted users (2) Initiated a snapshot restore and reindex process to repopulate missing pull request documents in Elasticsearch (3) Added a degradation notice on PR pages to inform users of incomplete search results while recovery was in progress. The incident was resolved on May 1, 2026 at 4:15 UTC, following completion and validation of the reindex process.

To prevent recurrence, we are prioritizing improvements to the repair framework and safeguards. These include enforcing scoped query alignment between primary storage and Elasticsearch, preventing destructive operations without explicit opt-in, strengthening guardrails for manual repair jobs, and evaluating restrictions on production console access.

In parallel, we are expanding automated test coverage for reconciliation safety invariants and introducing detection for anomalous deletion patterns in Elasticsearch so similar issues can be identified or blocked earlier.

We are committed to improving the safety and reliability of our repair systems and ensuring that operational workflows are resilient to both software defects and manual invocation risks.
Apr 28, 14:17 - May 1, 04:15 UTC
- Collapse Incidents

April 2026

Disruption with some GitHub services
On April 28, 2026, from approximately 12:41 UTC to 17:09 UTC, GitHub Actions jobs using Standard Ubuntu 22 and Ubuntu 24 hosted runners experienced run start delays. Approximately 8% of hosted runner jobs using Ubuntu 22 and Ubuntu 24 experienced delays greater than 5 minutes or failures. Larger and self-hosted runners were not impacted.

This was caused by a performance regression introduced in the VM reimage process. That reimage delay lowered the overall capacity of runners available to pick up new jobs. This was mitigated with a rollback to a known good image version.

We are addressing the core issue with reimage performance and improving the granularity of reimage telemetry across our services and our compute provider to more quickly diagnose similar issues in the future. Finally, we are evaluating other rollout changes to automatically detect similar regressions.
Apr 28, 13:59 - 17:09 UTC
GitHub search is degraded
On April 27, 2026 between 16:15 UTC and 22:46 UTC, GitHub search services experienced degraded connectivity due to saturation of the load balancing tier deployed in front of our search infrastructure. This resulted in intermittent failures for services relying on our search data including Issues, Pull Requests, Projects, Repositories, Actions, Package Registry and Dependabot Alerts. The impact was varied by search target, with services seeing up to 65% of searches timing out or returning an error between 16:15 UTC and 18:00 UTC.

We detected the drop in search results through our ongoing monitoring and declared an incident at 16:21 UTC when we determined the issues would not self-heal. We tracked the incident as mitigated as of 21:33 UTC and monitored the systems until 22:46 UTC when we declared the incident resolved. Our existing monitoring did not classify the increased scraping as a risk and this dimension of the incident was only discovered while working to mitigate.

The saturation was caused by a large influx of anonymous distributed scraping traffic that was crafted to avoid our public API rate limits. This scraping traffic made up 30% of the day’s total search traffic, but it was concentrated within a four-hour period. The traffic originated from over 600,000 Unique IP addresses, with matching actor information across the board.

To mitigate, we immediately focused on relieving pressure from the load balancers while simultaneously working on scaling the load balancing tier, blocking the anomalous traffic and applying tuning to the balancers to fully resolve the incident.

Looking ahead, we’ve not only scaled the load balancer tier, but applied optimizations to improve our connection handling and re-use to reduce the possibility that a saturation event like this can re-occur. We’ve also added new monitors and controls within the platform to allow us to restrict anonymous traffic to mitigate the impact to our registered users.

Apr 27, 16:31 - 22:46 UTC
Disruption with some GitHub services
On April 22, 2026 from 18:49 to 19:32 UTC , the Copilot Cloud Agent service began failing during session execution for users running the Agent HQ Codex agent. Codex agent sessions failed to start for all entry points (issue assignment, @copilot comment mentions). 0.5% of total Copilot Cloud Agent jobs were impacted (~2,000 failed jobs). Copilot and other agent sessions were unaffected.

This was caused by a model resolution mismatch in Codex agent sessions, resulting in an incompatible model being used at runtime. A mitigation was deployed to select a stable default model for Codex agent sessions.

We are working to harden the underlying model-resolution path so it correctly scopes to the requesting agent's supported models to prevent similar failure mode in the future.
Apr 27, 16:48 - 19:02 UTC
Delays with Actions Jobs for Larger Runners using VNet Injection in the East US region
On April 24, 2026, from approximately 11:39 UTC to April 25, 2026 at 00:15 UTC, GitHub Actions experienced delays and timeouts for Larger Hosted Runner jobs using VNet injection in the East US region without a failover region configured. Standard and Self-hosted runners were not impacted. This was caused by backend failures in our compute provider’s provisioning, scaling, and update operations for VMs in the East US region and mitigated by a rollback across all affected Availability Zones. More detail is available at https://azure.status.microsoft/en-us/status/history/?trackingId=5GP8-W0G.

We are working to improve the reliability of our annotations for jobs impacted by regional issues and are adding system log notifications as an additional customer communication channel alongside annotations.

VNet Failover is also now in public preview, allowing customers to evacuate Larger Hosted Runners using VNet injection in cases like this.
Apr 24, 19:02 - Apr 25, 00:36 UTC
Incident with Pull Requests
On April 23, 2026, between 16:05 UTC and 20:43 UTC, the Pull Requests service experienced a regression affecting merge queue operations. PRs merged via merge queue using the squash merge method produced incorrect merge commits when the merge group contained more than one PR. In affected cases, changes from previously merged PRs and prior commits were inadvertently reverted by subsequent merges.

During the impact window 2,092 pull requests were affected. The issue did not affect pull requests merged outside of merge queue, nor merge queue groups using the merge or rebase methods.
It took approximately 3 hours and 33 minutes to identify the issue. The change completed deployment at approximately 16:05 UTC, and we became aware at 19:38 UTC following an increase in customer support inquiries. Because the issue affected merge commit correctness rather than availability, it was not detected by existing automated monitoring and was identified through customer reports.

The regression was introduced by a new code path that adjusted merge base computation for merge queue ref updates. This code path was intended to be gated behind a feature flag for an unreleased feature, but the gating was incomplete.

As a result, the new behavior was inadvertently applied to squash merge groups, producing an incorrect three-way merge. This caused subsequent squash merges to revert changes from earlier pull requests and, in some cases, changes between their starting points.

We mitigated the incident by reverting the code change and force-deploying the fix across all environments. After resolution, we identified affected repositories and sent targeted remediation instructions to repository administrators with step-by-step recovery guidance.

The regression was not identified during internal validation. Existing test coverage primarily exercised single-PR merge queue groups, which did not exhibit the faulty base-reference calculation. Because automated checks did not validate merge correctness for multi-PR squash groups, the defect surfaced only in production.

To prevent recurrence, GitHub is expanding test coverage for merge correctness validation. We are broadening automated coverage for merge queue operations, including regression checks that validate resulting Git contents across supported configurations, so issues affecting merge correctness are caught before reaching production.

We are committed to ensuring the correctness and reliability of merge queue operations. These actions will reduce the risk of similar regressions and improve confidence in future changes to the Pull Requests service.
Apr 23, 19:50 - 21:43 UTC
Disruption with users unable to start Claude and Codex agent task from the web
Between 18:45 and 19:42 UTC on April 23, users were unable to start new agent tasks using either Claude or Codex agent on github.com. This was caused by a code change to how Copilot mission control routes task creation requests. Ongoing agent tasks and other Copilot agent features were not affected. We mitigated the impact by reverting the breaking change. We are adding extra monitoring and integration test coverage for the task creation path to prevent future recurrence.
Apr 23, 19:28 - 19:42 UTC
Incident with multiple GitHub services
On April 23, 2026, between 16:03 UTC and 17:27 UTC, multiple GitHub services experienced elevated error rates and degraded performance due to DNS resolution failures originating from our DNS infrastructure in our VA3 datacenter. Approximately 5–7% of overall traffic was affected during the impact window:

- Webhooks: ~0.35% of API requests returned 5xx (peak ~0.39%). ~0.88% of requests exceeded 3s latency; at peak, >3s responses represented ~10% of Webhooks API traffic.

- Copilot Metrics: ~9% of Copilot Insights dashboard requests returned 5xx.

- Copilot cloud agents: ~10% of cloud agent sessions were affected and failing.

- Octoshift: 0.88% of active repo migrations failed and 79% saw elevated durations (avg. 5.2 min) during this period.

- Git Operations: averaged 1.25% errors over the duration of the incident, with a peak of 2.07% errors.

- Actions: Workflow run status updates experienced delays of up to ~8s over the duration of the incident window.

Our DNS infrastructure in VA3 entered a degraded state and began intermittently returning NXDOMAIN responses and timing out on lookups for both internal service discovery and external endpoints. This caused a cascading impact across the dependent services listed above.

We identified a specific load pattern under which our DNS resolvers began failing. The evidence points to a recently introduced traffic-balancing mechanism, rolled out progressively to support our growth, as the root cause. We have since reverted this change.

We are immediately prioritizing investments in a more controlled rollout and validation process, including a dedicated environment to safely shadow production DNS traffic and detect these failure modes before they can affect production.
Apr 23, 16:12 - 17:30 UTC
Investigating errors on GitHub
On April 23, 2026 between 14:30 UTC and 15:18 UTC multiple services were degraded on github.com. During this time approximately 1.5% of all web requests resulted in a 5xx status and unicorn pages for github.com users. We also saw elevated error rates across Actions workflow runs, Copilot, Codespaces and Packages, leading to degraded experiences during this timeframe. Codespaces impact peaked at 45% failures for create requests and 65% failures for resume requests. Packages impact was mainly Maven related with 50% failure rates in downloads and 70% failure rates in uploads. Actions experienced a peak of 8% of failed jobs and up to 85% of jobs impacted by run start delays of more than 5 minutes.

This was due to a configuration change to an internal billing service that led to a cache being overwhelmed and causing requests to time out. These timeouts cascaded across multiple services and eventually caused requests to queue up and exhaust web request workers.

This configuration change was reverted at 14:42 UTC and following this, all services began to see recovery immediately.

To prevent this situation in the future, we are taking steps to ensure that failures and timeouts in the billing service don’t cascade to other services causing impact. This includes implementing more aggressive timeouts on callers of these billing services, adding circuit breaker configurations for cache timeouts and using more resilient cache options. We have also decreased max request timeouts within the billing service that caused impact and added more capacity to our cache to prevent traffic spikes from having the same impact.
Apr 23, 14:40 - 15:18 UTC
Disruption with some GitHub services
On April 22, 2026, between 09:00 UTC and 22:05 UTC, the Copilot coding agent and pull request comment event processing were degraded. During this period, approximately 0.5% of total pull request and issue comments mentioned @copilot (~23,000 invocations), explicitly requested work from the Copilot coding agent but were not acted upon.

Creating, viewing, and replying to pull request comments was unaffected, and other Copilot
functionality continued to operate normally. The impact was limited to @copilot mentions on pull request comments not triggering Copilot coding agent runs, and to some downstream systems not receiving new pull request comment events during the impact window.

The cause was a serialization error that prevented pull request comment events from being published to downstream consumers, including the Copilot coding agent. This was related to the same class of issue as incident #4295 on April 20, affecting a another event type.

We mitigated the incident by deploying a fix that restored event publishing, after which the Copilot coding agent and other downstream consumers resumed processing pull request comment events normally.

We are working to complete our audit of related event schemas, migrate remaining consumers to use
the updated identifier fields, and improve monitoring to detect drops in publishing on critical event topics, to reduce our time to detection and mitigation of issues like this one in the future.
Apr 22, 19:53 - 22:43 UTC
Disruption with Copilot chat and Copilot Coding Agent
On April 22, 2026, between 15:16 UTC and 19:18 UTC, users experienced errors when interacting with Copilot Chat on github.com and Copilot Cloud Agent. During this time, users were unable to use Copilot Chat or Copilot Cloud Agent. Copilot Memory (in preview) was not available to Copilot agent sessions during this time. The issue was caused by an infrastructure configuration change that resulted in connectivity issues with our databases. The team identified the cause and restored connectivity to the database. Copilot Chat and Cloud Agent for github.com were restored by 18:16 UTC. Remaining regional deployments were restored incrementally, with full resolution at 19:18 UTC. We have taken steps to prevent similar infrastructure changes from causing these kinds of database operations in the future.
Apr 22, 15:35 - 19:18 UTC
Disruption with projects service
On April 21, 2026, between 13:35 UTC and 01:24 UTC the following day the projects service was degraded. During this time period, projects may have been out of sync and users may have experienced delays in changes to projects and their items. Delays in reflected changes peaked at approximately 45 minutes. The delays were caused by serialization errors that failed events and triggered a flood of resyncs, overloading our event processing layers.

We mitigated the incident by speeding up processing time for incoming changes and otherwise waiting for all changes to be processed.

We are working to increase our capacity for processing updates to projects to reduce our time to mitigation of issues like this one in the future.
Apr 21, 15:03 - Apr 22, 01:24 UTC
Partial degradation for code scanning default setup and for code quality
On April 20, 2026 between 10:28 UTC and 15:04 UTC GitHub experienced degraded service for code scanning default setup, code quality, and project boards. Repair of affected project boards additionally lasted until April 21, 05:04 UTC

During this time, code scanning default setup and code quality analyses were not triggered on newly opened pull requests. Additionally, newly created issues were not appearing on project boards.

The cause was a serialization error that prevented proper triggering of code scanning, code quality analyses, and project board updates.

We mitigated the issue by deploying a fix, restoring event publishing for code scanning and code quality. For project boards, an additional code change was deployed to update event consumers, followed by a reindex of affected project items.

We are working to prevent recurrence by strengthening our schema validations and improving monitoring for drops in publishing on critical hydro topics.
Apr 20, 13:28 - Apr 21, 05:04 UTC
Disruption with some GitHub services
On April 17, 2026, between 14:46 UTC and 15:12 UTC, users experienced a degraded web experience on GitHub.com. During this time, approximately 1.5% of web requests resulted in errors, with some users encountering slow page loads or failed requests. The issue was caused by capacity saturation of a caching component in one of our data center regions. We mitigated the issue by redirecting traffic to an unaffected region and rolling back a recent deployment. The incident was fully resolved at 15:18 UTC. We are taking steps to provide appropriate capacity for this caching path to prevent recurrence.
Apr 17, 14:56 - 15:18 UTC
Incident with Codespaces
On April 16, 2026 between 09:30 UTC and 17:15 UTC, users experienced failures when attempting to connect to GitHub Codespaces via the VS Code editor. During this time, approximately 40% of codespace start operations failed. Users connecting via SSH were not impacted.

The issue was caused by a failure in an upstream download service that prevented the VS Code Server from being retrieved during codespace startup. The impact was mitigated by implementing a workaround to use an alternative download path when the primary endpoint is degraded.

We are working with the upstream dependency to address the root cause of the download service failure, and we are improving our fallback mechanisms to reduce the impact of similar upstream failures in the future.
Apr 16, 15:06 - 18:28 UTC
Disruption with some GitHub services
On April 14, between 00:58 UTC and 06:08 UTC, GitHub Enterprise Cloud customers experienced 500 errors when attempting to access Copilot Insights pages which was caused by an authentication failure in our metrics pipeline. We fully mitigated the issue and validated the fix in production. Approximately 709 users were impacted. The total impact duration was approximately 5 hours and 10 minutes.

Our investigation determined the incident was caused by a change in a tenant credential which caused authentication errors to retrieve the required data needed on our Copilot Insights pages.

We understand this disruption impacted customers' ability to access the Copilot Insights page. To prevent similar issues and reduce resolution time in the future, we are investing in improved diagnostics tooling to quickly identify the root cause of failures, enhanced monitoring, and alerting to detect issues at a more granular level.

GitHub is a critical infrastructure for your work, your teams, and your businesses. We are focused on these remediations and continued reliability improvements for Copilot Insights and related metrics experiences.
Apr 14, 01:57 - 06:08 UTC
Incident with Pages
On Sunday April 13th, 2026, between 18:53 UTC and 20:30 UTC, the GitHub Pages service experienced elevated error rates. On average, the error rate was 10.58% and peaked at 12.77% of requests to the service, resulting in approximately 17.5 million failed requests returning HTTP 500 errors. This was due to an automated DNS management tool (octodns) erroneously deleting a DNS record for a Pages backend storage host after its upstream data source intermittently failed to return the record, causing the tool to treat it as stale and remove it.

We mitigated the incident by re-creating the deleted DNS record. To prevent future incidents, we are implementing availability-zone-tolerant routing in the Pages frontend so that an unresolvable backend host triggers failover to healthy hosts rather than returning errors, adding safeguards to prevent automated deletion of DNS records owned by other systems, and improving logging and alerting for DNS resolution failures in the Pages serving path.
Apr 13, 19:56 - 20:35 UTC
Disruption with some GitHub services
On April 13, 2026, between 14:41 UTC and 17:29 UTC, the Copilot service experienced degraded performance. All Copilot users were impacted by increased latency, and approximately 20% experienced request failures when interacting with Copilot Cloud Agent (CCA). On average, request latency increased to approximately 950ms. The GitHub User Dashboard also displayed intermittent errors loading Copilot quota information. CCA and the User Dashboard were impacted for approximately 2 hours and 56 minutes.

This was due to an infrastructure change that reduced the available compute capacity for a backend service responsible for Copilot rate limiting and quota management. The reduced capacity caused resource exhaustion under normal traffic load, leading to cascading failures in downstream request processing.

We mitigated the incident by increasing compute resources allocated to the affected service and scaling out the number of service instances to distribute load more effectively.

We are working to improve proactive capacity monitoring to detect resource degradation before it impacts users, reviewing retry and timeout configurations across dependent services to reduce amplification during degraded states, and evaluating connection management strategies to improve resilience under constrained resources.
Apr 13, 16:41 - 17:40 UTC
Problems with third-party Claude and Codex Agent sessions not being listed in the agents tab dashboard
On April 9, 2026, between 22:59 UTC and April 10, 2026, 13:24 UTC, the Copilot Mission Control service was degraded and did not display Claude and Codex Cloud Agent sessions in the agents tab dashboard. Customers were unable to see, list, or manage their third party agent sessions during this period. The underlying agent sessions continued to function normally. This was a visibility and management issue only, and no HTTP errors were generated. The API returned successful responses with incomplete results, with an average error rate of 0% and a maximum error rate of 0%. This was due to a code change that introduced a filter which inadvertently excluded third party agent sessions.

We mitigated the incident by reverting the problematic code change and deploying the fix to production.

We are working to add automated monitoring for dashboard content visibility and improve integration test coverage for third party agent session listing to reduce our time to detection and mitigation of issues like this one in the future.
Apr 10, 13:07 - 13:28 UTC
Disruption with some GitHub services
On April 9, 2026, between 16:05 UTC and 20:36 UTC, the Copilot cloud agent service was degraded, causing new agent sessions to be delayed or fail to start. Users who attempted to start Copilot cloud agent sessions during this period experienced jobs getting stuck in the queue, with wait times peaking at 54 minutes compared to the normal 15–40 seconds. On average, approximately 84% of requests to start agent sessions failed, peaking at 97.5% during the worst period.

This was due to an internal service exceeding API rate limits, compounded by a caching bug that persisted the rate-limited state beyond the actual rate limit window, causing recurring outage waves rather than a single recovery.

We mitigated the incident by deploying a configuration change to bypass the affected cache and shifting API traffic to an alternative authentication path that reduced rate limit exposure. We have since added automated monitoring and alerting for this failure mode, deployed per-endpoint rate limit controls, and added caching for high-traffic API calls to reduce overall load. We are also working on longer-term improvements to rate limit isolation and traffic management to prevent similar issues in the future.

This incident shared the same underlying root causes with an incident declared in the time frame https://www.githubstatus.com/incidents/zn1t56bfxdzg
Apr 9, 16:20 - 20:36 UTC
Disruption with some GitHub services
On April 9, 2026, between 09:05 UTC and 19:05 UTC, the Copilot coding agent service was degraded and users experienced significant delays starting new agent sessions. Approximately 84% of new agent session requests were delayed across four separate outage waves, with queue wait times peaking at 54 minutes compared to a normal baseline of 15–40 seconds. On average, the error rate was 83.9% and peaked at 97.5% of requests to the service. Approximately 22,700 workflow creations were delayed or failed during the incident.

This was due to a bug in our rate limiting logic that incorrectly applied a rate limit globally across all users, rather than scoping it to the individual installation that triggered the limit. A contributing factor was a surge in API traffic from a client update that increased requests to an internal endpoint by 3–4x, which accelerated rate limit exhaustion.

We mitigated the incident by disabling the faulty rate limit caching mechanism via feature flag and updating our service to use per-installation credentials for API calls, ensuring rate limits are correctly scoped to individual installations.

We have since added automated monitoring and alerting to detect this failure mode proactively, deployed fixes to reduce unnecessary API traffic through caching improvements, and are continuing work to further isolate rate limit scoping across client types to prevent similar issues in the future.

This incident shared the same underlying root causes with an incident declared in the time frame https://www.githubstatus.com/incidents/2rqwxl8y7m0j
Apr 9, 09:50 - 10:15 UTC
Disruption with GitHub notifications
On April 9, 2026, between 03:22 UTC and 04:49 UTC, GitHub Notifications experienced degraded availability. During this time, approximately 45% of requests to the notifications service returned errors, with a peak error rate of approximately 54%, preventing affected users from successfully viewing or interacting with their notifications service. The issue was identified and resolved, restoring the service to full availability.

We are working to improve our metrics to reduce time to detection and mitigation for similar issues in the future.
Apr 9, 04:42 - 04:57 UTC
Disruption with some GitHub services
Between 15:20 and 20:18 UTC on Thursday April 2, Copilot Cloud Agent entered a period of reduced performance. Due to an internal feature being developed for Copilot Code Review, the Copilot Cloud Agent infrastructure started to receive an increased number of jobs. This load eventually caused us to hit an internal rate limit, causing all work to suspend for an hour. During this hour, some new jobs would time out, while others would resume once rate limiting ended. Roughly 40% of jobs in this period were affected.

Once the cause of this rate limiting was identified, we were able to disable the new CCR feature via a feature flag. Once the jobs that were already in the queue were able to clear, we didn't see additional instances of rate limiting afterwards.
Apr 2, 17:49 - 21:48 UTC
Copilot Coding Agent failing to start some jobs
Between 15:20 and 20:18 UTC on Thursday April 2, Copilot Cloud Agent entered a period of reduced performance. Due to an internal feature being developed for Copilot Code Review, the Copilot Cloud Agent infrastructure started to receive an increased number of jobs. This load eventually caused us to hit an internal rate limit, causing all work to suspend for an hour. During this hour, some new jobs would time out, while others would resume once rate limiting ended. Roughly 40% of jobs in this period were affected.

Once the cause of this rate limiting was identified, we were able to disable the new CCR feature via a feature flag. Once the jobs that were already in the queue were able to clear, we didn't see additional instances of rate limiting afterwards.

This was the same incident declared in https://www.githubstatus.com/incidents/d96l71t3h63k
Apr 2, 16:18 - 16:30 UTC
Disruption with GitHub's code search
On April 1st, 2026 between 14:40 and 17:00 UTC the GitHub code search service had an outage which resulted in users being unable to perform searches.

The issue was initially caused by an upgrade to the code search Kafka cluster ZooKeeper instances which caused a loss of quorum. This resulted in application-level data inconsistencies which required the index to be reset to a point in time before the loss of quorum occurred. Meanwhile, an accidental deploy resulted in query services losing their shard-to-host mappings, which are typically propagated by Kafka.

We remediated the problem by performing rolling restarts in the Kafka cluster, allowing quorum to be reestablished. From there we were able to reset our index to a point in time before the inconsistencies occurred.

The team is working on ways to improve our time to respond and mitigate issues relating to Kafka in the future.
Apr 1, 15:02 - 23:45 UTC
GitHub audit logs are unavailable
On April 1, 2026, between 15:34 UTC and 16:02 UTC, our audit log service lost connectivity to its backing data store due to a failed credential rotation. During this 28-minute window, audit log history was unavailable via both the API and web UI. This resulted in 5xx errors for 4,297 API actors and 127 github.com users. Additionally, events created during this window were delayed by up to 29 minutes in github.com and event streaming. No audit log events were lost; all audit log events were ultimately written and streamed successfully. Customers using GitHub Enterprise Cloud with data residency were not impacted by this incident.

We were alerted to the infrastructure failure at 15:40 UTC — six minutes after onset — and resolved the issue by recycling the affected environment, restoring full service by 16:02 UTC. We are conducting a thorough review of our credential rotation process to strengthen its resiliency and prevent recurrence. In parallel, we are strengthening our monitoring capabilities to ensure faster detection and earlier visibility into similar issues going forward.
Apr 1, 16:06 - 16:10 UTC
Incident with Copilot
On April 1, 2026, between 07:29 and 12:41 UTC, some customers experienced elevated 5xx errors and increased latency when using GitHub Copilot features that rely on `/agents/sessions` endpoints (including creating or viewing agent sessions). The issue was caused by resource exhaustion in one of the Copilot backend services handling these requests, in turn, causing timeouts and failed requests. We mitigated the incident by increasing the service’s available compute resources and tuning its runtime concurrency settings. Service health returned to normal and the incident was fully resolved by 12:41 UTC.
Apr 1, 09:58 - 12:41 UTC
- Collapse Incidents

March 2026

Incident with Pull Requests: High percentage of 500s
On Monday March 31st, 2026, between 13:53 UTC and 21:23 UTC the Pull Requests service experienced elevated latency and failures. On average, the error rate was 0.15% and peaked at 0.28% of requests to the service. This was due to a change in garbage collection (GC) settings for a Go-based internal service that provides access to Git repository data. The changes caused more frequent GC activity and elevated CPU consumption on a subset of storage nodes, increasing latency and failure rates for some internal API operations.

We mitigated the incident by reverting the GC changes. To prevent future incidents and improve time to detection and mitigation, we are instrumenting additional metrics and alerting for GC-related behavior, improving our visibility into other signals that could cause degraded impact of this type, and updating our best practices and standards for garbage collection in Go-based services.
Mar 31, 15:05 - 21:23 UTC
Issues with metered billing report generation
On March 31, 2026, between 06:15 UTC and 15:30 UTC, the GitHub billing usage reports feature was degraded due to reduced server capacity. Customers requesting billing usage reports and loading the top usage by organization and repository on the billing overview and usage pages were impacted. The average error rate for usage report requests was 15%, peaking at 98% over an eight-minute window. For the billing pages, an average of 56% of requests failed to load the top usage cards. The root cause was an increase in billing usage report requests with large datasets, which exhausted the capacity of the nodes responsible for reporting data. There was no impact on billing charges.

We mitigated the incident by adjusting our auto-scaling thresholds to better meet our capacity needs. We are working to improve our metrics to reduce time to detection and mitigation for similar issues in the future.
Mar 31, 13:47 - 15:10 UTC
Elevated delays in Actions workflow runs and Pull Request status updates
On March 30, 2026, between 10:11 UTC and 13:25 UTC, GitHub Actions experienced degraded performance. During this time, approximately 2.65% of workflow jobs triggered by pull request events experienced start delays exceeding 5 minutes. The issue was caused by replication lag on an internal database cluster used by Actions, which triggered write throttling in our database protection layer and slowed job queue processing.

The replication lag originated from planned maintenance to scale the internal database. Newly added database hosts triggered guardrails in the throttling layer, restricting write throughput. The incident was mitigated by excluding the new hosts from replication delay calculations.

To prevent recurrence, we have updated our maintenance procedures to ensure new hosts are excluded from throttling assessments during scaling operations. Additionally, we are investing in automation to streamline this type of maintenance activity.
Mar 30, 13:02 - 13:25 UTC
Incident with Copilot
On March 27, 2026, from 02:30 to 04:56 UTC, a misconfiguration in our rate limiting system caused users on Copilot Free, Student, Pro, and Pro+ plans to experience unexpected rate limit errors. The configuration that was incorrectly applied was intended solely for internal staff testing of rate-limiting experiences. Copilot Business and Copilot Enterprise accounts were not affected. During this period, affected users received error messages instructing them to retry after a certain time. Approximately 32% of active Free users, 35% of active Student users, 46% of active Pro users, and 66% of active Pro+ users were affected. After identifying the root cause, we reverted the change and restored the expected rate limits. We are reviewing our deployment and validation processes to help ensure configurations used for internal testing cannot be inadvertently applied to production environments.
Mar 27, 05:00 - 05:00 UTC
Disruption with some GitHub services
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
Mar 24, 20:18 - 20:56 UTC
Teams Github Notifications App is down
On March 24, 2026, between 15:57 UTC and 19:51 UTC, the Microsoft Teams Integration and Teams Copilot Integration services were degraded and unable to deliver GitHub event notifications to Microsoft Teams. On average, the error rate was 37.4% and peaked at 90.1% of requests to the service -- approximately 19% of all integration installs failed to receive GitHub-to-Teams notifications in this time period.

This was due to an outage at one of our upstream dependencies, which caused HTTP 500 errors and connection resets for our Teams integration.

We coordinated with the relevant service teams, and the issue was resolved at 19:51 UTC when the upstream incident was mitigated.

We are working to update observability and runbooks to reduce time to mitigation for issues like this in the future.
Mar 24, 16:59 - 19:51 UTC
Disruption with some GitHub services
On March 22, 2026, between 09:05 UTC and 10:02 UTC, users may have experienced intermittent errors and increased latency when performing Git http read operations. On average, the error rate was 3.84% and peaked at 15.55% of requests to the service. The issue was caused by elevated latency in an internal authentication service within one of our regional clusters. We mitigated the issue by redirecting traffic away from the affected cluster at 09:39 UTC, after which error rates returned to normal. The incident was fully resolved at 10:02 UTC.

We are working to scale the authentication service and reduce our time to detection and mitigation of issues like this one in the future.
Mar 22, 09:08 - 10:02 UTC
Disruption with Copilot Coding Agent Sessions
On March 19, 2026, between 01:05 UTC and 02:52 UTC, and again on March 20, 2026, between 00:42 UTC and 01:58 UTC, the Copilot Coding Agent service was degraded and users were unable to start new Copilot Agent sessions or view existing ones. During the first incident, the average error rate was ~53% and
peaked at ~93% of requests to the service. During the second incident, the average error rate was ~99%% and peaked at ~100%% of requests with significant retry amplification. Both incidents were caused by the same underlying system authentication issue that prevented the service from connecting to its
backing datastore.

We mitigated each incident by rotating the affected credentials, which restored connectivity and returned error rates to normal. The mitigation time was 01:24. The second occurrence was due to an incomplete remediation of the first.

We are implementing automated monitoring for credential lifecycle events and improving operational processes to reduce our time to detection and mitigation of issues like this one in the future.
Mar 20, 00:58 - 01:58 UTC
Git operations for users in the west coast are experiencing an increase in latency
On March 19, 2026 between 16:10 UTC and 00:05 UTC (March 20), Git operations (clone, fetch, push) from the US west coast experienced elevated latency and degraded throughput. Users reported clone speeds dropping from typical speeds to under 1 MiB/s in extreme cases. The root cause was network transport link saturation at our Seattle edge site, where a fiber cut affecting our backbone transport resulted in saturation and packet loss. We had a planned scale-up in progress for the site that was accelerated to resolve the backbone capacity pressure. We also brought online additional edge capacity in a cloud region and redirected some users there. Current scale with the upgraded network capacity is sufficient to prevent reoccurrence, as we upgraded from 800Gbps to 3.2Tbps total capacity on this path. We will continue to monitor network health and respond to any further issues.
Mar 19, 16:25 - Mar 20, 00:05 UTC
Issues with Copilot Coding Agent
On March 19, 2026, between 01:05 UTC and 02:52 UTC, and again on March 20, 2026, between 00:42 UTC and 01:58 UTC, the Copilot Coding Agent service was degraded and users were unable to start new Copilot Agent sessions or view existing ones. During the first incident, the average error rate was ~53% and
peaked at ~93% of requests to the service. During the second incident, the average error rate was ~99%% and peaked at ~100%% of requests with significant retry amplification. Both incidents were caused by the same underlying system authentication issue that prevented the service from connecting to its
backing datastore.

We mitigated each incident by rotating the affected credentials, which restored connectivity and returned error rates to normal. The mitigation time was 01:24. The second occurrence was due to an incomplete remediation of the first.

We are implementing automated monitoring for credential lifecycle events and improving operational processes to reduce our time to detection and mitigation of issues like this one in the future.
Mar 19, 13:44 - 14:32 UTC
Disruption with Copilot Coding Agent sessions
On March 19, 2026, between 01:05 UTC and 02:52 UTC, and again on March 20, 2026, between 00:42 UTC and 01:58 UTC, the Copilot Coding Agent service was degraded and users were unable to start new Copilot Agent sessions or view existing ones. During the first incident, the average error rate was ~53% and
peaked at ~93% of requests to the service. During the second incident, the average error rate was ~99%% and peaked at ~100%% of requests with significant retry amplification. Both incidents were caused by the same underlying system authentication issue that prevented the service from connecting to its
backing datastore.

We mitigated each incident by rotating the affected credentials, which restored connectivity and returned error rates to normal. The mitigation time was 01:24. The second occurrence was due to an incomplete remediation of the first.

We are implementing automated monitoring for credential lifecycle events and improving operational processes to reduce our time to detection and mitigation of issues like this one in the future.
Mar 19, 02:05 - 02:52 UTC
Disruption with some GitHub services
On March 19, 2026 between 16:10 UTC and 00:05 UTC (March 20), Git operations (clone, fetch, push) from the US west coast experienced elevated latency and degraded throughput. Users reported clone speeds dropping from typical speeds to under 1 MiB/s in extreme cases. The root cause was network transport link saturation at our Seattle edge site, where a fiber cut affecting our backbone transport resulted in saturation and packet loss. We had a planned scale-up in progress for the site that was accelerated to resolve the backbone capacity pressure. We also brought online additional edge capacity in a cloud region and redirected some users there. Current scale with the upgraded network capacity is sufficient to prevent reoccurrence, as we upgraded from 800Gbps to 3.2Tbps total capacity on this path. We will continue to monitor network health and respond to any further issues.

This was the same incident declared in https://www.githubstatus.com/incidents/xs6xtcv196g7
Mar 18, 22:36 - Mar 19, 01:44 UTC
Webhook delivery is delayed
On March 18, 2026, between 18:18 UTC and 19:46 UTC all webhook deliveries experienced elevated latency. During this time, average delivery latency increased from a baseline of approximately 5 seconds to a peak of approximately 160 seconds. This was due to resource constraints in the webhook delivery pipeline, which caused queue backlog growth and increased delivery latency. We mitigated the incident by shifting traffic and adding capacity, after which webhook delivery latency returned to normal. We are working to improve capacity management and detection in the webhook delivery pipeline to help prevent similar issues in the future.
Mar 18, 18:51 - 19:46 UTC
Errors starting and connecting to Codespaces
On 16 March 2026, between 14:16 UTC and 15:18 UTC, Codespaces users encountered a download failure error message when starting newly created or resumed codespaces. At peak, 96% of the created or resumed codespaces were impacted. Active codespaces with a running VSCode environment were not affected.

The error was a result of an API deployment issue with our VS Code remote experience dependency and was resolved by rolling back that deployment. We are working with our partners to reduce our incident engagement time, improve early detection before they impact our customers, and ensure safe rollout of similar changes in the future.
Mar 16, 15:01 - 15:28 UTC
Degraded performance for various services
On March 13, 2026, between 13:35 UTC and 16:02 UTC, a configuration change to an internal authorization service reduced its processing capacity below what was needed during peak traffic. This caused intermittent timeouts when other GitHub services checked user permissions, resulting in four to five waves of errors over roughly two hours and forty minutes. In total, 0.4% of users were denied access to actions they were authorized to perform.

The root cause was a resource right-sizing change deployed to the authorization service the previous day. It reduced CPU allocation below what was required at peak, causing the service's network gateway to throttle under load. Because the change was deployed after peak traffic on March 12, the reduced capacity wasn't surfaced until the next day's peak.

The incident was mitigated by manually scaling up the authorization service and reverting the configuration change.


To prevent recurrence, we are adding further resource utilization monitors across our entire stack to detect throttling and improving error handling so transient infrastructure timeouts are distinguished from authorization failures, enabling quicker detection of the root issue.
Mar 13, 15:12 - 16:15 UTC
Degraded Codespaces experience
On March 12, 2026, between 01:00 UTC and 18:53 UTC, users saw failures downloading extensions within created or resumed codespaces. Users would see an error when attempting to use an extension within VS Code. Active codespaces with extensions already downloaded were not impacted.

The extensions download failures were the result of a change introduced in our extension dependency and was resolved by updating the configuration of how those changes affect requests from Codespaces. We are enhancing observability and alerting of critical issues within regular codespace operations to better detect and mitigate similar issues in the future.
Mar 12, 13:06 - 18:53 UTC
Actions failures to download (401 Unauthorized)
On March 12, 2026 between 02:30 and 06:02 UTC some GitHub Apps were unable to mint server to server tokens, resulting in 401 Unauthorized errors. During the outage window, ~1.3% of requests resulted in 401 errors incorrectly. This manifested in GitHub Actions jobs failing to download tarballs, as well as failing to mint fine-grained tokens. During this period, approximately 5% of Actions jobs were impacted

The root cause was a failure with the authentication service’s token cache layer, a newly created secondary cache layer backed by Redis – caused by Kubernetes control plane instability, leading to an inability to read certain tokens which resulted in 401 errors. The mitigation was to fallback reads to the primary cache layer backed by mysql. As permanent mitigations, we have made changes to how we deploy redis to not rely on the Kubernetes control plane and maintain service availability during similar failure modes. We also improved alerting to reduce overall impact time from similar failures.
Mar 12, 04:46 - 06:02 UTC
Disruption with some GitHub services
Between 01:36 and 08:11 UTC on Thursday March 12, GitHub.com experienced elevated error rates across Git operations, web requests, and related services. During a planned infrastructure upgrade, a configuration issue caused newly provisioned Kubernetes nodes to run an incompatible version of etcd, which disrupted cluster consensus across several production clusters. This led to intermittent 5XX errors on git push, git clone, and page loads. Deployments were paused for the duration of the incident.

Once the incompatible nodes were identified, they were removed and cluster consensus was restored. A validation deploy confirmed all systems were healthy before normal operations resumed.

To prevent recurrence, we are adding programmatic enforcement of version compatibility during node replacements, implementing monitoring to detect split-brain conditions earlier, and updating our recovery tooling to reduce restoration time.
Mar 12, 01:54 - 02:45 UTC
Degraded experience with Copilot Code Review
On March 11, 2026, between 13:00 UTC and 15:23 UTC the Copilot Code Review service was degraded and experienced longer than average review times. On average, Copilot Code Review requests took 4 minutes and peaked at just under 8 minutes. This was due to hitting worker capacity limits and CPU throttling. We mitigated the incident by increasing partitions, and we are improving our resource monitoring to identify potential issues sooner.
Mar 11, 14:25 - 15:53 UTC
Incident with API Requests
On March 11, 2026, between 14:25 UTC and 14:34 UTC, the REST API platform was degraded, resulting in increased error rates and request timeouts. REST API 5xx error rates peaked at ~5% during the incident window with two distinct spikes: the first impacting REST services broadly, and the second driven by sustained timeouts on a subset of endpoints.

The incident was caused by a performance degradation in our data layer, which resulted in increased query latency across dependent services. Most services recovered quickly after the initial spike, but resource contention caused sustained 5xx errors due to how certain endpoints responded to the degraded state.

A fix addressing the behavior that prolonged impact has already been shipped. We are continuing to work to resolve the primary contributing factor of the degradation and to implement safeguards against issues causing cascading impact in the future.
Mar 11, 14:37 - 15:02 UTC
Incident With Webhooks
On March 10, 2026, between 23:00 UTC and 23:40 UTC, the Webhooks service was degraded and ~6% of users experienced intermittent errors when accessing webhook delivery history, retrying webhook deliveries, and listing webhooks via the UI and API. Approximately 0.37% of requests resulted in errors, while at peak 0.5% of requests resulted in errors. This was due to unhealthy infrastructure. We mitigated the incident by redeploying affected services, after which service health returned to normal. We are working to improve detection of unhealthy infrastructure and strengthen service safeguards to reduce time to detect and mitigate similar issues in the future.
Mar 10, 23:00 - 23:00 UTC
Incident with Webhooks
On March 9, 2026, between 15:03 and 20:52 UTC, the Webhooks API experienced was degraded, resulted in higher average latency on requests and in certain cases error responses. Approximately 0.6% of total requests exceeded the normal latency threshold of 3s, while 0.4% of requests resulted in 500 errors. At peak, 2.0% experienced latency greater than 3 seconds and 2.8% of requests returned 500 errors.

The issue was caused by a noisy actor that led to resource contention on the Webhooks API service. We mitigated the issue initially by increasing CPU resources for the Webhooks API service, and ultimately applied lower rate limiting thresholds to the noisy actor to prevent further impact to other users.

We are working to improve monitoring to more quickly ascertain noisy traffic and will continue to improve our rate-limiting mechanisms to help prevent similar issues in the future.
Mar 9, 15:50 - 17:03 UTC
Incident with Codespaces
On March 9, 2026, between 01:23 UTC and 03:25 UTC, users attempting to create or resume codespaces in the Australia East region experienced elevated failures, peaking at a 100% failure rate for this region. Codespaces in other regions were not affected.

The create and resume failures were caused by degraded network connectivity between our control plane services and the VMs hosting the codespaces. This was resolved by redirecting traffic to an alternate site within the region. While we are addressing the core network infrastructure issue, we have also improved our observability of components in this area to improve detection. This will also enable our existing automated failovers to cover this failure mode. These changes will prevent or significantly reduce the time any similar incident causes user impact.
Mar 9, 03:04 - 03:51 UTC
Incident with Webhooks
On March 6, 2026, between 16:16 UTC and 23:28 UTC the Webhooks service was degraded and some users experienced intermittent errors when accessing webhook delivery histories, retrying webhook deliveries, and listing webhooks via the UI and API. On average, the error rate was 0.57% and peaked at approximately 2.73% of requests to the service. This was due to unhealthy infrastructure affecting a portion of webhook API traffic.

We mitigated the incident by redeploying affected services, after which service health returned to normal.

We are working to improve detection of unhealthy infrastructure and strengthen service safeguards to reduce time to detection and mitigation of issues like this one in the future.
Mar 6, 16:58 - 23:28 UTC
Actions is experiencing degraded availability
On March 5, between 22:39 and 23:55 UTC, Actions was degraded due to a repeat of an incident a few hours prior. In this case, a Redis cluster topology change made as a follow-up to the earlier incident caused a repeat of the earlier degradation of Actions jobs. Details of both incidents and the follow-ups are shared at https://www.githubstatus.com/incidents/g5gnt5l5hf56.
Mar 5, 22:53 - 23:55 UTC
Multiple services are affected, service degradation
On Mar 5, 2026, between 16:24 UTC and 19:30 UTC, Actions was degraded. During this time, 95% of workflow runs failed to start within 5 minutes with an average delay of 30 minutes and 10% workflow runs failed with an infrastructure error. This was due to Redis infrastructure updates that were being rolled out to production to improve our resiliency. These changes introduced a set of incorrect configuration change into our Redis load balancer causing internal traffic to be routed to an incorrect host leading to two incidents.

We mitigated this incident by correcting the misconfigured load balancer. Actions jobs were running successfully starting at 17:24 UTC. The remaining time until we closed the incident was burning through the queue of jobs.

We immediately rolled back the updates that were a contributing factor and have frozen all changes in this area until we have completed follow-up work from this. We are working to improve our automation to ensure incorrect configuration changes are not able to propagate through our infrastructure. We are also working on improved alerting to catch misconfigured load balancers before it becomes an incident. Additionally, we are updating the Redis client configuration in Actions to improve resiliency to brief cache interruptions.
Mar 5, 16:35 - 19:30 UTC
Disruption with some GitHub services
On March 5, 2026, between 12:53 UTC and 13:35 UTC, the Copilot mission control service was degraded. This resulted in empty responses returned for users' agent session lists across GitHub web surfaces. Impacted users were unable to see their lists of current and previous agent sessions in GitHub web surfaces. This was caused by an incorrect database query that falsely excluded records that have an absent field.

We mitigated the incident by rolling back the database query change. There were no data alterations nor deletions during the incident.

To prevent similar issues in the future, we're improving our monitoring depth to more easily detect degradation before changes are fully rolled out.
Mar 5, 01:13 - 01:30 UTC
Some OpenAI models degraded in Copilot
On March 5th, 2026, between approximately 00:26 and 00:44 UTC, the Copilot service experienced a degradation of the GPT 3.5 Codex model due to an issue with our upstream provider. Users encountered elevated error rates when using GPT 3.5 Codex, impacting approximately 30% of requests. No other models were impacted.

The issue was resolved by a mitigation put in place by our provider.
Mar 5, 00:47 - 01:13 UTC
Claude Opus 4.6 Fast not appearing for some Copilot users
On March 3, 2026, between 19:44 UTC and 21:05 UTC, some GitHub Copilot users reported that the Claude Opus 4.6 Fast model was no longer available in their IDE model selection. After investigation, we confirmed that this was caused by enterprise administrators adjusting their organization's model policies, which correctly removed the model for users in those organizations. No users outside the affected organizations lost access.

We confirmed that the Copilot settings were functioning as designed, and all expected users retained access to the model. The incident was resolved once we verified that the change was intentional and no platform regression had occurred.
Mar 3, 20:31 - 21:11 UTC
Incident with all GitHub services
On March 3, 2026, between 18:46 UTC and 20:09 UTC, GitHub experienced a period of degraded availability impacting GitHub.com, the GitHub API, GitHub Actions, Git operations, GitHub Copilot, and other dependent services. At the peak of the incident, GitHub.com request failures reached approximately 40%. During the same period, approximately 43% of GitHub API requests failed. Git operations over HTTP had an error rate of approximately 6%, while SSH was not impacted. GitHub Copilot requests had an error rate of approximately 21%. GitHub Actions experienced less than 1% impact.

This incident shared the same underlying cause as an incident in early February where we saw a large volume of writes to the user settings caching mechanism. While deploying a change to reduce the burden of these writes, a bug caused every user’s cache to expire, get recalculated, and get rewritten. The increased load caused replication delays that cascaded down to all affected services. We mitigated this issue by immediately rolling back the faulty deployment.

We understand these incidents disrupted the workflows of developers. While we have made substantial, long-term investments in how GitHub is built and operated to improve resilience, we acknowledge we have more work to do. Getting there requires deep architectural work that is already underway, as well as urgent, targeted improvements. We are taking the following immediate steps:

- We have added a killswitch and improved monitoring to the caching mechanism to ensure we are notified before there is user impact and can respond swiftly.
- We are moving the cache mechanism to a dedicated host, ensuring that any future issues will solely affect services that rely on it.
Mar 3, 18:59 - 20:09 UTC
Delayed visibility of newly added issues on project boards
Between March 2, 21:42 UTC and March 3, 05:54 UTC project board updates, including adding new issues, PRs, and draft items to boards, were delayed from 30 minutes to over 2 hours, as a large backlog of messages accumulated in the Projects data denormalization pipeline.

The incident was caused by an anomalously large event that required longer processing time than expected. Processing this message exceeded the Kafka consumer heartbeat timeout, triggering repeated consumer group rebalances. As a result, the consumer group was unable to make forward progress, creating head-of-line blocking that delayed processing of subsequent project board updates.

We mitigated the issue by deploying a targeted fix that safely bypassed the offending message and allowed normal message consumption to resume. Consumer group stability recovered at 04:10 UTC, after which the backlog began draining. All queued messages were fully processed by 05:53 UTC, returning project board updates to normal processing latency.

We have identified several follow-up improvements to reduce the likelihood and impact of similar incidents in the future, including improved monitoring and alerting, as well as introducing limits for unusually large project events.
Mar 2, 23:10 - Mar 3, 05:54 UTC
Incident with Pull Requests /pulls
On March 2nd, 2026, between 7:10 UTC and 22:04 UTC the pull requests service was degraded. Users navigating between tabs on the pull requests dashboard were met with 404 errors or blank pages.

This was due to a configuration change deployed on February 27th at 11:03 PM UTC. We mitigated the incident by reverting the change.

We’re working to improve monitoring for the page to automatically detect and alert us to routing failures.
Mar 2, 19:11 - 22:04 UTC
- Collapse Incidents