2020-10-8 & 9 Major outage
Incident Report for api.video
Postmortem

Incident summary

Since October 8, 2020 at approximately 21:00 UTC, and until this morning of October 9, 2020 at 8:20 UTC, api.video has suffered a major failure, impacting its entire service. Some random issues for live streaming continued until 10:30 UTC.

The event was triggered by a progressive major failure at one of our DNS providers, at 21:00 UTC, on October 8, 2020.

The event was detected by our support team at 6:00 UTC. The Tech team started working on the event by 6:30 UTC.

The incident affected all users.

Incident cleared after workarounds got implemented, and situation is considered as stable since 8:20 UTC.

Leadup

Today, to deliver all our contents (live streams, VOD files, player assets, ...) we rely on a CDN, in front of a platform split across continents (North America & France).

To load-balance the traffic from the CDN towards the closest platform, we rely on a GSLB, a load-balancer at the DNS level. We also benefit from this GSLB within our platform to geo-route traffic between geographic dependencies (databases, keystore, ...).

We use to do this thanks to our hosting provider's local load-balancers, but moved away from this setup on June 14, due to repetitive major issues on this service. We selected PerfOps as our GSLB provider, created by the man behind jsdelivr, Dmitriy Akulov.

Fault

On October 8 evening, all their domains, including perfops.net (the brand one) and flexbalancers.net (technical one), went off: the first one has no more exisiting DNS records, whereas the later is no more declared by any registrar. At the time of this post-mortem, we still have no more communications with their executive team nor their technical team.

Due to this outage, all our load-balanced DNS records went unresponsive.

Impact

Load-balanced DNS records include VOD, LIVE, player assets origins for the CDN. From then on, it was impossible to view videos.

The impact was global.

Detection

While we completely revised our monitoring in May 2020, we did not implement end-2-end probes. As a result, while each individual service was running, we had no information of any service interruption to our users.

Our support team noticed the issues at 6:00 UTC on October 9, raising the case to our CTO at 6:20 UTC, after a series of tests.

Response

Our CTO started to diagnose the incident at 6:25 UTC, pointing issues at the DNS level. Several providers could be concerned (GoDaddy - our actual Registrar -, PerfOps - our actual GSLB -, NSOne - the replacement of PerfOps we setting up -, CDNetworks - our actual CDN -).

Meanwhile, he tried to reach out to the infrastructure team to get assistance on the issue at 6:42 UTC. Infrastructure team was available on working on the incident at 7:00 UTC.

Recovery

As the issue was on PerfOps (unresponsive domains), and as the actual work with NSOne is not ready for production yet, we decided to go with a workaround, updating all the CNAME towards PerfOps' DNS records, by aliases towards unique hostname on our ends.

The change was performed at GoDaddy's level, our Registrar. Due to how DNS works, and as the minimal TTL by GoDaddy is 30 minutes, any changed, started at 7:00 UTC, would be effective within the hour (twice the TTL minus 1 second) in worst case scenario.

At 8:00 UTC, some internal services still encountered some issues. We noticed errors in the workarounds (wrong records used) and implemented fixes at both DNS level and servers' level.

Timeline

All times are UTC.

21:00 - PerfOps' domains went off, DNS propagation slowing down the impact from an external point of view

6:00 - global incident noticed by our support team

6:20 - incident escalated to our CTO

6:25 - beginning of the diagnosis

6:42 - incident escalated to our Infrastructure Team

7:00 - DNS records are being updated

8:00 - few internal errors remained and got fixed

8:20 - incident closed

10:00 - live customers complains about service flapping, the infrastructure team reopens the incident and start another diagnosis

10:07 - the unique ingest server we're using as a workaround to the initial incident suffers from consecutive loads and recurrent error from a specific stream

10:10 - standard load-balancing is implemented between all our ingest nodes for live streaming and suspicious live stream is killed

10:12 - as a quick fix, some customers are manually moved towards specific ingest nodes, while the load balancing is being setup

10:25 - the load-balancing is up & running

10:30 - service is ok for our live customers

Root cause

  1. Video delivery was out due to errors at the CDN level.
  2. Those errors were generated either by errors at the origin level (our geo-clustered platform), or DNS errors.
  3. The errors by our servers were DNS related.
  4. All DNS errors were related to a provider outage.
  5. Because we lack end-2-end testing, no alert got raised on the monitoring, towards our tech team.
  6. Because we felt confident with the GSLB provider, and due to the urge with the former load balancing provider, we didn't challenge them enough.
  7. Because we chose workarounds to quickly solve the situation, some small errors appear, which we solve on the fly.

Backlog check

We have several items in our backlog that are already in progress that would have avoid this situation:

  1. implement end-to-end tests, to monitor the service as our end-users consume it
  2. replace PerfOps to a better performance-based GSLB service provider (NSOne)
  3. avoid external GSLB for internal dependencies: this concerns both the way we address our systems from one service to the other, but also various clusters we have as geo-replication instead of replication where applicable
  4. setup a proper on-call scheduling to avoid any time frame without any tech person available

Recurrence

No previous incident was related to this root cause.

Lesson learned

From this incident, we’ve noticed several aspects to be improved. Also, DNS remains key to any internet services and internal infrastructure communication. Any element should be redundant enough.

Thanks to this, as we build clusters of any service, we should benefit from multiple DNS providers at any level. A first step is to rely on unique 1st-class providers for each stage of a DNS resolution process, then to move to multiple 1st-class providers for each stage.

Corrective actions

Although all these topics are already in our backlog, it is necessary to review their priorities and deadlines.

  • [ ] End-to-end tests to ensure about how we actually deliver the service to our users: deadline set to October 30
  • [ ] Replace PerfOps with NSOne to ensure about stability of DNS resolution: deadline set to October 13
  • [ ] Use internal DNS instead of public DNS for internal system communications: deadline set to October 16
  • [ ] Revamp internal clusters as databases, storage: deadline set to December 25
  • [ ] Proper on-call scheduling: deadline set to October 30
  • [ ] Multiple DNS providers for GSLB & DNS resolution: 2021 Q1
Posted Oct 09, 2020 - 12:09 UTC

Resolved
This incident has been resolved.
Posted Oct 09, 2020 - 12:08 UTC
This incident affected: Live, Web services, Player, VOD, and User space.