Since October 8, 2020 at approximately 21:00 UTC, and until this morning of October 9, 2020 at 8:20 UTC, api.video has suffered a major failure, impacting its entire service. Some random issues for live streaming continued until 10:30 UTC.
The event was triggered by a progressive major failure at one of our DNS providers, at 21:00 UTC, on October 8, 2020.
The event was detected by our support team at 6:00 UTC. The Tech team started working on the event by 6:30 UTC.
The incident affected all users.
Incident cleared after workarounds got implemented, and situation is considered as stable since 8:20 UTC.
Today, to deliver all our contents (live streams, VOD files, player assets, ...) we rely on a CDN, in front of a platform split across continents (North America & France).
To load-balance the traffic from the CDN towards the closest platform, we rely on a GSLB, a load-balancer at the DNS level. We also benefit from this GSLB within our platform to geo-route traffic between geographic dependencies (databases, keystore, ...).
We use to do this thanks to our hosting provider's local load-balancers, but moved away from this setup on June 14, due to repetitive major issues on this service. We selected PerfOps as our GSLB provider, created by the man behind jsdelivr, Dmitriy Akulov.
On October 8 evening, all their domains, including perfops.net (the brand one) and flexbalancers.net (technical one), went off: the first one has no more exisiting DNS records, whereas the later is no more declared by any registrar. At the time of this post-mortem, we still have no more communications with their executive team nor their technical team.
Due to this outage, all our load-balanced DNS records went unresponsive.
Load-balanced DNS records include VOD, LIVE, player assets origins for the CDN. From then on, it was impossible to view videos.
The impact was global.
While we completely revised our monitoring in May 2020, we did not implement end-2-end probes. As a result, while each individual service was running, we had no information of any service interruption to our users.
Our support team noticed the issues at 6:00 UTC on October 9, raising the case to our CTO at 6:20 UTC, after a series of tests.
Our CTO started to diagnose the incident at 6:25 UTC, pointing issues at the DNS level. Several providers could be concerned (GoDaddy - our actual Registrar -, PerfOps - our actual GSLB -, NSOne - the replacement of PerfOps we setting up -, CDNetworks - our actual CDN -).
Meanwhile, he tried to reach out to the infrastructure team to get assistance on the issue at 6:42 UTC. Infrastructure team was available on working on the incident at 7:00 UTC.
As the issue was on PerfOps (unresponsive domains), and as the actual work with NSOne is not ready for production yet, we decided to go with a workaround, updating all the CNAME towards PerfOps' DNS records, by aliases towards unique hostname on our ends.
The change was performed at GoDaddy's level, our Registrar. Due to how DNS works, and as the minimal TTL by GoDaddy is 30 minutes, any changed, started at 7:00 UTC, would be effective within the hour (twice the TTL minus 1 second) in worst case scenario.
At 8:00 UTC, some internal services still encountered some issues. We noticed errors in the workarounds (wrong records used) and implemented fixes at both DNS level and servers' level.
All times are UTC.
21:00 - PerfOps' domains went off, DNS propagation slowing down the impact from an external point of view
6:00 - global incident noticed by our support team
6:20 - incident escalated to our CTO
6:25 - beginning of the diagnosis
6:42 - incident escalated to our Infrastructure Team
7:00 - DNS records are being updated
8:00 - few internal errors remained and got fixed
8:20 - incident closed
10:00 - live customers complains about service flapping, the infrastructure team reopens the incident and start another diagnosis
10:07 - the unique ingest server we're using as a workaround to the initial incident suffers from consecutive loads and recurrent error from a specific stream
10:10 - standard load-balancing is implemented between all our ingest nodes for live streaming and suspicious live stream is killed
10:12 - as a quick fix, some customers are manually moved towards specific ingest nodes, while the load balancing is being setup
10:25 - the load-balancing is up & running
10:30 - service is ok for our live customers
We have several items in our backlog that are already in progress that would have avoid this situation:
No previous incident was related to this root cause.
From this incident, we’ve noticed several aspects to be improved. Also, DNS remains key to any internet services and internal infrastructure communication. Any element should be redundant enough.
Thanks to this, as we build clusters of any service, we should benefit from multiple DNS providers at any level. A first step is to rely on unique 1st-class providers for each stage of a DNS resolution process, then to move to multiple 1st-class providers for each stage.
Although all these topics are already in our backlog, it is necessary to review their priorities and deadlines.