Ingestion Issue

Incident Report for Scout

Postmortem

On 2020/05/31 we experienced a short network outage that prevented our zookeeper and kafka nodes from reaching each other. When connectivity was restored, there was a problem with stale zookeeper data which prevented the kafka brokers from initiating a proper leader election for topic partitions. This also prevented kafka producers from being able to produce to a majority of partitions. A manual leader election was attempted, but failed to correct the issue. We began a rolling restart of our entire kafka cluster, which ultimately resolved the issue. Later versions of Kafka have better handling around this particular failure, and we anticipate moving to a recent version to prevent entering this failure mode again.

Posted Jun 01, 2020 - 15:47 MDT

Resolved

Ingestion for all customers has been operating normally since 10:20AM MT. Some customers will have some or no data from 8:50AM to 10:20AM MT. We will follow up with more information about the cause of the outage.

Posted May 31, 2020 - 11:44 MDT

Monitoring

We have restarted several of our Kafka servers in ingestion, and ingestion appears to be recovering.

Data should begin appearing on your dashboard again.

Posted May 31, 2020 - 10:22 MDT

Investigating

We are investigating an issue in ingestion of agent data.

Posted May 31, 2020 - 09:46 MDT

This incident affected: Application Monitoring.