Important: Please ensure you have access to production so that you are able to handle incidents!
Background
Team Pipeline has committed to processing "standard" events end-to-end within 30 seconds (p95) and "buffered" events within 1m30s (p95).
It's important for customers that their data is queryable soon after being sent so they can debug their implementation and get the feeling that PostHog is working correctly.
To calculate how we're doing with respect to our commitment, we have a Celery task that every 60 seconds calculates lag based on an event that we emit every minute from a plugin.
As a result of this, our metric includes 60 second error bars, meaning we actually target 1m30s for "standard" events and 2m30s for "buffered" events.
Also, beyond measuring how fast we are, the ingestion lag metric being high might also be an indication that we're not processing events at all, so it's important that we treat it seriously.
Debugging
There are many things that can lead to ingestion lag being high, so it's very important to rule out parts of the system completely in order to determine the root cause.
The first thing to establish is if we're processing events with a delay or not processing events at all. A third option to consider later on is if we're processing some events and dropping others.
You can do this in many ways:
- Look at the live events tab 
- Send an event and see if it appears on live events 
- Check ClickHouse for the latest event - _timestamp, and ingestion rate per minuteSQLSELECTtoStartOfMinute(timestamp) ts,count(*)FROM eventsWHERE timestamp > now() - INTERVAL 2 HOUR AND timestamp < now()GROUP BY tsORDER BY ts- Please note that - timestampreflects the time the event happened, not when the event was ingested. Thus it is possible that this metric is off if users are sending us events in the past or the future. However, it generally is a great representation of things. You can use- _timestampto get ingestion time timestamps, but note that it is a column that's not in our sorting key and thus filtering on it leads to expensive (and thus slow) queries.
If we're not processing events, then things are very serious. However, even then there are levels to the issue.
The worst case scenario is if events are being dropped by the capture endpoint (posthog/api/capture.py). This is because these events are forever lost.
It's a smaller but still serious issue if the plugin server is unable to process events or ClickHouse unable to consume them. These events will remain in Kafka until the messages older than the specified retention for the topic.
To determine this, there are a few useful resources:
- Graph: Kafka Production Rate (US Cloud link)
- Graph: Events received per minute (US Cloud link)
- Graph: HTTP Response Codes (US Cloud link)
- Sentry errors for /e,/batchetc endpoints (US Cloud link)
- Plugin server logs (US Cloud link)
Actions
Restarting the ingestion pods
A reasonably safe operation to perform is restarting the plugin server ingestion pods. This doesn't cause event loss and can be done by triggering a deploy or running a command like the following:
It's ok to do this when you're not fully sure what's wrong with the ingestion pipeline.
Scaling the ingestion pods
If you suspect that we're struggling due to high load, then you can scale vertically or horizontally.
To scale vertically (i.e. give the pods more resources), you should update the requests and limits for CPU and memory on the relevant section of the posthog-cloud-infra repo.
To scale horizontally (i.e. add more pods), you should update the min and max hpa values on the relevant section of the posthog-cloud-infra repo.
Dealing with hot partitions
If you see from the Kafka consumer lag graph (US Cloud link) that our lag seems to stem from only one or a few partitions, that's likely because we have a hot partition.
We partition the events_plugin_ingestion topic by <team_id>:<distinct_id>, meaning we're likely getting a burst of events for a given person in a short period of time, causing a hot partition. 
There are a few possible mitigation steps for this:
- Wait. If the burst is temporary we should recover in time.
- Discover the problematic IDs and force events being sent to them to be randomly partitioned.
In order to do #2, you should first try to discover what IDs could be causing problems. Here's a useful query to run on ClickHouse:
If an ID stands out, you can do some further digging with:
Running the above query on Metabase gives you a very useful visualization, where you can see if events for this ID have increased significantly recently or if the volume is always high.
Often times this graph will give you a clear indication of if an ID caused the hot partition or not.
If you've established that there's likely an ID causing a hot partition, it's worth peeking into the properties for the person it's associated to:
If it looks like the properties are all / mostly generic (e.g. initial referrers, OS, properties PostHog sets by default) or the hot partition problem is very serious (a lot of lag), you can then copy the approach from this PR to force the events for the ID to be randomly partitioned.
We check the person properties because if a lot of custom properties are being used random partitioning may cause properties to get inconsistent values.
Dropping and recreating ClickHouse "ingestion tables"
If you've established that ClickHouse is not ingesting data correctly (via Kafka consumer lag graphs, ClickHouse logs, etc), then it could be worth recreating the relevant Kafka table + materialized view pairs that are responsible for ingesting data.
In order to do this, first get the schema for each table:
Then you should first drop the materialized view (remove the comment to run on all nodes):
And later the Kafka table:
And recreate them in reverse order using the schemas you copied earlier. i.e. Kafka table first, then materialized view. Make sure to add the ON CLUSTER clause if you dropped the tables on the whole cluster.
Restarting the ClickHouse server on a node
Important: You should never do this if you're not comfortable doing so from experience operating ClickHouse! Please loop in someone to do this in case you feel like ClickHouse has gotten into a bad state (from looking at logs, metrics, etc).
In extreme situations a ClickHouse node may have gotten into a bad state. In that case, restarting the ClickHouse server might help. To do this, you should SSH into the node and run:
Quick reference
A list of relevant resources / information that is useful for debugging ingestion lag.
Kafka topics
- events_plugin_ingestion: Events sent from the Django events service to the plugin server. Consumed by a group poorly named- clickhouse-ingestion.
- clickhouse_events_json: Events sent from the plugin server to ClickHouse. Consumed by a group called- group1(and also by a group called- clickhouse-plugin-server-asyncfor the purpose of exports, but that's irrelevant here).
- conversion_events_buffer: Events that should wait (60s by default) to be processed. Produced to from the plugin server and consumed by the plugin server using a group called- ingester.
Tools
- https://metrics.posthog.net (Grafana)
- https://grafana.us-east-1.prod.posthog.dev (Grafana, requires VPN)
- https://grafana.eu-central-1.prod.posthog.dev (Grafana, requires VPN access)
- https://gregor.posthog.net (Kafka manager)
- https://metabase.posthog.net (Metabase)
- Sentry
- AWS console / CLI
- kubectl