Important: Please ensure you have access to production so that you are able to handle incidents!
Background
Team Pipeline has committed to processing "standard" events end-to-end within 30 seconds (p95) and "buffered" events within 1m30s (p95).
It's important for customers that their data is queryable soon after being sent so they can debug their implementation and get the feeling that PostHog is working correctly.
To calculate how we're doing with respect to our commitment, we have a Celery task that every 60 seconds calculates lag based on an event that we emit every minute from a plugin.
As a result of this, our metric includes 60 second error bars, meaning we actually target 1m30s for "standard" events and 2m30s for "buffered" events.
Also, beyond measuring how fast we are, the ingestion lag metric being high might also be an indication that we're not processing events at all, so it's important that we treat it seriously.
Debugging
There are many things that can lead to ingestion lag being high, so it's very important to rule out parts of the system completely in order to determine the root cause.
The first thing to establish is if we're processing events with a delay or not processing events at all. A third option to consider later on is if we're processing some events and dropping others.
You can do this in many ways:
Look at the live events tab
Send an event and see if it appears on live events
Check ClickHouse for the latest event
_timestamp
, and ingestion rate per minuteSQLSELECTtoStartOfMinute(timestamp) ts,count(*)FROM eventsWHERE timestamp > now() - INTERVAL 2 HOUR AND timestamp < now()GROUP BY tsORDER BY tsPlease note that
timestamp
reflects the time the event happened, not when the event was ingested. Thus it is possible that this metric is off if users are sending us events in the past or the future. However, it generally is a great representation of things. You can use_timestamp
to get ingestion time timestamps, but note that it is a column that's not in our sorting key and thus filtering on it leads to expensive (and thus slow) queries.
If we're not processing events, then things are very serious. However, even then there are levels to the issue.
The worst case scenario is if events are being dropped by the capture endpoint (posthog/api/capture.py
). This is because these events are forever lost.
It's a smaller but still serious issue if the plugin server is unable to process events or ClickHouse unable to consume them. These events will remain in Kafka until the messages older than the specified retention for the topic.
To determine this, there are a few useful resources:
- Graph: Kafka Production Rate (US Cloud link)
- Graph: Events received per minute (US Cloud link)
- Graph: HTTP Response Codes (US Cloud link)
- Sentry errors for
/e
,/batch
etc endpoints (US Cloud link) - Plugin server logs (US Cloud link)
Actions
Restarting the ingestion pods
A reasonably safe operation to perform is restarting the plugin server ingestion pods. This doesn't cause event loss and can be done by triggering a deploy or running a command like the following:
kubectl rollout restart deployment/plugins-ingestion -n posthog
It's ok to do this when you're not fully sure what's wrong with the ingestion pipeline.
Scaling the ingestion pods
If you suspect that we're struggling due to high load, then you can scale vertically or horizontally.
To scale vertically (i.e. give the pods more resources), you should update the requests and limits for CPU and memory on the relevant section of the posthog-cloud-infra
repo.
To scale horizontally (i.e. add more pods), you should update the min and max hpa
values on the relevant section of the posthog-cloud-infra
repo.
Dealing with hot partitions
If you see from the Kafka consumer lag graph (US Cloud link) that our lag seems to stem from only one or a few partitions, that's likely because we have a hot partition.
We partition the events_plugin_ingestion
topic by <team_id>:<distinct_id>
, meaning we're likely getting a burst of events for a given person in a short period of time, causing a hot partition.
There are a few possible mitigation steps for this:
- Wait. If the burst is temporary we should recover in time.
- Discover the problematic IDs and force events being sent to them to be randomly partitioned.
In order to do #2, you should first try to discover what IDs could be causing problems. Here's a useful query to run on ClickHouse:
SELECTteam_id,distinct_id,count(*) AS cFROM eventsWHEREtimestamp > now() - INTERVAL 1 HOUR ANDtimestamp < now()GROUP BYteam_id,distinct_idORDER BY c DESCLIMIT 20
If an ID stands out, you can do some further digging with:
SELECTtoStartOfMinute(timestamp) ts,count(*)FROM eventsWHEREteam_id=<team_id> ANDdistinct_id=<distinct_id> ANDtimestamp > now() - INTERVAL 1 DAY ANDtimestamp < now()GROUP BY ts
Running the above query on Metabase gives you a very useful visualization, where you can see if events for this ID have increased significantly recently or if the volume is always high.
Often times this graph will give you a clear indication of if an ID caused the hot partition or not.
If you've established that there's likely an ID causing a hot partition, it's worth peeking into the properties for the person it's associated to:
WITH p AS (SELECTid,propertiesFROM personWHEREteam_id=<team_id>),pdi AS (SELECT person_idFROM person_distinct_id2WHEREteam_id=<team_id> ANDdistinct_id=<distinct_id>)SELECT p.propertiesFROM p JOIN pdiON p.id = pdi.person_id
If it looks like the properties are all / mostly generic (e.g. initial referrers, OS, properties PostHog sets by default) or the hot partition problem is very serious (a lot of lag), you can then copy the approach from this PR to force the events for the ID to be randomly partitioned.
We check the person properties because if a lot of custom properties are being used random partitioning may cause properties to get inconsistent values.
Dropping and recreating ClickHouse "ingestion tables"
If you've established that ClickHouse is not ingesting data correctly (via Kafka consumer lag graphs, ClickHouse logs, etc), then it could be worth recreating the relevant Kafka table + materialized view pairs that are responsible for ingesting data.
In order to do this, first get the schema for each table:
-- save the results somewhereSHOW CREATE TABLE kafka_events_json;SHOW CREATE TABLE events_json_mv;
Then you should first drop the materialized view (remove the comment to run on all nodes):
DROP VIEW events_json_mv -- ON CLUSTER 'posthog'
And later the Kafka table:
DROP TABLE kafka_events_json -- ON CLUSTER 'posthog'
And recreate them in reverse order using the schemas you copied earlier. i.e. Kafka table first, then materialized view. Make sure to add the ON CLUSTER
clause if you dropped the tables on the whole cluster.
Restarting the ClickHouse server on a node
Important: You should never do this if you're not comfortable doing so from experience operating ClickHouse! Please loop in someone to do this in case you feel like ClickHouse has gotten into a bad state (from looking at logs, metrics, etc).
In extreme situations a ClickHouse node may have gotten into a bad state. In that case, restarting the ClickHouse server might help. To do this, you should SSH into the node and run:
sudo systemctl restart clickhouse-server
Quick reference
A list of relevant resources / information that is useful for debugging ingestion lag.
Kafka topics
events_plugin_ingestion
: Events sent from the Django events service to the plugin server. Consumed by a group poorly namedclickhouse-ingestion
.clickhouse_events_json
: Events sent from the plugin server to ClickHouse. Consumed by a group calledgroup1
(and also by a group calledclickhouse-plugin-server-async
for the purpose of exports, but that's irrelevant here).conversion_events_buffer
: Events that should wait (60s by default) to be processed. Produced to from the plugin server and consumed by the plugin server using a group calledingester
.
Tools
- https://metrics.posthog.net (Grafana)
- https://grafana.us-east-1.prod.posthog.dev (Grafana, requires VPN)
- https://grafana.eu-central-1.prod.posthog.dev (Grafana, requires VPN access)
- https://gregor.posthog.net (Kafka manager)
- https://metabase.posthog.net (Metabase)
- Sentry
- AWS console / CLI
kubectl