Working with JSON

Last updated:

At PostHog, we store arbitrary payloads users send us for further analysis as JSON. As such, it's critical we do a good job at storing and analyzing this data.

This document covers:

  • Storing JSON in Strings and operations on them
  • Why and how to compress this data
  • Materialized columns
  • Alternative solutions: JSON data type, arrays

JSON Strings

At PostHog, we store JSON data as VARCHAR (or String) columns.

Relevant properties are then parsed out from the String columns at query-time using JSONExtract functions.

This has the following problems:

  1. These columns end up really large even after compression, meaning slow I/O
  2. It requires CPU to parse properties
  3. Data is not stored optimally. As an example, JSON keys are frequently repeated and numbers are stored as strings.

Compressing JSON

Luckily, JSON compresses really well, speeding up reading this data from disk.

By default our JSON columns are compressed by the LZ4 algorithm. See benchmarks for more information and benchmarks.

Materialized columns

ClickHouse has support for Materialized columns which are columns calculated dynamically based off of other columns.

We leverage them to dynamically create new columns for frequently-queried JSON keys to speed up queries as each materialized column is stored the same way as normal columns and requires less resources to read and parse.

Read more in our blog and in this guide for PostHog specific details.

Operational notes

After adding a materialized column, it is only populated for new data and on merges. When querying old data, this can introduce performance regressions, so forcing the column to be written to disk, even for historical data, is recommended.

Materialized columns may cause issues during operations - e.g. they can make copying data between tables painful. It's sometimes worth considering dropping them before large operations.

Alternative solutions

Arrays

Uber published an article on their logging, popularizing the idea to store JSON data as arrays: one for keys, one for values.

However internal benchmarking showed that in our use-case the improvement wasn't big enough to be worth the investment (yet)..

Semi-structured data / JSON data type

In 2022, ClickHouse released support for semi-structured data.

However after testing we encountered several fundamental problems which make this feature unusable in our case until they are resolved: 1, 2, 3, 4, and 5

Next in the ClickHouse manual: Query performance

Questions?

Was this page useful?

Next article

Query performance

This document goes over: What tools are available to understand and measure query performance Importance of page cache General tips and tricks for performant queries Tooling clickhouse-client clickhouse-client is a command-line application for running queries against ClickHouse. When executing queries, it details progress, execution time, how many rows and gigabytes of data were processed, and how much CPU was used. system.query_log ClickHouse saves all queries it runs into system.query_log…

Read next article