Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse
AI-Generated Summary: This is an automated summary created using AI. For the full details and context, please read the original post.
Hidden Bottleneck in ClickHouse Exposed: A Story of Lock Contention in Query Planning
Cloudflare, a heavy user of ClickHouse, recently experienced a performance issue with their billing pipeline. Despite checking all the usual suspects, the problem turned out to be lock contention in query planning, a hidden bottleneck in ClickHouse's internals. This issue was exposed after redesigning one of their largest ClickHouse tables to add a column to the partitioning key, enabling per-tenant retention.
Key Technical Details
- Cloudflare uses ClickHouse to store over a hundred petabytes of data across a few dozen clusters.
- They built a system called "Ready-Analytics" to simplify onboarding for internal teams, which stores data in a single, massive table with a namespace and standard schema.
- The table is partitioned by day, with a retention policy that drops partitions older than 31 days.
- The redesign added a new column to the partitioning key, allowing per-namespace retention.
Practical Implications for Developers
- When redesigning tables or adding new columns to the partitioning key, developers should be aware of potential lock contention in query planning.
- ClickHouse's performance can be affected by the number of data parts and the filtering of queries, so careful planning is necessary when changing the partitioning scheme.
- Developers should consider using the max-min fairness algorithm to manage disk utilization and automatically "share" available space.
Important Facts and Figures
- Cloudflare stores over a hundred petabytes of data across a few dozen clusters.
- The Ready-Analytics table has grown to over 2PiB of data and ingests millions of rows per second.
- The redesign added a new column to the partitioning key, increasing the total number of data parts in the table.
Want to read the full article?
Read Full Post on Cloudflare Blog