Back to all summaries

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

James Morrison, Christian Endres
ClickHouse Engineering Performance Database Open Source

AI-Generated Summary: This is an automated summary created using AI. For the full details and context, please read the original post.

Hidden Bottleneck in ClickHouse Exposed: A Story of Lock Contention in Query Planning

Cloudflare, a heavy user of ClickHouse, recently experienced a performance issue with their billing pipeline. Despite checking all the usual suspects, the problem turned out to be lock contention in query planning, a hidden bottleneck in ClickHouse's internals. This issue was exposed after redesigning one of their largest ClickHouse tables to add a column to the partitioning key, enabling per-tenant retention.

Key Technical Details

  • Cloudflare uses ClickHouse to store over a hundred petabytes of data across a few dozen clusters.
  • They built a system called "Ready-Analytics" to simplify onboarding for internal teams, which stores data in a single, massive table with a namespace and standard schema.
  • The table is partitioned by day, with a retention policy that drops partitions older than 31 days.
  • The redesign added a new column to the partitioning key, allowing per-namespace retention.

Practical Implications for Developers

  • When redesigning tables or adding new columns to the partitioning key, developers should be aware of potential lock contention in query planning.
  • ClickHouse's performance can be affected by the number of data parts and the filtering of queries, so careful planning is necessary when changing the partitioning scheme.
  • Developers should consider using the max-min fairness algorithm to manage disk utilization and automatically "share" available space.

Important Facts and Figures

  • Cloudflare stores over a hundred petabytes of data across a few dozen clusters.
  • The Ready-Analytics table has grown to over 2PiB of data and ingests millions of rows per second.
  • The redesign added a new column to the partitioning key, increasing the total number of data parts in the table.

Want to read the full article?

Read Full Post on Cloudflare Blog