ClickHouse Crypto Candle Storage Guide

Q: How much disk space do crypto candles take in ClickHouse?

ClickHouse compresses data extremely well — typically 5-10x better than uncompressed. A full history of 1-minute candles for 200 trading pairs from Binance (going back 3 years) usually fits in 2-5 GB after compression. You can verify compression ratios with the system.columns table.

Q: Can I use ClickHouse Cloud instead of self-hosting?

Yes, ClickHouse Cloud is a managed option that removes infrastructure overhead. It supports the same SQL dialect and engines as self-hosted. For solo traders or small teams, it's often more practical than maintaining your own server, though costs can add up at high ingest rates.

Q: Is ClickHouse better than TimescaleDB for candle data?

Both work well, but ClickHouse generally wins on raw analytical query speed and compression for pure OHLCV workloads. TimescaleDB is built on Postgres and is easier to adopt if your team already uses Postgres. For a greenfield trading infrastructure, most quant teams choose ClickHouse at scale.

Q: How do I handle multiple exchanges like Bybit, OKX, and Binance in the same table?

Include an 'exchange' column as the first field in your ORDER BY key. This physically co-locates all data for a given exchange on disk, making per-exchange filters extremely fast. Use LowCardinality(String) for the exchange column to enable dictionary compression.

Q: What timeframes should I store — all of them or just 1-minute?

Store only 1-minute candles and resample on the fly with SQL. Storing multiple timeframes duplicates data and creates consistency headaches when you need to update a historical candle. The resampling query is fast enough in ClickHouse that pre-computing timeframes rarely pays off.

Q: How do I connect my Python trading bot to ClickHouse?

Use the clickhouse-connect library (pip install clickhouse-connect) — it's the official Python client maintained by ClickHouse Inc. and supports both inserts and queries. For async workloads, asynch is a solid alternative built on aiohttp.

◈ Contents

→ What Is ClickHouse and Why Does It Matter for Trading?
→ Designing a Candle Storage Schema That Actually Scales
→ Ingesting Candle Data from Exchanges at Scale
→ Querying Candles: Aggregations, Resampling, and Indicators
→ Deduplication and Data Quality for Reliable Backtests
→ Frequently Asked Questions
→ Putting It All Together

Every time you look at a chart on Binance or Bybit, you're looking at candlestick data — millions of rows of open, high, low, close, and volume (OHLCV) records compressed into a visual story. Behind that chart is a database working hard to fetch and aggregate those rows in milliseconds. For most traders, that database is someone else's problem. But if you're building your own trading bot, backtesting engine, or signal platform, the database becomes your problem — and your competitive advantage. ClickHouse is the tool serious quant traders and data engineers reach for when the data gets big and the queries need to stay fast.

What Is ClickHouse and Why Does It Matter for Trading?

ClickHouse is an open-source columnar database built by Yandex for analytical workloads. Unlike traditional row-based databases like PostgreSQL or MySQL — which store each database record as a full row — ClickHouse stores data column by column. For trading data, this distinction is massive. When you want to calculate the average closing price of BTC/USDT over the last 90 days, a row-based database has to read every column of every row even though you only care about the 'close' column. ClickHouse reads only the column you asked for, skipping everything else.

Think of it like a spreadsheet. A row-based database is like scanning every cell in every row to answer a question about one column. A columnar database goes straight to that column and reads only what it needs. For OHLCV data with hundreds of millions of rows across dozens of trading pairs and timeframes, this translates directly to query times measured in milliseconds rather than seconds.

Key Takeaway: ClickHouse is purpose-built for analytical queries over large datasets. It scans only the columns you request, making it 10-100x faster than Postgres for typical trading data queries like aggregations, resampling, and range scans.

Designing a Candle Storage Schema That Actually Scales

The schema is where most people make mistakes. A naïve approach copies whatever the exchange API returns and dumps it into a table. That works at small scale. At 500 million rows across 200 trading pairs pulling from Binance, OKX, and Bybit simultaneously, the wrong schema becomes a bottleneck that no hardware upgrade can fix.

A solid candle table for ClickHouse typically looks like this. The engine choice and sort key are critical — they determine how data is physically stored on disk and how fast range queries perform.

CREATE TABLE candles
(
    exchange    LowCardinality(String),
    symbol      LowCardinality(String),
    timeframe   LowCardinality(String),
    open_time   DateTime64(3, 'UTC'),
    open        Float64,
    high        Float64,
    low         Float64,
    close       Float64,
    volume      Float64,
    close_time  DateTime64(3, 'UTC')
)
ENGINE = MergeTree()
ORDER BY (exchange, symbol, timeframe, open_time)
PARTITION BY toYYYYMM(open_time)
TTL open_time + INTERVAL 2 YEAR;

A few decisions worth explaining here. LowCardinality(String) is a ClickHouse optimization for columns with a small number of distinct values — exchange names like 'binance' or 'okx', symbols like 'BTCUSDT', and timeframes like '1m' or '4h' are perfect candidates. It compresses them into dictionary encoding automatically, cutting storage and speeding up filtering significantly. The ORDER BY clause defines the primary index — queries that filter or sort by exchange, symbol, timeframe, and then open_time will be extremely fast because that's exactly how rows are physically ordered on disk. The TTL line automatically deletes data older than 2 years, keeping the table from growing forever.

Key Takeaway: Your ORDER BY is your most important schema decision. Design it around your most frequent query patterns — usually filtering by symbol and timeframe, then scanning a time range. Get this right and your queries stay fast even at billions of rows.

Ingesting Candle Data from Exchanges at Scale

Fetching and storing candles efficiently requires thinking about both the write path and the read path. On the write side, ClickHouse performs best with batch inserts — sending thousands of rows at once rather than one row at a time. Single-row inserts create too many small parts on disk, which ClickHouse has to merge in the background, adding overhead and slowing things down.

A typical ingestion pattern pulls historical candles from the exchange REST API to backfill, then switches to WebSocket streams for live data. Binance provides WebSocket kline streams that push a new candle event every second during an active candle. OKX and Bybit offer similar WebSocket channels. You buffer incoming events in memory, then flush to ClickHouse in batches every few seconds.

import clickhouse_connect
import asyncio
from collections import defaultdict

client = clickhouse_connect.get_client(host='localhost', port=8123)
buffer = defaultdict(list)

async def flush_to_clickhouse():
    while True:
        await asyncio.sleep(5)  # flush every 5 seconds
        for key, rows in list(buffer.items()):
            if rows:
                client.insert(
                    'candles',
                    rows,
                    column_names=['exchange','symbol','timeframe',
                                  'open_time','open','high','low',
                                  'close','volume','close_time']
                )
                buffer[key].clear()

def on_kline_event(exchange, event):
    k = event['k']
    if k['x']:  # only store closed candles
        buffer[f"{exchange}_{k['s']}"].append([
            exchange, k['s'], k['i'],
            int(k['t']), float(k['o']), float(k['h']),
            float(k['l']), float(k['c']), float(k['v']),
            int(k['T'])
        ])

Notice the `if k['x']` check — this filters to only closed candles, avoiding duplicate writes for in-progress candles that update every second. For live signal generation, platforms like VoiceOfChain use a similar pattern: closed candles are committed to ClickHouse immediately, while the current live candle is held in memory and used for real-time calculations without polluting the historical store.

Querying Candles: Aggregations, Resampling, and Indicators

Once the data is in ClickHouse, the real power emerges. You can resample candles on the fly — turning 1-minute candles into 4-hour candles with a single query, without storing the resampled data separately. You can calculate rolling averages, detect volume spikes, and join candle data against other tables, all inside the database before results hit your application.

-- Resample 1m candles into 4h candles for BTCUSDT on Bybit
SELECT
    toStartOfInterval(open_time, INTERVAL 4 HOUR) AS ts,
    argMin(open, open_time)   AS open,
    max(high)                 AS high,
    min(low)                  AS low,
    argMax(close, open_time)  AS close,
    sum(volume)               AS volume
FROM candles
WHERE exchange = 'bybit'
  AND symbol = 'BTCUSDT'
  AND timeframe = '1m'
  AND open_time >= now() - INTERVAL 30 DAY
GROUP BY ts
ORDER BY ts;

This query scans 30 days of 1-minute candles — about 43,200 rows — and returns 4-hour candles in under 50 milliseconds on typical hardware. The argMin and argMax functions are ClickHouse-specific: they return the open price from the earliest row in each group, and the close price from the latest row, which is exactly what you need when resampling OHLC data. Running this same query against PostgreSQL on the same dataset typically takes 2-10 seconds — the difference between a snappy UI and a frustrating one.

For more advanced use cases, ClickHouse supports materialized views — you can define a view that automatically aggregates 1-minute candles into hourly candles as new data arrives, keeping pre-computed candles ready for fast reads without any application-level ETL pipeline.

Key Takeaway: Store 1-minute candles as your base granularity and resample on the fly with SQL. This keeps storage simple while giving you any timeframe on demand — 3m, 15m, 2h, 12h, whatever your strategy needs.

Deduplication and Data Quality for Reliable Backtests

Anyone who has run a backtest against bad data knows the pain. Your strategy looks profitable in testing, you go live on Binance or KuCoin, and reality disagrees. Bad candle data — duplicates, gaps, or incorrectly stored live candles — is one of the most common sources of backtest overfitting and false signals.

ClickHouse offers a ReplacingMergeTree engine specifically for deduplication. Instead of MergeTree, you declare the table with ReplacingMergeTree and ClickHouse will automatically merge duplicate rows with the same primary key, keeping only the latest version. This is useful when your ingestion pipeline might re-send candles due to restarts or network retries.

-- Use ReplacingMergeTree for automatic deduplication
CREATE TABLE candles
(
    exchange   LowCardinality(String),
    symbol     LowCardinality(String),
    timeframe  LowCardinality(String),
    open_time  DateTime64(3, 'UTC'),
    open       Float64,
    high       Float64,
    low        Float64,
    close      Float64,
    volume     Float64,
    close_time DateTime64(3, 'UTC'),
    inserted_at DateTime DEFAULT now()
)
ENGINE = ReplacingMergeTree(inserted_at)
ORDER BY (exchange, symbol, timeframe, open_time)
PARTITION BY toYYYYMM(open_time);

With ReplacingMergeTree, if you insert the same BTC 1-minute candle twice (which happens often when your WebSocket reconnects), ClickHouse keeps the newer version based on the inserted_at column. Important caveat: deduplication happens during background merges, not instantly. For queries where you need guaranteed deduplication right now, add FINAL to your SELECT: `SELECT ... FROM candles FINAL WHERE ...`. This forces ClickHouse to deduplicate before returning results, at the cost of slightly slower queries.

Always validate candle count per day against expected count (1440 for 1m candles)
Alert on gaps longer than 2 candles — exchange downtime or ingestion failure
Compare volume totals against exchange-reported daily volume as a sanity check
Store raw exchange responses separately for 30 days so you can replay on schema changes
Never use unfinished (live) candles in backtests — filter by close_time < now()

Frequently Asked Questions

How much disk space do crypto candles take in ClickHouse?

ClickHouse compresses data extremely well — typically 5-10x better than uncompressed. A full history of 1-minute candles for 200 trading pairs from Binance (going back 3 years) usually fits in 2-5 GB after compression. You can verify compression ratios with the system.columns table.

Can I use ClickHouse Cloud instead of self-hosting?

Yes, ClickHouse Cloud is a managed option that removes infrastructure overhead. It supports the same SQL dialect and engines as self-hosted. For solo traders or small teams, it's often more practical than maintaining your own server, though costs can add up at high ingest rates.

Is ClickHouse better than TimescaleDB for candle data?

Both work well, but ClickHouse generally wins on raw analytical query speed and compression for pure OHLCV workloads. TimescaleDB is built on Postgres and is easier to adopt if your team already uses Postgres. For a greenfield trading infrastructure, most quant teams choose ClickHouse at scale.

How do I handle multiple exchanges like Bybit, OKX, and Binance in the same table?

Include an 'exchange' column as the first field in your ORDER BY key. This physically co-locates all data for a given exchange on disk, making per-exchange filters extremely fast. Use LowCardinality(String) for the exchange column to enable dictionary compression.

What timeframes should I store — all of them or just 1-minute?

Store only 1-minute candles and resample on the fly with SQL. Storing multiple timeframes duplicates data and creates consistency headaches when you need to update a historical candle. The resampling query is fast enough in ClickHouse that pre-computing timeframes rarely pays off.

How do I connect my Python trading bot to ClickHouse?

Use the clickhouse-connect library (pip install clickhouse-connect) — it's the official Python client maintained by ClickHouse Inc. and supports both inserts and queries. For async workloads, asynch is a solid alternative built on aiohttp.

Putting It All Together

ClickHouse is not the easiest database to get started with — it has its own dialect, its own engine options, and its own performance tuning knobs. But for anyone serious about building trading infrastructure that scales, it is the right tool. The ability to resample any timeframe on demand, run aggregations over years of tick-level data in seconds, and ingest from multiple exchanges like Binance, Bybit, OKX, and Bitget simultaneously into a single coherent dataset makes it uniquely suited to the demands of systematic trading.

The stack that serious algo traders and signal platforms like VoiceOfChain use typically combines ClickHouse for historical OHLCV storage, a message queue like NATS or Kafka for real-time candle streaming, and a lightweight in-memory cache for the current live candle. This separation of concerns lets each layer do what it does best: ClickHouse handles deep historical queries, the message queue handles low-latency distribution, and the cache handles sub-second signal generation on incomplete candles.

Start simple: one table, one exchange, 1-minute candles. Get comfortable with the schema, the insert patterns, and the query syntax. Then expand to multiple exchanges and add materialized views for your most common aggregations. The learning curve is real but so is the payoff — when your backtest runs against 3 years of multi-exchange data in under a second, you'll understand why teams build around ClickHouse and don't look back.

Key Takeaway: Start with MergeTree or ReplacingMergeTree, store 1-minute candles only, and resample on query. Add materialized views only after you identify specific slow queries. Premature optimization of a ClickHouse schema causes the same problems as anywhere else.

◈ more on this topic

⌘ api Kraken API Documentation for Crypto Traders: Essentials and Examples

ClickHouse for Crypto Candle Storage: The Trader's Guide