◈   ∿ algotrading · Intermediate

Mean Reversion Backtest in Python for Crypto Traders

Build a complete mean reversion trading strategy in Python, backtest it on real crypto data from Binance, and calculate Sharpe ratio, drawdown, and win rate — all from scratch.

Uncle Solieditor · voc · 05.05.2026 ·views 36
◈   Contents
  1. → What Is Mean Reversion and Why It Works in Crypto
  2. → Fetching Crypto Market Data with Python and CCXT
  3. → Building Mean Reversion Signals with Bollinger Bands and Z-Score
  4. → Running the Backtest and Calculating Performance Metrics
  5. → Position Sizing with the Kelly Criterion
  6. → Avoiding Overfitting and Validating Before Going Live
  7. → Frequently Asked Questions
  8. → Conclusion

Most retail traders lose money chasing momentum. Experienced quants often go the other direction — building systems around the statistical tendency of prices to revert to their historical mean. Mean reversion works because markets overshoot. Panic selling pushes Bitcoin 15% below its 20-day average; algorithmic buyers step in and the price snaps back. Capture that snap consistently and you have an edge. This guide walks through building, backtesting, and evaluating a mean reversion strategy in Python using real crypto market data — no hand-waving, no black boxes.

What Is Mean Reversion and Why It Works in Crypto

Mean reversion is the statistical hypothesis that asset prices oscillate around a long-term average, and extreme deviations are temporary. In traditional markets this process plays out slowly — over weeks or months. In crypto, it can resolve in hours, which is both the opportunity and the danger. The key mathematical tool is the z-score: how many standard deviations the current price sits from its rolling mean. A z-score of -2 means price is two standard deviations below average — statistically unusual and often followed by a bounce. Bollinger Bands visualize this by drawing envelopes two standard deviations above and below a moving average. When price touches the lower band, mean reversion traders go long expecting a return to the middle band.

Not all crypto pairs mean-revert equally well. Stablecoin pairs like USDC/USDT are almost perfectly mean-reverting by design. Major pairs like BTC/USDT and ETH/USDT mean-revert on shorter timeframes (1h-4h) but trend on longer ones. Altcoins can mean-revert violently but also gap down on bad news and never recover. Your backtest will tell you which pairs and timeframes have historically favored this approach — that is the whole point of building this system before risking real capital.

Fetching Crypto Market Data with Python and CCXT

The CCXT library gives you a unified API to pull historical OHLCV data from over 100 exchanges. The same code that fetches data from Binance works with Bybit, OKX, and Bitget — you just change the exchange name. This matters for backtesting because different exchanges have different liquidity profiles, fee structures, and slippage. A strategy that looks excellent on Binance's deep BTC/USDT market might perform worse on the same pair on KuCoin, where the order book is thinner and spreads are wider.

import ccxt
import pandas as pd
import numpy as np

# Swap 'binance' for 'bybit', 'okx', 'bitget', etc. — same API
exchange = ccxt.binance({'enableRateLimit': True})

def fetch_ohlcv(symbol='BTC/USDT', timeframe='1h', limit=1000):
    ohlcv = exchange.fetch_ohlcv(symbol, timeframe=timeframe, limit=limit)
    df = pd.DataFrame(
        ohlcv,
        columns=['timestamp', 'open', 'high', 'low', 'close', 'volume']
    )
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
    df.set_index('timestamp', inplace=True)
    return df

df = fetch_ohlcv('BTC/USDT', '1h', 1000)
print(f'Loaded {len(df)} candles: {df.index[0]} to {df.index[-1]}')
print(df[['open', 'high', 'low', 'close', 'volume']].tail(3))

Building Mean Reversion Signals with Bollinger Bands and Z-Score

The core of the strategy is computing z-scores and generating buy signals when price falls too far below its rolling mean. The window size (typically 20 periods) and the entry threshold (typically ±2 standard deviations) are your primary parameters. On 1-hour BTC/USDT data from Binance, a 20-period window represents roughly 20 hours of price action — enough to capture short-term deviations while filtering out tick noise. The exit signal fires when price returns to the mean (z-score crosses zero) rather than waiting for the upper band, which keeps holding time short and reduces exposure to sudden trend reversals.

def add_bollinger_bands(df, window=20, num_std=2.0):
    df['sma'] = df['close'].rolling(window).mean()
    df['std'] = df['close'].rolling(window).std()
    df['upper'] = df['sma'] + num_std * df['std']
    df['lower'] = df['sma'] - num_std * df['std']
    # Z-score: signed distance from mean in std dev units
    df['z_score'] = (df['close'] - df['sma']) / df['std']
    return df

def generate_signals(df, entry_z=-2.0, exit_z=0.0):
    df['signal'] = 0
    # Long entry: price is unusually far below the mean
    df.loc[df['z_score'] < entry_z, 'signal'] = 1
    # Exit: price has reverted back to or above the mean
    df.loc[df['z_score'] > exit_z, 'signal'] = -1
    return df

df = add_bollinger_bands(df, window=20, num_std=2.0)
df = generate_signals(df, entry_z=-2.0, exit_z=0.0)

buy_signals = (df['signal'] == 1).sum()
print(f'Buy signals: {buy_signals} ({buy_signals / len(df) * 100:.1f}% of candles)')
print(df[['close', 'sma', 'z_score', 'signal']].tail(5))
Tip: If buy signals fire on fewer than 1% of candles, widen entry_z to -1.5. If they fire constantly, tighten to -2.5. You need at least 30-50 trades in your backtest for the statistics to mean anything — aim for 2-5% signal frequency.

Running the Backtest and Calculating Performance Metrics

A basic event-driven backtest loops through every candle, checks the previous candle's signal, and enters or exits at the current candle's open price. Using the previous candle's signal to enter at the next open is critical — entering at the same close that generated the signal introduces look-ahead bias, one of the most common mistakes that makes paper results look better than live performance will ever be. The performance metrics that matter most for a mean reversion strategy are Sharpe ratio (risk-adjusted return), maximum drawdown (worst peak-to-trough loss you would have endured), and win rate (mean reversion strategies typically win often but their losses can be larger than individual wins).

def backtest(df, initial_capital=10_000, fee=0.001):
    """Event-driven backtest. Uses prev candle signal to enter at next open."""
    capital = initial_capital
    position = 0.0
    trades = []
    equity_curve = []

    for i in range(1, len(df)):
        prev_signal = df['signal'].iloc[i - 1]  # Signal from last candle
        entry_price = df['open'].iloc[i]          # Execute at current open

        if prev_signal == 1 and position == 0:
            shares = (capital * 0.95) / entry_price
            cost = shares * entry_price * (1 + fee)
            if cost <= capital:
                position = shares
                capital -= cost
                trades.append({'type': 'buy', 'price': entry_price, 'idx': i})

        elif prev_signal == -1 and position > 0:
            proceeds = position * entry_price * (1 - fee)
            capital += proceeds
            trades.append({'type': 'sell', 'price': entry_price, 'idx': i})
            position = 0.0

        equity_curve.append(capital + position * df['close'].iloc[i])

    if position > 0:  # Close any open position at last price
        capital += position * df['close'].iloc[-1] * (1 - fee)

    return capital, trades, pd.Series(equity_curve, index=df.index[1:])


def calculate_metrics(equity, trades, initial_capital=10_000):
    returns = equity.pct_change().dropna()

    # Annualized Sharpe (hourly data = 8760 periods/year)
    sharpe = (returns.mean() / returns.std()) * np.sqrt(8760)

    # Max Drawdown
    rolling_max = equity.cummax()
    max_dd = ((equity - rolling_max) / rolling_max).min() * 100

    # Win Rate
    buys  = [t for t in trades if t['type'] == 'buy']
    sells = [t for t in trades if t['type'] == 'sell']
    pairs = list(zip(buys, sells))
    wins  = sum(1 for b, s in pairs if s['price'] > b['price'])
    win_rate = wins / len(pairs) * 100 if pairs else 0

    total_return = (equity.iloc[-1] - initial_capital) / initial_capital * 100

    print(f'Total Return : {total_return:.2f}%')
    print(f'Sharpe Ratio : {sharpe:.2f}')
    print(f'Max Drawdown : {max_dd:.2f}%')
    print(f'Win Rate     : {win_rate:.1f}%')
    print(f'Total Trades : {len(pairs)}')
    return {'return': total_return, 'sharpe': sharpe, 'max_dd': max_dd, 'win_rate': win_rate}

final_cap, trades, equity = backtest(df)
metrics = calculate_metrics(equity, trades)
Benchmark thresholds for mean reversion backtest results on BTC/USDT 1H
MetricPoorAcceptableStrong
Sharpe Ratio< 0.50.5 – 1.5> 1.5
Max Drawdown> 30%15% – 30%< 15%
Win Rate< 50%50% – 65%> 65%
Total Trades (sample)< 2020 – 100> 100

Position Sizing with the Kelly Criterion

Fixed position sizing — always using 95% of capital — is fine for running backtests but dangerous in live trading. The Kelly Criterion gives you a mathematically optimal fraction of capital to risk per trade based on your historical win rate and average win-to-loss ratio. In practice, traders use half-Kelly to account for estimation error and model drift. Crypto markets shift regimes constantly: a strategy that posted a 62% win rate during a sideways 2023 market may drop to 51% during a 2024 trend-driven rally. Always apply a hard cap regardless of what the formula suggests. On Bybit and OKX, you set this as a fixed percentage of account balance per order.

def kelly_criterion(win_rate: float, avg_win: float, avg_loss: float,
                    max_risk: float = 0.02) -> float:
    """
    Returns recommended fraction of capital to risk per trade.
    win_rate : historical win probability, e.g. 0.58
    avg_win  : mean profit per winning trade (dollars)
    avg_loss : mean loss per losing trade (positive, dollars)
    max_risk : hard cap — never risk more than this regardless of Kelly
    """
    if avg_loss == 0:
        return max_risk

    b = avg_win / avg_loss     # payoff ratio
    p, q = win_rate, 1 - win_rate
    kelly = (b * p - q) / b   # Full Kelly
    half_kelly = kelly * 0.5  # Use half for robustness

    return float(np.clip(half_kelly, 0, max_risk))


def extract_trade_stats(trades):
    buys  = [t for t in trades if t['type'] == 'buy']
    sells = [t for t in trades if t['type'] == 'sell']
    pnls  = [s['price'] - b['price'] for b, s in zip(buys, sells)]

    wins   = [p for p in pnls if p > 0]
    losses = [abs(p) for p in pnls if p <= 0]

    win_rate = len(wins) / len(pnls)    if pnls   else 0.0
    avg_win  = float(np.mean(wins))     if wins   else 0.0
    avg_loss = float(np.mean(losses))   if losses else 1.0

    return win_rate, avg_win, avg_loss


wr, aw, al = extract_trade_stats(trades)
size = kelly_criterion(wr, aw, al, max_risk=0.02)
print(f'Win Rate     : {wr*100:.1f}%')
print(f'Avg Win/Loss : ${aw:.2f} / ${al:.2f}')
print(f'Position Size: {size*100:.2f}% of capital per trade')
Warning: Never trade full Kelly. It is mathematically optimal but causes brutal drawdowns when your win rate estimate is even slightly off. Half-Kelly sacrifices roughly 25% of expected return in exchange for dramatically smoother equity curves — a trade worth making every time.

Avoiding Overfitting and Validating Before Going Live

The most dangerous mistake in backtesting is optimizing parameters until historical results look amazing, then discovering the strategy barely works in live markets. Overfitting is invisible in your backtest output — you only find out when real money is on the line. The standard defense is a strict train/test split: optimize your parameters on the first 70% of data, then evaluate the final version on the held-out 30% without touching anything. If performance drops significantly on the test set, your parameters are overfit to noise. Also test across different market regimes — the 2022 bear market and the 2024 bull run behave completely differently, and a robust strategy should survive both.

Parameter stability is a second key validation test. Run your backtest across a grid of window sizes and entry thresholds. A strategy that only works precisely at window=20 and entry_z=-2.0 is suspicious. A strategy that performs reasonably across window values of 15 to 25, and entry thresholds of -1.8 to -2.2, is likely capturing a real market inefficiency rather than curve-fitted noise. Once you pass both tests, paper trade on Bybit or OKX (both have free paper trading with live market data) for at least two to four weeks before committing capital. Pair that with VoiceOfChain's real-time signal feed to cross-reference your algorithmic entries with on-chain whale activity and exchange flow data that pure price-based models will never capture on their own.

Frequently Asked Questions

Does mean reversion work in crypto bull markets?
Mean reversion strategies typically underperform during strong trends because price keeps extending away from the mean rather than snapping back. The fix is adding a trend filter — for example, only taking mean reversion longs when the 200-period moving average is flat or declining. This reduces trade frequency but dramatically improves the strategy's regime awareness and reduces losses from catching falling knives in trending markets.
How much historical data do I need for a reliable backtest?
Aim for at least 50-100 completed round-trip trades in your backtest — otherwise your win rate and Sharpe estimates carry too much statistical variance to be meaningful. For BTC/USDT on a 1-hour timeframe with a tight entry threshold, this typically requires 3-12 months of data. Fewer trades mean your results could look good or bad purely by luck.
Can I automate this strategy live on Binance or Bybit?
Yes. The CCXT library used here for data fetching also handles live order placement on Binance, Bybit, OKX, Bitget, Gate.io, and most major exchanges using the same function signatures. Once you have a validated strategy, you replace the backtest loop with a scheduler (APScheduler or a cron job) that runs every candle close, generates signals, and places real orders via the exchange API.
What is the difference between mean reversion and arbitrage?
Arbitrage exploits price differences for the same asset across two markets — buying BTC on Coinbase and simultaneously selling on Binance when a gap appears — and is theoretically risk-free when executed fast enough. Mean reversion is a statistical directional bet that a single asset's price will return to its own historical average. It involves real directional risk and can lose money if price continues moving away from the mean.
Why does my backtest look great but live performance is poor?
The most common culprits are look-ahead bias (using close price both to generate the signal and to enter the trade), overfitted parameters, and underestimated transaction costs. Always use the previous candle's signal to enter at the next candle's open, model realistic fees (0.1% on Binance spot), and add 0.05-0.1% for slippage. Together these small adjustments can cut apparent backtest returns by 30-50%, which is actually the realistic number.
Is z-score or RSI better for mean reversion entry signals?
Both work but measure different things. Z-score measures absolute deviation from a rolling mean in statistically interpretable units — a z-score of -2 always means the same thing regardless of asset price. RSI is a momentum oscillator capped between 0 and 100 that is easier to explain and natively supported across charting platforms like Bybit and Gate.io. For quantitative backtesting, z-score gives more precise and consistent entry thresholds; RSI is preferable if you need the signal visible on a standard trading terminal.

Conclusion

Mean reversion backtesting in Python is one of the most transferable skills in algorithmic crypto trading. The framework built here — CCXT data fetching, z-score signal generation, event-driven backtesting, Kelly position sizing — is not a toy. You can extend it to pairs trading across correlated assets, scale it to sweep dozens of symbols simultaneously on Binance and Bybit, or layer in on-chain filters from VoiceOfChain to improve signal quality. The discipline that separates profitable quants from expensive hobbyists is always the same: test out-of-sample, model realistic fees, and resist the temptation to optimize until your results look perfect. A great strategy looks mediocre in a backtest and consistent in live trading — that is the target.

◈   more on this topic
⌘ api Kraken API Documentation for Crypto Traders: Essentials and Examples ◉ basics Mastering the ccxt library documentation for crypto traders