WebSocket Failover for Crypto Bots: Never Miss a Trade
Learn how to build resilient WebSocket failover logic for crypto trading bots — with real Python code for Binance, Bybit, and OKX connections.
Learn how to build resilient WebSocket failover logic for crypto trading bots — with real Python code for Binance, Bybit, and OKX connections.
Your bot is live, catching signals from VoiceOfChain, and then — silence. The WebSocket dropped. By the time it reconnects, you've missed a 3% move on BTC. This is not a hypothetical. It happens every week to traders running bots without proper failover logic. A WebSocket connection to Binance, Bybit, or OKX is not a utility pipe — it's a fragile, stateful stream that drops under load spikes, during exchange maintenance windows, and when your ISP has a bad night. The difference between a professional bot and an amateur one is not the strategy. It's how the bot behaves when things go wrong.
Every major exchange imposes hard limits on WebSocket connections. Binance, for example, closes connections after 24 hours and sends a ping every 3 minutes — if your bot doesn't respond with a pong within 10 minutes, you're disconnected without warning. Bybit uses a similar heartbeat pattern but with a 20-second ping interval. OKX has been known to silently drop connections during high-volatility periods when their infrastructure is under stress.
Beyond exchange-side limits, there are network-layer issues: NAT timeouts on cloud servers, TCP keepalive misconfigurations, and transient DNS failures. If your bot is hosted on AWS or a VPS, intermediate routers will silently kill idle TCP connections — even if data was flowing seconds ago. The result is a bot that thinks it's connected but is actually listening to nothing.
A naive reconnect is a while loop that catches an exception and calls connect() again. That works for 10 minutes in a demo. In production, it creates thundering herd problems — every bot instance hits the exchange simultaneously after a shared outage, triggering rate limits. Real failover logic needs exponential backoff, jitter, and a separate health-check thread that detects silent drops before your order logic notices.
import asyncio
import websockets
import json
import random
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
BINANCE_WS_URL = "wss://stream.binance.com:9443/ws/btcusdt@trade"
MAX_RETRIES = 10
BASE_DELAY = 1.0 # seconds
MAX_DELAY = 60.0 # seconds
class WebSocketClient:
def __init__(self, url: str):
self.url = url
self.ws = None
self.last_message_time = time.time()
self.running = True
async def on_message(self, message: str):
self.last_message_time = time.time()
data = json.loads(message)
price = data.get("p")
logger.info(f"BTC trade: ${price}")
# Feed into your signal logic here
async def health_check(self):
"""Detects silent drops: no message for 30s = reconnect."""
while self.running:
await asyncio.sleep(10)
if time.time() - self.last_message_time > 30:
logger.warning("Silent drop detected — forcing reconnect")
if self.ws:
await self.ws.close()
async def connect(self):
attempt = 0
asyncio.create_task(self.health_check())
while self.running:
try:
delay = min(BASE_DELAY * (2 ** attempt) + random.uniform(0, 1), MAX_DELAY)
if attempt > 0:
logger.info(f"Reconnect attempt {attempt}, waiting {delay:.1f}s")
await asyncio.sleep(delay)
async with websockets.connect(self.url, ping_interval=20, ping_timeout=10) as ws:
self.ws = ws
attempt = 0 # reset on successful connection
logger.info(f"Connected to {self.url}")
async for message in ws:
await self.on_message(message)
except (websockets.ConnectionClosed, OSError) as e:
logger.error(f"Connection lost: {e}")
attempt = min(attempt + 1, MAX_RETRIES)
except Exception as e:
logger.error(f"Unexpected error: {e}")
attempt = min(attempt + 1, MAX_RETRIES)
async def main():
client = WebSocketClient(BINANCE_WS_URL)
await client.connect()
asyncio.run(main())
Always add jitter (random.uniform) to your backoff delay. Without it, all bot instances reconnect at exactly the same moment after a shared outage — this hammers the exchange and gets your IP rate-limited or temporarily banned.
Single-exchange bots have a single point of failure. If Binance goes into maintenance or Bybit's WebSocket API lags (both have happened during major market moves), your bot goes blind. The solution is a multi-source architecture: subscribe to the same market data from two exchanges simultaneously, and use whichever stream is healthiest. For BTC/USDT price feeds, Binance and Bybit are essentially equivalent in real-time accuracy.
import asyncio
import websockets
import json
import time
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class StreamState:
name: str
url: str
last_price: Optional[float] = None
last_update: float = field(default_factory=time.time)
healthy: bool = False
streams = [
StreamState("Binance", "wss://stream.binance.com:9443/ws/btcusdt@trade"),
StreamState("Bybit", "wss://stream.bybit.com/v5/public/spot"),
StreamState("OKX", "wss://ws.okx.com:8443/ws/v5/public"),
]
active_price = {"value": None, "source": None}
def best_price() -> Optional[float]:
"""Return price from healthiest stream."""
for s in sorted(streams, key=lambda x: -x.healthy):
if s.healthy and s.last_price:
active_price["source"] = s.name
return s.last_price
return None
async def binance_listener(state: StreamState):
while True:
try:
async with websockets.connect(state.url, ping_interval=20) as ws:
state.healthy = True
async for msg in ws:
data = json.loads(msg)
state.last_price = float(data["p"])
state.last_update = time.time()
except Exception:
state.healthy = False
await asyncio.sleep(5)
async def price_consumer():
while True:
price = best_price()
if price:
print(f"BTC: ${price:.2f} (via {active_price['source']})")
await asyncio.sleep(1)
async def main():
# Run all stream listeners + consumer concurrently
await asyncio.gather(
binance_listener(streams[0]),
price_consumer(),
# Add bybit_listener, okx_listener similarly
)
asyncio.run(main())
Platforms like VoiceOfChain aggregate signals from multiple sources and normalize them — so when you're consuming pre-processed trading signals rather than raw feeds, the multi-source problem is partially handled upstream. But for bots that need raw order book data from specific exchanges like Gate.io or KuCoin, you still need this failover layer in your own code.
Price feeds are public, but order updates and account data require authenticated WebSocket streams. The challenge here is that authentication tokens on Binance expire after 60 minutes (listenKey rotation), while Bybit's private WebSocket uses API key signing with a timestamp. When your private stream drops, you can't just reconnect — you have to re-authenticate first. If your listenKey expired during the outage, the reconnect will silently succeed but you'll stop receiving order updates.
import asyncio
import aiohttp
import websockets
import json
import time
import hmac
import hashlib
import os
API_KEY = os.environ["BINANCE_API_KEY"]
API_SECRET = os.environ["BINANCE_API_SECRET"]
BASE_REST = "https://api.binance.com"
async def get_listen_key(session: aiohttp.ClientSession) -> str:
"""Create or refresh Binance listenKey for private stream."""
headers = {"X-MBX-APIKEY": API_KEY}
async with session.post(f"{BASE_REST}/api/v3/userDataStream", headers=headers) as r:
data = await r.json()
return data["listenKey"]
async def keep_listen_key_alive(session: aiohttp.ClientSession, listen_key: str):
"""Ping every 30 min to prevent 60-min expiry."""
headers = {"X-MBX-APIKEY": API_KEY}
while True:
await asyncio.sleep(1800) # 30 minutes
await session.put(
f"{BASE_REST}/api/v3/userDataStream",
headers=headers,
params={"listenKey": listen_key}
)
async def private_stream():
async with aiohttp.ClientSession() as session:
listen_key = await get_listen_key(session)
asyncio.create_task(keep_listen_key_alive(session, listen_key))
ws_url = f"wss://stream.binance.com:9443/ws/{listen_key}"
while True:
try:
async with websockets.connect(ws_url, ping_interval=20) as ws:
async for msg in ws:
event = json.loads(msg)
if event.get("e") == "executionReport":
order_id = event["i"]
status = event["X"]
print(f"Order {order_id} → {status}")
except Exception as e:
print(f"Private stream dropped: {e}")
# Re-fetch listenKey — old one may have expired
listen_key = await get_listen_key(session)
ws_url = f"wss://stream.binance.com:9443/ws/{listen_key}"
await asyncio.sleep(3)
asyncio.run(private_stream())
On Bybit and OKX, private WebSocket auth uses HMAC-signed timestamps — not listenKeys. Re-auth on reconnect is built into the connection handshake, which is actually simpler. Binance's listenKey system is the most error-prone because the key expires server-side and your bot gets no notification.
Failover code handles known failure modes. Unknown failures — bugs in your reconnect logic, memory leaks, asyncio task exceptions being swallowed silently — require external monitoring. The minimum viable monitoring setup for a 24/7 trading bot is: a health endpoint that returns 200 only when the WebSocket is connected, and an external watchdog that hits that endpoint every 60 seconds. If the health check fails three times in a row, you get a Telegram alert.
A dead man's switch complements this: your bot sends a heartbeat to an external URL every minute. If the heartbeat stops for 5 minutes, the switch fires an alert. This catches scenarios where your process crashes entirely and the health endpoint is unreachable. Services like Cronitor or a simple Telegram bot webhook work well for this. Traders running bots on Binance Futures or Bybit perpetuals where positions stay open overnight especially need this — an unmonitored bot with an open leveraged position is a liability.
| Feature | Why It Matters | Complexity |
|---|---|---|
| Exponential backoff with jitter | Prevents thundering herd on shared outages | Low |
| Silent drop detection (message timeout) | Catches connections that appear open but are dead | Low |
| listenKey rotation (Binance) | Prevents private stream expiry after 60 min | Medium |
| Multi-source price feeds | Survives single-exchange outages | Medium |
| External health check + alerting | Catches bugs in your failover code itself | Medium |
| Dead man's switch heartbeat | Catches total process crashes | Low |
Every exchange will drop your connection eventually. Binance will hit its 24-hour limit. Bybit will lag during a CPI print. OKX will blip during one of its rolling infrastructure upgrades. The bots that keep running through all of it are the ones built assuming failure is normal, not exceptional. The code patterns here — exponential backoff, health check threads, listenKey rotation, multi-source feeds — are not over-engineering. They are the baseline for anything running unattended with real money. Start with the reconnect loop, add silent drop detection, and wire up an external health alert before your first live deployment. The strategy can be mediocre and still make money. A bot that goes blind at 3am and misses a liquidation cascade will erase weeks of edge in one night.