{"uuid": "d527ba0f-0b83-4cda-b2f5-c04d7a1367a1", "vulnerability_lookup_origin": "1a89b78e-f703-45f3-bb86-59eb712668bd", "author": "9f56dd64-161d-43a6-b9c3-555944290a09", "vulnerability": "GHSA-rj6c-83wx-jxf2", "type": "seen", "source": "https://gist.github.com/zmanian/08280e428b3f9b90551a2fa74a4b1a40", "content": "# Getting zebrad off a wedged initial sync \u2014 a 2026-05-29 field report\n\nThis is a write-up of debugging a zebrad mainnet initial sync that wedged\nrepeatedly at ~42% on otherwise perfectly-spec'd hardware. The summary up\nfront: it's a real upstream bug ([ZcashFoundation/zebra#5709][issue]) and the\nZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT knob (NOT the two knobs the\nname implies you'd reach for) is the fix that actually moves the needle.\n\nIf you're hitting the same symptom, scroll to \"What actually worked.\" If\nyou want the diagnostic walk, read on.\n\n## Symptom\n\nzebrad on a clean, well-provisioned host stalls during initial sync. The\nlog pattern is identical across restarts:\n\n1. Process starts, connects to ~30 peers, kicks off `extending tips` with\n   `in_flight` climbing toward `lookahead_limit=1000`.\n2. Within 30-90 seconds, `in_flight` saturates at 999.\n3. Sync goes **completely silent** \u2014 no warnings, no retries, no log lines\n   from the syncer task. The `estimated progress` task continues emitting\n   once per minute showing `time_since_last_state_block` climbing.\n4. After ~8 minutes zebrad's internal verifier timeout fires and the cycle\n   restarts. ~1.6K-8K new blocks land in the burst. Stall again. Repeat.\n\nSteady-state effective rate: ~14K blocks/hr. Burst rate during the active\n~30-90s window: 70K-150K blocks/hr.\n\nIt looks for all the world like a peer or resource problem. It is neither.\n\n## What it is NOT\n\nWe ruled all of these out empirically. Save yourself the time:\n\n- **CPU bound.** Container CPU sits at 0.01-0.07% across 24 cores during\n  the stall. We resized c3-standard-8 \u2192 c4-standard-16 \u2192 c4-standard-24.\n  No change in stall behavior.\n- **Disk bound.** `iostat -x 5` shows the disk at &lt;1% utilization with\n  queue depth ~0. We migrated pd-ssd \u2192 hyperdisk-balanced (250GB, 15K\n  provisioned IOPS, 600 MB/s throughput). The 15K IOPS sat completely\n  unused. (Note for cost-conscious operators: don't burn money on\n  hyperdisk-extreme for this workload until you've ruled out the actual\n  bug \u2014 IOPS is not the bottleneck.)\n- **Peer count.** `ss -tn` inside the container's netns shows 30 ESTAB\n  connections to peers on :8233. `Recv-Q`/`Send-Q` are all zero \u2014 peers\n  are alive but no data is flowing. Bandwidth drops to keepalive-only:\n  ~700 B/s in, ~2 KB/s out.\n- **Stale peer cache.** Wiping `/var/lib/zebrad-cache/network/mainnet.peers`\n  produces another burst, then re-stalls in the same pattern.\n- **The two knobs whose names suggest they'd help.** Lowering both\n  `ZEBRA_SYNC__DOWNLOAD_CONCURRENCY_LIMIT` (50 \u2192 25) and\n  `ZEBRA_NETWORK__PEERSET_INITIAL_TARGET_SIZE` (50 \u2192 25) had no effect.\n  `in_flight` still saturated at 996-999. These knobs do not bound `in_flight`.\n- **Zebrad version.** We went 4.4.1 \u2192 4.5.1 hoping the 4.5.0 security fix\n  for \"peer inventory registry poisoning on sync restart\"\n  ([GHSA-rj6c-83wx-jxf2]) would address it. Same stall pattern on 4.5.1.\n\n## What's actually happening\n\nPer [issue #5709][issue]: zebrad downloads blocks out of height order from\npeers, but the checkpoint verifier needs them strictly contiguous. When\none block in the lowest checkpoint range is late or missing, every\nalready-downloaded block above it parks in the verifier holding a queue\nslot. `in_flight` pins near the configured ceiling. The syncer stops\nrequesting anything new and goes idle until the verifier's 8-minute\ninternal timeout fires.\n\nThe ceiling on `in_flight` is **`checkpoint_verify_concurrency_limit`**\n(default 1000), not the two knobs whose names sound relevant. The\nverifier is what holds the slots.\n\n## What actually worked\n\nThree changes, in order of impact:\n\n### 1. Lower `checkpoint_verify_concurrency_limit` to its minimum (400)\n\n```\nZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT=400\n```\n\nThis caps the blast radius of each stall to ~1 checkpoint range instead\nof ~2.5. After applying, `in_flight` saturates at 399 instead of 999.\nWe immediately saw sync rates of 12-18K blocks/minute (720K-1M blocks/hr\ninstantaneous), with `time_since_last_state_block=0s` continuously and\nCPU jumping to ~90%. The runbook says you can also test 500 and compare.\n\n### 2. Disable TCP slow-start-after-idle\n\nThe kernel's `net.ipv4.tcp_slow_start_after_idle=1` default resets the\ncongestion window after every idle interval. zebrad fetches one block\nper peer with idle gaps; every fetch starts cold on long-haul links.\n\nOn the host:\n```\necho 'net.ipv4.tcp_slow_start_after_idle=0' &gt; /etc/sysctl.d/99-zebra.conf\nsysctl --system\n```\n\nOr inside the container's netns via Docker:\n```\n--sysctl net.ipv4.tcp_slow_start_after_idle=0\n```\n\n**Gotcha:** `sysctl --system` reapplies all `/etc/sysctl.d/*.conf` files,\nincluding any system defaults that set `net.ipv4.ip_forward=0`. Docker\nsets `ip_forward=1` at daemon start to enable container outbound traffic.\nIf `sysctl --system` reverts it, every container loses external\nconnectivity and DNS stops resolving. We hit this exact regression mid-\ndebug. Add `net.ipv4.ip_forward=1` to your config file alongside the\nslow-start setting so the system file precedence keeps it in place.\n\n### 3. Set `external_addr` to your actual public IP\n\n```\nZEBRA_NETWORK__EXTERNAL_ADDR=:8233\n```\n\nWithout this, zebrad advertises `[::]:8233`, which other nodes drop from\ntheir peer pools. You see fewer responsive peers and more stragglers.\n\n### 4. (Safety net) Auto-bounce watchdog\n\nA small Python systemd service that polls zebrad's `estimated progress`\nlog line every 30s and runs `systemctl restart zebrad-docker` whenever\n`time_since_last_state_block` exceeds 90s. With items 1-3 in place this\nshould fire rarely; without them it'll keep sync inching forward at a\nrespectable ~30-90K blocks/hr average even while the underlying bug\npersists. Source at the end of this gist.\n\n### Things you should NOT also do\n\n- Bump `download_concurrency_limit` or `peerset_initial_target_size` (wrong\n  layer; we tested both directions, no effect).\n- Upgrade 4.4.1 \u2192 4.5.1 expecting a sync fix (the sync/verify code is\n  byte-identical between them \u2014 upgrade anyway for the security fixes,\n  but not for this bug).\n- Buy bigger instances or faster disks. The stall is not resource-bound.\n  We went all the way to c4-standard-24 with 15K IOPS hyperdisk-balanced\n  and saw zero impact until we set `checkpoint_verify_concurrency_limit=400`.\n- Touch `max_connections_per_ip` unless you have confirmed evidence your\n  peers are sharing IPs.\n\n## Setting `ZEBRA_NETWORK__INITIAL_MAINNET_PEERS` via env var\n\nIt doesn't work. The figment env-var deserializer in zebrad 4.4.x / 4.5.x\nrejects both CSV and JSON-array values:\n```\ninvalid type: string ..., expected a set for key network.initial_mainnet_peers\n```\n\nThis is [zebra#10658][peers]. To pin known-good peers you have to mount a\nTOML config file into the container. We deferred this; the other changes\nwere enough to get healthy sync.\n\n## What we wish zebrad would log\n\nThe single most expensive thing about this debug was that **the syncer\ntask emits nothing during the stall**. No WARN, no peer-eviction event,\nno \"X requests in flight for Y seconds.\" If `in_flight` is saturated and\nstate hasn't advanced for &gt;N seconds, zebrad should log at WARN level\nwith the peer ID and block hash of the oldest in-flight request. That\nwould have collapsed the diagnostic time from hours to minutes.\n\nA documented note that `lookahead_limit` (gated by\n`checkpoint_verify_concurrency_limit`) is the actual ceiling on\n`in_flight` \u2014 not `download_concurrency_limit` \u2014 would also have saved\nconsiderable time.\n\n## Final config (us-central1-a, c4-standard-24)\n\n`/etc/bedrock/zebra.env`:\n```\nZEBRA_DOCKER_IMAGE=zfnd/zebra:4.5.1\nZEBRA_SYNC__FULL_VERIFY_CONCURRENCY_LIMIT=40\nZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT=400\nZEBRA_NETWORK__EXTERNAL_ADDR=:8233\n```\n\n`/etc/sysctl.d/99-zebra.conf`:\n```\nnet.ipv4.tcp_slow_start_after_idle=0\nnet.ipv4.ip_forward=1\n```\n\nsystemd unit excerpt (the env-var pass-through is required because\n`docker run` only inherits vars you explicitly `-e`):\n```\nExecStart=/usr/bin/docker run --name zebra --rm \\\n  -p 8233:8233 \\\n  -p 8232:8232 \\\n  -v /var/lib/zebrad-cache:/home/zebra/.cache/zebra \\\n  --dns=8.8.8.8 --dns=1.1.1.1 \\\n  -e ZEBRA_NETWORK__NETWORK=Mainnet \\\n  -e ZEBRA_RPC__LISTEN_ADDR=0.0.0.0:8232 \\\n  -e ZEBRA_RPC__ENABLE_COOKIE_AUTH=false \\\n  -e ZEBRA_SYNC__FULL_VERIFY_CONCURRENCY_LIMIT \\\n  -e ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT \\\n  -e ZEBRA_NETWORK__EXTERNAL_ADDR \\\n  ${ZEBRA_DOCKER_IMAGE}\n```\n\n## Watchdog script\n\n```python\n#!/usr/bin/env python3\n\"\"\"Zebrad sync stall watchdog. Polls progress logs; restarts on stall.\"\"\"\nimport logging, re, signal, subprocess, sys, time\n\nSTALL_THRESHOLD_SEC = 90\nPOLL_INTERVAL_SEC = 30\nSETTLE_SEC = 75\nSERVICE = \"zebrad-docker\"\nDURATION_RE = re.compile(r\"time_since_last_state_block=(?:(\\d+)m)?(?:\\s*(\\d+)s)?\")\n\nlogging.basicConfig(level=logging.INFO, format=\"%(asctime)s %(levelname)s %(message)s\")\n\ndef parse_duration_sec(line):\n    m = DURATION_RE.search(line)\n    if not m: return None\n    return int(m.group(1) or 0) * 60 + int(m.group(2) or 0)\n\ndef latest_progress_line():\n    try:\n        out = subprocess.check_output(\n            [\"docker\", \"logs\", \"zebra\", \"--since\", \"5m\", \"--tail\", \"20\"],\n            stderr=subprocess.STDOUT, timeout=15\n        ).decode(errors=\"replace\")\n    except subprocess.SubprocessError as e:\n        logging.warning(\"docker logs failed: %s\", e)\n        return None\n    for line in reversed(out.splitlines()):\n        if \"estimated progress\" in line:\n            return line\n    return None\n\ndef restart_zebrad():\n    logging.warning(\"Restarting %s\", SERVICE)\n    subprocess.run([\"systemctl\", \"restart\", SERVICE], check=False, timeout=180)\n\ndef handle_term(*_): sys.exit(0)\n\ndef main():\n    signal.signal(signal.SIGTERM, handle_term)\n    signal.signal(signal.SIGINT, handle_term)\n    logging.info(\"watchdog started threshold=%ds\", STALL_THRESHOLD_SEC)\n    while True:\n        line = latest_progress_line()\n        if line is None:\n            time.sleep(POLL_INTERVAL_SEC); continue\n        secs = parse_duration_sec(line)\n        if secs is None:\n            time.sleep(POLL_INTERVAL_SEC); continue\n        if secs &gt;= STALL_THRESHOLD_SEC:\n            logging.warning(\"stall age=%ds, restarting\", secs)\n            restart_zebrad()\n            time.sleep(SETTLE_SEC); continue\n        logging.info(\"ok: stall age=%ds\", secs)\n        time.sleep(POLL_INTERVAL_SEC)\n\nif __name__ == \"__main__\":\n    main()\n```\n\nsystemd unit:\n```ini\n[Unit]\nDescription=Zebrad sync stall watchdog\nAfter=zebrad-docker.service\nWants=zebrad-docker.service\n\n[Service]\nType=simple\nExecStart=/usr/bin/python3 /usr/local/bin/zebra-watchdog.py\nRestart=always\nRestartSec=10\nStandardOutput=journal\nStandardError=journal\n\n[Install]\nWantedBy=multi-user.target\n```\n\n## Acknowledgments\n\nThe diagnostic call-out to issue #5709 and the\n`checkpoint_verify_concurrency_limit=400` recommendation came from a\nsecond agent who'd already worked the problem. Without that pointer this\nwould have taken many more hours and possibly a fresh-sync from genesis\nto get past.\n\n[issue]: https://github.com/ZcashFoundation/zebra/issues/5709\n[peers]: https://github.com/ZcashFoundation/zebra/issues/10658\n[GHSA-rj6c-83wx-jxf2]: https://github.com/ZcashFoundation/zebra/security/advisories/GHSA-rj6c-83wx-jxf2\n", "creation_timestamp": "2026-05-30T04:43:13.000000Z"}