Runbooks | BeeDifferent

Runbooks Step-by-step procedures for things that break or need to happen on a schedule

When something breaks or needs to happen, the answer is usually in one of these. Each runbook has: symptom → diagnosis steps → fix → verification → follow-up notes. Copy-pasteable commands where useful.

If you can’t find what you need here, search the hub (top-right search box, or ⌘K) — relevant info is probably scattered across the homelab/, automation/, or claude/ sections.

When X breaks, look at…

Symptom	Runbook
Doc-sync report is 300 bytes of “401 / socket closed”	Fix doc-sync auth failure
4 AM briefing email didn’t arrive / `morning-briefing.md` is stale	Daily briefing didn’t arrive
ha-mcp / farm services unreachable	Farm network is down
Recordings from JPR not turning into diary entries or tasks	Dictation not processing
Sonarr/Radarr/Lidarr key got exposed	Rotate *arr API keys
MCP changes don’t show up in Claude	Reload Claude Desktop / MCPs
Expected a cron-driven thing to run, didn’t see it	Cron job didn’t fire
`deploy-vps.sh` aborted; site is on a stale tree	Recover from a broken deploy
Pasted a secret somewhere I shouldn’t have	Credential leaked in chat
`~/Sync/ED/` not converging across machines	Syncthing not converging
Seedbox not draining / Kuma “Cron: Seedbox Sync” down / `/mnt/seedbox` looks empty	Seedbox sync stalled

Adding a runbook

Create content/runbooks/<short-slug>/_index.md with frontmatter like:

---
title: "Short imperative title"
page_links:
  - { label: "Related service", url: "/homelab/whatever/" }
---

Then write four sections — Symptom, Diagnose, Fix, Verify — each wrapped in the theme’s section shortcode (open with section id="..." title="...", close with /section). Add the new runbook to the sidebar_sections list above and the symptom table.

The runbook section is also picked up by Pagefind search, so even if you forget where you filed it, the search will find it.

Credential leaked in chat (or anywhere it shouldn't be)

Treat the credential as compromised

Anything pasted into a Claude conversation, a chat thread, a screenshot, a public/semi-public doc, or a git repo — even briefly — should be treated as leaked. Editing the doc to remove the value does NOT unleak it. The only fix is rotation.

1. Triage (within 5 min)

Decide blast radius:

Risk	Examples	Rotate within
Critical — public-facing, full account access	Anthropic API key, Cloudflare API token, root SSH keys, Gmail OAuth	Immediately
High — service that can reach money / customer data / others’ systems	Stripe key, Gotify token (if used for alerts that gate decisions), database master pwd	Today
Medium — service that’s mostly homelab-internal	Sonarr/Radarr/Lidarr API keys, internal app passwords, indexer credentials	This week (logged in TASKS.md)
Low — read-only or already-public-equivalent	Public RSS feeds, public bookmarks	Note but don’t rush

2. Record it

Add an entry to ~/Sync/ED/TASKS.md Active — Security/Credentials section with the exact value so future-you knows what to rotate. Include the consumer map (where the credential is used).

Example (this happened on 2026-05-25 with the *arr keys):

- [ ] **Rotate Sonarr/Radarr/Lidarr API keys** — they were hardcoded in
  ~/Sync/ED/skills/arr-media-management/SKILL.md (Syncthing-replicated).
  Old values: d792444549..., b117993eb50..., 3dc17d20ca664...
  Consumers: Prowlarr (settings → apps), Recyclarr (yaml), homelab-config (none — extracts at runtime via arr-briefing-data.py)

3. Rotate

Service-specific procedures live in dedicated runbooks where they’re complicated:

Rotate *arr API keys
Anthropic API key (the “Fix” section of doc-sync auth-fail is also a rotation procedure)

For simple cases: log in to the service, generate a new credential, save the new value to ~/Sync/ED/SECRETS.md, update consumers, test.

4. Restart anything holding the old credential

For containerized services: docker restart <name> after env var update. For Mac launchd: kickstart the job. For Claude Desktop: full quit + relaunch.

If the leak was into a git repo:

# Find every commit containing the value
git -C ~/Sync/ED log -p --all -S 'leaked-value-here' | head -30

# Remove from history (heavy — use only when necessary, force-pushes break clones)
# Prefer rotating + accepting that the historical value is exposed but inert.

For the homelab-config repo (private but synced), rotating the underlying credential is usually enough — historical exposure of a now-invalid key is not a real risk.

5. Clean up

Once consumers are updated and the new credential works:

Remove the rotation entry from TASKS.md
Update SECRETS.md with the new value + last-rotated date
If the leak was a class of mistake (hardcoded in a SKILL, committed in a config), add a defense:
- Pre-commit hook to scan for sk-, <ApiKey>, etc.
- Bundle behavioral rule against the pattern
- Linter for the file type

Cron job didn't fire

Diagnose by surface

The homelab has four scheduling surfaces. The “didn’t fire” question depends on which one.

Surface	Where it lives	How to check
Mac launchd	`~/Library/LaunchAgents/*.plist`	`launchctl list \| grep -i <name>` shows last exit code; `tail -50 ~/Library/Logs/<job>.log`
Mac user cron	`crontab -l` on Mac Studio	`tail -50 /tmp/cron-*.log` if the job writes there; otherwise add `MAILTO=""` and re-run
CT100 cron	`pct exec 100 -- crontab -l`	Errors route to Gotify via `cron-gotify-wrapper.sh` (priority 5 → Telegram). Check the telegram bot.
hpve cron	`crontab -l` on pve as root	Same wrapper as CT100 — errors → Gotify
Cowork scheduled tasks	Cowork Settings → Scheduled Tasks	`lastRunAt` timestamp on each task; audit log in `~/Library/Application Support/Claude/local-agent-mode-sessions/...`

launchd didn't fire

# Show last exit code (column 1) and PID (column 2 — - means not running)
launchctl list | grep -i com.bee

# Manually kickstart (run now)
launchctl kickstart -k gui/$(id -u)/com.bee.<job-name>

# If kickstart errors with "Could not find specified service", the plist isn't loaded:
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.bee.<job-name>.plist

# If the plist has a typo, log will say so
log show --predicate 'subsystem == "com.apple.xpc.launchd"' --info --last 1h | grep com.bee.<job-name>

Common launchd gotcha: the plist points at a script that doesn’t exist (e.g., com.bee.rebuild-mcp-venvs.plist references ~/scripts/rebuild-mcp-venvs.sh — make sure that file exists and is executable). The launchd job loads fine but every run silently fails.

cron didn't fire

# On CT100 or hpve — list crontab
crontab -l

# The cron-gotify-wrapper.sh captures stderr to Gotify. If you don't see an
# error, the job ran cleanly OR the wrapper isn't installed.
# Verify wrapping:
crontab -l | head -10  # should see /usr/local/bin/cron-gotify-wrapper.sh <cmd>

# Validate the cron expression
# mcp__cron-validator__cron_validate("0 4 * * *")

Cowork scheduled task didn't fire

# Find today's session for the task
python3 << 'PY'
import json, os, glob, datetime
base = os.path.expanduser('~/Library/Application Support/Claude/local-agent-mode-sessions')
today = datetime.date.today().isoformat()
task = 'daily-briefing'  # change as needed
for f in glob.glob(f'{base}/*/*/local_*.json'):
    try: d = json.load(open(f))
    except: continue
    if d.get('scheduledTaskId') != task: continue
    if datetime.datetime.fromtimestamp(d['createdAt']/1000).date().isoformat() != today: continue
    print(datetime.datetime.fromtimestamp(d['createdAt']/1000), '->', datetime.datetime.fromtimestamp(d['lastActivityAt']/1000), f)
PY

If no session for today appeared, the Cowork scheduler didn’t fire — check Cowork Settings → Scheduled Tasks. If a session DID appear but the expected output is missing, read its audit.jsonl to see where it got stuck.

Fix

Most often the fix is one of:

Re-bootstrap the launchd plist after editing
Re-run manually to confirm the job works, then wait for next scheduled run
Update the Cowork task definition (Settings → Scheduled Tasks → Edit) if the cron expression is wrong

For broken behavior despite the job firing, see the specific runbooks (doc-sync auth fail, daily briefing not arriving, dictation not processing).

Daily briefing didn't arrive

Symptom

It’s after 4:30 AM and you don’t see:

Email from beedifferent5455@gmail.com with subject “Daily Briefing — …”
A current entry in ~/Sync/ED/morning-briefing.md (mtime should be today)
A Drafts note tagged “briefing”

Diagnose

# Did the scheduled task actually fire?
# (Check Cowork scheduled-tasks list; lastRunAt should be today ~04:01)

# Did the briefing write a file?
ls -la ~/Sync/ED/morning-briefing.md
# stale mtime = task failed mid-run

# Pull the audit log for today's 04:01 session
python3 << 'PY'
import json, os, glob, datetime
base = os.path.expanduser('~/Library/Application Support/Claude/local-agent-mode-sessions')
today = datetime.date.today().isoformat()
for f in glob.glob(f'{base}/*/*/local_*.json'):
    try: d = json.load(open(f))
    except: continue
    if d.get('scheduledTaskId') != 'daily-briefing': continue
    if datetime.datetime.fromtimestamp(d['createdAt']/1000).date().isoformat() == today:
        sess_dir = f.replace('.json','')
        print('Session:', sess_dir)
        print('  audit.jsonl:', os.path.getsize(sess_dir + '/audit.jsonl'), 'bytes')
PY

Read the last 20 events of that session’s audit.jsonl to see where it got stuck.

Common failure modes:

Failure	Tell
Shell-quoting hell on arr extraction*	Many retries with `osascript → ssh → pct → docker → curl` strings, never reaches Write step. Should NOT happen now — Section 1 uses `~/scripts/arr-briefing-data.py`. If you see this, the SKILL.md was reverted.
MCP unavailable	`homelab-snapshot` MCP errored / not loaded. Fallback is to Read `~/Sync/ED/.homelab-snapshot.json` directly.
Rate limit / API error	`rate_limit_event` records in audit, or HTTP errors. Wait or check anthropic.com status.
Session timed out / token exhausted	Session ran 7+ minutes and stopped mid-thought before reaching the Write step.

Fix

If today’s run failed:

Force a manual rerun — Cowork sidebar → Scheduled Tasks → daily-briefing → Run now.
While it runs, watch for the same failure pattern. If arr-briefing-data.py is the culprit, test it standalone: python3 ~/scripts/arr-briefing-data.py --hours 24.
If ~/Sync/ED/.homelab-snapshot.json is missing/stale, the launchd job (com.bee.homelab-snapshot) didn’t run — kick it: launchctl kickstart -k gui/$(id -u)/com.bee.homelab-snapshot.

Verify

After the rerun:

ls -la ~/Sync/ED/morning-briefing.md     # mtime = today
ls -la ~/Sync/ED/todays-briefing.md       # mtime = today
cat ~/Sync/ED/morning-briefing.md | head -20
# inbox: subject "Daily Briefing — <today>"

The SKILL is hardened to write explicit “No dictation in the last 24 hours” and “No diary entry for this date” lines when those sources are empty — so even a no-content day produces a useful email, not silence.

Dictation not processing

Symptom

You recorded a JPR voice memo on Apple Watch / iPhone but:

It hasn’t appeared in ~/Sync/ED/dictation/processed/
~/Sync/ED/daily-diary.md is stale
No tasks from dash commands made it to TASKS.md

The hourly process-dictation Cowork task runs at :04 past every hour — recordings should appear within ~70 minutes of capture.

Diagnose

1. Did iCloud sync the file from your watch/phone to Mac?

ls -la ~/Library/Mobile\ Documents/iCloud~com~openplanetsoftware~just-press-record/Documents/$(date +%Y-%m-%d)/

If the day folder is empty or the file is .something.icloud (placeholder), iCloud hasn’t downloaded it. Force the download:

brctl download ~/Library/Mobile\ Documents/iCloud~com~openplanetsoftware~just-press-record/Documents/$(date +%Y-%m-%d)/

The pipeline runs brctl download automatically before scanning, but only on day directories it can already see — a never-synced day won’t be scanned.

2. Did the pipeline script run?

tail -20 /tmp/dictation-run.log

If you see “No new dictation files found” right after your recording was made, JPR-on-watch didn’t sync in time. Wait 5-15 minutes and the next hourly run should catch it.

3. Was the recording skipped?

The script skips files < 50 KB (accidental taps). Confirm size:

ls -la ~/Library/Mobile\ Documents/iCloud~com~openplanetsoftware~just-press-record/Documents/$(date +%Y-%m-%d)/*.m4a

4. Was there an anomaly?

Recordings > 2h with < 5 min of detected speech trigger a Gotify alert. Check ~/Sync/ED/dictation/processed/ for an ## ⚠ ANOMALY entry — that means transcription ran but the audio was flagged as a left-running mic.

Fix

Force a manual run:

/bin/zsh ~/Sync/ED/dictation/process-dictation.sh

This handles iCloud download, transcription (Whisper large-v3 via mlx-whisper), chunking long recordings, and parsing dash commands. Watch the output for errors.

If transcription itself failed, check the venv:

ls -la ~/venvs/whisper-asr/bin/python3
~/venvs/whisper-asr/bin/python3 -c "import mlx_whisper; print(mlx_whisper.__version__)"

If the venv is broken, ~/scripts/rebuild-mcp-venvs.sh doesn’t cover whisper-asr — rebuild manually:

python3.12 -m venv ~/venvs/whisper-asr
source ~/venvs/whisper-asr/bin/activate
pip install mlx-whisper

Verify

After the manual run:

# A processed report should exist with today's date prefix
ls -lat ~/Sync/ED/dictation/processed/ | head -3

# daily-diary.md should reflect today's content if there was any narrative
ls -la ~/Sync/ED/daily-diary.md
head -20 ~/Sync/ED/daily-diary.md

Date rule: all artifacts use the recording’s start time from the JPR folder name (Documents/YYYY-MM-DD/HH-MM-SS.m4a), never date +%Y-%m-%d or file mtime. A recording started at 11:50 PM May 21 finishing at 12:30 AM May 22 belongs to May 21 even when picked up by the 01:04 AM May 22 cron.

Farm network is down

Symptom

Home Assistant unreachable (https://192.168.0.10:8123 from Mac times out)
ha-mcp shows Server transport closed unexpectedly in ~/Library/Logs/Claude/mcp-server-ha-mcp.log
fpve.netbird.cloud not pingable from home
MCP health monitor Gotify-alerts ha-mcp failure counts

Diagnose from home

# Can the home Proxmox reach the farm Proxmox?
ssh pve 'ping -c 3 -W 2 fpve.netbird.cloud'
# 100% packet loss = NetBird mesh is broken between hpve and fpve

# Is the NetBird mesh API up?
curl -s -o /dev/null -w 'HTTP %{http_code}\n' https://app.netbird.io

# Check NetBird peer status (requires netbird MCP loaded)
# mcp__netbird__list_peers — fpve should appear with status: online

If fpve.netbird.cloud is the only unreachable peer:

NetBird daemon on fpve is down or disconnected, OR
Farm’s Starlink/Omada lost internet, OR
fpve itself is powered off / kernel-panicked

You can’t diagnose from home if the host is unreachable. You need farm-LAN access.

Fix (farm-side)

When physically on the farm or when LAN reachability returns:

# 1. Verify fpve is up
ssh root@192.168.0.191 'uptime; netbird status'

# 2. If NetBird daemon is dead, restart it
ssh root@192.168.0.191 'systemctl restart netbird; netbird status'

# 3. If fpve had a reboot, the IPv6 sysctls aren't persistent — re-apply
ssh root@192.168.0.191 'sysctl -w net.ipv6.conf.vmbr0.accept_ra=2 net.ipv6.conf.vmbr0.autoconf=1 net.ipv6.conf.vmbr0.accept_ra_defrtr=1 net.ipv6.conf.vmbr0.accept_ra_pinfo=1'

# 4. Check HA is up
ssh root@192.168.0.191 'pct exec 100 -- ping -c 2 192.168.0.10'
# HA itself is a separate VM/CT; should respond independently of fpve uptime

Verify from home

# Mesh restored
ssh pve 'ping -c 3 -W 2 fpve.netbird.cloud'

# HA reachable
curl -s -o /dev/null -w 'HTTP %{http_code}\n' --max-time 5 http://192.168.0.10:8123

# Restart Claude Desktop so ha-mcp reconnects
osascript -e 'tell application "Claude" to quit'
sleep 2
open -a Claude

In a new Cowork session, smoke-test ha-mcp: ask it to list HA areas. Should return 13 areas (Barn, Garage, Kitchen, etc.).

Fix doc-sync auth failure

Symptom

Morning email arrives but ~/Sync/ED/.doc-sync-log/YYYY-MM-DD.md is ~300 bytes containing one of:

Failed to authenticate. API Error: 401 The socket connection was closed unexpectedly...
Failed to authenticate. API Error: 401 <html><head><title>502 Bad Gateway</title>...cloudflare</html>
The Gotify alert: ⚠️ Doc-Sync YYYY-MM-DD — AUTH FAILED

The CLI mis-reports the real cause as “401” regardless of the underlying issue. The most likely cause is a corrupted API key file, not an actual auth-server problem.

Diagnose

# Inspect the key file — look for any garbage prefix
head -c 25 ~/.config/anthropic-api-key
echo

# Check file size — a clean key is exactly 108 bytes (no trailing newline)
wc -c ~/.config/anthropic-api-key

If the file starts with anything other than sk-ant-api03-, it’s corrupted. The historical bug was a literal -n prefix from a manual echo -n "$KEY" > file in a shell where -n was printed instead of treated as a flag.

Test the key directly against the API:

KEY=$(cat ~/.config/anthropic-api-key)
curl -s -o /dev/null -w 'HTTP %{http_code}\n' \
    -H "x-api-key: $KEY" \
    -H 'anthropic-version: 2023-06-01' \
    https://api.anthropic.com/v1/models

HTTP 200 = key works. HTTP 401 = key is bad.

Fix

Rewrite the file cleanly with printf (which doesn’t have the -n ambiguity):

# Back up the broken version first
cp ~/.config/anthropic-api-key ~/.config/anthropic-api-key.bak.$(date +%Y%m%d-%H%M%S)

# Get the key from your password manager / wherever it lives, then:
printf '%s' 'sk-ant-api03-...' > ~/.config/anthropic-api-key
chmod 600 ~/.config/anthropic-api-key

# Verify
ls -la ~/.config/anthropic-api-key   # should show 108 bytes, mode 600
head -c 25 ~/.config/anthropic-api-key  # should start with sk-ant-api03-

Never use echo -n to write the key — different shells handle -n differently and some write it as a literal prefix.

Verify

Run doc-sync manually with yesterday’s date:

~/Sync/ED/skills/doc-sync/scripts/run.sh
tail -30 ~/Sync/ED/.doc-sync-log/.last-run.log

You should see Auth precheck OK (HTTP 200) in the log and the report should be 5–20 KB (not 300 bytes).

Why this is already hardened

As of 2026-05-25, run.sh has two defenses:

Strip on load — sed -E 's/^-n[[:space:]]+//; s/[[:space:]]+$//' removes a stray -n prefix and any trailing whitespace before using the key.
Step 0 fail-fast precheck — single /v1/models request with --max-time 15. If it returns non-200, writes the actual cause to the day’s report, Gotify-alerts at priority 8, and exits 1 instead of burning 5+ minutes on a 1 MB prompt.

So the failure mode going forward should be a clear error inside 15 seconds, not a silent stub at 3 AM.

Recover from a broken Bee Hub deploy

Symptom

~/Library/Logs/bee-hub-deploy.log shows Hugo build FAILED (rc=1). Aborting deploy.
Live site at hub.edmd.me looks current (not stale)
A Gotify alert may have fired if Hugo’s stderr was wrapped (it isn’t currently — only validator failures alert)

The good news: the deploy script has strict mode (--panicOnWarning) and the exit-code check aborts on any Hugo error before rsyncing. So a broken local build doesn’t propagate to live. The live site stays on the last good tree until you fix it.

Find the error

cd ~/Sync/ED/homelab/bee_hub
/opt/homebrew/bin/hugo --panicOnWarning --printPathWarnings 2>&1 | tail -20

Hugo’s strict-mode errors are usually one of:

Error	Likely cause
`shortcode "section" must be closed or self-closed`	Missing close tag for a `section` shortcode (forgot the `/section` line)
`failed to extract shortcode "<name>": shortcode "<name>" not found`	Typo in shortcode name
`failed to render shortcode "<name>"`	Bad params or unclosed nested block
`parse failed`	YAML frontmatter syntax error — usually a missing quote or bad indent
`duplicate path warning`	Two pages compile to the same URL — check `slug:` overrides
`template render error`	A custom layout/partial references something that doesn’t exist

The error message includes a file path and line number. Jump straight there.

Fix

Edit the file, fix the issue, run Hugo again locally before redeploying:

cd ~/Sync/ED/homelab/bee_hub
/opt/homebrew/bin/hugo --panicOnWarning --printPathWarnings 2>&1 | tail -5
# Want to see "Total in NNNms" and no ERROR lines

Once it builds clean, re-run the full deploy:

zsh ~/Sync/ED/homelab/bee_hub/deploy-vps.sh
tail -5 ~/Library/Logs/bee-hub-deploy.log

Should end with Deploy complete.

Emergency rollback

If you pushed a broken change before the strict-mode protection caught it (shouldn’t happen now), the live site lives on:

VPS (public): root@100.123.69.155:/var/www/bee-hub/
CT103 (internal): root@192.168.8.54:/var/www/bee-hub/

To roll back to a previous Hugo build:

# Move current tree aside, rebuild from a known-good git commit, redeploy
cd ~/Sync/ED/homelab/bee_hub
git stash             # set aside in-progress edits
git log --oneline -20 # find a good commit
git checkout <hash>
zsh deploy-vps.sh     # rebuilds + rsyncs the old tree
git checkout main     # come back to current
git stash pop         # restore edits

The deploy targets are rsync targets, not git checkouts on the remote — so “rollback” means re-deploying an older local build over the top.

Verify live state

curl -s -o /dev/null -w 'public: %{http_code} %{size_download} bytes\n' https://hub.edmd.me/
curl -s -o /dev/null -w 'internal: %{http_code} %{size_download} bytes\n' http://192.168.8.54/

# spot-check a known page
curl -s https://hub.edmd.me/runbooks/ | grep -c 'Runbooks'

Both should return 200 and a non-trivial size. The index regen log should also confirm: Wrote /Users/bee/Sync/ED/BEE_HUB_INDEX.md (322 pages, ...).

Reload Claude Desktop / MCPs

When to do this

Edited ~/Sync/ED/config/claude_desktop_config.json (added/removed/modified an MCP)
Edited an MCP server’s source code in ~/.mcp-servers/<name>/
Edited a SKILL.md and need it to appear in available_skills (also requires ~/scripts/sync-cowork-snapshot.sh first — see below)
A specific MCP shows Server transport closed unexpectedly in ~/Library/Logs/Claude/mcp-server-<name>.log

Procedure

Step 1 — quit fully (not just close the window):

Cmd-Q from the Claude Desktop menu bar

If you see “Claude Desktop is still running” indicators (tray icon, dock bounce), kill from terminal:

osascript -e 'tell application "Claude" to quit'
# or, harder:
pkill -f 'Claude.app/Contents/MacOS/Claude'

Step 2 — relaunch. Open /Applications/Claude.app or open -a Claude.

Step 3 — verify MCPs loaded. Open a Cowork session. The first system reminder of any session lists deferred MCP tools. Look for the MCP you expected.

If you only changed a SKILL, the change won’t appear in the new session unless you also:

~/scripts/sync-cowork-snapshot.sh

The Cowork session snapshot is keyed on stable UUIDs and is REUSED across sessions — Cmd-Q doesn’t rebuild it, plugin reinstall doesn’t rebuild it. Only the explicit rsync does.

Verify

For a specific MCP, do a one-tool call in a new Cowork session. e.g. for tana-local: mcp__tana-local__list_workspaces. For memory: mcp__memory__memory_stats.

For SKILL availability: open a session and type /skill — the new SKILL should appear in the dropdown.

Gotchas

Tana startup race: if you launch Claude Desktop before Tana is fully open, tana-local will ECONNREFUSED 127.0.0.1:8262 and bail. Launch Tana first, then Claude. The MCP doesn’t auto-reconnect — restart Claude Desktop after the race.
ha-mcp connection failures: check that NetBird has the 192.168.0.0/24 route enabled and that fpve.netbird.cloud is reachable. The MCP itself wires correctly; the typical failure is the upstream HA being unreachable.
MCP venvs on MacBook may be stale after a Syncthing pull. The com.bee.rebuild-mcp-venvs launchd watcher handles it on file change, but you can force it: launchctl kickstart -k gui/$(id -u)/com.bee.rebuild-mcp-venvs.

Rotate Sonarr / Radarr / Lidarr API keys

When to run this

A key leaked into chat, into a doc, or into the homelab-config git repo.
A consumer (briefing helper, Prowlarr, etc.) started rejecting auth.
Routine rotation.

Fix

The *arr API keys live inside each container’s /config/config.xml (<ApiKey>...</ApiKey>). Rotating = generate a new key, write it back, restart the container, update any consumer that hardcoded the old value.

Generate new keys and rotate one at a time (so you can verify between):

# Generate a 32-char hex key
NEW_KEY=$(openssl rand -hex 16)
echo "$NEW_KEY"

# SSH to pve, edit Sonarr config.xml inside CT100
ssh pve "pct exec 100 -- bash -c 'sed -i.bak \"s|<ApiKey>[^<]*</ApiKey>|<ApiKey>$NEW_KEY</ApiKey>|\" /var/lib/docker/volumes/sonarr_config/_data/config.xml || docker exec sonarr sed -i.bak \"s|<ApiKey>[^<]*</ApiKey>|<ApiKey>$NEW_KEY</ApiKey>|\" /config/config.xml'"

# Restart Sonarr
ssh pve "pct exec 100 -- docker restart sonarr"

# Wait for it to come back, then verify the new key works
sleep 10
ssh pve "pct exec 100 -- curl -s 'http://localhost:8989/api/v3/system/status?apikey=$NEW_KEY'" | head -c 200

Repeat for Radarr (port 7878) and Lidarr (port 8686).

Update consumers

The daily-briefing helper (~/scripts/arr-briefing-data.py) extracts keys at runtime from config.xml via SSH, so it doesn’t need updating after rotation — that’s the whole point of not hardcoding.

What does need updating:

Consumer	Where to update
Prowlarr	http://192.168.8.100:9696 → Settings → Apps → Sonarr/Radarr/Lidarr → API Key field
Recyclarr	`/opt/recyclarr/recyclarr.yml` — set new API keys, restart container
Any hardcoded use	`grep -rln 'd792444549\|b117993eb50\|3dc17d20ca664' ~/Sync/ED ~/scripts` — find leftovers from the May 2026 leak
Bee Hub docs	None should hardcode — but search anyway: `grep -rln '<ApiKey>' ~/Sync/ED/homelab/bee_hub/content`

Verify

Run the briefing helper and confirm it pulls real data:

python3 ~/scripts/arr-briefing-data.py --hours 168 | head -30

Should return JSON with non-empty sonarr.imports, etc. If you get {"error": "api key: ..."}, the key didn’t actually rotate or the container hasn’t restarted.

Record

Update ~/Sync/ED/SECRETS.md with the new keys (or note that they’re extracted at runtime, depending on the entry). If this was a leak-driven rotation, remove the rotation entry from TASKS.md once consumers are updated.

Seedbox sync stalled / not draining

The seedbox→homelab pull is the single most misleading thing in the stack to diagnose, because the same dataset is mounted under two names and LXC processes show up in the host’s process list. Read this before concluding anything is “broken.” (Reconciled 2026-06-13.)

Symptom

Uptime Kuma monitor 57 “Cron: Seedbox Sync” shows DOWN, or a Gotify alert “no successful seedbox sync in Xh”.
The seedbox quota isn’t dropping, or /mnt/seedbox “looks empty.”
An rsync to …/completed/ → /mnt/seedbox/ appears to be running but moving nothing.

Diagnose

The real sync runs on CT100, not pve. It’s /usr/local/bin/seedbox-sync.sh on CT100, cron */15, pulling delgross@ismene.usbx.me:~/downloads/{completed,transmission,complete}/ into $DEST=/mnt/seedbox.

Two traps that make a healthy sync look broken:

Dual mount name. The ingest ZFS dataset nvmepool/ingest is mounted at /mnt/seedbox inside CT100 and at /nvmepool/ingest on the pve host — same data (~1.3 TB), two names. The pve-host /mnt/seedbox is a dead empty dir (just a stale Music-ImportFailed stub). Looking there and seeing “empty” tells you nothing.
LXC procs in the host ps. CT100’s rsync/cron processes appear in pve’s ps -eo with host PIDs and a /mnt/seedbox target that means the live dataset in CT100’s namespace but a dead dir on the host. Always run cat /proc/<pid>/cgroup — lxc/100 = it’s CT100’s process, not pve’s.

Don’t trust “rsync is running” — sample throughput: cat /sys/class/net/eth0/statistics/rx_bytes twice 5s apart inside CT100. A real pull is ~6–13 MB/s; a true stall is ~0.

Note: a 1+ TB backlog legitimately runs up to 6 h per leg, during which the */15 cron just logs “flock held, skipping.” That is normal, not failure.

Fix

Never add a second sync on pve. It writes to /nvmepool/ingest = the same dataset CT100 syncs, and both use --remove-source-files → they race and can corrupt mid-transfer. (Claude did exactly this in error on 2026-06-13; reverted.)
NZBGet client + remote-path-mapping host must be 192.168.8.100 (the CT100 tunnel), not 192.168.8.221/pve. A wrong host silently fails grabs and breaks import path resolution in Bookshelf/Chaptarr too.
The progress heartbeat (added to seedbox-sync.sh 2026-06-13) pushes Kuma 57 every 5 min based on real eth0 RX, so a healthy multi-hour drain stays green and a true stall flags DOWN within the 30-min interval. If 57 is flapping during a normal drain, confirm the heartbeat block is present in the script.
If genuinely wedged: check the NZBGet/Transmission tunnels (autossh-* on CT100) and the seedbox daemon, then let the */15 cron pick up — --partial resumes, nothing is lost.

Verify

Inside CT100: throughput sample shows MB/s, not KB/s.
Seedbox quota -s is dropping; ~/downloads/nzbget/completed/Movies shrinking.
Kuma monitor 57 returns UP (it carries an OK moved NNN MB message on success).

Syncthing not converging

Symptom

File you saved on Mac Studio not appearing on MacBook (or vice versa)
Syncthing UI shows “Out of Sync” for a folder with a large file count behind it
New SKILL edits aren’t showing up on MacBook even after waiting

Diagnose

Open both UIs side-by-side:

Mac Studio: http://127.0.0.1:8384
Proxmox (hub): http://192.168.8.221:8384
MacBook: http://127.0.0.1:8384 on the MacBook itself, or via ssh -L 8385:127.0.0.1:8384 macbook from Studio

Check:

Indicator	Meaning
Folder shows “Out of Sync”	Real sync work pending; let it run (or kick it)
Folder shows “Up to Date” on both ends but file is missing	`.stignore` is filtering the file — check
One device offline	Bring it online or wait
Folder shows huge queue size on one end	Probably blocked behind an excluded path that previously synced

`.stignore` check

~/Sync/ED/.stignore excludes large subtrees from syncing (life_archive/data/, homelab/paperless-ngx/, etc.). If a file you want isn’t replicating, it might be under an excluded path.

cat ~/Sync/ED/.stignore

# Confirm whether a specific path is excluded
syncthing cli --home="$HOME/Library/Application Support/Syncthing" check ignore ~/Sync/ED/relative/path/to/file
# OR, simpler: just see if the file's parent is in stignore

To add an exclusion (e.g., a new bloat source):

# Edit ~/Sync/ED/.stignore — add the relative path (no leading slash)
echo 'new/bloat/path' >> ~/Sync/ED/.stignore
# Syncthing watches .stignore and reloads automatically

To remove an exclusion: edit the file, save. Next scan will pick up the previously-excluded content.

Force a rescan

When Syncthing thinks everything’s in sync but a file is clearly missing, force a rescan on the source device:

# Find the folder ID in the Syncthing UI URL or via the API
SYNC_API="$(grep -oE '<apikey>[^<]+' ~/Library/Application\ Support/Syncthing/config.xml | head -1 | sed 's/<apikey>//')"
curl -s -X POST -H "X-API-Key: $SYNC_API" 'http://localhost:8384/rest/db/scan?folder=<folder-id>'

For the Studio’s claude-ed folder, the typical sequence after a heavy-edit session is:

# Trigger rescan, then wait a minute and check that Proxmox picked it up
launchctl kickstart -k gui/$(id -u)/com.beedifferent.syncthing  # restart Syncthing if its index is stuck

Conflict files

If both devices edited the same file while disconnected, Syncthing keeps both as conflict copies. Find them:

find ~/Sync/ED -name '*.sync-conflict-*' | head -20

Resolve: review each, pick the right version, delete the loser, let Syncthing replicate the winner.

Edit Markdown files in /content/ · Auto-index regenerated every deploy · / or ⌘K to search