-
Notifications
You must be signed in to change notification settings - Fork 190
[chart/redis-ha] split-brain-fix does nothing when sentinel can't find a master #397
Description
We ran into a situation where our redis-ha cluster got into a split-brain state during a node disruption, and the fix-split-brain.sh sidecar didn't do anything about it. After digging into the script, I think I found why.
What's happening
When sentinel can't agree on a master (quorum is broken), sentinel get-master-addr-by-name returns an empty string. The main loop in fix-split-brain.sh checks for two cases:
identify_master # sets $MASTER via sentinel get-master-addr-by-name
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
# "I'm supposed to be master" - check that local redis agrees
elif [ "${MASTER}" ]; then
# "Someone else is master" - check that local redis is replicating from the right node
fi
# Nothing here for when $MASTER is emptyWhen $MASTER comes back empty:
- First
ifis false (""doesn't equal our announce IP) elifis also false (empty string is falsy in shell)- So the script just... sleeps and loops. No log, no warning, no recovery attempt.
This is exactly the scenario where you'd most want the split-brain fix to kick in, but it's completely inert.
Why this matters
- No visibility - There's nothing in the logs to tell you sentinel lost quorum. The sidecar just silently keeps looping.
- No recovery - Redis nodes stay in whatever broken state they're in. We had to manually intervene.
- False sense of security - The container stays up and passes health checks, so everything looks fine from a monitoring perspective.
How to reproduce
- Deploy redis-ha with 3 replicas and
splitBrainDetection.enabled: true - Wait for things to stabilize
- Break quorum - e.g. kill 2 of 3 sentinel processes, or partition the network so sentinels can't talk to each other
- Watch the split-brain-fix container logs: nothing gets printed for the empty-master case
- Redis nodes may now be in an inconsistent state with no automatic recovery
What I'd expect instead
At minimum, the script should log a warning when sentinel returns empty so operators know something is wrong. Ideally it should also:
- Check the local redis role as a diagnostic
- After some number of consecutive empty responses, try
sentinel resetto kick off re-election - Maybe write a status file that a readiness probe could check
Suggested fix
Add an else branch:
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
# existing logic...
elif [ "${MASTER}" ]; then
# existing logic...
else
EMPTY_MASTER_COUNT=$((EMPTY_MASTER_COUNT + 1))
echo "$(date) WARNING: sentinel returned no master (attempt ${EMPTY_MASTER_COUNT}). Quorum may be broken."
redis_role
echo " Local redis role: ${ROLE:-unknown}"
if [ "${EMPTY_MASTER_COUNT}" -ge "${MAX_EMPTY_MASTER_RETRIES}" ]; then
echo "$(date) ERROR: No master from sentinel after ${EMPTY_MASTER_COUNT} checks. Resetting sentinel."
redis-cli -h "${SERVICE}" -p "${SENTINEL_PORT}" sentinel reset "${MASTER_GROUP}" || true
EMPTY_MASTER_COUNT=0
fi
fiCould be controlled with a new values.yaml param like splitBrainDetection.maxEmptyMasterRetries: 5.
Related issues
This feels like part of a pattern with silent failures in this script:
- [chart/redis-ha][BUG] K8s Cluster upgrade causes split brain #121 - the original split brain bug that motivated the fix
- [chart/redis-ha] split-brain-fix.sh is executed on "sh" but uses "==" instead of "=" for comparison #229 - the
==vs=POSIX bug that silently broke the comparisons - [chart/redis-ha][BUG] split-brain-fix causes unnecessary master shutdown during failover #383 - the race condition where a newly promoted master gets shut down because
identify_redis_master()returns empty on masters
In each case, the script either does nothing or does the wrong thing, and there's no log output to help you figure out what happened.
Environment
- redis-ha 4.35.10
- Kubernetes 1.28
- 3 replicas, default sentinel config