Skip to content

[chart/redis-ha] split-brain-fix does nothing when sentinel can't find a master #397

@kishoregv

Description

@kishoregv

We ran into a situation where our redis-ha cluster got into a split-brain state during a node disruption, and the fix-split-brain.sh sidecar didn't do anything about it. After digging into the script, I think I found why.

What's happening

When sentinel can't agree on a master (quorum is broken), sentinel get-master-addr-by-name returns an empty string. The main loop in fix-split-brain.sh checks for two cases:

identify_master   # sets $MASTER via sentinel get-master-addr-by-name

if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
    # "I'm supposed to be master" - check that local redis agrees

elif [ "${MASTER}" ]; then
    # "Someone else is master" - check that local redis is replicating from the right node

fi
# Nothing here for when $MASTER is empty

When $MASTER comes back empty:

  • First if is false ("" doesn't equal our announce IP)
  • elif is also false (empty string is falsy in shell)
  • So the script just... sleeps and loops. No log, no warning, no recovery attempt.

This is exactly the scenario where you'd most want the split-brain fix to kick in, but it's completely inert.

Why this matters

  • No visibility - There's nothing in the logs to tell you sentinel lost quorum. The sidecar just silently keeps looping.
  • No recovery - Redis nodes stay in whatever broken state they're in. We had to manually intervene.
  • False sense of security - The container stays up and passes health checks, so everything looks fine from a monitoring perspective.

How to reproduce

  1. Deploy redis-ha with 3 replicas and splitBrainDetection.enabled: true
  2. Wait for things to stabilize
  3. Break quorum - e.g. kill 2 of 3 sentinel processes, or partition the network so sentinels can't talk to each other
  4. Watch the split-brain-fix container logs: nothing gets printed for the empty-master case
  5. Redis nodes may now be in an inconsistent state with no automatic recovery

What I'd expect instead

At minimum, the script should log a warning when sentinel returns empty so operators know something is wrong. Ideally it should also:

  • Check the local redis role as a diagnostic
  • After some number of consecutive empty responses, try sentinel reset to kick off re-election
  • Maybe write a status file that a readiness probe could check

Suggested fix

Add an else branch:

if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
    # existing logic...
elif [ "${MASTER}" ]; then
    # existing logic...
else
    EMPTY_MASTER_COUNT=$((EMPTY_MASTER_COUNT + 1))
    echo "$(date) WARNING: sentinel returned no master (attempt ${EMPTY_MASTER_COUNT}). Quorum may be broken."
    redis_role
    echo "  Local redis role: ${ROLE:-unknown}"

    if [ "${EMPTY_MASTER_COUNT}" -ge "${MAX_EMPTY_MASTER_RETRIES}" ]; then
        echo "$(date) ERROR: No master from sentinel after ${EMPTY_MASTER_COUNT} checks. Resetting sentinel."
        redis-cli -h "${SERVICE}" -p "${SENTINEL_PORT}" sentinel reset "${MASTER_GROUP}" || true
        EMPTY_MASTER_COUNT=0
    fi
fi

Could be controlled with a new values.yaml param like splitBrainDetection.maxEmptyMasterRetries: 5.

Related issues

This feels like part of a pattern with silent failures in this script:

In each case, the script either does nothing or does the wrong thing, and there's no log output to help you figure out what happened.

Environment

  • redis-ha 4.35.10
  • Kubernetes 1.28
  • 3 replicas, default sentinel config

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions