Skip to content

Fix memory leak due to lingering messages in the message store even after removal of message from queues#109

Open
ashah4 wants to merge 115 commits intoNetflix:masterfrom
ashah4:fix-memory-leak-in-hash-after-removal
Open

Fix memory leak due to lingering messages in the message store even after removal of message from queues#109
ashah4 wants to merge 115 commits intoNetflix:masterfrom
ashah4:fix-memory-leak-in-hash-after-removal

Conversation

@ashah4
Copy link
Copy Markdown

@ashah4 ashah4 commented Jun 9, 2025

Problem
When the remove() is invoked for a particular message in queue, it attempts to remove the message from all the shards of the queue, as well as from the unack queue. However, it removes the message from the message store ONLY when if an entry existed before removal in the message queue.

This might not always be true because a message might have been popped from the queue, and when that happens, it is removed from the queue shard, and added to the unack queue. Hence in such situations, if the remove() is invoked, the entry in the message store hash never gets cleaned up, resulting in memory leak.

This is particularly identified when the conductor's reconciler (Sweeper) may have popped the workflow from decider queue, and at the same time, another thread tried to remove the workflow from the queue.

Changes in the PR
Updated the remove() function to remove the entry from the message store as long as the message was found either in message queue or unack queue. This ensures that there is no memory leak due to lingering entries in the hash.

v1r3n and others added 30 commits October 6, 2017 01:19
revert the oss plugin version
[debug] revert gradle version
Refactor queues to use java.time.Clock
gradle updates

Upgrade plugins

Updating Redis Pipeline Queue v2 with proper names
Remove some unused lines (solves Netflix#29)
smukil and others added 30 commits October 11, 2019 15:05
unsafePopWithMsgId() checks all 3 shards for a msgId, but only one of
them can have it. So make sure that the checks against the other
2 shards do not spam the logs with "Cannot find message with ID ...".
This API looks for a value with a matching predicate in the hashmap and pops
the first message ID that matches if it does find one.

Also added a atomicPopWithMsgIdHelper() helper function to do a pop in one
round trip. This should only be used if the ring size of the underlying
Dynomite cluster is 1.

Testing: Tested with DynoQueueDemo
Using queues with DC_EACH_SAFE_QUORUM is quite expensive. The bulk
pop operation is meant to pop more within  a single round trip.
unsafeBulkPop() allows bulk popping from all shards.

loaclGet() does a get() with a non quorum connection.

TODO: unsafeBulkPop() will return nil if messageCount > size().
Fix this.
TODO 2: Do code cleanup.
Previously popWithMsgPredicate() would pop the first item it found
in the hashmap that matches the given predicate. With this patch, it
will obey the queueing priority.
Tests fail without this fix.
In some weird cases, we find the message in the queue but without a
payload in the hashmap. Although this is never expected, it seems to
happen, and this patch should stop the bleeding until we find out the
root cause.
Attempts to return the items present in the local queue shard
but not in the hashmap, if any.

(Ideally, we would not require this function, however, in some
configurations, especially with multi-region write traffic sharing
the same queue, we may find ourselves with stale items in the queue
shards)
Note: All items returned MUST be checked at the app level if they've
already been processed before acting on them (eg: removing them)
Upgrade nebula.netflixoss to replace bintray publication and update TravisCi secrets
Remove TravisCI and use Github Actions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.