Fix memory leak due to lingering messages in the message store even after removal of message from queues by ashah4 · Pull Request #109 · Netflix/dyno-queues

ashah4 · 2025-06-09T07:59:24Z

Problem
When the remove() is invoked for a particular message in queue, it attempts to remove the message from all the shards of the queue, as well as from the unack queue. However, it removes the message from the message store ONLY when if an entry existed before removal in the message queue.

This might not always be true because a message might have been popped from the queue, and when that happens, it is removed from the queue shard, and added to the unack queue. Hence in such situations, if the remove() is invoked, the entry in the message store hash never gets cleaned up, resulting in memory leak.

This is particularly identified when the conductor's reconciler (Sweeper) may have popped the workflow from decider queue, and at the same time, another thread tried to remove the workflow from the queue.

Changes in the PR
Updated the remove() function to remove the entry from the message store as long as the message was found either in message queue or unack queue. This ensures that there is no memory leak due to lingering entries in the hash.

Master

revert the oss plugin version

[debug] revert gradle version

Refactor queues to use java.time.Clock

Fixed typo

gradle updates Upgrade plugins Updating Redis Pipeline Queue v2 with proper names

Cde updates

… to 1.6.5-rc.1

Remove some unused lines (solves Netflix#29)

unsafePopWithMsgId() checks all 3 shards for a msgId, but only one of them can have it. So make sure that the checks against the other 2 shards do not spam the logs with "Cannot find message with ID ...".

This API looks for a value with a matching predicate in the hashmap and pops the first message ID that matches if it does find one. Also added a atomicPopWithMsgIdHelper() helper function to do a pop in one round trip. This should only be used if the ring size of the underlying Dynomite cluster is 1. Testing: Tested with DynoQueueDemo

Using queues with DC_EACH_SAFE_QUORUM is quite expensive. The bulk pop operation is meant to pop more within a single round trip.

unsafeBulkPop() allows bulk popping from all shards. loaclGet() does a get() with a non quorum connection. TODO: unsafeBulkPop() will return nil if messageCount > size(). Fix this. TODO 2: Do code cleanup.

Previously popWithMsgPredicate() would pop the first item it found in the hashmap that matches the given predicate. With this patch, it will obey the queueing priority.

Tests fail without this fix.

In some weird cases, we find the message in the queue but without a payload in the hashmap. Although this is never expected, it seems to happen, and this patch should stop the bleeding until we find out the root cause.

Attempts to return the items present in the local queue shard but not in the hashmap, if any. (Ideally, we would not require this function, however, in some configurations, especially with multi-region write traffic sharing the same queue, we may find ourselves with stale items in the queue shards)

Note: All items returned MUST be checked at the app level if they've already been processed before acting on them (eg: removing them)

…ravisCi secrets

Upgrade nebula.netflixoss to replace bintray publication and update TravisCi secrets

Replace JCenter with Maven Central

Remove TravisCI and use Github Actions

Use this link to re-run the recipe: https://app.moderne.io/recipes/org.openrewrite.github.ChangeActionVersion?organizationId=TmV0ZmxpeA%3D%3D#defaults=W3sidmFsdWUiOiJhY3Rpb25zL2NhY2hlIiwibmFtZSI6ImFjdGlvbiJ9LHsidmFsdWUiOiJ2NCIsIm5hbWUiOiJ2ZXJzaW9uIn1d Co-authored-by: Moderne <team@moderne.io>

…yno queue

v1r3n and others added 30 commits October 6, 2017 01:19

Merge pull request Netflix#23 from Netflix/master

1eec8f9

Master

Merge pull request Netflix#24 from Netflix/master

e787435

revert the oss plugin version

Merge pull request Netflix#25 from Netflix/master

0e98adf

[debug] revert gradle version

ignoring IDEA files

f3dd104

ignoring .gradle dir

b9caecc

refactoring queue implementations to accept a Clock

a022a1e

Merge pull request Netflix#26 from robzienert/clock-refactor

29b9efc

Refactor queues to use java.time.Clock

refactored for the new pipeline based queues

267567b

fixed Netflix#19

6335430

v2 recipe that uses redis pipes

647f812

update dyno version

6d9ce47

Apply license header to the files.

e74db61

more tests and implement long polling for the v2 recipe

309ef57

refactored tests so the queue names are distinct across tests

a1a8e39

javadocs

3ac3dbe

use hashtags to separate messages per shard

efb295a

Fixed typo

3329399

Merge pull request Netflix#28 from michel-zededa/dev

9232b8d

Fixed typo

update dyno version to the latest stable

e92d2aa

update dyno version to the latest stable

4f008f7

Gradle, nebula and REdis pipeline

d1185bd

gradle updates Upgrade plugins Updating Redis Pipeline Queue v2 with proper names

Merge pull request Netflix#33 from Netflix/cde-updates

c3324d7

Cde updates

Custom sharding strategy

154bb2d

fix bundle.gradle to pick up RCs and upgrade dyno-core and dyno-jedis…

82e5ae6

… to 1.6.5-rc.1

Review fix - test and improvments

ec33a3a

change custom shard strategy logic

b8e5168

Remove some unused lines (solves Netflix#29)

8af9b44

Merge pull request Netflix#35 from Ricool06/dev

c8aa2cf

Remove some unused lines (solves Netflix#29)

Make v2 queue more safe

0eaebf6

Merge recent changes from develop

adebe95

smukil and others added 30 commits October 11, 2019 15:05

popWithMsgId() and unsafePopWithMsgId() should report metrics

2218bb3

Pin to dyno-1.7.2

9ccba78

Pin to dyno-1.7.2-rc2

8824bfd

unsafePopWithMsgId() shouldn't spam the logs

a6f5592

unsafePopWithMsgId() checks all 3 shards for a msgId, but only one of them can have it. So make sure that the checks against the other 2 shards do not spam the logs with "Cannot find message with ID ...".

Add a atomic bulkPop() API

d4305d4

Using queues with DC_EACH_SAFE_QUORUM is quite expensive. The bulk pop operation is meant to pop more within a single round trip.

Add unsafeBulkPop() and localGet()

5089a7c

unsafeBulkPop() allows bulk popping from all shards. loaclGet() does a get() with a non quorum connection. TODO: unsafeBulkPop() will return nil if messageCount > size(). Fix this. TODO 2: Do code cleanup.

atomicBulkPopHelper() should cast to Message before returning

6d8a91a

Make popWithMsgPredicate() obey queue priority

84325c0

Previously popWithMsgPredicate() would pop the first item it found in the hashmap that matches the given predicate. With this patch, it will obey the queueing priority.

Fix null return from popMsgWithPredicateObeyPriority

ad6c57f

Add atomicProcessUnacks() and getAllMessages()

12d242a

JedisMock cannot be cast to DynoJedisClient

ee40066

Tests fail without this fix.

Make popMsgWithPredicateObeyPriority() check if hash exists

3da0ec1

In some weird cases, we find the message in the queue but without a payload in the hashmap. Although this is never expected, it seems to happen, and this patch should stop the bleeding until we find out the root cause.

Add an atomicRemove() API

de3787f

Make findStaleMessages() return stale messages from all shards

03af6f5

Note: All items returned MUST be checked at the app level if they've already been processed before acting on them (eg: removing them)

Update README.md

7f8e681

Comment out Global properties from demo.properties

6722cb2

Update build status location

1089056

Upgrade nebula.netflixoss to replace bintray publication and update T…

826888f

…ravisCi secrets

Merge pull request Netflix#101 from Netflix/replace-bintray

8a36294

Upgrade nebula.netflixoss to replace bintray publication and update TravisCi secrets

Replace JCenter with Maven Central

e9a3eea

Merge pull request Netflix#102 from Netflix/replace-jcenter

88805bf

Replace JCenter with Maven Central

Rotate publishing credentials

e20195a

rotate TravisCI secrets

2b193c1

Remove TravisCI and use Github Actions

07bcd27

Merge pull request Netflix#104 from Netflix/use-github-actions

a920a8e

Remove TravisCI and use Github Actions

Upgrade Gradle

6104363

Fix lingering messages in the hash after remove() is invoked on the d…

3ba0d2a

…yno queue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak due to lingering messages in the message store even after removal of message from queues#109

Fix memory leak due to lingering messages in the message store even after removal of message from queues#109
ashah4 wants to merge 115 commits intoNetflix:masterfrom
ashah4:fix-memory-leak-in-hash-after-removal

ashah4 commented Jun 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

ashah4 commented Jun 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants