Skip to content

feat: optimize rpc calls#5394

Open
sbackend123 wants to merge 9 commits intomasterfrom
feat/rpc-calls-optimisation
Open

feat: optimize rpc calls#5394
sbackend123 wants to merge 9 commits intomasterfrom
feat/rpc-calls-optimisation

Conversation

@sbackend123
Copy link
Copy Markdown
Contributor

@sbackend123 sbackend123 commented Mar 12, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Removed RPC patterns in the transaction flow. Added cache layer for BlockNumber rpc call.
Fixed lint issues in the new cache package and cleaned up minor lint findings in adjacent test/helper code.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

#5388

Screenshots (if appropriate):

Screenshot from 2026-03-16 13-29-22

Highly likely hit ratio is low because of high load errors (EOF), which is fix by PR

@sbackend123 sbackend123 changed the title RPC calls optimisation Feat: RPC calls optimisation Mar 12, 2026
@sbackend123 sbackend123 force-pushed the feat/rpc-calls-optimisation branch from f8ca4e2 to e55ac11 Compare March 13, 2026 07:54
@sbackend123 sbackend123 force-pushed the feat/rpc-calls-optimisation branch from e55ac11 to 04f6520 Compare March 13, 2026 08:00
@sbackend123 sbackend123 changed the title Feat: RPC calls optimisation feat: optimize rpc calls Mar 13, 2026
@sbackend123 sbackend123 marked this pull request as ready for review March 16, 2026 12:31
blocktime,
true,
c.config.GetUint64(optionNameMinimumGasTipCap),
blocktime-2,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note that current blocktime on our CI is 1s. So this way caching is not used there... How we could achive to use it there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In deploy.go blockTime is hardcoded by itself. But if we are talking about CI, normally cacheTTL is about 85% of block time, so even if it is 1 second, than we will get 850 ms

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very strange, as block time is defined as const blocktime = 15 on line 16, which is actually a nanosecond. As that is passed in poolingInterval, it looks to me that it is a very short time, I suppose that block time should be in minutes, not nanoseconds. Luckily, this is only for the deploy command. Bee has its own block time passed from options. I believe that line 16 in this file should look like this const blocktime = 15 * time.Minute, which is not related to this PR.

Also what happens if blocktime value is 1 (if someone changes the constant) and we have a negative duration? If the constant is changed to the much longer value like 15*time.Minute, literal 2 in here would reduce only two nanoseconds from that large number.

t.Helper()

var expSegments [][]byte
expSegments := make([][]byte, 0, len(exp))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are fine but feel like they belong in a separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but linter was failing so I decided to fix everything. Next time will create separate PR

pollingInterval time.Duration,
chainEnabled bool,
minimumGasTipCap uint64,
blockCacheTTl time.Duration,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo blockCacheTTl -> blockCacheTTL

}
}

func NewMetrics() Metrics {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not used (dead code).

c.metrics.LoadErrors.Inc()
return val, err
}
c.Set(val, time.Now())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use here now from argument?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could happen if loader takes a while?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so because there is difference between time when we request value and time, when we set new value if we had to load it.

@gacevicljubisa
Copy link
Copy Markdown
Member

Suggestion: Instead of a fixed TTL (blocktime*85/100, blocktime-2) + BlockNumber RPC, use a single HeaderByNumber(nil) call periodically to get both the block number and timestamp, then extrapolate between syncs with zero RPC calls.

How it works:

  • Every ~20 blocks, call HeaderByNumber(nil) (latest) — returns block number + timestamp in one RPC call. Store as anchor point.
  • Between syncs, estimate the current block from pure math: currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime. No RPC calls at all.
  • Cache TTL: anchorTimestamp + (currentBlock - anchorBlock + 1) * blockTime - now.
    ?

@@ -0,0 +1,107 @@
// Copyright 2025 The Swarm Authors. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026

@@ -0,0 +1,189 @@
// Copyright 2025 The Swarm Authors. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026

@@ -0,0 +1,10 @@
// Copyright 2025 The Swarm Authors. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026

@@ -0,0 +1,84 @@
// Copyright 2025 The Swarm Authors. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026

expiresAt time.Time

group singleflight.Group
key Key
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key here is only used in metrics and in c.group.Do as the key, but that key is always the same as the singleflight group is the same and the key is the same for the same instance of the ExpiringSingleFlightCache type. I owuld suggest to remove the Key type and just to provide the metrics prefix string in the constructor NewExpiringSingleFlightCache.

}(i)
}

time.Sleep(50 * time.Millisecond)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could synctest be used here to avoid sleeping and potentially have a flaky test as synchronization not happen in 50ms?


c.metrics.Misses.Inc()

result, err, shared := c.group.Do(string(c.key), func() (any, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This group is always calling the same key, so it can be even the static string.

b.metrics.TotalRPCCalls.Inc()
b.metrics.BlockNumberCalls.Inc()

blockNumber, err := b.backend.BlockNumber(ctx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the golang.org/x/sync/singleflight is that it is not context.Context aware. The context that is passed from the first caller will influence the end result of all other callers of this function. Meaning, if the first caller cancels or times out on the context, all other callers will receive that error even if they did not cancel or the timeout for their call did not happen. This is why I've created the resenje.org/signleflight, that is context aware and will not terminate the execution of the function until all callers terminate their contexts. There are a few places where resenje.org/singleflight is used in bee, if you like, you can look at it and consider using if you see important for this case. Basically GetOrLoad would require to accept context, pass to signleflight.Do and the context from the callback would be used in b.backend.BlockNumber.

blocktime,
true,
c.config.GetUint64(optionNameMinimumGasTipCap),
blocktime-2,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very strange, as block time is defined as const blocktime = 15 on line 16, which is actually a nanosecond. As that is passed in poolingInterval, it looks to me that it is a very short time, I suppose that block time should be in minutes, not nanoseconds. Luckily, this is only for the deploy command. Bee has its own block time passed from options. I believe that line 16 in this file should look like this const blocktime = 15 * time.Minute, which is not related to this PR.

Also what happens if blocktime value is 1 (if someone changes the constant) and we have a negative duration? If the constant is changed to the much longer value like 15*time.Minute, literal 2 in here would reduce only two nanoseconds from that large number.


c.value = value
c.valid = true
c.expiresAt = now.Add(c.ttl)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, for the block number caching, the expiration time should be the time when the next block number is expected and that is around the block time interval. This is to have as least as possible time when the cache is reporting the older block number comapred what is the current block number.

In this generalized caching implementation, it is hard to generalize such requirement as it is very specific to how the block number is increased (in some ~ period, that is configurable). So the ideal situation would be to calculate what would be the time of the next block number and to set expiresAt to that time, or a bit later. And to watch if the block number is actually increased.

In the current implementation, the worst case would be that the cache will keep the block number cache up to the block time (15 minutes, for example on some networks) when the actual block number is increased, if the value is set on the cache a moment before the new block number. On average, statistically in a very large sample of caching, that time would be half of the block time (7 minutes).

I am not sure if such precision is needed, and if the delay of the new block number of up to the whole block time is acceptable. This is just my observation.

@gacevicljubisa
Copy link
Copy Markdown
Member

gacevicljubisa commented Mar 24, 2026

I don't see how this cache meaningfully reduces RPC calls. The BlockNumber callers (postage listener, storage incentives, tx confirmation, API) run on independent staggered timers and rarely overlap within the same block window. With the TTL at 85% of block time, each caller almost always finds the cache expired by its next poll, resulting in a fresh RPC call anyway. The singleflight deduplication only helps when callers hit BlockNumber at the exact same instant, which is uncommon given the staggered schedules. On Gnosis mainnet the block time is only 5s, so the cache TTL would be ~4.25s — hardly enough to span multiple independent caller cycles.

If we do want to reduce BlockNumber RPC calls, a more effective approach would be to use a single HeaderByNumber(nil) call periodically (e.g. every ~20 blocks, or more) to get both the block number and timestamp, then extrapolate between syncs: currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime (mentioned here).

The filterPendingTransactions refactor (eliminating redundant TransactionByHash calls in nextNonce) is a genuine improvement and worth keeping.

@janos wdyt?

@janos
Copy link
Copy Markdown
Member

janos commented Mar 24, 2026

I don't see how this cache meaningfully reduces RPC calls. The BlockNumber callers (postage listener, storage incentives, tx confirmation, API) run on independent staggered timers and rarely overlap within the same block window. With the TTL at 85% of block time, each caller almost always finds the cache expired by its next poll, resulting in a fresh RPC call anyway. The singleflight deduplication only helps when callers hit BlockNumber at the exact same instant, which is uncommon given the staggered schedules. On Gnosis mainnet the block time is only 5s, so the cache TTL would be ~4.25s — hardly enough to span multiple independent caller cycles.

If we do want to reduce BlockNumber RPC calls, a more effective approach would be to use a single HeaderByNumber(nil) call periodically (e.g. every ~20 blocks, or more) to get both the block number and timestamp, then extrapolate between syncs: currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime (mentioned here).

The filterPendingTransactions refactor (eliminating redundant TransactionByHash calls in nextNonce) is a genuine improvement and worth keeping.

@janos wdyt?

Yes, given that block numbers are changing very frequently on 5 to 12s depending on the network (gnosis and sepolia), but not exactly the same period every time. It is a very good suggestion to calculate block number as you described to reduce even more the frequency of rpc call to get the block number and estimate the block time by HeaderByNumber. In that case even signleflight is not needed, as internally, block numbers will always be returned by its specific cache.

I would even go further and use the block time value that is calculated from the HeaderByNumber instead specifying it using options statically.

The consequence could be that it is required to get the block number as the node startup as both block number and block time are needed as known values. Maybe that is even good to do, and to exit the application in case that there are problems with the rpc endpoint when getting the block number on start. Just thinking.

@sbackend123
Copy link
Copy Markdown
Contributor Author

sbackend123 commented Mar 25, 2026

Suggestion: Instead of a fixed TTL (blocktime*85/100, blocktime-2) + BlockNumber RPC, use a single HeaderByNumber(nil) call periodically to get both the block number and timestamp, then extrapolate between syncs with zero RPC calls.

How it works:

* Every ~20 blocks, call `HeaderByNumber(nil)` (latest) — returns block number + timestamp in one RPC call. Store as anchor point.

* Between syncs, estimate the current block from pure math: `currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime`. No RPC calls at all.

* Cache TTL: `anchorTimestamp + (currentBlock - anchorBlock + 1) * blockTime - now`.
  ?

For me it sounds like overengineering a little bit from one side, and not so flexible implementations from another side (may be I miss smth?). I thought about whatif would like to cache another type of value? Than

I don't see how this cache meaningfully reduces RPC calls. The BlockNumber callers (postage listener, storage incentives, tx confirmation, API) run on independent staggered timers and rarely overlap within the same block window. With the TTL at 85% of block time, each caller almost always finds the cache expired by its next poll, resulting in a fresh RPC call anyway. The singleflight deduplication only helps when callers hit BlockNumber at the exact same instant, which is uncommon given the staggered schedules. On Gnosis mainnet the block time is only 5s, so the cache TTL would be ~4.25s — hardly enough to span multiple independent caller cycles.
If we do want to reduce BlockNumber RPC calls, a more effective approach would be to use a single HeaderByNumber(nil) call periodically (e.g. every ~20 blocks, or more) to get both the block number and timestamp, then extrapolate between syncs: currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime (mentioned here).
The filterPendingTransactions refactor (eliminating redundant TransactionByHash calls in nextNonce) is a genuine improvement and worth keeping.
@janos wdyt?

Yes, given that block numbers are changing very frequently on 5 to 12s depending on the network (gnosis and sepolia), but not exactly the same period every time. It is a very good suggestion to calculate block number as you described to reduce even more the frequency of rpc call to get the block number and estimate the block time by HeaderByNumber. In that case even signleflight is not needed, as internally, block numbers will always be returned by its specific cache.

I would even go further and use the block time value that is calculated from the HeaderByNumber instead specifying it using options statically.

The consequence could be that it is required to get the block number as the node startup as both block number and block time are needed as known values. Maybe that is even good to do, and to exit the application in case that there are problems with the rpc endpoint when getting the block number on start. Just thinking.

I have couple of concerns:

  1. Drift over time: block production is not perfectly uniform, so we can accumulate error between syncs, especially if the resync interval is relatively large.

  2. Risk of overshooting: extrapolation may produce a block number that has not actually been produced yet

  3. With this approach we resolve problem with block_number call only. If in future we would need to reduce some other rpc calls, we might need to add smth else.

I think, there can be other corner cases, which would be more difficult to catch, since debugging will become more complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants