Summary
When asking for vault assets or storage maps in order to sync the client, the node doesn't deduplicate entries by default, and there are many repeated values that are sent and always discarded by the client. Adding dedup logic has the benefit of reducing the bandwidth of these node requests, the downside is just added logic. Performance difference in the node is negligible (very likely on the good side because we have to serialize fewer entries if we dedup, and the deduplication itself is almost free in terms of computation).
The purpose of this issue is to determine whether it's worth to make changes related to endpoints SyncAccountVault and SyncAccountStorageMaps in order to deduplicate what we return, or if we should just keep the logic the same and maybe document the reason in the code, being clear about the fact that these endpoints don't have to be used for getting deltas because it's not their purpose, just a side effect.
Full Description
Currently the node doesn't have deduplication of vault assets. We want to add it, but the is_latest field is not reliable for this if we ask for a block that's not at the chain tip, so we discarded that method for deduplicating. We can still do it in another way, but independently of the approach there are problems related to pagination:
If the response is too big, then it's possible that we are still going to return multiple values for the same key, because the pagination logic we have is stateless and depends only on block number. The way it works is that if the response is too big we send results up to a block X (between from and to), and the user will have to request again from block X + 1 up to block to iteratively until X = to. So between requests we could potentially have duplication, which means the client should deduplicate anyway on their side. An alternative would be stateful pagination using a cursor, but that may be unwanted, so perhaps we have to accept that client-side deduplication would still be necessary regardless of whether we have node-side deduplication.
Benefits of deduplicating on the node side:
- Response payload is smaller, which makes it lighter on bandwidth.
- CPU processing improves slightly, as we serialize less data.
Cons of deduplicating on the node side:
- Added logic to node code, though it's a few lines.
- Because of constraints we would still need dedup on the client side, so it's more code overall.
Other Worries
Our current method returns deltas even though that's not its purpose (it's a side effect). The problem is that if the node has pruned information after block_from, we still return the "deltas" but starting from the earliest data the node has available. That's okay for syncing, but not if we actually wanted the deltas between those blocks, which is something we should warn the user about.
The deduplication change would get rid of this confusion. But the simplest thing that we could do now is to clarify in a comment that the current implementation is not for getting deltas, but it's just a side effect of the request.
Summary
When asking for vault assets or storage maps in order to sync the client, the node doesn't deduplicate entries by default, and there are many repeated values that are sent and always discarded by the client. Adding dedup logic has the benefit of reducing the bandwidth of these node requests, the downside is just added logic. Performance difference in the node is negligible (very likely on the good side because we have to serialize fewer entries if we dedup, and the deduplication itself is almost free in terms of computation).
The purpose of this issue is to determine whether it's worth to make changes related to endpoints
SyncAccountVaultandSyncAccountStorageMapsin order to deduplicate what we return, or if we should just keep the logic the same and maybe document the reason in the code, being clear about the fact that these endpoints don't have to be used for getting deltas because it's not their purpose, just a side effect.Full Description
Currently the node doesn't have deduplication of vault assets. We want to add it, but the
is_latestfield is not reliable for this if we ask for a block that's not at the chain tip, so we discarded that method for deduplicating. We can still do it in another way, but independently of the approach there are problems related to pagination:If the response is too big, then it's possible that we are still going to return multiple values for the same key, because the pagination logic we have is stateless and depends only on block number. The way it works is that if the response is too big we send results up to a block
X(betweenfromandto), and the user will have to request again from blockX + 1up to blocktoiteratively untilX = to. So between requests we could potentially have duplication, which means the client should deduplicate anyway on their side. An alternative would be stateful pagination using a cursor, but that may be unwanted, so perhaps we have to accept that client-side deduplication would still be necessary regardless of whether we have node-side deduplication.Benefits of deduplicating on the node side:
Cons of deduplicating on the node side:
Other Worries
Our current method returns deltas even though that's not its purpose (it's a side effect). The problem is that if the node has pruned information after
block_from, we still return the "deltas" but starting from the earliest data the node has available. That's okay for syncing, but not if we actually wanted the deltas between those blocks, which is something we should warn the user about.The deduplication change would get rid of this confusion. But the simplest thing that we could do now is to clarify in a comment that the current implementation is not for getting deltas, but it's just a side effect of the request.