Streaming sync serialization#287
Streaming sync serialization#287bjester wants to merge 10 commits intolearningequality:release-v0.9.xfrom
Conversation
f7d7b60 to
7762fff
Compare
3bc3ec8 to
8898125
Compare
rtibbles
left a comment
There was a problem hiding this comment.
Implementation makes sense to me, and I can follow the mapping from existing operation code to the new stream architecture. The minimal changes to the existing operations tests give confidence against regressions.
The only thing I got hung up on was the names of the abstract base classes!
| @abc.abstractmethod | ||
| def __call__(self, items: Iterable[Any]) -> Iterator: | ||
| """Process the incoming iterable and yield output items.""" | ||
| raise NotImplementedError |
There was a problem hiding this comment.
Small inconsistency here and below where raise NotImplementedError is used? Fairly sure it's not needed in addition to the abstractmethod decorator?
|
|
||
| .. code-block:: python | ||
|
|
||
| source.pipe(transform1).pipe(transform2).end(sink) |
There was a problem hiding this comment.
Not at all necessary, but the thought of being able to construct the pipeline with pipe operators amused me!
source | transform1 | transform2 | sink
| pass | ||
|
|
||
|
|
||
| class ReaderModule(abc.ABC): |
There was a problem hiding this comment.
Is there a reason this isn't a subclass of StreamModule?
| pass | ||
|
|
||
|
|
||
| class PipelineModule(StreamModule): |
There was a problem hiding this comment.
Maybe TransformModule? Or seeing that you use that more specifically below OperatorModule?
| pass | ||
|
|
||
|
|
||
| class Pipeline(ReaderModule): |
There was a problem hiding this comment.
I think the use of Pipeline here and PipelineModule above does make it feel more confusing to me.
| stores_to_update.append(created_store) | ||
|
|
||
| if stores_to_update: | ||
| # TODO: bulk_update performs poorly-- is there a better way? |
There was a problem hiding this comment.
This library claims 8x speed up over bulk_update - but also doesn't seem to be hugely well maintained, so might be useful to look at for inspiration rather than usage! https://github.com/netzkolchose/django-fast-update
| "djangorestframework>3.10", | ||
| "django-ipware==4.0.2", | ||
| "requests", | ||
| "typing-extensions==4.1.1", |
There was a problem hiding this comment.
I assume this was purposeful, but flagging that this is precisely the same version of typing-extensions that Kolibri bundles (although it's still not quite clear to me what requires it, as it's not a direct dependency).
Summary
streamz, which I unfortunately opted against because it uses tornado_serialize_into_storelogic into individual classes built upon foundational stream utilities-- so much better for unit testing!typing-extensionsfor backported future typing featuresMorangoProfileControllerto usesync_filterkwarg instead offilter-- always bothered me it shadowed the built-in_serialize_into_storewith newserialize_into_storestreaming replacementbulk_updateas Django was observed to spend excessive time with itImprovements
The changes were evaluated by installing the local version into Kolibri. A dedicated command was created within Kolibri to run solely the serialization step, and then the performance of that command was benchmarked.
Further investigation will be required to determine how to reduce the increased duration.
Case 1: existing large dataset
Kolibri was launched with a pre-existing database containing data for about 18,000 users.
Case 2: artificial 500 users
Kolibri's
generateuserdatacommand was used to generate data for 500 users, which is the maximum the command currently supports.Case 3: large dataset reduced -- 1000 users
Since the
generateuserdatacommand currently can only generate up to 500 users, the existing large dataset was trimmed down to 1000 users. After manually deleting the other users,kolibri managewas executed (no-op) to trigger Kolibri's FK integrity check which deletes the broken records. Note, this probably takes longer due to the deletions, which provides additional insights into the process, even though the deletion processing has not really changed.Case 4: large dataset reduced -- 5000 users
Again, the existing large dataset was trimmed down, this time to 5000 users. Same situation with regards to deletion behavior as in Case (3)
How AI was used
TODO
Reviewer guidance
Issues addressed
Closes #192