feat: remove ligrec parallelize#1125
Conversation
9fe8f25 to
4a60ef3
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1125 +/- ##
==========================================
- Coverage 74.05% 73.90% -0.16%
==========================================
Files 39 39
Lines 6495 6510 +15
Branches 1122 1122
==========================================
+ Hits 4810 4811 +1
- Misses 1230 1249 +19
+ Partials 455 450 -5
🚀 New features to boost your workflow:
|
d1f752c to
6230aed
Compare
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
| ) | ||
|
|
||
|
|
||
| @njit(nogil=True, cache=True) |
There was a problem hiding this comment.
Why not parallel=True + prange? Because this is being run in a thread pool? Why not just make every individual step parallel?
https://numba.pydata.org/numba-doc/dev/user/parallel.html?highlight=njit#explicit-parallel-loops
Would this require rewriting into a reduction of some sort to prevent overlapping writes?
I see in the benchmarks that the speedups with more jobs is not really scaling linearly, which is not what I would expect.
There was a problem hiding this comment.
I see in the benchmarks that the speedups with more jobs is not really scaling linearly, which is not what I would expect.
Thats a good point worth investigating
There was a problem hiding this comment.
Why not parallel=True + prange? Because this is being run in a thread pool? Why not just make every individual step parallel?
https://numba.pydata.org/numba-doc/dev/user/parallel.html?highlight=njit#explicit-parallel-loopsWould this require rewriting into a reduction of some sort to prevent overlapping writes?
For keeping the progress bar. Also fits current code more naturally, I agree that it's better and faster but in the sepal PR but this pattern helped me keep the progress bar and avoid duplicate code in a straightforward way. For example I had to create a function that would densify per thread but pushing this step to python reduced the duplicate code.
There was a problem hiding this comment.
Two options at a high level:
- Refactor for the
numbaitself to be parallel and work on "batches" - Does https://github.com/mortacious/numba-progress work?
| def _worker(t: int) -> NDArrayA: | ||
| local_counts = np.zeros((n_inter, n_cpairs), dtype=np.int64) | ||
| rs = np.random.RandomState(None if seed is None else t + seed) | ||
| perm = clustering.copy() | ||
| for _ in range(chunk_sizes[t]): | ||
| rs.shuffle(perm) | ||
| _score_permutation( | ||
| data_arr, | ||
| perm, | ||
| inv_counts, | ||
| mean_obs, | ||
| interactions, | ||
| interaction_clusters, | ||
| valid, | ||
| local_counts, | ||
| ) | ||
| pbar.update(1) | ||
| return local_counts |
There was a problem hiding this comment.
Why can't this also be numba-ified with an outer-loop of some sort? Why do we still need a thread pool? I thought "one giant kernel" was the goal
Is shuffling not parallelizable? Certainly there are ways around this like argsort + randomindices or somethign? Other than that, I don't really see why therange(chunk_sizes[t]) couldn't be parallelized. Is it the validity of local_counts? Seems like there should be ways around this
There was a problem hiding this comment.
to have a responsive progress bar and to have the same shuffling results as old version.
There was a problem hiding this comment.
Could you explain a bit more
- Why is the "same results" thing a hard blocker?
clusteringseems small so copy+shuffle should be cheap as a pre-processing step i.e., do all the "shuffle" stuff ahead of time / outsidenumba - Would you expect a giant kernel to be faster? My gut is "yes" given Severin's experience/our experience with
co_occurrencebut I'm all ears
There was a problem hiding this comment.
clustering seems small so copy+shuffle should be cheap as a pre-processing step i.e.,
perm = clustering.copy()
Copies per thread. To allocate ahead of time you'd need n_perms * len(clustering). So it can blow up if the user give high n_perms. You can also do per thread in place shuffling but that changes the behaviour and results.
Would you expect a giant kernel to be faster? My gut is "yes" given Severin's experience/our experience with co_occurrence but I'm all ears
For sure.
There was a problem hiding this comment.
@timtreis How many permutations are realistic? I would the memory hit can't be that high here
Separate from our "internal knowledge", I have to wonder if this is wroth trying out and seeing where it breaks/degrades. If there's a way to do this that doesn't involve changing results but you get more speed at the risk of higher memory usage, I think we should understand that tradeoff
this might be a nice way to retain the progress bar without depending on the linked package:
Instead of explicit threads each working on a "chunk", we would redo the algorithm so that we sequentially proceed over the "chunks" but process each "chunk" at once in using parallelized numba instead of for _ in range(chunk_sizes[t]). This could help alleviate the memory concerns by doing allocations for only clustering * chunk_sizes[t] amount of data instead of the full n_perms * len(clustering).
Do I ahve this right?
There was a problem hiding this comment.
I will have a look at these questions by doing some runs. Didn't fully understand some parts but will come back to you. But to be clear: would not reproducing old results be a dealbreaker for you? I didn't understand your stance on this.
There was a problem hiding this comment.
But to be clear: would not reproducing old results be a dealbreaker for you? I didn't understand your stance on this.
That's not really my call here. That being said, what I was trying to propose was a way of using numba's parallel=True mechanism without breaking backwards compatible reproducibility. It seems like all the reproducibility concerns boil down to the shuffling, so I'm proposing basically front loading the shuffling so that the kernel can run in parallel on multiple pre-shuffled permutations
| def _worker(t: int) -> NDArrayA: | ||
| local_counts = np.zeros((n_inter, n_cpairs), dtype=np.int64) | ||
| rs = np.random.RandomState(None if seed is None else t + seed) | ||
| perm = clustering.copy() | ||
| for _ in range(chunk_sizes[t]): | ||
| rs.shuffle(perm) | ||
| _score_permutation( | ||
| data_arr, | ||
| perm, | ||
| inv_counts, | ||
| mean_obs, | ||
| interactions, | ||
| interaction_clusters, | ||
| valid, | ||
| local_counts, | ||
| ) | ||
| pbar.update(1) | ||
| return local_counts |
There was a problem hiding this comment.
@timtreis How many permutations are realistic? I would the memory hit can't be that high here
Separate from our "internal knowledge", I have to wonder if this is wroth trying out and seeing where it breaks/degrades. If there's a way to do this that doesn't involve changing results but you get more speed at the risk of higher memory usage, I think we should understand that tradeoff
this might be a nice way to retain the progress bar without depending on the linked package:
Instead of explicit threads each working on a "chunk", we would redo the algorithm so that we sequentially proceed over the "chunks" but process each "chunk" at once in using parallelized numba instead of for _ in range(chunk_sizes[t]). This could help alleviate the memory concerns by doing allocations for only clustering * chunk_sizes[t] amount of data instead of the full n_perms * len(clustering).
Do I ahve this right?
| ) | ||
|
|
||
|
|
||
| @njit(nogil=True, cache=True) |
There was a problem hiding this comment.
Two options at a high level:
- Refactor for the
numbaitself to be parallel and work on "batches" - Does https://github.com/mortacious/numba-progress work?
|
|
||
| def _worker(t: int) -> NDArrayA: | ||
| local_counts = np.zeros((n_inter, n_cpairs), dtype=np.int64) | ||
| rs = np.random.RandomState(None if seed is None else t + seed) |
There was a problem hiding this comment.
Why do we need to use the old RandomState? Can we get the same result by using Generator?
There was a problem hiding this comment.
Can we get the same result by using Generator?
I am not sure but isn't this what Phil tried to do in his PR? By using the legacy RandomState internally by default when no rng was given. I think he also said the guarantees for reproducibility is different than RandomState (I verified it and its correct). Like Generator doesn't guarantee reproducibility across different versions of numpy.
benchmark code
main results
pr results
Results compared:
both faster and cleaner code. this removes parallelize.
update: the reason main is faster when n_jobs=1 is because main sets also
numba_parallel=Trueso it's because it's still numba parallel even though it's one process.