Verify indirect deps reachable on incremental run#20735
Verify indirect deps reachable on incremental run#20735ilevkivskyi wants to merge 2 commits intopython:masterfrom
Conversation
This comment has been minimized.
This comment has been minimized.
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
|
Ping on this one. |
JukkaL
left a comment
There was a problem hiding this comment.
This looks good, must I'm a little worried by performance impact in pathological cases with bad dependency graphs. Maybe it's not going to be a problem, I'd like to at least think a bit about potentially more efficient approaches.
|
|
||
| def verify_transitive(self, from_scc_id: int, to_scc_id: int) -> bool: | ||
| """Verify that one SCC is a (transitive) dependency of another.""" | ||
| if (from_scc_id, to_scc_id) in self.verify_transitive_cache: |
There was a problem hiding this comment.
Minor: use := with .get to avoid constructing tuple twice and performing lookup twice?
| ) | ||
|
|
||
| def verify_transitive(self, from_scc_id: int, to_scc_id: int) -> bool: | ||
| """Verify that one SCC is a (transitive) dependency of another.""" |
There was a problem hiding this comment.
Maybe a name like is_transitive_dep would be clearer, or is_transitive_scc_dep?
| results, | ||
| ) | ||
|
|
||
| def verify_transitive(self, from_scc_id: int, to_scc_id: int) -> bool: |
There was a problem hiding this comment.
Will be the cache be O(n**2) in size in the worst case? I wonder if there is a more efficient algorithm if that's the case.
There was a problem hiding this comment.
Yes. It will be quadratic in number of SCCs in worst case. I optimized for time vs space. It is a tough tradeoff. I considered few other things:
- Using SCCs processed so far: not correct because we may accidentally process unrelated SCCs, thus causing false negatives.
- Include indirect deps when ordering SCCs: not correct for roughly the same reason.
- Do not use caching: horrible time performance
O(V * E), i.e.O(n**3)worst case. - Track imports delta, i.e. only track which dependencies changed since last build: non-trivial and error-prone, there are too many edge cases to consider.
Feel free to think about this, but I would prefer to not postpone this for long, this is a correctness issue that can cause crashes. So far we have been lucky that this didn't come often, bu there is a risk that some popular library will change their import structure hitting this issue.
In the meantime I will explore a bit more, maybe there are some niche algorithms for our specific situation, where in 99.9% there is a path in DAG between two points.
There was a problem hiding this comment.
Actually I have an idea: we can track whether SCCs are "dependency-fresh" (even if not interface-fresh), i.e. as we process them, check if all direct dependencies are the same as from cache. Essentially check if import delta is "strongly" zero, and then only fall back to full check if something changed. This will not help the worst case, but will help a lot with average case, where one simply edits some files without much changes to import structure.
There was a problem hiding this comment.
Correction: not "strongly" zero, but "strongly" positive (some extra new deps will do no harm, we just need to be sure we didn't loose any).
Fixes #19477
This is another surprising correctness hole in the indirect dependencies logic. We implicitly assume everywhere that indirect dependencies are subset of transitive ones. Although it is true when the indirect dependencies are initially calculated, there is no guarantee this will stay true after an incremental update where some import structure has changed.
Currently, we only handle the situations where an indirect dependency is removed from the graph completely, but if it is still in the graph, but now brought in by a different dependent, this may cause out-of-order cache loading, and thus a bug or a crash. It is interesting that although in theory it is kind of a big deal, in practice it is very hard to create such scenario. So far I found only two complicated scenarios (see tests).
Implementation note: although this additional check is needed for rare edge cases, it is quite computationally expensive. I tried few options (including only computing import deltas), but the only viable/robust option seems to be an honest recursive check with some caching. I tested this on
torch(it has ~1000 small SCCs plus single SCC with around 1000 modules), and the performance penalty is negligible there. So hopefully it should be fine even for large code bases.