Documentation for LD matrix methods by apragsdale · Pull Request #3416 · tskit-dev/tskit

apragsdale · 2026-03-06T00:41:26Z

This largely pulls material from an existing, open PR to complete minimal documentation for the LD matrix methods currently available in tskit. I've made edits for clarity from the original PR, and removed some material that is possibly confusing or more information than needed for a user.

codecov · 2026-03-06T01:22:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.92%. Comparing base (1835ea3) to head (fa7b3ab).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3416   +/-   ##
=======================================
  Coverage   91.92%   91.92%           
=======================================
  Files          37       37           
  Lines       32153    32153           
  Branches     5143     5143           
=======================================
  Hits        29556    29556           
  Misses       2264     2264           
  Partials      333      333

Flag	Coverage Δ
C	`82.70% <ø> (ø)`
c-python	`77.34% <ø> (ø)`
python-tests	`96.40% <ø> (ø)`
python-tests-no-jit	`33.22% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Python API	`98.69% <ø> (ø)`
Python C interface	`91.23% <ø> (ø)`
C library	`88.86% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jeromekelleher

I've had a quick read through and generally looks great. I've spotted a few typos and left a few take-it-or-leave-it comments.

I haven't thought through the details at all through, and I think it needs a careful review from @petrelharp for that.

docs/stats.md

python/tskit/trees.py

apragsdale · 2026-03-06T14:08:13Z

Thanks so much for the quick review, @jeromekelleher. I agree with your suggestions and will fix things up accordingly.

lkirk · 2026-03-09T21:20:29Z

I think this will also need a changelog entry, since adding documentation will announce that this API is now public. Does that happen here or in a later part of the release pipeline?

petrelharp · 2026-03-10T05:07:51Z

That could happen here!

jeromekelleher · 2026-03-10T09:16:17Z

We can add to the changelog here or not, whichever is easier. We do need a review from you though @petrelharp as I'm not on top of the stats details.

jeromekelleher · 2026-03-10T09:31:59Z

I've created a milestone for version 1.1, which will basically be this new LD API plus the new metadata codec. Would be good to get it shipped in the next week or so if possible.

petrelharp

See suggested edits (and feel free to say I'm wrong on any).

docs/stats.md

python/tskit/trees.py

petrelharp · 2026-03-14T14:15:48Z

python/tskit/trees.py

+        Similarly, in the branch mode, the ``positions`` argument specifies
+        loci for which the expectation for the two-locus statistic is computed
+        over pairs of trees at those positions. LD stats are computed between
+        trees whose ``[start, end)`` contains the given position (such that
+        repeats of trees are possible). Similar to the site mode, a nested list


Suggested change

Similarly, in the branch mode, the ``positions`` argument specifies

loci for which the expectation for the two-locus statistic is computed

over pairs of trees at those positions. LD stats are computed between

trees whose ``[start, end)`` contains the given position (such that

repeats of trees are possible). Similar to the site mode, a nested list

Similarly, in the branch mode, the ``positions`` argument specifies

genomic coordinates at which the expectation for the two-locus statistic

is computed, given the local tree structure. This defaults to computing

the LD for each pair of distinct trees, which is equivalent to passing in

the leftmost coordinates of each tree's span (since intervals are closed on

the left and open on the right). Similar to the site mode, a nested list

We'll need to adjust this given the "expectation" discussion above.

python/tskit/trees.py

petrelharp · 2026-03-14T14:26:49Z

This is great, thanks! Remaining things:

This should explain more precisely how branch mode is computed and what its interpretation is as an expected value.
How's it work with multiallelic sites? What exactly are the weighting schemes?
How's polarization work?

Background:

weighting: see normalization for two-locus, multiallelic stats? #2816
I still can't find the original issue where we laid out the conceptual framework for this - do you know where that is, @apragsdale?

petrelharp · 2026-03-14T15:09:47Z

Okay, just checking the details of branch mode:

the result is calculated in compute_two_tree_branch_stat without further weighting
and then the work is done by compute_two_tree_branch_state_update, which weights f(wAb,waB,wAB) by the product of the two branches in question

Okay, so: for a given pair of branches (on the same or different trees), the sample counts wAb, waB, wAB are the same for any pair of mutation falling on those branches; so the expected contribution to average LD determined by the summary funciton f per unit sequence length of this pair of branches is f(w_{Ab}, w_{aB}, w_{AB}, n) multiplied by the lengths of the two branches.

So, for computation, I think we should say very clearly somewhere that branch mode works by summing over all pairs of branches the value of f( ) multiplied by the lengths of the two branches.

petrelharp · 2026-03-14T15:25:20Z

As for interpretation (ie in what sense is it the expected value):

First off, I'm confused about a dumb silly thing. If I simulate a short tree sequence and compute ld_matrix (with r2 and site mode) I get a bunch of nans. Why? (This is confusing because the denominator for r2 is p_A * p_B * (1 - p_A) * (1 - p_B); and msprime only generates mutations with 0 < p < 1.)

The nan thing makes me think I'm missing some basic and important thing. But if my understanding above on how it is computed is correct here are some equivalent options for what branch mode means:

Put down two Poisson processes of mutations on each tree, and compute the stat between each pair of mutations; add those up across pairs of mutations; then branch mode is the expected value of this sum.
Pick two mutations uniformly, one on each tree; compute the stat; then branch mode is the expected stat multiplied by the total branch lengths of each tree.
Compute the average stat across all mutations using a large number of infinite-sites mutations at rate mu per bp. The result (for large mu) is mu multiplied by the sum across pairs of trees of the branch stat for those trees multiplied by the spans of the two trees.

petrelharp · 2026-03-14T15:33:06Z

For the multiallelic question, the computations happen here and here; based on where they're called from I'm assuming that the first function is just a quicker special-purpose version of the second one for both biallelic sites.

(ps I'm assuming you know this all, but I forget everything and it's nice to go look at the details)

So, it is summing over all pairs of alleles but then normalised.

docs/stats.md

petrelharp · 2026-03-14T15:50:39Z

I think we just need a few more paragraphs, documenting exactly what's being calculated, as above. Can you take a stab at this @apragsdale?

apragsdale · 2026-03-14T15:55:45Z

I think we just need a few more paragraphs, documenting exactly what's being calculated, as above. Can you take a stab at this @apragsdale?

Hi @petrelharp - yes, of course! I'll tackle these today and this weekend.

Thank you for such detailed comments. With some earlier rounds of revision on these docs, I had gone back and forth with including too much vs too little detail, and probably ended up falling on the too-little side of it.

I also agree the nan calculations are confusing, and I remember @lkirk and I having discussions about that last year. I think we had a good explanation/work-around for it. I'll dig up those notes, because it will be important to document as well. I'm also returning to this from many months away from the code, so it will be good to remind myself of all of it as well, and make sure it's documented properly.

petrelharp · 2026-03-15T15:08:10Z

I hope it's easy to ressurrect some of those earlier versions, then?

I also agree the nan calculations are confusing,

I'm not sure if they are? I imagine it's just "0/0 = nan", but then I'm confused why that happens?

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

apragsdale · 2026-03-15T17:10:43Z

First off, I'm confused about a dumb silly thing. If I simulate a short tree sequence and compute ld_matrix (with r2 and site mode) I get a bunch of nans. Why? (This is confusing because the denominator for r2 is p_A * p_B * (1 - p_A) * (1 - p_B); and msprime only generates mutations with 0 < p < 1.)

I'm not sure if they are? I imagine it's just "0/0 = nan", but then I'm confused why that happens?

Peter- does this occur in site mode, as you say above? We shouldn't get nans if msprime is only generating mutations with 0 < p < 1, and if we compute r^2 over all samples. However, if we specify a subsample as our sample set, the method computes LD at all sites still, and for some of those sites a mutation may not be found in the subsamples. Then p could equal 0 or 1. In this case, you'd get nans for any pair of mutations including one or two of such sites.

But I want to make sure that we are thinking about the same scenario that generates a nan.

In [23]: ts = msprime.sim_ancestry(10, population_size=1e4, recombination_rate=1e-8, sequence_length=1e4, random_seed=1)

In [24]: ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)

In [28]: ts.ld_matrix()
Out[28]: 
array([[1.        , 0.00277008, 0.00277008, 0.00277008],
       [0.00277008, 1.        , 0.00277008, 0.00277008],
       [0.00277008, 0.00277008, 1.        , 0.00277008],
       [0.00277008, 0.00277008, 0.00277008, 1.        ]])

In [36]: ts.ld_matrix(sample_sets=range(12))
Out[36]: 
array([[1.        ,        nan,        nan, 0.00826446],
       [       nan,        nan,        nan,        nan],
       [       nan,        nan,        nan,        nan],
       [0.00826446,        nan,        nan, 1.        ]])

Edit: In any case, this behavior should be documented.

petrelharp · 2026-03-15T22:54:12Z

Here's what I did:

>>> ts = msprime.sim_ancestry(3, sequence_length=10, recombination_rate=0.1, random_seed=123)
>>> mts = msprime.sim_mutations(ts, rate=0.1, random_seed=456)
>>> mts.ld_matrix()
array([[1.   , 0.44 ,   nan, 0.44 , 0.44 , 0.125, 0.44 ],
       [0.44 , 1.   ,   nan, 1.   , 1.   , 0.1  , 1.   ],
       [  nan,   nan,   nan,   nan,   nan,   nan,   nan],
       [0.44 , 1.   ,   nan, 1.   , 1.   , 0.1  , 1.   ],
       [0.44 , 1.   ,   nan, 1.   , 1.   , 0.1  , 1.   ],
       [0.125, 0.1  ,   nan, 0.1  , 0.1  , 1.   , 0.1  ],
       [0.44 , 1.   ,   nan, 1.   , 1.   , 0.1  , 1.   ]])
>>> mts.num_sites
7
>>> mts.genotype_matrix()
array([[1, 0, 0, 0, 2, 0],
       [0, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 1],
       [0, 1, 1, 1, 1, 1],
       [0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0]], dtype=int32)
>>> mts.genotype_matrix().sum(axis=1)
array([3, 5, 0, 5, 5, 2, 1])

Oh! I see: "msprime does not simulate sites where mutations that aren't polymorphic" but it does simulate situations where backmutations can create monomorphic sites despite there being mutations:

>>> mts.site(2)
Site(id=2, position=3.0, ancestral_state='A', mutations=[
 Mutation(id=3, site=2, node=0, derived_state='T', parent=-1, metadata=b'', time=1.549923340261233, edge=21, inherited_state='A'), 
 Mutation(id=4, site=2, node=0, derived_state='A', parent=3, metadata=b'', time=1.0122678066613313, edge=21, inherited_state='T')], metadata=b'')

Okay, that's all cleared up. I ran into this so quick because I had an unrealistic mutation rate. However, this would be a good thing to mention in the section on multiple alleles.

apragsdale mentioned this pull request Mar 6, 2026

LD matrix documentation #3353

Closed

apragsdale force-pushed the ld_matrix_docs branch from 271bcaa to e2f50d6 Compare March 6, 2026 00:48

jeromekelleher reviewed Mar 6, 2026

View reviewed changes

docs/stats.md Outdated Show resolved Hide resolved

docs/stats.md Outdated Show resolved Hide resolved

docs/stats.md Outdated Show resolved Hide resolved

docs/stats.md Show resolved Hide resolved

python/tskit/trees.py Outdated Show resolved Hide resolved

jeromekelleher requested a review from petrelharp March 6, 2026 09:41

apragsdale force-pushed the ld_matrix_docs branch from e2f50d6 to 7e2b62a Compare March 6, 2026 14:27

jeromekelleher added this to the 1.1.0 milestone Mar 10, 2026

apragsdale added 5 commits March 10, 2026 11:31

Edit trees.py to add docstring to ld_matrix()

10912c0

Add, edit, streamline two-locus docs

a2c59d5

add section title

51ab754

Address comments from Jerome

fd6e6f7

Add changelog change

bae4065

apragsdale force-pushed the ld_matrix_docs branch from 7e2b62a to bae4065 Compare March 10, 2026 16:42

petrelharp reviewed Mar 14, 2026

View reviewed changes

docs/stats.md Outdated Show resolved Hide resolved

apragsdale and others added 3 commits March 15, 2026 11:49

Update docs/stats.md

ac15ff5

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Update docs/stats.md

95d3b9e

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Update docs/stats.md

c1e358a

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

apragsdale and others added 3 commits March 15, 2026 11:51

Update docs/stats.md

c86f85e

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Apply suggestions from code review

7e4152c

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Apply suggestions from code review

af8a725

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

apragsdale added 4 commits March 15, 2026 12:36

Add description of weighting schemes

0c0fb29

Add a bit more about polarization

ece62a9

try to clarify branch mode

b5efa32

example of subsetting causing nan values

fa7b3ab

Conversation

apragsdale commented Mar 6, 2026

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

apragsdale commented Mar 6, 2026

Uh oh!

lkirk commented Mar 9, 2026

Uh oh!

petrelharp commented Mar 10, 2026

Uh oh!

jeromekelleher commented Mar 10, 2026

Uh oh!

jeromekelleher commented Mar 10, 2026

Uh oh!

petrelharp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrelharp Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

petrelharp Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

petrelharp commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Mar 14, 2026

Uh oh!

petrelharp commented Mar 14, 2026

Uh oh!

petrelharp commented Mar 14, 2026

Uh oh!

Uh oh!

petrelharp commented Mar 14, 2026

Uh oh!

apragsdale commented Mar 14, 2026

Uh oh!

petrelharp commented Mar 15, 2026

Uh oh!

apragsdale commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Mar 6, 2026 •

edited

Loading

petrelharp commented Mar 14, 2026 •

edited

Loading

apragsdale commented Mar 15, 2026 •

edited

Loading