Json struct alignment by petrelharp · Pull Request #3437 · tskit-dev/tskit

petrelharp · 2026-03-13T16:25:06Z

Edit: This is now ready. It

closes json+struct codec should provide a memory alignment guarantee for the binary component #3423, by (a) adding the padding bytes to the C function (tsk_json_struct_metadata_get_blob)
also closes json+struct codec should check for a length problem with != not < #3424, here

This is pretty straightforward. However, I got confused along the way, basically about "why aren't the C and python code being tested against each other"; and by the end I decided that it was all okay. Keep reading if you want the explanation.

Examining #3306 I see that the python code does not use the C function (tsk_json_struct_metadata_get_blob) at all, so we've got no good current way to test these against each other. Options that I can think of are:

remove the C code entirely
write an entirely new class of tests that reads in python-produced tree sequence files and then parses them in C and compares to what is expected
write a python method that uses the C function
not test that the C getter and the python getter do the same thing

However, (2) and (3) are ugly. (2) would break the nice clean one-way C->python arrangment we have. And (3) would leave us with two ways of doing the same thing. This leaves us with (1) or (4).

The motivation for putting the C code in here was that it's very annoying to write and it would help client code a lot to have it available. However, the setter is arguably more useful than the getter (since one does not actually decode the metadata in C), so for (4) it seems we'd still want to additionally write the setter. So, I'm leaning towards (1), out of laziness. We could perhaps document somewhere the example setter/getter code (copied from SLiM and here respectively), to make this easy for other client code, which was the goal.

Edit: see below for the resolution.

codecov · 2026-03-13T16:28:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.92%. Comparing base (17bacda) to head (728b7e6).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3437   +/-   ##
=======================================
  Coverage   91.92%   91.92%           
=======================================
  Files          37       37           
  Lines       32153    32169   +16     
  Branches     5143     5145    +2     
=======================================
+ Hits        29556    29572   +16     
  Misses       2264     2264           
  Partials      333      333

Flag	Coverage Δ
C	`82.72% <100.00%> (+0.01%)`	⬆️
c-python	`77.30% <0.00%> (-0.05%)`	⬇️
python-tests	`96.40% <100.00%> (+<0.01%)`	⬆️
python-tests-no-jit	`33.21% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Python API	`98.69% <100.00%> (+<0.01%)`	⬆️
Python C interface	`91.23% <ø> (ø)`
C library	`88.87% <100.00%> (+<0.01%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

petrelharp · 2026-03-13T16:41:13Z

Update: @bhaller tells me that

The problem with having a setter function is that it would require the client to prepare the JSON is a buffer, prepare the binary in a buffer, and then pass pointers to those buffers in to the setter function, which would set up the final aggregate buffer and copy the JSON and binary in. That is a bad design for performance reasons – it copies all the metadata (which might be quite large), wasting time but even more importantly causing the high-water memory usage for the process to double. I suppose tskit could provide a setter anyway, for simplicity for clients that don’t care about those performance issues; but SLiM wouldn’t use it, probably. If having the setter would make your life easier for the testing, I’d have no objection to adding it as long as there is no assertion that clients have to use the setter; the format used by the codec should remain public, and clients should remain free to write their data themselves.

petrelharp · 2026-03-13T18:32:33Z

After some more discussion with @bhaller:

Why do we want support for JSON+struct? We want to have support for JSON+struct metadata because it is extremely useful. The reason it is extremely useful is that sometimes client code will want to store large amounts of information that doesn't fit in the standard table format for some reason, and this is a very clever (thanks @benjeffery) way to do that, that is way more seamless than other alternatives.

Why have support for it at all in C? Other metadata parsing stuff is not in C at all, for good reason (we don't want to write or import a JSON parser or a struct parser, etcetera). However, the json and struct codecs are pretty straightforward for client C code to do themselves; this codec is annoying enough to set up (special header with magic bytes, etc) that it does make sense to provide a bit of help.

Well then, why don't we have a setter in C, then? I think that #3306 intended to provide a setter, but this fell through the cracks. So, I think it would be a good thing to have for the same reason as in the previous paragraph. However, I can't currently justify the time to actually impmlement and test it. (If we did provide a setter, then (with the note by @bhaller about memory requirements above in mind) we might want to have an API return a pointer to the binary metadata portion, so that the client code could fill it in, rather than passing in the pre-assembled binary portion, thus requiring an extra copy.)

So, I'm leaning towards option (4), "leave things the way they are". This is not perfect because (a) no testing that the C and python do the same thing, and (b) no setter in C; however, what we have is useful and well-tested, and furthermore, if the C and python were in disagreement we'd find out in obvious ways immediately.

bhaller · 2026-03-13T18:42:57Z

Sounds good, with the caveat that "leave things the way they are" still needs the alignment padding bytes to be added in the python code as we talked about, yes?

remove un-hittable error

petrelharp · 2026-03-16T04:06:31Z

Okay - this is ready to go. Could @jeromekelleher or @benjeffery take a (hopefully quick) look at this?

petrelharp force-pushed the json-struct-alignment branch 2 times, most recently from 219c1ba to c88027e Compare March 13, 2026 16:28

petrelharp added 2 commits March 15, 2026 09:34

memory alignment in tsk_json_struct_metadata_get_blob

3930cec

remove un-hittable error

python padding

eb137b8

petrelharp force-pushed the json-struct-alignment branch from 8912a96 to eb137b8 Compare March 15, 2026 16:34

python docs

728b7e6

petrelharp requested a review from benjeffery March 16, 2026 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Json struct alignment#3437

Json struct alignment#3437
petrelharp wants to merge 3 commits intotskit-dev:mainfrom
petrelharp:json-struct-alignment

petrelharp commented Mar 13, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

petrelharp commented Mar 13, 2026

Uh oh!

petrelharp commented Mar 13, 2026

Uh oh!

bhaller commented Mar 13, 2026

Uh oh!

petrelharp commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

petrelharp commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

petrelharp commented Mar 13, 2026

Uh oh!

petrelharp commented Mar 13, 2026

Uh oh!

bhaller commented Mar 13, 2026

Uh oh!

petrelharp commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

petrelharp commented Mar 13, 2026 •

edited

Loading

codecov bot commented Mar 13, 2026 •

edited

Loading