Skip to content

Stitched AES-GCM for aarch64#165

Open
brian-pane wants to merge 1 commit intoctz:mainfrom
brian-pane:aarch64-aes-gcm
Open

Stitched AES-GCM for aarch64#165
brian-pane wants to merge 1 commit intoctz:mainfrom
brian-pane:aarch64-aes-gcm

Conversation

@brian-pane
Copy link
Copy Markdown
Contributor

No description provided.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 10, 2026

Merging this PR will not alter performance

✅ 155 untouched benchmarks


Comparing brian-pane:aarch64-aes-gcm (29e452c) with main (9685008)

Open in CodSpeed

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.76%. Comparing base (9685008) to head (29e452c).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #165   +/-   ##
=======================================
  Coverage   99.76%   99.76%           
=======================================
  Files         184      184           
  Lines       50832    50918   +86     
=======================================
+ Hits        50714    50800   +86     
  Misses        118      118           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@brian-pane
Copy link
Copy Markdown
Contributor Author

I made this as an attempt to speed up AES-GCM for larger input sizes on Arm (context: issue #163).

In my testing with cargo bench on a Mac M4 system, it helps a little bit:

aes128-gcm/aws-lc-rs/32B time:     [53.615 ns 53.731 ns 53.852 ns]
aes128-gcm/graviola/32B (main):    [90.274 ns 90.482 ns 90.703 ns]
aes128-gcm/graviola/32B (this PR): [90.605 ns 90.960 ns 91.409 ns]

aes128-gcm/aws-lc-rs/2KB time:     [242.13 ns 242.96 ns 243.86 ns]
aes128-gcm/graviola/2KB (main):    [299.82 ns 300.59 ns 301.45 ns]
aes128-gcm/graviola/2KB (this PR): [288.97 ns 289.53 ns 290.12 ns]

aes128-gcm/aws-lc-rs/8KB time:     [799.89 ns 803.41 ns 809.65 ns]
aes128-gcm/graviola/8KB (main):    [983.82 ns 990.83 ns 1.0001 µs]
aes128-gcm/graviola/8KB (this PR): [925.98 ns 928.91 ns 933.72 ns]

aes128-gcm/aws-lc-rs/16KB time:     [1.5428 µs 1.5448 µs 1.5466 µs]
aes128-gcm/graviola/16KB (main):    [1.8859 µs 1.8889 µs 1.8921 µs]
aes128-gcm/graviola/16KB (this PR): [1.7605 µs 1.7626 µs 1.7645 µs]

Comment on lines +339 to +347
// Reverse the order of the bytes in each of the two 64-bit lanes in `u`.
let u = vrev64q_u8(u);
let u = vreinterpretq_u64_u8(u);

// Swap the locations of the two 64-bit lanes to finish reversing the bytes.
let lane0 = vgetq_lane_u64(u, 0);
let lane1 = vgetq_lane_u64(u, 1);
let reversed = vsetq_lane_u64(lane0, u, 1);
vsetq_lane_u64(lane1, reversed, 0)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slow, but I haven't figured out a better alternative yet. I tried doing a shuffle operation, similar to what the x86_64 version does:

    const SHUFFLE_MAP: uint8x16_t = unsafe { mem::transmute([15u8, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]) };
    let mut reversed: uint8x16_t;
    unsafe {
        core::arch::asm!(
            "tbl {reversed:v}.16B, {{ {u:v}.16B }}, {map:v}.16B",
            reversed = out(vreg) reversed,
            u = in(vreg) u,
            map = in(vreg) SHUFFLE_MAP,
        );
    }
    vreinterpretq_u64_u8(reversed)

but that ran even slower.

@brian-pane
Copy link
Copy Markdown
Contributor Author

This patch assumes that the aarch64 target system is little-endian. Does Graviola support ARM running in big-endian mode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant