AMD ROCm support by benoit-cty · Pull Request #1072 · mlco2/codecarbon

benoit-cty · 2026-02-18T16:08:17Z

Description

Continuing #490

Related Issue

Please link to the issue this PR resolves: [issue #178 ]

Motivation and Context

AMD GPU are not yet supported.

How Has This Been Tested?

Using Adastra supercomputer. With AMD MI250 GPUs.

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

Go over all the following points, and put an x in all the boxes that apply.

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING.md document.
I have added tests to cover my changes.
All new and existing tests passed.

benoit-cty · 2026-02-18T16:09:53Z

I manage to make CodeCarbon works on Adastra and upgrade the @IlyasMoutawwakil code to support more recent version of the amdsmi package.

There is still work to do as the metrics are weird:

[codecarbon INFO @ 17:04:13] Energy consumed for all GPUs : 4.254300 kWh. Total GPU Power : 12572969.923258875 W

codecarbon/emissions_tracker.py

benoit-cty · 2026-03-05T10:29:50Z

Here is the execution log for two MI250 on Adastra:
4693754-adastra-matrix-multi-gpu-err.log
4693754-adastra-matrix-multi-gpu-out.log

benoit-cty · 2026-03-05T11:22:46Z

Logs for MI300:
4693858-adastra-matrix-multi-MI300gpu-err.log
4693858-adastra-matrix-multi-MI300gpu-out.log

benoit-cty · 2026-03-05T11:34:58Z

Emissions of this PR : 0.8 Kg.co2.eq for all 96 testing runs on Adastra.

codecov · 2026-03-05T11:44:57Z

Codecov Report

❌ Patch coverage is 98.31461% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.35%. Comparing base (2282658) to head (cc1b4d6).

Files with missing lines	Patch %	Lines
codecarbon/core/gpu_nvidia.py	94.02%	4 Missing ⚠️
codecarbon/core/gpu_amd.py	99.32%	1 Missing ⚠️
codecarbon/core/gpu_device.py	97.72%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1072      +/-   ##
==========================================
+ Coverage   78.22%   80.35%   +2.13%     
==========================================
  Files          38       41       +3     
  Lines        3632     3868     +236     
==========================================
+ Hits         2841     3108     +267     
+ Misses        791      760      -31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Remove warning for amdsmi.amdsmi_get_gpu_process_list Debug detection Fix Uninitialized amdsmi_get_energy_count fix Slurm log Handle ROCR_VISIBLE_DEVICES AMD debug wip: AMD debug wip: AMD debug wip: AMD debug wip: AMD debug Handle AMDSMI_STATUS_NOT_INIT Cleaning log Cleaning log Introduce a GPU index Introduce a GPU index debug ROCR_VISIBLE_DEVICES debug ROCR_VISIBLE_DEVICES debug ROCR_VISIBLE_DEVICES wip: debug AMD wip: debug AMD amdsmi fallback wip: debug AMD

Handle power and energy_accumulator Adastra Adastra Doc

SaboniAmine

Really cool job, thanks Benoît!
Few comments but as it seems to have been extensively tested, this might be ready to be shared to other users. Would you like to go through pre-release on a specific tag from a branch to test on other devices ? Maybe @prmths128 could give it a try ?

codecarbon/core/gpu.py

codecarbon/core/gpu_amd.py

SaboniAmine · 2026-03-07T16:27:22Z

codecarbon/core/gpu_device.py

+
+        return str_or_bytes
+
+    def emit_selection_warning(self) -> None:


is this method useful ?

Yes, for AMD MIxxx GPU it will warns you that power monitoring works only on first GCD.

SaboniAmine · 2026-03-07T16:29:54Z

codecarbon/core/resource_tracker.py

    def set_GPU_tracking(self):
        logger.info("[setup] GPU Tracking...")
-        if self.tracker._gpu_ids:
+        if isinstance(self.tracker._gpu_ids, str):


Are integer IDs explicitely excluded from this test ? I can recall an issue with mixed UUIDs / integer ids, in the HF computing cluster if I'm not wrong ? What would be the impact of this filter, in this kind of case ?

GPT-5.3-Codex anwser:

For mixed UUID + integer IDs, impact depends on how IDs are provided:

As a single string (for example via env/config): it is parsed into string tokens first, then both numeric and UUID-like entries can be resolved downstream. Parsing path is here: resource_tracker.py:219, config.py:61.

As a Python list containing mixed types (for example [0, "GPU-..."]): this branch in resource tracker skips parse_gpu_ids, and downstream resolver still handles both types. So this path is generally safe with the current code.

If mixed lists were passed directly into parse_gpu_ids, that function currently only accepts list elements when all are integers: config.py:65. Otherwise it warns and effectively returns None: config.py:72.

So the current filter in resource tracker likely reduces risk in the mixed-list case by not forcing mixed lists through parse_gpu_ids.
The bigger potential risk in HF-like MIG environments is the sanitizer in parse_gpu_ids stripping characters outside alnum, hyphen, comma: config.py:61. If an ID format contains other separators, matching can fail and those GPUs may be ignored.

I'd then extract the logic in a proper custom method which will test correctly the parsing, escaping properly the gpu ids strings in case we detect a mixed-typed id array.

SaboniAmine · 2026-03-07T16:31:51Z

codecarbon/external/hardware.py

+            self._gpu_ids_resolved = True
            return list(range(self.num_gpus))

+    def _emit_selection_warning_for_gpu_id(self, gpu_id: int) -> None:


I guess the previously commented method (in abstract GPUDevice) is now deprecated, as those private methods exist ?

I confirmed this is not actually a deprecation path: the method on the base GPU device is the extension hook, and the private method in hardware is just dispatch by selected index. I’ll make that explicit in code by tightening the call path and adding a clear docstring.

docs/introduction/power_estimation.md

benoit-cty · 2026-03-08T10:03:01Z

Thanks

I confirm it still work with NVidia GPU.

Test could be done with uv pip install git+https://github.com/mlco2/codecarbon.git@feat/rocm

github-advanced-security bot found potential problems Mar 3, 2026

View reviewed changes

codecarbon/emissions_tracker.py Fixed Show fixed Hide fixed

benoit-cty force-pushed the feat/rocm branch 2 times, most recently from daf29f5 to 80e07e7 Compare March 4, 2026 19:53

benoit-cty changed the title ~~[Draft] AMD ROCm support~~ AMD ROCm support Mar 5, 2026

benoit-cty marked this pull request as ready for review March 5, 2026 17:30

benoit-cty requested a review from a team as a code owner March 5, 2026 17:30

IlyasMoutawwakil and others added 19 commits March 7, 2026 16:26

added amd-smi interface

254cb89

fix energy unit

6196148

use counter_resolution instead of hard coding it

371cc60

wip : handle AMD and Nvidia at the same time

34cea71

added support for amd and nvidia at the same time

9fa89b5

Fix merge conflict

8af048e

Breaking change : Watt

3bed1bd

Adastra

4813af3

Handle power and energy_accumulator Adastra Adastra Doc

Contributing to a PR

9a1535a

fix merge errors

3cd2b77

Warn about dual GCD

3d5c496

Fix GPU tests

95e03b1

Matrix multiplication across all devices

5fa1f43

Warn about dual GCD

4964f7b

Warn about dual GCD

042ce8c

Docs on power estimation

bab9105

Adastra

fcce50c

Docs on power estimation

46d779d

benoit-cty added 13 commits March 7, 2026 16:26

Fix start()

3290580

Fix too much call to _get_gpu_ids

040df0d

MI300

b1ff90f

Fix _get_power_usage

ad5e4f4

Handle SLURM CPU allocation for default TDP

f2419d9

Exclude MI210 single GCD

3529e99

Remove login sensitive informations

38fa868

fix tests

8bf4ac8

Add tests

00e27a2

Add test

ffa058f

wip: refacto GPU

fe8259a

wip: refacto GPU

893883d

fix test

1f77e58

benoit-cty force-pushed the feat/rocm branch from 27577f0 to 1f77e58 Compare March 7, 2026 15:30

Docs and split GPU tests

2b3ca2e

SaboniAmine reviewed Mar 7, 2026

View reviewed changes

benoit-cty added 7 commits March 8, 2026 10:30

Suggest OffLine mode in the CLI

2408844

Debug log for PUE

9e8c830

pre-commit

7974e81

remove commented lines

7b4fec7

review on emit_selection_warning

d7f84cc

fix test cli

6ed0ef4

doc

3df2d6f

fix tests/test_gpu_amd.py

cc1b4d6


		return str_or_bytes

		def emit_selection_warning(self) -> None:

Uh oh!

Conversation

benoit-cty commented Feb 18, 2026

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

benoit-cty commented Feb 18, 2026

Uh oh!

Uh oh!

benoit-cty commented Mar 5, 2026

Uh oh!

benoit-cty commented Mar 5, 2026

Uh oh!

benoit-cty commented Mar 5, 2026

Uh oh!

codecov bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SaboniAmine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SaboniAmine Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

benoit-cty Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

SaboniAmine Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

benoit-cty Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

SaboniAmine Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

SaboniAmine Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

benoit-cty Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benoit-cty commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 5, 2026 •

edited

Loading