Skip to content

AMD ROCm support#1072

Open
benoit-cty wants to merge 41 commits intomasterfrom
feat/rocm
Open

AMD ROCm support#1072
benoit-cty wants to merge 41 commits intomasterfrom
feat/rocm

Conversation

@benoit-cty
Copy link
Contributor

Description

Continuing #490

Related Issue

Please link to the issue this PR resolves: [issue #178 ]

Motivation and Context

AMD GPU are not yet supported.

How Has This Been Tested?

Using Adastra supercomputer. With AMD MI250 GPUs.

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

Go over all the following points, and put an x in all the boxes that apply.

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@benoit-cty
Copy link
Contributor Author

I manage to make CodeCarbon works on Adastra and upgrade the @IlyasMoutawwakil code to support more recent version of the amdsmi package.

There is still work to do as the metrics are weird:

[codecarbon INFO @ 17:04:13] Energy consumed for all GPUs : 4.254300 kWh. Total GPU Power : 12572969.923258875 W

@benoit-cty benoit-cty force-pushed the feat/rocm branch 2 times, most recently from daf29f5 to 80e07e7 Compare March 4, 2026 19:53
@benoit-cty benoit-cty changed the title [Draft] AMD ROCm support AMD ROCm support Mar 5, 2026
@benoit-cty
Copy link
Contributor Author

Here is the execution log for two MI250 on Adastra:
4693754-adastra-matrix-multi-gpu-err.log
4693754-adastra-matrix-multi-gpu-out.log

@benoit-cty
Copy link
Contributor Author

@benoit-cty
Copy link
Contributor Author

Emissions of this PR : 0.8 Kg.co2.eq for all 96 testing runs on Adastra.

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 98.31461% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.35%. Comparing base (2282658) to head (cc1b4d6).

Files with missing lines Patch % Lines
codecarbon/core/gpu_nvidia.py 94.02% 4 Missing ⚠️
codecarbon/core/gpu_amd.py 99.32% 1 Missing ⚠️
codecarbon/core/gpu_device.py 97.72% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1072      +/-   ##
==========================================
+ Coverage   78.22%   80.35%   +2.13%     
==========================================
  Files          38       41       +3     
  Lines        3632     3868     +236     
==========================================
+ Hits         2841     3108     +267     
+ Misses        791      760      -31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@benoit-cty benoit-cty marked this pull request as ready for review March 5, 2026 17:30
@benoit-cty benoit-cty requested a review from a team as a code owner March 5, 2026 17:30
IlyasMoutawwakil and others added 19 commits March 7, 2026 16:26
Remove warning for amdsmi.amdsmi_get_gpu_process_list

Debug detection

Fix Uninitialized

amdsmi_get_energy_count

fix Slurm log

Handle ROCR_VISIBLE_DEVICES

AMD debug

wip: AMD debug

wip: AMD debug

wip: AMD debug

wip: AMD debug

Handle AMDSMI_STATUS_NOT_INIT

Cleaning log

Cleaning log

Introduce a GPU index

Introduce a GPU index

debug ROCR_VISIBLE_DEVICES

debug ROCR_VISIBLE_DEVICES

debug ROCR_VISIBLE_DEVICES

wip: debug AMD

wip: debug AMD

amdsmi fallback

wip: debug AMD
Handle power and energy_accumulator

Adastra

Adastra Doc
Copy link
Member

@SaboniAmine SaboniAmine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool job, thanks Benoît!
Few comments but as it seems to have been extensively tested, this might be ready to be shared to other users. Would you like to go through pre-release on a specific tag from a branch to test on other devices ? Maybe @prmths128 could give it a try ?


return str_or_bytes

def emit_selection_warning(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method useful ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for AMD MIxxx GPU it will warns you that power monitoring works only on first GCD.

def set_GPU_tracking(self):
logger.info("[setup] GPU Tracking...")
if self.tracker._gpu_ids:
if isinstance(self.tracker._gpu_ids, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are integer IDs explicitely excluded from this test ? I can recall an issue with mixed UUIDs / integer ids, in the HF computing cluster if I'm not wrong ? What would be the impact of this filter, in this kind of case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT-5.3-Codex anwser:

For mixed UUID + integer IDs, impact depends on how IDs are provided:

  1. As a single string (for example via env/config): it is parsed into string tokens first, then both numeric and UUID-like entries can be resolved downstream. Parsing path is here: resource_tracker.py:219, config.py:61.
  2. As a Python list containing mixed types (for example [0, "GPU-..."]): this branch in resource tracker skips parse_gpu_ids, and downstream resolver still handles both types. So this path is generally safe with the current code.
  3. If mixed lists were passed directly into parse_gpu_ids, that function currently only accepts list elements when all are integers: config.py:65. Otherwise it warns and effectively returns None: config.py:72.

So the current filter in resource tracker likely reduces risk in the mixed-list case by not forcing mixed lists through parse_gpu_ids.
The bigger potential risk in HF-like MIG environments is the sanitizer in parse_gpu_ids stripping characters outside alnum, hyphen, comma: config.py:61. If an ID format contains other separators, matching can fail and those GPUs may be ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd then extract the logic in a proper custom method which will test correctly the parsing, escaping properly the gpu ids strings in case we detect a mixed-typed id array.

self._gpu_ids_resolved = True
return list(range(self.num_gpus))

def _emit_selection_warning_for_gpu_id(self, gpu_id: int) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the previously commented method (in abstract GPUDevice) is now deprecated, as those private methods exist ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed this is not actually a deprecation path: the method on the base GPU device is the extension hook, and the private method in hardware is just dispatch by selected index. I’ll make that explicit in code by tightening the call path and adding a clear docstring.

@benoit-cty
Copy link
Contributor Author

Thanks

I confirm it still work with NVidia GPU.

Test could be done with uv pip install git+https://github.com/mlco2/codecarbon.git@feat/rocm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants