Conversation
|
I manage to make CodeCarbon works on Adastra and upgrade the @IlyasMoutawwakil code to support more recent version of the amdsmi package. There is still work to do as the metrics are weird:
|
daf29f5 to
80e07e7
Compare
|
Here is the execution log for two MI250 on Adastra: |
|
Emissions of this PR : 0.8 Kg.co2.eq for all 96 testing runs on Adastra. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1072 +/- ##
==========================================
+ Coverage 78.22% 80.35% +2.13%
==========================================
Files 38 41 +3
Lines 3632 3868 +236
==========================================
+ Hits 2841 3108 +267
+ Misses 791 760 -31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Remove warning for amdsmi.amdsmi_get_gpu_process_list Debug detection Fix Uninitialized amdsmi_get_energy_count fix Slurm log Handle ROCR_VISIBLE_DEVICES AMD debug wip: AMD debug wip: AMD debug wip: AMD debug wip: AMD debug Handle AMDSMI_STATUS_NOT_INIT Cleaning log Cleaning log Introduce a GPU index Introduce a GPU index debug ROCR_VISIBLE_DEVICES debug ROCR_VISIBLE_DEVICES debug ROCR_VISIBLE_DEVICES wip: debug AMD wip: debug AMD amdsmi fallback wip: debug AMD
SaboniAmine
left a comment
There was a problem hiding this comment.
Really cool job, thanks Benoît!
Few comments but as it seems to have been extensively tested, this might be ready to be shared to other users. Would you like to go through pre-release on a specific tag from a branch to test on other devices ? Maybe @prmths128 could give it a try ?
|
|
||
| return str_or_bytes | ||
|
|
||
| def emit_selection_warning(self) -> None: |
There was a problem hiding this comment.
Yes, for AMD MIxxx GPU it will warns you that power monitoring works only on first GCD.
| def set_GPU_tracking(self): | ||
| logger.info("[setup] GPU Tracking...") | ||
| if self.tracker._gpu_ids: | ||
| if isinstance(self.tracker._gpu_ids, str): |
There was a problem hiding this comment.
Are integer IDs explicitely excluded from this test ? I can recall an issue with mixed UUIDs / integer ids, in the HF computing cluster if I'm not wrong ? What would be the impact of this filter, in this kind of case ?
There was a problem hiding this comment.
GPT-5.3-Codex anwser:
For mixed UUID + integer IDs, impact depends on how IDs are provided:
- As a single string (for example via env/config): it is parsed into string tokens first, then both numeric and UUID-like entries can be resolved downstream. Parsing path is here: resource_tracker.py:219, config.py:61.
- As a Python list containing mixed types (for example [0, "GPU-..."]): this branch in resource tracker skips parse_gpu_ids, and downstream resolver still handles both types. So this path is generally safe with the current code.
- If mixed lists were passed directly into parse_gpu_ids, that function currently only accepts list elements when all are integers: config.py:65. Otherwise it warns and effectively returns None: config.py:72.
So the current filter in resource tracker likely reduces risk in the mixed-list case by not forcing mixed lists through parse_gpu_ids.
The bigger potential risk in HF-like MIG environments is the sanitizer in parse_gpu_ids stripping characters outside alnum, hyphen, comma: config.py:61. If an ID format contains other separators, matching can fail and those GPUs may be ignored.
There was a problem hiding this comment.
I'd then extract the logic in a proper custom method which will test correctly the parsing, escaping properly the gpu ids strings in case we detect a mixed-typed id array.
| self._gpu_ids_resolved = True | ||
| return list(range(self.num_gpus)) | ||
|
|
||
| def _emit_selection_warning_for_gpu_id(self, gpu_id: int) -> None: |
There was a problem hiding this comment.
I guess the previously commented method (in abstract GPUDevice) is now deprecated, as those private methods exist ?
There was a problem hiding this comment.
I confirmed this is not actually a deprecation path: the method on the base GPU device is the extension hook, and the private method in hardware is just dispatch by selected index. I’ll make that explicit in code by tightening the call path and adding a clear docstring.
|
Thanks I confirm it still work with NVidia GPU. Test could be done with |
Description
Continuing #490
Related Issue
Please link to the issue this PR resolves: [issue #178 ]
Motivation and Context
AMD GPU are not yet supported.
How Has This Been Tested?
Using Adastra supercomputer. With AMD MI250 GPUs.
Types of changes
What types of changes does your code introduce? Put an
xin all the boxes that apply:Checklist:
Go over all the following points, and put an
xin all the boxes that apply.