Skip to content

[feat] Remove unnecessary Slurm polling when retrieving job pending reason#3642

Merged
vkarak merged 3 commits intoreframe-hpc:developfrom
vkarak:bugfix/slurm-rpc-load
Apr 28, 2026
Merged

[feat] Remove unnecessary Slurm polling when retrieving job pending reason#3642
vkarak merged 3 commits intoreframe-hpc:developfrom
vkarak:bugfix/slurm-rpc-load

Conversation

@vkarak
Copy link
Copy Markdown
Contributor

@vkarak vkarak commented Mar 25, 2026

This PR improves the polling of jobs for the pending reason.

This is now done in a single command for all pending jobs. Two knobs are also exposed to users now as configuration options and environment variables:

  1. slurm_job_cancel_reasons: This is a list of pending reasons that reframe will check and will cancel the job proactively.
  2. slurm_pending_job_reason_poll_freq: This controls the frequency that pending jobs will be polled for their pending reasons (valid only for slurm backend).

Closes #3640.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 50.84746% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.87%. Comparing base (4f021cf) to head (8881baf).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines Patch % Lines
reframe/core/schedulers/slurm.py 50.00% 27 Missing ⚠️
reframe/core/schedulers/__init__.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3642      +/-   ##
===========================================
+ Coverage    91.69%   91.87%   +0.18%     
===========================================
  Files           62       62              
  Lines        13745    13755      +10     
===========================================
+ Hits         12603    12638      +35     
+ Misses        1142     1117      -25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vkarak vkarak force-pushed the bugfix/slurm-rpc-load branch from 61c38ae to 45c5613 Compare March 25, 2026 16:52
@jack-morrison jack-morrison self-requested a review April 21, 2026 13:48
Comment thread docs/config_reference.rst Outdated
Comment thread docs/config_reference.rst Outdated
Comment thread docs/config_reference.rst
Comment thread reframe/core/schedulers/slurm.py
Comment thread reframe/core/schedulers/slurm.py Outdated
@github-project-automation github-project-automation Bot moved this from Todo to In Progress in ReFrame Backlog Apr 21, 2026
vkarak and others added 3 commits April 24, 2026 22:37
Co-authored-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Signed-off-by: Vasileios Karakasis <vkarak@gmail.com>
@vkarak vkarak force-pushed the bugfix/slurm-rpc-load branch from b2c3e02 to 8881baf Compare April 24, 2026 20:41
@jack-morrison jack-morrison self-requested a review April 28, 2026 17:24
@vkarak vkarak changed the title [feat] Aggregate polling for retrieving job pending reason [feat] Remove unnecessary Slurm polling when retrieving job pending reason Apr 28, 2026
@vkarak vkarak merged commit ea92563 into reframe-hpc:develop Apr 28, 2026
55 of 57 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in ReFrame Backlog Apr 28, 2026
@vkarak vkarak deleted the bugfix/slurm-rpc-load branch April 28, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Multiple squeue's in _cancel_if_blocked in reframe/core/schedulers/slurm.py are hitting slurm's RPC rate limit

2 participants