Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 35 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# **E**arth **C**omputing **H**yperparameter **O**ptimization (ECHO): A distributed hyperparameter optimization package build with Optuna

<p align="center">
<img src="docs/echo_logo.png" alt="ECHO logo" width="400"/>
</p>

### Install

To install a stable version of ECHO from PyPI, use the following command:
Expand Down Expand Up @@ -97,7 +101,7 @@ pbs:
kernel: "ncar_pylib /glade/work/schreck/py37"
bash: ["module load ncarenv/1.3 gnu/8.3.0 openmpi/3.1.4 python/3.7.5 cuda/10.1"]
batch:
l: ["select=1:ncpus=8:ngpus=1:mem=128GB", "walltime=12:00:00"]
l: ["select=1:ncpus=8:ngpus=1:mem=128GB:gpu_type=a100_80gb", "walltime=12:00:00"]
A: "NAML0001"
q: "casper"
N: "echo_trial"
Expand Down Expand Up @@ -164,24 +168,41 @@ The save_path field sets the location where all generated data will be saved.
The log field allows you to save the logging details to file to save_path; they will always be printed to stdout. If this field is removed, logging details will only be printed to stdout.
* log: boolean to save log.txt in save_path.

The subfields within "pbs" and slurm" should mostly be familiar to you. In this example there would be 10 jobs submitted to pbs queue and 15 jobs to the slurm queue. Most HPCs just use one or the other, so make sure to only speficy what your system supports. The kernel field is optional and can be any call(s) to activate a conda/python/ncar_pylib/etc environment. Additional snippets that you might need in your launch script can be added to the list in the "bash" field. For example, as in the example above, loading modules before training a model is required. Note that the bash options will be run in order, and before the kernel field. Remove or leave the kernel field blank if you do not need it.
The subfields within "pbs" and "slurm" should mostly be familiar to you. In this example there would be 10 jobs submitted to pbs queue and 15 jobs to the slurm queue. Most HPCs just use one or the other, so make sure to only specify what your system supports. The kernel field is optional and can be any call(s) to activate a conda/python/ncar_pylib/etc environment. Additional snippets that you might need in your launch script can be added to the list in the "bash" field. For example, as in the example above, loading modules before training a model is required. Note that the bash options will be run in order, and before the kernel field. Remove or leave the kernel field blank if you do not need it.

**Casper GPU type selection**: On NCAR Casper, the GPU architecture is specified using `gpu_type` inside the PBS select string. The exact values and their node configurations are:

| `gpu_type=` value | GPU | Notes |
|---|---|---|
| `v100_32gb` | V100 32 GB | |
| `a100_40gb` | A100 40 GB | Data & Viz queue |
| `a100_80gb` | A100 80 GB | 4 GPUs per node (ML/GPGPU) |
| `h100_80gb` | H100 80 GB | 4 GPUs per node |
| `gp100_16gb` | GP100 16 GB | Data & Viz queue |
| `l40_45gb` | L40 48 GB | Data & Viz queue |
| `mi300a` | MI300A 128 GB | Must request `ngpus=4`, exclusive-use only |

Example: `-l select=1:ncpus=8:ngpus=1:mem=128GB:gpu_type=a100_80gb`

See the [Casper node types documentation](https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/casper/casper-node-types/) for the authoritative list.

The subfields within the "optuna" field have the following functionality:
* storage: sqlite or mysql destination.
* study_name: The name of the study.
* storage_type: Choose "sqlite" or "maria" if a MariaDB is setup.
* If "sqlite", the storage field will automatically be appended to the save_path field (e.g. sql:///{save_path}/mlp.db)
* If "maria", specify the full path including username:password in the storage field (for example, mysql://user:pw@someserver.ucar.edu/optuna).
* storage_type: Choose the backend that matches your environment:
* `"sqlite"` — the storage field is joined to save_path automatically (e.g. `sqlite:///{save_path}/mlp.db`). Simple and works well for single-node runs.
* `"maria"` — specify a full MariaDB/MySQL URL in the storage field (e.g. `mysql://user:pw@someserver.ucar.edu/optuna`). Best for large distributed studies with many concurrent workers.
* `"nfs"` — uses Optuna's `JournalFileBackend` (a plain append-only file) instead of a relational database. The storage field is joined to save_path. **Recommended when SQLite locking is unreliable on your shared filesystem** (common on Lustre/GPFS mounts). Example: `storage: "study_journal.log"`, `storage_type: "nfs"`.
* objective: The path to the user-supplied objective class
* metric: The metric to be used to determine the model performance.
* direction: Indicates which direction the metric must go to represent improvement (pick from maximimize or minimize)
* direction: Indicates which direction the metric must go to represent improvement (pick from `maximize` or `minimize`). For **multi-objective optimization**, supply a list of directions (e.g. `direction: ["minimize", "maximize"]`) and a list of metrics (e.g. `metric: ["val_loss", "val_accuracy"]`). Multi-objective studies use `TPESampler` or `NSGAIISampler` and produce a Pareto front. Note: the `optuna.multi_objective` submodule was removed in optuna 4.0 — all multi-objective support now goes through the unified `create_study(directions=[...])` API, which ECHO uses automatically.
* n_trials: The number of trials in the study.
* gpu: Set to true to obtain the GPUs and their IDs
* sampler
+ type: Choose how optuna will do parameter estimation. The default choice both here and in optuna is the [Tree-structured Parzen Estimator Approach](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f), [e.g. TPESampler](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf). See the optuna documentation for the different options. For some samplers (e.g. GridSearch) additional fields may be included (e.g. search_space).
* parameters
+ type: Option to select an optuna trial setting. See the [optuna Trial documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html?highlight=suggest#optuna.trial.Trial.suggest_uniform) for what is available. Currently, this package supports the available options from optuna: "categorical", "discrete_uniform", "float", "int", "loguniform", and "uniform".
+ settings: This dictionary field allows you to specify any settings that accompany the optuna trial type. In the example above, the named num_dense parameter is stated to be an integer with values ranging from 0 to 10. To see all the available options, consolt the [optuna Trial documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html?highlight=suggest#optuna.trial.Trial.suggest_uniform)
+ type: Option to select an optuna trial setting. See the [optuna Trial documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html) for what is available. Currently, this package supports: "categorical", "float", "int", "loguniform", "uniform", and "discrete_uniform". Note: "loguniform", "uniform", and "discrete_uniform" were removed from optuna 4.0 — ECHO maps them internally to `suggest_float` for backward compatibility, but prefer "float" for new configs (use `log: True` in settings for log-uniform sampling).
+ settings: This dictionary field allows you to specify any settings that accompany the optuna trial type. In the example above, the named num_dense parameter is stated to be an integer with values ranging from 0 to 10. To see all the available options, consult the [optuna Trial documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html).
* enqueue: [Optional] Adding this option will allow the user to add trials with pre-defined values when the study is first initialized, that will be run in order according to their id. Each entry added must be structured as a dictionary with the paramater names exactly matching all the hyperparameter name field in the parameters field.

Lastly, the "log" field allows you to save the logging details to file; they will always be printed to stdout. If this field is removed, logging details will only be printed to stdout.
Expand Down Expand Up @@ -233,12 +254,15 @@ def custom_updates(trial, conf):
hyperparameters = conf["optuna"]["parameters"]

# Now update some via custom rules
num_dense = trial.suggest_discrete_uniform(**hyperparameters["num_dense"])
# Note: suggest_discrete_uniform was removed in optuna 4.0; use suggest_int or
# suggest_float(..., step=q) instead.
settings = hyperparameters["num_dense"]["settings"]
num_dense = trial.suggest_int(settings["name"], settings["low"], settings["high"])

# Update the config based on optuna's suggestion
conf["model"]["dense_hidden_dims"] = [1000 for k in range(num_dense)]
conf["model"]["dense_hidden_dims"] = [1000 for k in range(num_dense)]

return conf
return conf
```

The method should be called first thing in the custom Objective.train method (see the example Objective above). You may have noticed that the configuration (named conf) contains both hyperparameter and model fields. This package will copy the hyperparameter optuna field to the model configuration for convenience, so that we can reduce the total number of class and method dependencies (which helps me keep the code generalized). This occurs in the run.py script.
Expand Down
Binary file added docs/echo_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion echo/examples/keras/hyperparameter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ pbs:
gpus_per_node: 1
bash: ["source ~/.bashrc", "conda activate echo"]
batch:
l: ["select=1:ncpus=8:ngpus=1:mem=64GB", "walltime=12:00:00"]
l: ["select=1:ncpus=8:ngpus=1:mem=64GB:gpu_type=a100_80gb", "walltime=12:00:00"]
A: "NAML0001"
q: "casper"
N: "keras_example"
Expand Down
2 changes: 1 addition & 1 deletion echo/examples/torch/hyperparameter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ pbs:
gpus_per_node: 1
bash: ["source ~/.bashrc", "conda activate echo"]
batch:
l: ["select=1:ncpus=8:ngpus=1:mem=64GB", "walltime=12:00:00"]
l: ["select=1:ncpus=8:ngpus=1:mem=64GB:gpu_type=a100_80gb", "walltime=12:00:00"]
A: "NAML0001"
q: "casper"
N: "torch_example"
Expand Down
101 changes: 49 additions & 52 deletions echo/optimize.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,70 +198,67 @@ def fix_broken_study(


def generate_batch_commands(
hyper_config, batch_type, aiml_path, jobid, batch_commands = []
hyper_config, batch_type, aiml_path, jobid, batch_commands=None
) -> List[str]:
"""Build the list of shell commands that run echo-run inside a job script.

When ``gpus_per_node`` is set, each GPU gets exactly one process pinned to
it via ``CUDA_VISIBLE_DEVICES=<device>``. All processes are launched in
the background (``&``) and a single ``wait`` is appended so the job does
not exit until every worker finishes.

When no GPUs are present but ``tasks_per_worker > 1``, that many CPU-only
workers are launched in parallel the same way.

Parameters
----------
hyper_config : dict
Full hyperparameter config.
batch_type : str
``"pbs"`` or ``"slurm"`` — selects the sub-dict of ``hyper_config``.
aiml_path : str
The ``echo-run <hyper.yml> <model.yml>`` command string.
jobid : str
Scheduler job-ID variable (e.g. ``"$PBS_JOBID"`` or
``"$SLURM_JOB_ID"``).
batch_commands : list, optional
Accumulator list; header lines (shebang, #PBS/#SBATCH directives, bash
setup) are passed in here and the run commands are appended. Defaults
to a new empty list.

Returns
-------
list[str]
The complete script as a list of lines.
"""
if batch_commands is None:
batch_commands = []

# Check if "gpus_per_node" is specified in hyper_config[batch_type]
if "gpus_per_node" in hyper_config[batch_type]:
# Get the list of GPU devices, or convert a single integer to a list
gpus_per_node = list(range(hyper_config[batch_type]["gpus_per_node"]))

# Check if "tasks_per_worker" is specified in hyper_config[batch_type]
if (
"tasks_per_worker" in hyper_config[batch_type]
and hyper_config[batch_type]["tasks_per_worker"] > 1
):
# Warn about the experimental nature of tasks_per_worker
logging.warning(
"The tasks_per_worker is experimental; be advised that some runs may fail."
)
logging.warning(
"Check the log and stdout/err files if simulations are dying to see the errors."
# One process per GPU, each explicitly pinned via CUDA_VISIBLE_DEVICES.
# gpus_per_node is an integer; convert to a 0-based device list.
gpus = list(range(hyper_config[batch_type]["gpus_per_node"]))
for device in gpus:
batch_commands.append(
f"CUDA_VISIBLE_DEVICES={device} {aiml_path} -n {jobid} &"
)
batch_commands.append("wait")

# Loop over the specified number of trials
for copy in range(hyper_config[batch_type]["tasks_per_worker"]):
# Loop over each GPU device
for device in gpus_per_node:
# Append the command with CUDA_VISIBLE_DEVICES={device} to batch_commands
batch_commands.append(
f"CUDA_VISIBLE_DEVICES={device}, {aiml_path} -n {jobid} &"
)
# Allow some time between calling instances of run
batch_commands.append("sleep 0.5")
# Wait for all background jobs to finish
batch_commands.append("wait")
else:
# Loop over each GPU device without multiple trials
for device in gpus_per_node:
# Append the command with CUDA_VISIBLE_DEVICES={device} to batch_commands
batch_commands.append(
f"CUDA_VISIBLE_DEVICES={device}, {aiml_path} -n {jobid} &"
)
batch_commands.append("wait")
elif (
"tasks_per_worker" in hyper_config[batch_type]
and hyper_config[batch_type]["tasks_per_worker"] > 1
):
# Warn about the experimental nature of tasks_per_worker
logging.warning(
"The trails_per_job is experimental, be advised that some runs may fail."
)
# Multiple CPU-only workers sharing the node.
logging.warning(
"Check the log and stdout/err files if simulations are dying to see the errors."
"tasks_per_worker without gpus_per_node: launching %d CPU workers in parallel.",
hyper_config[batch_type]["tasks_per_worker"],
)
# Loop over the specified number of trials
for copy in range(hyper_config[batch_type]["tasks_per_worker"]):
# Append the command to batch_commands
batch_commands.append(
f"{aiml_path} -n {jobid} &"
)
# Allow some time between calling instances of run
batch_commands.append("sleep 0.5")
# Wait for all background jobs to finish
for _ in range(hyper_config[batch_type]["tasks_per_worker"]):
batch_commands.append(f"{aiml_path} -n {jobid} &")
batch_commands.append("wait")

else:
# Append the default command to batch_commands
# Single worker — no GPU pinning, no background execution needed.
batch_commands.append(f"{aiml_path} -n {jobid}")

return batch_commands
Expand Down Expand Up @@ -305,7 +302,7 @@ def prepare_pbs_launch_script(hyper_config: str, model_config: str) -> List[str]
pbs_options.append(f"#PBS -{arg} {val}")
elif arg in ["o", "e"]:
if val != "/dev/null":
_val = os.path.append(hyper_config["save_path"], val)
_val = os.path.join(hyper_config["save_path"], val)
# info?
pbs_options.append(f"#PBS -{arg} {_val}")
else:
Expand Down
Loading
Loading