diff --git a/README.md b/README.md index eb79d93..a776c1b 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,9 @@ # **E**arth **C**omputing **H**yperparameter **O**ptimization (ECHO): A distributed hyperparameter optimization package build with Optuna +

+ ECHO logo +

+ ### Install To install a stable version of ECHO from PyPI, use the following command: @@ -97,7 +101,7 @@ pbs: kernel: "ncar_pylib /glade/work/schreck/py37" bash: ["module load ncarenv/1.3 gnu/8.3.0 openmpi/3.1.4 python/3.7.5 cuda/10.1"] batch: - l: ["select=1:ncpus=8:ngpus=1:mem=128GB", "walltime=12:00:00"] + l: ["select=1:ncpus=8:ngpus=1:mem=128GB:gpu_type=a100_80gb", "walltime=12:00:00"] A: "NAML0001" q: "casper" N: "echo_trial" @@ -164,24 +168,41 @@ The save_path field sets the location where all generated data will be saved. The log field allows you to save the logging details to file to save_path; they will always be printed to stdout. If this field is removed, logging details will only be printed to stdout. * log: boolean to save log.txt in save_path. -The subfields within "pbs" and slurm" should mostly be familiar to you. In this example there would be 10 jobs submitted to pbs queue and 15 jobs to the slurm queue. Most HPCs just use one or the other, so make sure to only speficy what your system supports. The kernel field is optional and can be any call(s) to activate a conda/python/ncar_pylib/etc environment. Additional snippets that you might need in your launch script can be added to the list in the "bash" field. For example, as in the example above, loading modules before training a model is required. Note that the bash options will be run in order, and before the kernel field. Remove or leave the kernel field blank if you do not need it. +The subfields within "pbs" and "slurm" should mostly be familiar to you. In this example there would be 10 jobs submitted to pbs queue and 15 jobs to the slurm queue. Most HPCs just use one or the other, so make sure to only specify what your system supports. The kernel field is optional and can be any call(s) to activate a conda/python/ncar_pylib/etc environment. Additional snippets that you might need in your launch script can be added to the list in the "bash" field. For example, as in the example above, loading modules before training a model is required. Note that the bash options will be run in order, and before the kernel field. Remove or leave the kernel field blank if you do not need it. + +**Casper GPU type selection**: On NCAR Casper, the GPU architecture is specified using `gpu_type` inside the PBS select string. The exact values and their node configurations are: + +| `gpu_type=` value | GPU | Notes | +|---|---|---| +| `v100_32gb` | V100 32 GB | | +| `a100_40gb` | A100 40 GB | Data & Viz queue | +| `a100_80gb` | A100 80 GB | 4 GPUs per node (ML/GPGPU) | +| `h100_80gb` | H100 80 GB | 4 GPUs per node | +| `gp100_16gb` | GP100 16 GB | Data & Viz queue | +| `l40_45gb` | L40 48 GB | Data & Viz queue | +| `mi300a` | MI300A 128 GB | Must request `ngpus=4`, exclusive-use only | + +Example: `-l select=1:ncpus=8:ngpus=1:mem=128GB:gpu_type=a100_80gb` + +See the [Casper node types documentation](https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/casper/casper-node-types/) for the authoritative list. The subfields within the "optuna" field have the following functionality: * storage: sqlite or mysql destination. * study_name: The name of the study. -* storage_type: Choose "sqlite" or "maria" if a MariaDB is setup. - * If "sqlite", the storage field will automatically be appended to the save_path field (e.g. sql:///{save_path}/mlp.db) - * If "maria", specify the full path including username:password in the storage field (for example, mysql://user:pw@someserver.ucar.edu/optuna). +* storage_type: Choose the backend that matches your environment: + * `"sqlite"` — the storage field is joined to save_path automatically (e.g. `sqlite:///{save_path}/mlp.db`). Simple and works well for single-node runs. + * `"maria"` — specify a full MariaDB/MySQL URL in the storage field (e.g. `mysql://user:pw@someserver.ucar.edu/optuna`). Best for large distributed studies with many concurrent workers. + * `"nfs"` — uses Optuna's `JournalFileBackend` (a plain append-only file) instead of a relational database. The storage field is joined to save_path. **Recommended when SQLite locking is unreliable on your shared filesystem** (common on Lustre/GPFS mounts). Example: `storage: "study_journal.log"`, `storage_type: "nfs"`. * objective: The path to the user-supplied objective class * metric: The metric to be used to determine the model performance. -* direction: Indicates which direction the metric must go to represent improvement (pick from maximimize or minimize) +* direction: Indicates which direction the metric must go to represent improvement (pick from `maximize` or `minimize`). For **multi-objective optimization**, supply a list of directions (e.g. `direction: ["minimize", "maximize"]`) and a list of metrics (e.g. `metric: ["val_loss", "val_accuracy"]`). Multi-objective studies use `TPESampler` or `NSGAIISampler` and produce a Pareto front. Note: the `optuna.multi_objective` submodule was removed in optuna 4.0 — all multi-objective support now goes through the unified `create_study(directions=[...])` API, which ECHO uses automatically. * n_trials: The number of trials in the study. * gpu: Set to true to obtain the GPUs and their IDs * sampler + type: Choose how optuna will do parameter estimation. The default choice both here and in optuna is the [Tree-structured Parzen Estimator Approach](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f), [e.g. TPESampler](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf). See the optuna documentation for the different options. For some samplers (e.g. GridSearch) additional fields may be included (e.g. search_space). * parameters - + type: Option to select an optuna trial setting. See the [optuna Trial documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html?highlight=suggest#optuna.trial.Trial.suggest_uniform) for what is available. Currently, this package supports the available options from optuna: "categorical", "discrete_uniform", "float", "int", "loguniform", and "uniform". - + settings: This dictionary field allows you to specify any settings that accompany the optuna trial type. In the example above, the named num_dense parameter is stated to be an integer with values ranging from 0 to 10. To see all the available options, consolt the [optuna Trial documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html?highlight=suggest#optuna.trial.Trial.suggest_uniform) + + type: Option to select an optuna trial setting. See the [optuna Trial documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html) for what is available. Currently, this package supports: "categorical", "float", "int", "loguniform", "uniform", and "discrete_uniform". Note: "loguniform", "uniform", and "discrete_uniform" were removed from optuna 4.0 — ECHO maps them internally to `suggest_float` for backward compatibility, but prefer "float" for new configs (use `log: True` in settings for log-uniform sampling). + + settings: This dictionary field allows you to specify any settings that accompany the optuna trial type. In the example above, the named num_dense parameter is stated to be an integer with values ranging from 0 to 10. To see all the available options, consult the [optuna Trial documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html). * enqueue: [Optional] Adding this option will allow the user to add trials with pre-defined values when the study is first initialized, that will be run in order according to their id. Each entry added must be structured as a dictionary with the paramater names exactly matching all the hyperparameter name field in the parameters field. Lastly, the "log" field allows you to save the logging details to file; they will always be printed to stdout. If this field is removed, logging details will only be printed to stdout. @@ -233,12 +254,15 @@ def custom_updates(trial, conf): hyperparameters = conf["optuna"]["parameters"] # Now update some via custom rules - num_dense = trial.suggest_discrete_uniform(**hyperparameters["num_dense"]) + # Note: suggest_discrete_uniform was removed in optuna 4.0; use suggest_int or + # suggest_float(..., step=q) instead. + settings = hyperparameters["num_dense"]["settings"] + num_dense = trial.suggest_int(settings["name"], settings["low"], settings["high"]) # Update the config based on optuna's suggestion - conf["model"]["dense_hidden_dims"] = [1000 for k in range(num_dense)] + conf["model"]["dense_hidden_dims"] = [1000 for k in range(num_dense)] - return conf + return conf ``` The method should be called first thing in the custom Objective.train method (see the example Objective above). You may have noticed that the configuration (named conf) contains both hyperparameter and model fields. This package will copy the hyperparameter optuna field to the model configuration for convenience, so that we can reduce the total number of class and method dependencies (which helps me keep the code generalized). This occurs in the run.py script. diff --git a/docs/echo_logo.png b/docs/echo_logo.png new file mode 100644 index 0000000..8fc4af1 Binary files /dev/null and b/docs/echo_logo.png differ diff --git a/echo/examples/keras/hyperparameter.yml b/echo/examples/keras/hyperparameter.yml index e2ad7ee..ec3ac41 100644 --- a/echo/examples/keras/hyperparameter.yml +++ b/echo/examples/keras/hyperparameter.yml @@ -7,7 +7,7 @@ pbs: gpus_per_node: 1 bash: ["source ~/.bashrc", "conda activate echo"] batch: - l: ["select=1:ncpus=8:ngpus=1:mem=64GB", "walltime=12:00:00"] + l: ["select=1:ncpus=8:ngpus=1:mem=64GB:gpu_type=a100_80gb", "walltime=12:00:00"] A: "NAML0001" q: "casper" N: "keras_example" diff --git a/echo/examples/torch/hyperparameter.yml b/echo/examples/torch/hyperparameter.yml index 488772d..cfb3822 100644 --- a/echo/examples/torch/hyperparameter.yml +++ b/echo/examples/torch/hyperparameter.yml @@ -7,7 +7,7 @@ pbs: gpus_per_node: 1 bash: ["source ~/.bashrc", "conda activate echo"] batch: - l: ["select=1:ncpus=8:ngpus=1:mem=64GB", "walltime=12:00:00"] + l: ["select=1:ncpus=8:ngpus=1:mem=64GB:gpu_type=a100_80gb", "walltime=12:00:00"] A: "NAML0001" q: "casper" N: "torch_example" diff --git a/echo/optimize.py b/echo/optimize.py index 8fb9dd7..a57039e 100755 --- a/echo/optimize.py +++ b/echo/optimize.py @@ -198,70 +198,67 @@ def fix_broken_study( def generate_batch_commands( - hyper_config, batch_type, aiml_path, jobid, batch_commands = [] + hyper_config, batch_type, aiml_path, jobid, batch_commands=None ) -> List[str]: + """Build the list of shell commands that run echo-run inside a job script. + + When ``gpus_per_node`` is set, each GPU gets exactly one process pinned to + it via ``CUDA_VISIBLE_DEVICES=``. All processes are launched in + the background (``&``) and a single ``wait`` is appended so the job does + not exit until every worker finishes. + + When no GPUs are present but ``tasks_per_worker > 1``, that many CPU-only + workers are launched in parallel the same way. + + Parameters + ---------- + hyper_config : dict + Full hyperparameter config. + batch_type : str + ``"pbs"`` or ``"slurm"`` — selects the sub-dict of ``hyper_config``. + aiml_path : str + The ``echo-run `` command string. + jobid : str + Scheduler job-ID variable (e.g. ``"$PBS_JOBID"`` or + ``"$SLURM_JOB_ID"``). + batch_commands : list, optional + Accumulator list; header lines (shebang, #PBS/#SBATCH directives, bash + setup) are passed in here and the run commands are appended. Defaults + to a new empty list. + + Returns + ------- + list[str] + The complete script as a list of lines. + """ + if batch_commands is None: + batch_commands = [] - # Check if "gpus_per_node" is specified in hyper_config[batch_type] if "gpus_per_node" in hyper_config[batch_type]: - # Get the list of GPU devices, or convert a single integer to a list - gpus_per_node = list(range(hyper_config[batch_type]["gpus_per_node"])) - - # Check if "tasks_per_worker" is specified in hyper_config[batch_type] - if ( - "tasks_per_worker" in hyper_config[batch_type] - and hyper_config[batch_type]["tasks_per_worker"] > 1 - ): - # Warn about the experimental nature of tasks_per_worker - logging.warning( - "The tasks_per_worker is experimental; be advised that some runs may fail." - ) - logging.warning( - "Check the log and stdout/err files if simulations are dying to see the errors." + # One process per GPU, each explicitly pinned via CUDA_VISIBLE_DEVICES. + # gpus_per_node is an integer; convert to a 0-based device list. + gpus = list(range(hyper_config[batch_type]["gpus_per_node"])) + for device in gpus: + batch_commands.append( + f"CUDA_VISIBLE_DEVICES={device} {aiml_path} -n {jobid} &" ) + batch_commands.append("wait") - # Loop over the specified number of trials - for copy in range(hyper_config[batch_type]["tasks_per_worker"]): - # Loop over each GPU device - for device in gpus_per_node: - # Append the command with CUDA_VISIBLE_DEVICES={device} to batch_commands - batch_commands.append( - f"CUDA_VISIBLE_DEVICES={device}, {aiml_path} -n {jobid} &" - ) - # Allow some time between calling instances of run - batch_commands.append("sleep 0.5") - # Wait for all background jobs to finish - batch_commands.append("wait") - else: - # Loop over each GPU device without multiple trials - for device in gpus_per_node: - # Append the command with CUDA_VISIBLE_DEVICES={device} to batch_commands - batch_commands.append( - f"CUDA_VISIBLE_DEVICES={device}, {aiml_path} -n {jobid} &" - ) - batch_commands.append("wait") elif ( "tasks_per_worker" in hyper_config[batch_type] and hyper_config[batch_type]["tasks_per_worker"] > 1 ): - # Warn about the experimental nature of tasks_per_worker - logging.warning( - "The trails_per_job is experimental, be advised that some runs may fail." - ) + # Multiple CPU-only workers sharing the node. logging.warning( - "Check the log and stdout/err files if simulations are dying to see the errors." + "tasks_per_worker without gpus_per_node: launching %d CPU workers in parallel.", + hyper_config[batch_type]["tasks_per_worker"], ) - # Loop over the specified number of trials - for copy in range(hyper_config[batch_type]["tasks_per_worker"]): - # Append the command to batch_commands - batch_commands.append( - f"{aiml_path} -n {jobid} &" - ) - # Allow some time between calling instances of run - batch_commands.append("sleep 0.5") - # Wait for all background jobs to finish + for _ in range(hyper_config[batch_type]["tasks_per_worker"]): + batch_commands.append(f"{aiml_path} -n {jobid} &") batch_commands.append("wait") + else: - # Append the default command to batch_commands + # Single worker — no GPU pinning, no background execution needed. batch_commands.append(f"{aiml_path} -n {jobid}") return batch_commands @@ -305,7 +302,7 @@ def prepare_pbs_launch_script(hyper_config: str, model_config: str) -> List[str] pbs_options.append(f"#PBS -{arg} {val}") elif arg in ["o", "e"]: if val != "/dev/null": - _val = os.path.append(hyper_config["save_path"], val) + _val = os.path.join(hyper_config["save_path"], val) # info? pbs_options.append(f"#PBS -{arg} {_val}") else: diff --git a/echo/src/config.py b/echo/src/config.py index f7b1f3f..55208a6 100644 --- a/echo/src/config.py +++ b/echo/src/config.py @@ -6,7 +6,14 @@ from typing import Dict from echo.src.pruners import pruners from echo.src.samplers import samplers -from optuna.storages import JournalStorage, JournalFileStorage +try: + # optuna >= 4.0: JournalFileBackend lives in optuna.storages.journal + from optuna.storages.journal import JournalFileBackend + from optuna.storages import JournalStorage +except ImportError: + # optuna < 4.0: the class was called JournalFileStorage + from optuna.storages import JournalStorage + from optuna.storages import JournalFileStorage as JournalFileBackend warnings.filterwarnings("ignore") @@ -16,6 +23,23 @@ def configure_storage(hyper_config): + """Build and return an Optuna storage backend from the hyperparameter config. + + Parameters + ---------- + hyper_config : dict + Full hyperparameter config. Reads ``save_path``, + ``optuna.storage_type``, and ``optuna.storage``. + + Returns + ------- + str or JournalStorage + - ``"sqlite"`` → ``sqlite:///{save_path}/{storage}`` URL string. + - ``"maria"`` → the raw ``storage`` string (full MySQL URL). + - ``"nfs"`` → a :class:`optuna.storages.JournalStorage` backed by a + :class:`JournalFileBackend` at ``{save_path}/{storage}``. Use this + for shared network filesystems where SQLite locking is unreliable. + """ # Set up storage db save_path = hyper_config["save_path"] storage_type = hyper_config["optuna"]["storage_type"] @@ -27,11 +51,28 @@ def configure_storage(hyper_config): storage = storage elif storage_type == "nfs": storage = os.path.join(save_path, storage) - storage = JournalStorage(JournalFileStorage(storage)) + storage = JournalStorage(JournalFileBackend(storage)) return storage def configure_sampler(hyper_config): + """Instantiate and return the Optuna sampler specified in the config. + + If no ``sampler`` key is present under ``optuna``, defaults to + :class:`optuna.samplers.TPESampler` for both single- and multi-objective + studies. + + Parameters + ---------- + hyper_config : dict + Full hyperparameter config. Reads ``optuna.direction`` and + ``optuna.sampler`` (optional). + + Returns + ------- + optuna.samplers.BaseSampler + Configured sampler instance. + """ direction = hyper_config["optuna"]["direction"] single_objective = isinstance(direction, str) if "sampler" not in hyper_config["optuna"]: @@ -39,15 +80,30 @@ def configure_sampler(hyper_config): if single_objective: # single-objective logger.warning("\tUsing the default TPESampler class.") sampler = optuna.samplers.TPESampler() - else: # multi-objective equivalent of TPESampler - logger.warning("\tUsing the default MOTPEMultiObjectiveSampler class.") - sampler = optuna.multi_objective.samplers.MOTPEMultiObjectiveSampler() + else: # multi-objective: optuna 3+ uses TPESampler natively + logger.warning("\tUsing the default TPESampler class (multi-objective).") + sampler = optuna.samplers.TPESampler() else: sampler = samplers(hyper_config["optuna"]["sampler"]) return sampler def configure_pruner(hyper_config): + """Instantiate and return the Optuna pruner specified in the config. + + If no ``pruner`` key is present under ``optuna``, defaults to + :class:`optuna.pruners.NopPruner` (no pruning). + + Parameters + ---------- + hyper_config : dict + Full hyperparameter config. Reads ``optuna.pruner`` (optional). + + Returns + ------- + optuna.pruners.BasePruner + Configured pruner instance. + """ if "pruner" not in hyper_config["optuna"]: logger.warning("No pruner was supplied in the hyperparameter config file.") logger.warning("\tUsing the default NopPruner class (no pruning).") @@ -58,6 +114,21 @@ def configure_pruner(hyper_config): def recursive_config_reader(_dict: Dict[str, str], path: bool = None): + """Yield ``(key_path, value)`` for every leaf node in a nested dict. + + Parameters + ---------- + _dict : dict + Arbitrarily nested dictionary. + path : list, optional + Accumulated key path (used internally for recursion). + + Yields + ------ + tuple[list[str], any] + ``(key_path, leaf_value)`` where ``key_path`` is a list of keys + leading to the leaf from the top-level dict. + """ if path is None: path = [] @@ -71,6 +142,18 @@ def recursive_config_reader(_dict: Dict[str, str], path: bool = None): def recursive_update(nested_keys, dictionary, update): + """Update a nested dictionary at the location specified by ``nested_keys``. + + Parameters + ---------- + nested_keys : list[str] + Ordered list of keys forming the path to the target leaf, e.g. + ``["optimizer", "learning_rate"]``. + dictionary : dict + The dict to update in-place. + update : any + The new value to assign at the leaf. + """ if isinstance(dictionary, dict) and len(nested_keys) > 1: recursive_update(nested_keys[1:], dictionary[nested_keys[0]], update) else: @@ -78,6 +161,26 @@ def recursive_update(nested_keys, dictionary, update): def config_check(hyper_config, model_config, file_check=False): + """Validate the hyperparameter and model configuration dictionaries. + + Raises ``AssertionError`` for any missing or invalid field. Can optionally + load configs from file paths instead of pre-parsed dicts. + + Parameters + ---------- + hyper_config : dict or str + Hyperparameter config dict, or a file path if ``file_check=True``. + model_config : dict or str + Model config dict, or a file path if ``file_check=True``. + file_check : bool, optional + If True, treat both arguments as file paths and load them. Also + verifies that the files exist. Default is False. + + Returns + ------- + bool + Always returns True on success; raises on failure. + """ if file_check: assert os.path.isfile( diff --git a/echo/src/pruners.py b/echo/src/pruners.py index 573f033..dc4e783 100644 --- a/echo/src/pruners.py +++ b/echo/src/pruners.py @@ -1,4 +1,3 @@ -from optuna.pruners.__init__ import __all__ as supported_pruners from typing import Dict import logging import optuna @@ -9,13 +8,45 @@ logger = logging.getLogger(__name__) +SUPPORTED_PRUNERS = [ + "BasePruner", + "HyperbandPruner", + "MedianPruner", + "NopPruner", + "PatientPruner", + "PercentilePruner", + "SuccessiveHalvingPruner", + "ThresholdPruner", +] + def pruners(pruner): + """Instantiate an Optuna pruner from a config dict. + + Parameters + ---------- + pruner : dict + Must contain a ``"type"`` key matching one of ``SUPPORTED_PRUNERS``. + All remaining keys are forwarded as keyword arguments to the pruner + constructor. **Note:** the ``"type"`` key is popped from the dict + in-place. + + Returns + ------- + optuna.pruners.BasePruner + Configured pruner instance. + + Raises + ------ + AssertionError + If ``pruner["type"]`` is not in ``SUPPORTED_PRUNERS``. + """ + pruner = dict(pruner) # copy so the caller's dict is not mutated _type = pruner.pop("type") assert ( - _type in supported_pruners - ), f"Pruner {_type} is not valid. Select from {supported_pruners}" + _type in SUPPORTED_PRUNERS + ), f"Pruner {_type} is not valid. Select from {SUPPORTED_PRUNERS}" if _type == "BasePruner": return optuna.pruners.BasePruner(**pruner) @@ -36,6 +67,30 @@ def pruners(pruner): class KerasPruningCallback(object): + """Keras-compatible callback that integrates Optuna trial pruning. + + On each epoch end, the monitored metric is reported to the trial. If the + trial's pruner decides the trial should be stopped, :class:`optuna.TrialPruned` + is raised, which Optuna treats as a pruned (but not failed) trial. + + Parameters + ---------- + trial : optuna.Trial + The current Optuna trial. + monitor : str + The metric key to read from the Keras ``logs`` dict (e.g. + ``"val_loss"``). + interval : int, optional + Epoch interval at which to report; currently unused (every epoch is + always reported). Default is 1. + + Examples + -------- + .. code-block:: python + + callbacks = [KerasPruningCallback(trial, monitor="val_loss")] + model.fit(..., callbacks=callbacks) + """ def __init__(self, trial, monitor, interval=1): self.trial = trial self.monitor = monitor @@ -63,4 +118,4 @@ def on_epoch_end(self, epoch, logs=None): self.trial.report(current_score, step=epoch) if self.trial.should_prune(): message = "Trial was pruned at epoch {}.".format(epoch) - raise optuna.structs.TrialPruned(message) + raise optuna.TrialPruned(message) diff --git a/echo/src/reporting.py b/echo/src/reporting.py index 8434139..87a40b6 100644 --- a/echo/src/reporting.py +++ b/echo/src/reporting.py @@ -5,15 +5,6 @@ import pandas as pd from collections import Counter -import collections -from typing import Any -from typing import DefaultDict -from typing import Dict -from typing import List -from typing import Set -from typing import Tuple -from optuna.trial._state import TrialState -from optuna import multi_objective import optuna warnings.filterwarnings("ignore") @@ -81,93 +72,6 @@ def trial_report(study): return state_histo -def to_df(study): - attrs = ( - "number", - "values", - # "intermediate_values", - "datetime_start", - "datetime_complete", - "params", - "user_attrs", - "system_attrs", - "state", - ) - multi_index = False - - trials = study.get_trials(deepcopy=False) - - attrs_to_df_columns = collections.OrderedDict() - for attr in attrs: - if attr.startswith("_"): - # Python conventional underscores are omitted in the dataframe. - df_column = attr[1:] - else: - df_column = attr - attrs_to_df_columns[attr] = df_column - - # column_agg is an aggregator of column names. - # Keys of column agg are attributes of `FrozenTrial` such as 'trial_id' and 'params'. - # Values are dataframe columns such as ('trial_id', '') and ('params', 'n_layers'). - column_agg: DefaultDict[str, Set] = collections.defaultdict(set) - non_nested_attr = "" - - def _create_record_and_aggregate_column( - trial: "optuna.trial.FrozenTrial", - ) -> Dict[Tuple[str, str], Any]: - - n_objectives = len(study.directions) - trial = multi_objective.trial.FrozenMultiObjectiveTrial( - n_objectives, - trial, - )._trial - - record = {} - for attr, df_column in attrs_to_df_columns.items(): - value = getattr(trial, attr) - if isinstance(value, TrialState): - # Convert TrialState to str and remove the common prefix. - value = str(value).split(".")[-1] - if isinstance(value, dict): - for nested_attr, nested_value in value.items(): - record[(df_column, nested_attr)] = nested_value - column_agg[attr].add((df_column, nested_attr)) - elif isinstance(value, list): - # Expand trial.values. - for nested_attr, nested_value in enumerate(value): - record[(df_column, nested_attr)] = nested_value - column_agg[attr].add((df_column, nested_attr)) - elif attr == "values": - # trial.values should be None when the trial's state is FAIL or PRUNED. - # assert value is None - if value is None: - value = [None for k in range(study.n_objectives)] - - for nested_attr, nested_value in enumerate(value): - record[(df_column, nested_attr)] = nested_value - column_agg[attr].add((df_column, nested_attr)) - else: - record[(df_column, non_nested_attr)] = value - column_agg[attr].add((df_column, non_nested_attr)) - return record - - records = [_create_record_and_aggregate_column(trial) for trial in trials] - - columns: List[Tuple[str, str]] = sum( - (sorted(column_agg[k]) for k in attrs if k in column_agg), [] - ) - - df = pd.DataFrame(records, columns=pd.MultiIndex.from_tuples(columns)) - - if not multi_index: - # Flatten the `MultiIndex` columns where names are concatenated with underscores. - # Filtering is required to omit non-nested columns avoiding unwanted trailing - # underscores. - df.columns = [ - "_".join(filter(lambda c: c, map(lambda c: str(c), col))) for col in columns - ] - return df - def study_report(study, hyper_config): n_trials = hyper_config["optuna"]["n_trials"] diff --git a/echo/src/samplers.py b/echo/src/samplers.py index d3ab244..105624a 100644 --- a/echo/src/samplers.py +++ b/echo/src/samplers.py @@ -1,14 +1,18 @@ -from optuna.samplers._tpe.sampler import TPESampler -from optuna.samplers._tpe.multi_objective_sampler import MOTPESampler -from optuna.samplers._search_space import IntersectionSearchSpace -from optuna.samplers._search_space import intersection_search_space -from optuna.samplers._random import RandomSampler -from optuna.samplers._partial_fixed import PartialFixedSampler -from optuna.samplers import NSGAIISampler -from optuna.samplers._grid import GridSampler -from optuna.samplers._cmaes import CmaEsSampler -from optuna.samplers._base import BaseSampler -from optuna.samplers.__init__ import __all__ as supported_samplers +import optuna +from optuna.samplers import ( + TPESampler, + RandomSampler, + GridSampler, + CmaEsSampler, + NSGAIISampler, + PartialFixedSampler, + BaseSampler, +) +try: + # optuna >= 4.0 moved these to optuna.search_space + from optuna.search_space import IntersectionSearchSpace, intersection_search_space +except ImportError: + from optuna.samplers import IntersectionSearchSpace, intersection_search_space import logging import warnings @@ -17,13 +21,48 @@ logger = logging.getLogger(__name__) +SUPPORTED_SAMPLERS = [ + "TPESampler", + "GridSampler", + "RandomSampler", + "CmaEsSampler", + "NSGAIISampler", + "PartialFixedSampler", + "BaseSampler", + "IntersectionSearchSpace", + "intersection_search_space", +] + def samplers(sampler): + """Instantiate an Optuna sampler from a config dict. + + Parameters + ---------- + sampler : dict + Must contain a ``"type"`` key matching one of ``SUPPORTED_SAMPLERS``. + All remaining keys are forwarded as keyword arguments to the sampler + constructor. **Note:** the ``"type"`` key is popped from the dict + in-place. + + Returns + ------- + optuna.samplers.BaseSampler + Configured sampler instance. + + Raises + ------ + AssertionError + If ``sampler["type"]`` is not in ``SUPPORTED_SAMPLERS``. + OSError + If ``type == "GridSampler"`` and ``"search_space"`` is not provided. + """ + sampler = dict(sampler) # copy so the caller's dict is not mutated _type = sampler.pop("type") assert ( - _type in supported_samplers - ), f"Sampler {_type} is not valid. Select from {supported_samplers}" + _type in SUPPORTED_SAMPLERS + ), f"Sampler {_type} is not valid. Select from {SUPPORTED_SAMPLERS}" if _type == "TPESampler": return TPESampler(**sampler) @@ -38,8 +77,6 @@ def samplers(sampler): return CmaEsSampler(**sampler) if _type == "IntersectionSearchSpace": return IntersectionSearchSpace(**sampler) - if _type == "MOTPESampler": - return MOTPESampler(**sampler) if _type == "BaseSampler": return BaseSampler(**sampler) if _type == "NSGAIISampler": diff --git a/echo/src/trial_suggest.py b/echo/src/trial_suggest.py index d49702d..ddddbd6 100644 --- a/echo/src/trial_suggest.py +++ b/echo/src/trial_suggest.py @@ -18,6 +18,42 @@ def trial_suggest_loader(trial, config): + """Call the appropriate ``trial.suggest_*`` method for a parameter config. + + Supports both current Optuna 4.x types and legacy aliases for backward + compatibility with older YAML configs. + + Parameters + ---------- + trial : optuna.Trial + The active trial object. + config : dict + Parameter config with keys: + + - ``"type"`` : one of ``supported_trials`` + - ``"settings"`` : dict of keyword arguments forwarded to the suggest + method (e.g. ``name``, ``low``, ``high``, ``choices``, ``step``). + + Legacy types are remapped internally: + + ==================== ========================================== + YAML type Optuna call + ==================== ========================================== + ``"loguniform"`` ``suggest_float(..., log=True)`` + ``"uniform"`` ``suggest_float(...)`` + ``"discrete_uniform"`` ``suggest_float(..., step=q)`` + ==================== ========================================== + + Returns + ------- + float, int, or str + The value suggested by Optuna for this trial. + + Raises + ------ + AssertionError + If ``config["type"]`` is not in ``supported_trials``. + """ _type = config["type"] @@ -28,12 +64,21 @@ def trial_suggest_loader(trial, config): if _type == "categorical": return trial.suggest_categorical(**config["settings"]) if _type == "discrete_uniform": - return int(trial.suggest_discrete_uniform(**config["settings"])) + # suggest_discrete_uniform was removed in optuna 4.0; map to suggest_float with step + settings = dict(config["settings"]) + q = settings.pop("q", None) + if q is not None: + settings["step"] = q + return float(trial.suggest_float(**settings)) if _type == "float": return float(trial.suggest_float(**config["settings"])) if _type == "int": return int(trial.suggest_int(**config["settings"])) if _type == "loguniform": - return float(trial.suggest_loguniform(**config["settings"])) + # suggest_loguniform was removed in optuna 4.0; map to suggest_float with log=True + settings = dict(config["settings"]) + settings["log"] = True + return float(trial.suggest_float(**settings)) if _type == "uniform": - return float(trial.suggest_uniform(**config["settings"])) + # suggest_uniform was removed in optuna 4.0; map to suggest_float + return float(trial.suggest_float(**config["settings"])) diff --git a/echo/tests/test_echo.py b/echo/tests/test_echo.py index 0a08e89..2f59972 100644 --- a/echo/tests/test_echo.py +++ b/echo/tests/test_echo.py @@ -1,7 +1,26 @@ -from echo.src.config import config_check +from echo.src.config import ( + config_check, + configure_storage, + configure_sampler, + configure_pruner, + recursive_config_reader, + recursive_update, +) +from echo.src.trial_suggest import trial_suggest_loader +from echo.src.base_objective import BaseObjective +from echo.src.reporting import successful_trials, get_sec +from echo.src.pruners import KerasPruningCallback +from echo.optimize import ( + prepare_pbs_launch_script, + prepare_slurm_launch_script, + generate_batch_commands, +) # import tensorflow as tf +import optuna +import pytest import warnings import yaml +import sys import os warnings.filterwarnings("ignore") @@ -45,3 +64,657 @@ def test_read_torch_config(): model_config = yaml.load(f, Loader=yaml.FullLoader) config_check(hyper_config, model_config) + + +# ── PBS / SLURM launch script helpers ──────────────────────────────────────── + +@pytest.fixture +def cli_args(monkeypatch): + """Patch sys.argv so that prepare_*_launch_script doesn't crash under pytest. + + Both functions call sys.argv[1] (hyper config path) and sys.argv[2] (model + config path) to build the echo-run command. When pytest is invoked without + an explicit file argument sys.argv has only index 0, causing an IndexError. + """ + monkeypatch.setattr(sys, "argv", ["echo-opt", "hyperparameter.yml", "model.yml"]) + + +# ── PBS launch script tests ─────────────────────────────────────────────────── + +def _base_pbs_config(extra_select=None): + """Return a minimal hyper_config dict with a PBS section.""" + select = "select=1:ncpus=8:ngpus=1:mem=64GB:gpu_type=a100_80gb" + if extra_select: + select = extra_select + return { + "save_path": "/tmp/echo_test", + "pbs": { + "jobs": 1, + "batch": { + "l": [select, "walltime=04:00:00"], + "A": "TEST0001", + "q": "casper", + "N": "test_job", + "o": "/tmp/echo_test/out", + "e": "/tmp/echo_test/err", + }, + "bash": ["source ~/.bashrc"], + "kernel": "conda activate echo", + }, + "optuna": { + "study_name": "test", + "storage": "test.db", + "storage_type": "sqlite", + "objective": __file__, # just needs to be a real file + "direction": "maximize", + "metric": "val_accuracy", + "n_trials": 10, + "gpu": True, + }, + } + + +def test_pbs_launch_script_has_shebang(cli_args): + hyper_config = _base_pbs_config() + script = prepare_pbs_launch_script(hyper_config, {}) + assert script[0] == "#!/bin/bash -l", f"Expected shebang, got: {script[0]}" + + +def test_pbs_launch_script_walltime_present(cli_args): + hyper_config = _base_pbs_config() + script = prepare_pbs_launch_script(hyper_config, {}) + assert any("walltime=04:00:00" in line for line in script), ( + "walltime not found in PBS script" + ) + + +def test_pbs_launch_script_gpu_type_in_select(cli_args): + """The gpu_type parameter must appear in the #PBS -l select line.""" + hyper_config = _base_pbs_config() + script = prepare_pbs_launch_script(hyper_config, {}) + select_lines = [l for l in script if "#PBS -l" in l and "select=" in l] + assert len(select_lines) >= 1, "No #PBS -l select line found" + assert any("gpu_type=a100_80gb" in l for l in select_lines), ( + f"gpu_type=a100_80gb not found in select lines: {select_lines}" + ) + + +def test_pbs_launch_script_various_gpu_types(cli_args): + """All valid Casper gpu_type values should pass through unchanged.""" + valid_gpu_types = [ + "v100_32gb", "a100_40gb", "a100_80gb", "h100_80gb", + "mi300a", "gp100_16gb", "l40_45gb", + ] + for gpu_type in valid_gpu_types: + select = f"select=1:ncpus=8:ngpus=1:mem=64GB:gpu_type={gpu_type}" + hyper_config = _base_pbs_config(extra_select=select) + script = prepare_pbs_launch_script(hyper_config, {}) + select_lines = [l for l in script if "#PBS -l" in l and "select=" in l] + assert any(gpu_type in l for l in select_lines), ( + f"gpu_type={gpu_type} not found in PBS script lines: {select_lines}" + ) + + +def test_pbs_launch_script_echo_run_command(cli_args): + """The script must contain an echo-run invocation.""" + hyper_config = _base_pbs_config() + script = prepare_pbs_launch_script(hyper_config, {}) + assert any("echo-run" in line for line in script), ( + "echo-run command not found in PBS script" + ) + + +def test_pbs_launch_script_bash_lines(cli_args): + hyper_config = _base_pbs_config() + script = prepare_pbs_launch_script(hyper_config, {}) + assert any("source ~/.bashrc" in line for line in script) + assert any("conda activate echo" in line for line in script) + + +# ── SLURM launch script tests ───────────────────────────────────────────────── + +def _base_slurm_config(): + return { + "save_path": "/tmp/echo_test", + "slurm": { + "jobs": 1, + "batch": { + "J": "test_job", + "t": "04:00:00", + "N": "1", + "n": "1", + "mem": "64G", + "partition": "gpu", + }, + "bash": ["source ~/.bashrc"], + "kernel": "conda activate echo", + }, + "optuna": { + "study_name": "test", + "storage": "test.db", + "storage_type": "sqlite", + "objective": __file__, + "direction": "maximize", + "metric": "val_accuracy", + "n_trials": 10, + "gpu": True, + }, + } + + +def test_slurm_launch_script_has_shebang(cli_args): + hyper_config = _base_slurm_config() + script = prepare_slurm_launch_script(hyper_config, {}) + assert script[0] == "#!/bin/bash -l", f"Expected shebang, got: {script[0]}" + + +def test_slurm_launch_script_sbatch_options(cli_args): + hyper_config = _base_slurm_config() + script = prepare_slurm_launch_script(hyper_config, {}) + sbatch_lines = [l for l in script if l.startswith("#SBATCH")] + assert len(sbatch_lines) > 0, "No #SBATCH lines found" + + +def test_slurm_launch_script_echo_run_command(cli_args): + hyper_config = _base_slurm_config() + script = prepare_slurm_launch_script(hyper_config, {}) + assert any("echo-run" in line for line in script), ( + "echo-run command not found in SLURM script" + ) + + +def test_slurm_launch_script_bash_lines(cli_args): + hyper_config = _base_slurm_config() + script = prepare_slurm_launch_script(hyper_config, {}) + assert any("source ~/.bashrc" in line for line in script) + assert any("conda activate echo" in line for line in script) + + +# ── Helpers shared across new tests ────────────────────────────────────────── + +def _minimal_hyper_config(direction="minimize"): + """Return a minimal valid hyper_config for unit tests.""" + return { + "save_path": "/tmp/echo_test", + "optuna": { + "study_name": "test", + "storage": "test.db", + "storage_type": "sqlite", + "objective": __file__, # any real file works + "direction": direction, + "metric": "val_loss", + "n_trials": 10, + "gpu": False, + }, + } + + +def _make_trial(direction="minimize"): + """Return a fresh in-memory Optuna Trial for testing suggest methods.""" + study = optuna.create_study(direction=direction) + return study.ask() + + +# ── config_check validation tests ──────────────────────────────────────────── + +def test_config_check_missing_save_path(): + """config_check must raise when save_path is absent.""" + with pytest.raises(AssertionError, match="save_path"): + config_check({"optuna": {}}, {}) + + +def test_config_check_invalid_direction(): + """config_check must raise for an unsupported direction string.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["direction"] = "sideways" + with pytest.raises(AssertionError): + config_check(cfg, {}) + + +def test_config_check_invalid_storage_type(): + """config_check must raise for an unsupported storage_type.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["storage_type"] = "postgres" + with pytest.raises(AssertionError): + config_check(cfg, {}) + + +def test_config_check_multi_objective_valid(): + """config_check should pass for a valid multi-objective config.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["direction"] = ["minimize", "maximize"] + assert config_check(cfg, {}) is True + + +def test_config_check_multi_objective_invalid_direction(): + """config_check must raise when any direction in the list is invalid.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["direction"] = ["minimize", "diagonal"] + with pytest.raises(AssertionError): + config_check(cfg, {}) + + +def test_config_check_pbs_missing_walltime(): + """config_check must raise when the PBS batch config has no walltime.""" + cfg = _minimal_hyper_config() + cfg["pbs"] = {"batch": {"l": ["select=1:ncpus=8"]}} + with pytest.raises(AssertionError): + config_check(cfg, {}) + + +def test_config_check_slurm_missing_walltime(): + """config_check must raise when the SLURM batch config has no 't' key.""" + cfg = _minimal_hyper_config() + cfg["slurm"] = {"batch": {}} + with pytest.raises(AssertionError): + config_check(cfg, {}) + + +# ── configure_storage tests ─────────────────────────────────────────────────── + +def test_configure_storage_sqlite(): + """configure_storage with sqlite should return a valid sqlite:/// URL.""" + cfg = _minimal_hyper_config() + storage = configure_storage(cfg) + assert isinstance(storage, str) + assert storage.startswith("sqlite:///") + assert "test.db" in storage + + +def test_configure_storage_maria(): + """configure_storage with maria should return the raw storage string.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["storage_type"] = "maria" + cfg["optuna"]["storage"] = "mysql://user:pw@server/optuna" + storage = configure_storage(cfg) + assert storage == "mysql://user:pw@server/optuna" + + +def test_configure_storage_nfs(tmp_path): + """configure_storage with nfs should return a JournalStorage object.""" + cfg = _minimal_hyper_config() + cfg["save_path"] = str(tmp_path) + cfg["optuna"]["storage_type"] = "nfs" + storage = configure_storage(cfg) + # JournalStorage is not a plain string + assert not isinstance(storage, str) + + +# ── configure_sampler tests ─────────────────────────────────────────────────── + +def test_configure_sampler_default_single_objective(): + """No sampler key → TPESampler for single-objective.""" + cfg = _minimal_hyper_config() + sampler = configure_sampler(cfg) + assert isinstance(sampler, optuna.samplers.TPESampler) + + +def test_configure_sampler_default_multi_objective(): + """No sampler key → TPESampler for multi-objective.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["direction"] = ["minimize", "maximize"] + sampler = configure_sampler(cfg) + assert isinstance(sampler, optuna.samplers.TPESampler) + + +def test_configure_sampler_random(): + """Explicit RandomSampler is instantiated correctly.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["sampler"] = {"type": "RandomSampler"} + sampler = configure_sampler(cfg) + assert isinstance(sampler, optuna.samplers.RandomSampler) + + +def test_configure_sampler_grid_requires_search_space(): + """GridSampler without search_space should raise OSError.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["sampler"] = {"type": "GridSampler"} + with pytest.raises(OSError, match="search_space"): + configure_sampler(cfg) + + +# ── configure_pruner tests ──────────────────────────────────────────────────── + +def test_configure_pruner_default(): + """No pruner key → NopPruner.""" + cfg = _minimal_hyper_config() + pruner = configure_pruner(cfg) + assert isinstance(pruner, optuna.pruners.NopPruner) + + +def test_configure_pruner_median(): + """Explicit MedianPruner is instantiated correctly.""" + cfg = _minimal_hyper_config() + cfg["optuna"]["pruner"] = {"type": "MedianPruner"} + pruner_obj = configure_pruner(cfg) + assert isinstance(pruner_obj, optuna.pruners.MedianPruner) + + +# ── trial_suggest tests ─────────────────────────────────────────────────────── + +def test_trial_suggest_float(): + trial = _make_trial() + result = trial_suggest_loader( + trial, {"type": "float", "settings": {"name": "lr", "low": 0.0, "high": 1.0}} + ) + assert isinstance(result, float) + assert 0.0 <= result <= 1.0 + + +def test_trial_suggest_int(): + trial = _make_trial() + result = trial_suggest_loader( + trial, {"type": "int", "settings": {"name": "layers", "low": 1, "high": 10}} + ) + assert isinstance(result, int) + assert 1 <= result <= 10 + + +def test_trial_suggest_categorical(): + trial = _make_trial() + result = trial_suggest_loader( + trial, + {"type": "categorical", "settings": {"name": "act", "choices": ["relu", "elu"]}}, + ) + assert result in ["relu", "elu"] + + +def test_trial_suggest_loguniform(): + """Legacy 'loguniform' type maps to suggest_float(log=True).""" + trial = _make_trial() + result = trial_suggest_loader( + trial, + {"type": "loguniform", "settings": {"name": "lr", "low": 1e-5, "high": 1e-1}}, + ) + assert isinstance(result, float) + assert 1e-5 <= result <= 1e-1 + + +def test_trial_suggest_uniform(): + """Legacy 'uniform' type maps to suggest_float.""" + trial = _make_trial() + result = trial_suggest_loader( + trial, + {"type": "uniform", "settings": {"name": "dr", "low": 0.0, "high": 0.5}}, + ) + assert isinstance(result, float) + assert 0.0 <= result <= 0.5 + + +def test_trial_suggest_discrete_uniform(): + """Legacy 'discrete_uniform' with q maps to suggest_float(step=q).""" + trial = _make_trial() + result = trial_suggest_loader( + trial, + { + "type": "discrete_uniform", + "settings": {"name": "ddr", "low": 0.0, "high": 1.0, "q": 0.1}, + }, + ) + assert isinstance(result, float) + assert 0.0 <= result <= 1.0 + + +def test_trial_suggest_invalid_type(): + """Unknown type should raise AssertionError.""" + trial = _make_trial() + with pytest.raises(AssertionError, match="not valid"): + trial_suggest_loader(trial, {"type": "unknown_type", "settings": {}}) + + +# ── recursive_config_reader / recursive_update tests ───────────────────────── + +def test_recursive_config_reader_flat(): + """Flat dict yields single-element key paths.""" + d = {"a": 1, "b": 2} + result = list(recursive_config_reader(d)) + assert (["a"], 1) in result + assert (["b"], 2) in result + + +def test_recursive_config_reader_nested(): + """Nested dict yields full key paths to leaves.""" + d = {"optimizer": {"learning_rate": 0.001}} + result = list(recursive_config_reader(d)) + assert (["optimizer", "learning_rate"], 0.001) in result + + +def test_recursive_update(): + """recursive_update modifies the correct nested field in-place.""" + d = {"optimizer": {"learning_rate": 0.001, "weight_decay": 0.0}} + recursive_update(["optimizer", "learning_rate"], d, 0.01) + assert d["optimizer"]["learning_rate"] == 0.01 + assert d["optimizer"]["weight_decay"] == 0.0 # unchanged + + +# ── reporting utility tests ─────────────────────────────────────────────────── + +def test_get_sec_hours(): + assert get_sec("01:00:00") == 3600 + + +def test_get_sec_minutes(): + assert get_sec("00:30:00") == 1800 + + +def test_get_sec_mixed(): + assert get_sec("12:30:45") == 12 * 3600 + 30 * 60 + 45 + + +def test_successful_trials_all_complete(): + """successful_trials counts all COMPLETE trials.""" + study = optuna.create_study() + study.optimize(lambda t: t.suggest_float("x", 0, 1), n_trials=3) + assert successful_trials(study) == 3 + + +def test_successful_trials_includes_pruned(): + """successful_trials counts PRUNED trials as successful.""" + study = optuna.create_study() + + def objective(trial): + trial.report(999.0, step=0) + raise optuna.TrialPruned() + + study.optimize(objective, n_trials=2) + assert successful_trials(study) == 2 + + +# ── BaseObjective tests ─────────────────────────────────────────────────────── + +def _make_obj_config(save_path, metric="val_loss", parameters=None): + if parameters is None: + parameters = {} + return { + "optuna": { + "save_path": str(save_path), + "metric": metric, + "parameters": parameters, + } + } + + +def test_base_objective_update_config_flat_key(tmp_path): + """update_config substitutes a flat (non-nested) parameter.""" + config = _make_obj_config(tmp_path, parameters={ + "learning_rate": { + "type": "float", + "settings": {"name": "lr", "low": 1e-4, "high": 1e-2}, + } + }) + config["learning_rate"] = 0.001 # pre-existing value in config + + study = optuna.create_study() + trial = study.ask() + obj = BaseObjective(config, metric="val_loss") + obj.set_properties(node_id=None, device="cpu") + updated = obj.update_config(trial) + assert 1e-4 <= updated["learning_rate"] <= 1e-2 + + +def test_base_objective_update_config_nested_key(tmp_path): + """update_config handles colon-separated nested keys.""" + config = _make_obj_config(tmp_path, parameters={ + "optimizer:learning_rate": { + "type": "float", + "settings": {"name": "lr", "low": 1e-4, "high": 1e-2}, + } + }) + config["optimizer"] = {"learning_rate": 0.001} + + study = optuna.create_study() + trial = study.ask() + obj = BaseObjective(config, metric="val_loss") + obj.set_properties(node_id=None, device="cpu") + updated = obj.update_config(trial) + assert 1e-4 <= updated["optimizer"]["learning_rate"] <= 1e-2 + + +def test_base_objective_save_single_objective(tmp_path): + """save() writes a CSV and returns the scalar metric value.""" + config = _make_obj_config(tmp_path) + study = optuna.create_study(direction="minimize") + trial = study.ask() + + obj = BaseObjective(config, metric="val_loss") + obj.set_properties(node_id=None, device="cpu") + result = obj.save(trial, {"val_loss": 0.42, "loss": 0.5}) + + assert result == 0.42 + csv_path = tmp_path / "trial_results.csv" + assert csv_path.exists() + import pandas as pd + df = pd.read_csv(csv_path) + assert "val_loss" in df.columns + assert "loss" in df.columns + + +def test_base_objective_save_multi_objective(tmp_path): + """save() with a list metric returns a list of values.""" + config = _make_obj_config(tmp_path, metric=["val_loss", "val_accuracy"]) + config["optuna"]["metric"] = ["val_loss", "val_accuracy"] + study = optuna.create_study(directions=["minimize", "maximize"]) + trial = study.ask() + + obj = BaseObjective(config, metric=["val_loss", "val_accuracy"]) + obj.set_properties(node_id=None, device="cpu") + result = obj.save(trial, {"val_loss": 0.3, "val_accuracy": 0.9}) + + assert result == [0.3, 0.9] + + +def test_base_objective_save_missing_metric_raises(tmp_path): + """save() raises AssertionError when the metric key is missing from results.""" + config = _make_obj_config(tmp_path) + study = optuna.create_study(direction="minimize") + trial = study.ask() + + obj = BaseObjective(config, metric="val_loss") + obj.set_properties(node_id=None, device="cpu") + with pytest.raises(AssertionError, match="val_loss"): + obj.save(trial, {"loss": 0.5}) + + +# ── KerasPruningCallback tests ──────────────────────────────────────────────── + +def test_keras_pruning_callback_triggers_prune(): + """on_epoch_end raises TrialPruned when the pruner decides to stop the trial.""" + study = optuna.create_study(pruner=optuna.pruners.ThresholdPruner(upper=1.0)) + trial = study.ask() + callback = KerasPruningCallback(trial, monitor="val_loss") + with pytest.raises(optuna.TrialPruned): + # 999.0 exceeds upper=1.0 → should prune + callback.on_epoch_end(epoch=0, logs={"val_loss": 999.0}) + + +def test_keras_pruning_callback_no_prune(): + """on_epoch_end does not raise when the NopPruner is in use.""" + study = optuna.create_study() # NopPruner by default + trial = study.ask() + callback = KerasPruningCallback(trial, monitor="val_loss") + # Should complete without raising + callback.on_epoch_end(epoch=0, logs={"val_loss": 0.5}) + + +def test_keras_pruning_callback_missing_monitor_key(): + """on_epoch_end is a no-op when the monitored metric is absent from logs.""" + study = optuna.create_study() + trial = study.ask() + callback = KerasPruningCallback(trial, monitor="val_loss") + # 'val_loss' not in logs — should return without error or pruning + callback.on_epoch_end(epoch=0, logs={"loss": 0.5}) + + +# ── generate_batch_commands tests ───────────────────────────────────────────── + +def _batch_cfg(batch_type, gpus_per_node=None, tasks_per_worker=None): + """Minimal config dict for generate_batch_commands tests.""" + inner = {} + if gpus_per_node is not None: + inner["gpus_per_node"] = gpus_per_node + if tasks_per_worker is not None: + inner["tasks_per_worker"] = tasks_per_worker + return {batch_type: inner} + + +def test_generate_batch_commands_single_worker(): + """No GPUs, no tasks_per_worker → single command, no & or wait.""" + cfg = _batch_cfg("pbs") + cmds = generate_batch_commands(cfg, "pbs", "echo-run h.yml m.yml", "$PBS_JOBID") + assert len(cmds) == 1 + assert "&" not in cmds[0] + assert "wait" not in cmds[0] + assert "echo-run" in cmds[0] + + +def test_generate_batch_commands_gpu_pinning(): + """With gpus_per_node=4, each GPU gets its own process with correct CUDA_VISIBLE_DEVICES.""" + cfg = _batch_cfg("pbs", gpus_per_node=4) + cmds = generate_batch_commands(cfg, "pbs", "echo-run h.yml m.yml", "$PBS_JOBID") + # Expect 4 worker lines + 1 wait + worker_lines = [l for l in cmds if "echo-run" in l] + assert len(worker_lines) == 4 + for i, line in enumerate(worker_lines): + assert f"CUDA_VISIBLE_DEVICES={i}" in line + assert line.endswith("&") + assert cmds[-1] == "wait" + + +def test_generate_batch_commands_gpu_pinning_no_duplicate_devices(): + """Each GPU device ID appears exactly once across all worker lines.""" + cfg = _batch_cfg("pbs", gpus_per_node=4) + cmds = generate_batch_commands(cfg, "pbs", "echo-run h.yml m.yml", "$PBS_JOBID") + worker_lines = [l for l in cmds if "echo-run" in l] + seen_devices = [int(l.split("CUDA_VISIBLE_DEVICES=")[1].split(" ")[0]) for l in worker_lines] + assert sorted(seen_devices) == list(range(4)) + + +def test_generate_batch_commands_tasks_per_worker_no_gpu(): + """tasks_per_worker without GPUs → N background processes, no CUDA prefix.""" + cfg = _batch_cfg("slurm", tasks_per_worker=3) + cmds = generate_batch_commands(cfg, "slurm", "echo-run h.yml m.yml", "$SLURM_JOB_ID") + worker_lines = [l for l in cmds if "echo-run" in l] + assert len(worker_lines) == 3 + for line in worker_lines: + assert "CUDA_VISIBLE_DEVICES" not in line + assert line.endswith("&") + assert cmds[-1] == "wait" + + +def test_generate_batch_commands_no_sleep_lines(): + """No sleep lines should appear in any generated script.""" + for cfg in [ + _batch_cfg("pbs", gpus_per_node=2), + _batch_cfg("pbs", tasks_per_worker=2), + _batch_cfg("pbs"), + ]: + cmds = generate_batch_commands(cfg, "pbs", "echo-run h.yml m.yml", "$PBS_JOBID") + assert not any("sleep" in l for l in cmds), f"sleep found in: {cmds}" + + +def test_generate_batch_commands_independent_calls(): + """Each call to generate_batch_commands starts with a fresh list (no mutable default bleed).""" + cfg = _batch_cfg("pbs") + cmds1 = generate_batch_commands(cfg, "pbs", "echo-run h.yml m.yml", "$PBS_JOBID") + cmds2 = generate_batch_commands(cfg, "pbs", "echo-run h.yml m.yml", "$PBS_JOBID") + assert len(cmds1) == len(cmds2) == 1 diff --git a/requirements.txt b/requirements.txt index 0781894..19da1d0 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,7 +2,7 @@ numpy scipy matplotlib pandas -optuna<4.0 +optuna>=3.0 setuptools pyyaml scikit-learn diff --git a/setup.cfg b/setup.cfg index 3b80aa6..c9c087d 100644 --- a/setup.cfg +++ b/setup.cfg @@ -32,7 +32,7 @@ install_requires = numpy scipy matplotlib - optuna<4.0 + optuna>=3.0 setuptools pandas scikit-learn