Use queue options directly from ert

equinor · Dec 19, 2024 · d275a36 · d275a36
1 parent e5c7953
commit d275a36
Show file tree

Hide file tree

Showing 20 changed files with 181 additions and 676 deletions.
diff --git a/docs/everest/config_generated.rst b/docs/everest/config_generated.rst
@@ -870,25 +870,6 @@ Type: *Optional[SimulatorConfig]*
 
 Simulation settings
 
-**name (optional)**
-    Type: *Optional[str]*
-
-    Specifies which queue to use
-
-
-**cores (optional)**
-    Type: *Optional[PositiveInt]*
-
-    Defines the number of simultaneously running forward models.
-
-    When using queue system lsf, this corresponds to number of nodes used at one
-    time, whereas when using the local queue system, cores refers to the number of
-    cores you want to use on your system.
-
-    This number is specified in Ert as MAX_RUNNING.
-
-
-
 **cores_per_node (optional)**
     Type: *Optional[PositiveInt]*
 
@@ -906,20 +887,6 @@ Simulation settings
     Whether the batch folder for a successful simulation needs to be deleted.
 
 
-**exclude_host (optional)**
-    Type: *Optional[str]*
-
-    Comma separated list of nodes that should be
-    excluded from the slurm run.
-
-
-**include_host (optional)**
-    Type: *Optional[str]*
-
-    Comma separated list of nodes that
-    should be included in the slurm run
-
-
 **max_runtime (optional)**
     Type: *Optional[NonNegativeInt]*
 
@@ -929,18 +896,10 @@ Simulation settings
 
 
 
-**options (optional)**
-    Type: *Optional[str]*
-
-    Used to specify options to LSF.
-    Examples to set memory requirement is:
-    * rusage[mem=1000]
-
-
 **queue_system (optional)**
-    Type: *Optional[Literal['lsf', 'local', 'slurm', 'torque']]*
+    Type: *Optional[LocalQueueOptions, LsfQueueOptions, SlurmQueueOptions, TorqueQueueOptions]*
 
-    Defines which queue system the everest server runs on.
+    Defines which queue system the everest submits jobs to
 
 
 **resubmit_limit (optional)**
@@ -956,54 +915,6 @@ Simulation settings
     If not specified, a default value of 1 will be used.
 
 
-**sbatch (optional)**
-    Type: *Optional[str]*
-
-    sbatch executable to be used by the slurm queue interface.
-
-
-**scancel (optional)**
-    Type: *Optional[str]*
-
-    scancel executable to be used by the slurm queue interface.
-
-
-**scontrol (optional)**
-    Type: *Optional[str]*
-
-    scontrol executable to be used by the slurm queue interface.
-
-
-**sacct (optional)**
-    Type: *Optional[str]*
-
-    sacct executable to be used by the slurm queue interface.
-
-
-**squeue (optional)**
-    Type: *Optional[str]*
-
-    squeue executable to be used by the slurm queue interface.
-
-
-**server (optional)**
-    Type: *Optional[str]*
-
-    Name of LSF server to use. This option is deprecated and no longer required
-
-
-**slurm_timeout (optional)**
-    Type: *Optional[int]*
-
-    Timeout for cached status used by the slurm queue interface
-
-
-**squeue_timeout (optional)**
-    Type: *Optional[int]*
-
-    Timeout for cached status used by the slurm queue interface.
-
-
 **enable_cache (optional)**
     Type: *bool*
 
@@ -1019,72 +930,6 @@ Simulation settings
     optimizer.
 
 
-**qsub_cmd (optional)**
-    Type: *Optional[str]*
-
-    The submit command
-
-
-**qstat_cmd (optional)**
-    Type: *Optional[str]*
-
-    The query command
-
-
-**qdel_cmd (optional)**
-    Type: *Optional[str]*
-
-    The kill command
-
-
-**qstat_options (optional)**
-    Type: *Optional[str]*
-
-    Options to be supplied to the qstat command. This defaults to -x, which tells the qstat command to include exited processes.
-
-
-**cluster_label (optional)**
-    Type: *Optional[str]*
-
-    The name of the cluster you are running simulations in.
-
-
-**memory_per_job (optional)**
-    Type: *Optional[str]*
-
-    You can specify the amount of memory you will need for running your job. This will ensure that not too many jobs will run on a single shared memory node at once, possibly crashing the compute node if it runs out of memory.
-    You can get an indication of the memory requirement by watching the course of a local run using the htop utility. Whether you should set the peak memory usage as your requirement or a lower figure depends on how simultaneously each job will run.
-    The option to be supplied will be used as a string in the qsub argument. You must specify the unit, either gb or mb.
-
-
-
-**keep_qsub_output (optional)**
-    Type: *Optional[int]*
-
-    Set to 1 to keep error messages from qsub. Usually only to be used if somethign is seriously wrong with the queue environment/setup.
-
-
-**submit_sleep (optional)**
-    Type: *Optional[float]*
-
-    To avoid stressing the TORQUE/PBS system you can instruct the driver to sleep for every submit request. The argument to the SUBMIT_SLEEP is the number of seconds to sleep for every submit, which can be a fraction like 0.5
-
-
-**queue_query_timeout (optional)**
-    Type: *Optional[int]*
-
-
-    The driver allows the backend TORQUE/PBS system to be flaky, i.e. it may intermittently not respond and give error messages when submitting jobs or asking for job statuses. The timeout (in seconds) determines how long ERT will wait before it will give up. Applies to job submission (qsub) and job status queries (qstat). Default is 126 seconds.
-    ERT will do exponential sleeps, starting at 2 seconds, and the provided timeout is a maximum. Let the timeout be sums of series like 2+4+8+16+32+64 in order to be explicit about the number of retries. Set to zero to disallow flakyness, setting it to 2 will allow for one re-attempt, and 6 will give two re-attempts. Example allowing six retries:
-
-
-
-**project_code (optional)**
-    Type: *Optional[str]*
-
-    String identifier used to map hardware resource usage to a project or account. The project or account does not have to exist.
-
-
 
 install_jobs (optional)
 -----------------------
@@ -1246,32 +1091,10 @@ requirements of the forward models.
 
 
 
-**exclude_host (optional)**
-    Type: *Optional[str]*
-
-    Comma separated list of nodes that should be
-    excluded from the slurm run
-
-
-**include_host (optional)**
-    Type: *Optional[str]*
-
-    Comma separated list of nodes that
-    should be included in the slurm run
-
-
-**options (optional)**
-    Type: *Optional[str]*
-
-    Used to specify options to LSF.
-    Examples to set memory requirement is:
-    * rusage[mem=1000]
-
-
 **queue_system (optional)**
-    Type: *Optional[Literal['lsf', 'local', 'slurm']]*
+    Type: *Optional[LocalQueueOptions, LsfQueueOptions, SlurmQueueOptions, TorqueQueueOptions]*
 
-    Defines which queue system the everest server runs on.
+    Defines which queue system the everest submits jobs to
 
 
 

diff --git a/src/ert/config/queue_config.py b/src/ert/config/queue_config.py
@@ -89,7 +89,7 @@ def driver_options(self) -> dict[str, Any]:
 
 @pydantic.dataclasses.dataclass
 class LocalQueueOptions(QueueOptions):
-    name: Literal[QueueSystem.LOCAL] = QueueSystem.LOCAL
+    name: Literal[QueueSystem.LOCAL, "local", "LOCAL"] = "local"
 
     @property
     def driver_options(self) -> dict[str, Any]:
@@ -98,7 +98,7 @@ def driver_options(self) -> dict[str, Any]:
 
 @pydantic.dataclasses.dataclass
 class LsfQueueOptions(QueueOptions):
-    name: Literal[QueueSystem.LSF] = QueueSystem.LSF
+    name: Literal[QueueSystem.LSF, "lsf", "LSF"] = "lsf"
     bhist_cmd: NonEmptyString | None = None
     bjobs_cmd: NonEmptyString | None = None
     bkill_cmd: NonEmptyString | None = None
@@ -121,7 +121,7 @@ def driver_options(self) -> dict[str, Any]:
 
 @pydantic.dataclasses.dataclass
 class TorqueQueueOptions(QueueOptions):
-    name: Literal[QueueSystem.TORQUE] = QueueSystem.TORQUE
+    name: Literal[QueueSystem.TORQUE, "torque", "TORQUE"] = "torque"
     qsub_cmd: NonEmptyString | None = None
     qstat_cmd: NonEmptyString | None = None
     qdel_cmd: NonEmptyString | None = None
@@ -157,7 +157,7 @@ def check_memory_per_job(cls, value: str | None) -> str | None:
 
 @pydantic.dataclasses.dataclass
 class SlurmQueueOptions(QueueOptions):
-    name: Literal[QueueSystem.SLURM] = QueueSystem.SLURM
+    name: Literal[QueueSystem.SLURM, "SLURM", "slurm"] = "slurm"
     sbatch: NonEmptyString = "sbatch"
     scancel: NonEmptyString = "scancel"
     scontrol: NonEmptyString = "scontrol"

diff --git a/src/everest/config/everest_config.py b/src/everest/config/everest_config.py
@@ -2,6 +2,7 @@
 import os
 import shutil
 from argparse import ArgumentParser
+from copy import copy
 from io import StringIO
 from pathlib import Path
 from typing import (
@@ -183,7 +184,7 @@ class EverestConfig(BaseModelWithPropertySupport):  # type: ignore
 """,
     )
     server: ServerConfig | None = Field(
-        default=None,
+        default_factory=ServerConfig,
         description="""Defines Everest server settings, i.e., which queue system,
             queue name and queue options are used for the everest server.
             The main reason for changing this section is situations where everest
@@ -216,6 +217,25 @@ class EverestConfig(BaseModelWithPropertySupport):  # type: ignore
     config_path: Path = Field()
     model_config = ConfigDict(extra="forbid")
 
+    @model_validator(mode="after")
+    def validate_queue_system(self) -> Self:  # pylint: disable=E0213
+        if self.server is None:
+            self.server = ServerConfig(queue_system=copy(self.simulator.queue_system))
+        elif self.server.queue_system is None:
+            self.server.queue_system = copy(self.simulator.queue_system)
+        if (
+            str(self.simulator.queue_system.name).lower() == "local"
+            and str(self.server.queue_system.name).lower()
+            != str(self.simulator.queue_system.name).lower()
+        ):
+            raise ValueError(
+                f"The simulator is using local as queue system "
+                f"while the everest server is using {self.server.queue_system.name}. "
+                f"If the simulator is using local, so must the everest server."
+            )
+        self.server.queue_system.max_running = 1
+        return self
+
     @model_validator(mode="after")
     def validate_install_job_sources(self) -> Self:  # pylint: disable=E0213
         model = self.model
@@ -735,7 +755,7 @@ def with_defaults(cls, **kwargs):
             "config_path": ".",
         }
 
-        return EverestConfig.model_validate({**defaults, **kwargs})
+        return cls.model_validate({**defaults, **kwargs})
 
     @staticmethod
     def lint_config_dict(config: dict) -> list["ErrorDetails"]:

diff --git a/src/everest/config/has_ert_queue_options.py b/src/everest/config/has_ert_queue_options.py