Skip to content

Commit

Permalink
Remove deprecated torque options
Browse files Browse the repository at this point in the history
This commit removes the deprecated torque/openpbs queue options:
* QUEUE_QUERY_TIMEOUT
* NUM_NODES
* NUM_CPUS_PER_NODE
* QSTAT_OPTIONS
* MEMORY_PER_JOB
  • Loading branch information
jonathan-eq committed Dec 20, 2024
1 parent 2e9c71a commit 177a3f3
Show file tree
Hide file tree
Showing 15 changed files with 19 additions and 472 deletions.
5 changes: 2 additions & 3 deletions docs/ert/reference/configuration/keywords.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1825,9 +1825,8 @@ in :ref:`queue-system-chapter`. In brief, the queue systems have the following o
``BHIST_CMD``, ``SUBMIT_SLEEP``, ``PROJECT_CODE``, ``EXCLUDE_HOST``,
``MAX_RUNNING``
* :ref:`TORQUE <pbs-systems>` — ``QSUB_CMD``, ``QSTAT_CMD``, ``QDEL_CMD``,
``QSTAT_OPTIONS``, ``QUEUE``, ``CLUSTER_LABEL``, ``MAX_RUNNING``, ``NUM_NODES``,
``NUM_CPUS_PER_NODE``, ``MEMORY_PER_JOB``, ``KEEP_QSUB_OUTPUT``, ``SUBMIT_SLEEP``,
``QUEUE_QUERY_TIMEOUT``
``QUEUE``, ``CLUSTER_LABEL``, ``MAX_RUNNING``, ``KEEP_QSUB_OUTPUT``,
``SUBMIT_SLEEP``
* :ref:`SLURM <slurm-systems>` — ``SBATCH``, ``SCANCEL``, ``SCONTROL``, ``SACCT``,
``SQUEUE``, ``PARTITION``, ``SQUEUE_TIMEOUT``, ``MAX_RUNTIME``, ``INCLUDE_HOST``,
``EXCLUDE_HOST``, ``MAX_RUNNING``
Expand Down
54 changes: 0 additions & 54 deletions docs/ert/reference/configuration/queue.rst
Original file line number Diff line number Diff line change
Expand Up @@ -251,12 +251,6 @@ The following is a list of all queue-specific configuration options:
QUEUE_OPTION TORQUE QSTAT_CMD /path/to/my/qstat
QUEUE_OPTION TORQUE QDEL_CMD /path/to/my/qdel

.. _torque_qstat_options:
.. topic:: QSTAT_OPTIONS

Options to be supplied to the ``qstat`` command. This defaults to :code:`-x`,
which tells the ``qstat`` command to include exited processes.

.. _torque_queue:
.. topic:: QUEUE

Expand All @@ -283,37 +277,6 @@ The following is a list of all queue-specific configuration options:

If ``n`` is zero (the default), then it is set to the number of realizations.

.. _torque_nodes_cpus:
.. topic:: NUM_NODES, NUM_CPUS_PER_NODE

The support for running a job over multiple nodes is deprecated in Ert,
but was previously accomplished by setting NUM_NODES to a number larger
than 1.

NUM_CPUS_PER_NODE is deprecated, instead please use NUM_CPU to specify the
number of CPU cores to reserve on a single compute node.

.. _torque_memory_per_job:
.. topic:: MEMORY_PER_JOB

You can specify the amount of memory you will need for running your
job. This will ensure that not too many jobs will run on a single
shared memory node at once, possibly crashing the compute node if it
runs out of memory.

You can get an indication of the memory requirement by watching the
course of a local run using the ``htop`` utility. Whether you should set
the peak memory usage as your requirement or a lower figure depends on
how simultaneously each job will run.

The option to be supplied will be used as a string in the ``qsub``
argument. You must specify the unit, either ``gb`` or ``mb`` as in
the example::

QUEUE_OPTION TORQUE MEMORY_PER_JOB 16gb

By default, this value is not set.

.. _torque_keep_qsub_output:
.. topic:: KEEP_QSUB_OUTPUT

Expand All @@ -332,23 +295,6 @@ The following is a list of all queue-specific configuration options:

QUEUE_OPTION TORQUE SUBMIT_SLEEP 0.5

.. _torque_queue_query_timeout:
.. topic:: QUEUE_QUERY_TIMEOUT

The driver allows the backend TORQUE/PBS system to be flaky, i.e. it may
intermittently not respond and give error messages when submitting jobs
or asking for job statuses. The timeout (in seconds) determines how long
ERT will wait before it will give up. Applies to job submission (``qsub``)
and job status queries (``qstat``). Default is 126 seconds.

ERT will do exponential sleeps, starting at 2 seconds, and the provided
timeout is a maximum. Let the timeout be sums of series like 2+4+8+16+32+64
in order to be explicit about the number of retries. Set to zero to disallow
flakyness, setting it to 2 will allow for one re-attempt, and 6 will give two
re-attempts. Example allowing six retries::

QUEUE_OPTION TORQUE QUEUE_QUERY_TIMEOUT 254

.. _torque_project_code:
.. topic:: PROJECT_CODE

Expand Down
24 changes: 0 additions & 24 deletions docs/everest/config_generated.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1037,27 +1037,12 @@ Simulation settings
The kill command


**qstat_options (optional)**
Type: *Optional[str]*

Options to be supplied to the qstat command. This defaults to -x, which tells the qstat command to include exited processes.


**cluster_label (optional)**
Type: *Optional[str]*

The name of the cluster you are running simulations in.


**memory_per_job (optional)**
Type: *Optional[str]*

You can specify the amount of memory you will need for running your job. This will ensure that not too many jobs will run on a single shared memory node at once, possibly crashing the compute node if it runs out of memory.
You can get an indication of the memory requirement by watching the course of a local run using the htop utility. Whether you should set the peak memory usage as your requirement or a lower figure depends on how simultaneously each job will run.
The option to be supplied will be used as a string in the qsub argument. You must specify the unit, either gb or mb.



**keep_qsub_output (optional)**
Type: *Optional[int]*

Expand All @@ -1070,15 +1055,6 @@ Simulation settings
To avoid stressing the TORQUE/PBS system you can instruct the driver to sleep for every submit request. The argument to the SUBMIT_SLEEP is the number of seconds to sleep for every submit, which can be a fraction like 0.5


**queue_query_timeout (optional)**
Type: *Optional[int]*


The driver allows the backend TORQUE/PBS system to be flaky, i.e. it may intermittently not respond and give error messages when submitting jobs or asking for job statuses. The timeout (in seconds) determines how long ERT will wait before it will give up. Applies to job submission (qsub) and job status queries (qstat). Default is 126 seconds.
ERT will do exponential sleeps, starting at 2 seconds, and the provided timeout is a maximum. Let the timeout be sums of series like 2+4+8+16+32+64 in order to be explicit about the number of retries. Set to zero to disallow flakyness, setting it to 2 will allow for one re-attempt, and 6 will give two re-attempts. Example allowing six retries:



**project_code (optional)**
Type: *Optional[str]*

Expand Down
30 changes: 0 additions & 30 deletions src/ert/config/parsing/config_schema_deprecations.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,36 +181,6 @@
"for the Ensemble Smoother update algorithm. "
"Please use ENKF_ALPHA and STD_CUTOFF keywords instead.",
),
DeprecationInfo(
keyword="QUEUE_OPTION",
message="QUEUE_QUERY_TIMEOUT as QUEUE_OPTION is ignored. "
"Please remove the line.",
check=lambda line: "QUEUE_QUERY_TIMEOUT" in line,
),
DeprecationInfo(
keyword="QUEUE_OPTION",
message="QSTAT_OPTIONS as QUEUE_OPTION to the TORQUE is ignored. "
"Please remove the line.",
check=lambda line: "QSTAT_OPTIONS" in line,
),
DeprecationInfo(
keyword="QUEUE_OPTION",
message="NUM_CPUS_PER_NODE as QUEUE_OPTION to Torque is deprecated and will removed in "
"the future. Replace by NUM_CPU.",
check=lambda line: "NUM_CPUS_PER_NODE" in line,
),
DeprecationInfo(
keyword="QUEUE_OPTION",
message="NUM_NODES as QUEUE_OPTION to Torque is deprecated and will removed in "
"the future. Replace by NUM_CPU on a single compute node.",
check=lambda line: "NUM_NODES" in line,
),
DeprecationInfo(
keyword="QUEUE_OPTION",
message="MEMORY_PER_JOB as QUEUE_OPTION to TORQUE is deprecated and will be removed in "
"the future. Replace by REALIZATION_MEMORY.",
check=lambda line: "MEMORY_PER_JOB" in line,
),
DeprecationInfo(
keyword="QUEUE_OPTION",
message="Memory requirements in LSF should now be set using REALIZATION_MEMORY and not"
Expand Down
48 changes: 0 additions & 48 deletions src/ert/config/queue_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,34 +126,19 @@ class TorqueQueueOptions(QueueOptions):
qstat_cmd: NonEmptyString | None = None
qdel_cmd: NonEmptyString | None = None
queue: NonEmptyString | None = None
memory_per_job: NonEmptyString | None = None
num_cpus_per_node: pydantic.PositiveInt = 1
num_nodes: pydantic.PositiveInt = 1
cluster_label: NonEmptyString | None = None
job_prefix: NonEmptyString | None = None
keep_qsub_output: bool = False

qstat_options: str | None = pydantic.Field(default=None, deprecated=True)
queue_query_timeout: str | None = pydantic.Field(default=None, deprecated=True)

@property
def driver_options(self) -> dict[str, Any]:
driver_dict = asdict(self)
driver_dict.pop("name")
driver_dict["queue_name"] = driver_dict.pop("queue")
driver_dict.pop("max_running")
driver_dict.pop("submit_sleep")
driver_dict.pop("qstat_options")
driver_dict.pop("queue_query_timeout")
return driver_dict

@pydantic.field_validator("memory_per_job")
@classmethod
def check_memory_per_job(cls, value: str | None) -> str | None:
if not torque_memory_usage_format.validate(value):
raise ValueError("wrong memory format")
return value


@pydantic.dataclasses.dataclass
class SlurmQueueOptions(QueueOptions):
Expand Down Expand Up @@ -322,23 +307,6 @@ def from_dict(cls, config_dict: ConfigDict) -> QueueConfig:
if tags:
queue_options.project_code = "+".join(tags)

if selected_queue_system == QueueSystem.TORQUE:
_check_num_cpu_requirement(
config_dict.get("NUM_CPU", 1), queue_options, raw_queue_options
)

for _queue_vals in all_validated_queue_options.values():
if (
isinstance(_queue_vals, TorqueQueueOptions)
and _queue_vals.memory_per_job
and realization_memory
):
_throw_error_or_warning(
"Do not specify both REALIZATION_MEMORY and TORQUE option MEMORY_PER_JOB",
"MEMORY_PER_JOB",
selected_queue_system == QueueSystem.TORQUE,
)

return QueueConfig(
job_script,
realization_memory,
Expand Down Expand Up @@ -369,22 +337,6 @@ def submit_sleep(self) -> float:
return self.queue_options.submit_sleep


def _check_num_cpu_requirement(
num_cpu: int, torque_options: TorqueQueueOptions, raw_queue_options: list[list[str]]
) -> None:
flattened_raw_options = [item for line in raw_queue_options for item in line]
if (
"NUM_NODES" not in flattened_raw_options
and "NUM_CPUS_PER_NODE" not in flattened_raw_options
):
return
if num_cpu != torque_options.num_nodes * torque_options.num_cpus_per_node:
raise ConfigValidationError(
f"When NUM_CPU is {num_cpu}, then the product of NUM_NODES ({torque_options.num_nodes}) "
f"and NUM_CPUS_PER_NODE ({torque_options.num_cpus_per_node}) must be equal."
)


def _parse_realization_memory_str(realization_memory_str: str) -> int:
if "-" in realization_memory_str:
raise ConfigValidationError(
Expand Down
38 changes: 1 addition & 37 deletions src/ert/scheduler/openpbs_driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,6 @@ def __init__(
queue_name: str | None = None,
project_code: str | None = None,
keep_qsub_output: bool | None = None,
memory_per_job: str | None = None,
num_nodes: int | None = None,
num_cpus_per_node: int | None = None,
cluster_label: str | None = None,
job_prefix: str | None = None,
qsub_cmd: str | None = None,
Expand All @@ -139,9 +136,6 @@ def __init__(
self._queue_name = queue_name
self._project_code = project_code
self._keep_qsub_output = keep_qsub_output
self._memory_per_job = memory_per_job
self._num_nodes: int | None = num_nodes
self._num_cpus_per_node: int | None = num_cpus_per_node
self._cluster_label: str | None = cluster_label
self._job_prefix = job_prefix
self._max_pbs_cmd_attempts = 10
Expand All @@ -158,45 +152,15 @@ def __init__(
self._finished_job_ids: set[str] = set()
self._finished_iens: set[int] = set()

if self._num_nodes is not None and self._num_nodes > 1:
logger.warning(
"OpenPBSDriver initialized with num_nodes > 1, "
"this behaviour is deprecated and will be removed"
)

if self._num_cpus_per_node is not None and self._num_cpus_per_node > 1:
logger.warning(
"OpenPBSDriver initialized with num_cpus_per_node, "
"this behaviour is deprecated and will be removed. "
"Use NUM_CPU in the config instead."
)

def _build_resource_string(
self, num_cpu: int = 1, realization_memory: int = 0
) -> list[str]:
resource_specifiers: list[str] = []

cpu_resources: list[str] = []
if self._num_nodes is not None:
cpu_resources += [f"select={self._num_nodes}"]
if self._num_cpus_per_node is not None:
num_nodes = self._num_nodes or 1
if num_cpu != self._num_cpus_per_node * num_nodes:
raise ValueError(
f"NUM_CPUS_PER_NODE ({self._num_cpus_per_node}) must be equal "
f"to NUM_CPU ({num_cpu}). "
"Please remove NUM_CPUS_PER_NODE from the configuration"
)
if num_cpu > 1:
cpu_resources += [f"ncpus={num_cpu}"]
if self._memory_per_job is not None and realization_memory > 0:
raise ValueError(
"Overspecified memory pr job. "
"Do not specify both memory_per_job and realization_memory"
)
if self._memory_per_job is not None:
cpu_resources += [f"mem={self._memory_per_job}"]
elif realization_memory > 0:
if realization_memory > 0:
cpu_resources += [f"mem={realization_memory // 1024**2 }mb"]
if cpu_resources:
resource_specifiers.append(":".join(cpu_resources))
Expand Down
18 changes: 0 additions & 18 deletions src/everest/config/simulator_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,21 +119,10 @@ class SimulatorConfig(BaseModel, HasErtQueueOptions, extra="forbid"): # type: i
qsub_cmd: str | None = Field(default="qsub", description="The submit command")
qstat_cmd: str | None = Field(default="qstat", description="The query command")
qdel_cmd: str | None = Field(default="qdel", description="The kill command")
qstat_options: str | None = Field(
default="-x",
description="Options to be supplied to the qstat command. This defaults to -x, which tells the qstat command to include exited processes.",
)
cluster_label: str | None = Field(
default=None,
description="The name of the cluster you are running simulations in.",
)
memory_per_job: str | None = Field(
default=None,
description="""You can specify the amount of memory you will need for running your job. This will ensure that not too many jobs will run on a single shared memory node at once, possibly crashing the compute node if it runs out of memory.
You can get an indication of the memory requirement by watching the course of a local run using the htop utility. Whether you should set the peak memory usage as your requirement or a lower figure depends on how simultaneously each job will run.
The option to be supplied will be used as a string in the qsub argument. You must specify the unit, either gb or mb.
""",
)
keep_qsub_output: int | None = Field(
default=0,
description="Set to 1 to keep error messages from qsub. Usually only to be used if somethign is seriously wrong with the queue environment/setup.",
Expand All @@ -142,13 +131,6 @@ class SimulatorConfig(BaseModel, HasErtQueueOptions, extra="forbid"): # type: i
default=0.5,
description="To avoid stressing the TORQUE/PBS system you can instruct the driver to sleep for every submit request. The argument to the SUBMIT_SLEEP is the number of seconds to sleep for every submit, which can be a fraction like 0.5",
)
queue_query_timeout: int | None = Field(
default=126,
description="""
The driver allows the backend TORQUE/PBS system to be flaky, i.e. it may intermittently not respond and give error messages when submitting jobs or asking for job statuses. The timeout (in seconds) determines how long ERT will wait before it will give up. Applies to job submission (qsub) and job status queries (qstat). Default is 126 seconds.
ERT will do exponential sleeps, starting at 2 seconds, and the provided timeout is a maximum. Let the timeout be sums of series like 2+4+8+16+32+64 in order to be explicit about the number of retries. Set to zero to disallow flakyness, setting it to 2 will allow for one re-attempt, and 6 will give two re-attempts. Example allowing six retries:
""",
)
project_code: str | None = Field(
default=None,
description="String identifier used to map hardware resource usage to a project or account. The project or account does not have to exist.",
Expand Down
1 change: 0 additions & 1 deletion src/everest/config_keys.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,6 @@ class ConfigKeys:
TORQUE_QDEL_CMD = "qdel_cmd"
TORQUE_QUEUE_NAME = "name"
TORQUE_CLUSTER_LABEL = "cluster_label"
TORQUE_MEMORY_PER_JOB = "memory_per_job"
TORQUE_KEEP_QSUB_OUTPUT = "keep_qsub_output"
TORQUE_SUBMIT_SLEEP = "submit_sleep"
TORQUE_PROJECT_CODE = "project_code"
Expand Down
2 changes: 0 additions & 2 deletions src/everest/queue_driver/queue_driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@
(ConfigKeys.TORQUE_QDEL_CMD, "QDEL_CMD"),
(ConfigKeys.TORQUE_QUEUE_NAME, "QUEUE"),
(ConfigKeys.TORQUE_CLUSTER_LABEL, "CLUSTER_LABEL"),
(ConfigKeys.CORES_PER_NODE, "NUM_CPUS_PER_NODE"),
(ConfigKeys.TORQUE_MEMORY_PER_JOB, "MEMORY_PER_JOB"),
(ConfigKeys.TORQUE_KEEP_QSUB_OUTPUT, "KEEP_QSUB_OUTPUT"),
(ConfigKeys.TORQUE_SUBMIT_SLEEP, "SUBMIT_SLEEP"),
(ConfigKeys.TORQUE_PROJECT_CODE, "PROJECT_CODE"),
Expand Down
2 changes: 0 additions & 2 deletions tests/ert/unit_tests/config/config_dict_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,6 @@ def valid_queue_values(option_name, queue_system):
elif option_name in queue_options_by_type["posfloat"][queue_system]:
return small_floats.map(str)
elif option_name in queue_options_by_type["posint"][queue_system]:
if option_name in {"NUM_NODES", "NUM_CPUS_PER_NODE"}:
return st.just("1")
return positives.map(str)
elif option_name in queue_options_by_type["bool"][queue_system]:
return booleans.map(str)
Expand Down
Loading

0 comments on commit 177a3f3

Please sign in to comment.