Merge pull request #2194 from ekouts/feat/refactor_policies

[feat] Execute build phase asynchronously
reframe-hpc · Jan 21, 2022 · f02136b · f02136b
2 parents 535d694 + f1e3bd4
commit f02136b
Show file tree

Hide file tree

Showing 17 changed files with 693 additions and 537 deletions.
diff --git a/docs/_static/img/async-exec-policy.svg b/docs/_static/img/async-exec-policy.svg
diff --git a/docs/_static/img/regression-task-state-machine.svg b/docs/_static/img/regression-task-state-machine.svg
diff --git a/docs/_static/img/serial-exec-policy.svg b/docs/_static/img/serial-exec-policy.svg
diff --git a/docs/config_reference.rst b/docs/config_reference.rst
@@ -101,6 +101,18 @@ System Configuration
    A list of hostname regular expression patterns in Python `syntax <https://docs.python.org/3.8/library/re.html>`__, which will be used by the framework in order to automatically select a system configuration.
    For the auto-selection process, see `here <configure.html#picking-a-system-configuration>`__.
 
+.. js:attribute:: .systems[].max_local_jobs
+
+   The maximum number of forced local build or run jobs allowed.
+
+   Forced local jobs run within the execution context of ReFrame.
+
+   :required: No
+   :default: ``8``
+
+   .. versionadded:: 3.10.0
+
+
 .. js:attribute:: .systems[].modules_system
 
    :required: No
@@ -1289,6 +1301,30 @@ General Configuration
   Timeout value in seconds used when checking if a git repository exists.
 
 
+.. js:attribute:: .general[].dump_pipeline_progress
+
+   Dump pipeline progress for the asynchronous execution policy in ``pipeline-progress.json``.
+   This option is meant for debug purposes only.
+
+   :required: No
+   :default: ``False``
+
+   .. versionadded:: 3.10.0
+
+
+.. js:attribute:: .general[].pipeline_timeout
+
+   Timeout in seconds for advancing the pipeline in the asynchronous execution policy.
+
+   ReFrame's asynchronous execution policy will try to advance as many tests as possible in their pipeline, but some tests may take too long to proceed (e.g., due to copying of large files) blocking the advancement of previously started tests.
+   If this timeout value is exceeded and at least one test has progressed, ReFrame will stop processing new tests and it will try to further advance tests that have already started.
+
+   :required: No
+   :default: ``10``
+
+   .. versionadded:: 3.10.0
+
+
 .. js:attribute:: .general[].remote_detect
 
    :required: No

diff --git a/docs/manpage.rst b/docs/manpage.rst
@@ -344,13 +344,13 @@ Options controlling ReFrame execution
    - ``async``: Tests will be executed asynchronously.
      This is the default policy.
 
-     The ``async`` execution policy executes the run phase of tests asynchronously by submitting their associated jobs in a non-blocking way.
-     ReFrame's runtime monitors the progress of each test and will resume the pipeline execution of an asynchronously spawned test as soon as its run phase has finished.
+     The ``async`` execution policy executes the build and run phases of tests asynchronously by submitting their associated jobs in a non-blocking way.
+     ReFrame's runtime monitors the progress of each test and will resume the pipeline execution of an asynchronously spawned test as soon as its build or run phase have finished.
      Note that the rest of the pipeline stages are still executed sequentially in this policy.
 
      Concurrency can be controlled by setting the :js:attr:`max_jobs` system partition configuration parameter.
      As soon as the concurrency limit is reached, ReFrame will first poll the status of all its pending tests to check if any execution slots have been freed up.
-     If there are tests that have finished their run phase, ReFrame will keep pushing tests for execution until the concurrency limit is reached again.
+     If there are tests that have finished their build or run phase, ReFrame will keep pushing tests for execution until the concurrency limit is reached again.
      If no execution slots are available, ReFrame will throttle job submission.
 
 .. option:: --force-local

diff --git a/docs/pipeline.rst b/docs/pipeline.rst
@@ -52,7 +52,8 @@ A `job descriptor <regression_test_api.html#reframe.core.pipeline.RegressionTest
 The Build Phase
 ---------------
 
-During this phase the source code associated with the test is compiled using the current programming environment.
+During this phase a job script for the compilation of the test will be created and it will be submitted for execution.
+The source code associated with the test is compiled using the current programming environment.
 If the test is `"run-only," <regression_test_api.html#reframe.core.pipeline.RunOnlyRegressionTest>`__ this phase is a no-op.
 
 Before building the test, all the `resources <regression_test_api.html#reframe.core.pipeline.RegressionTest.sourcesdir>`__ associated with it are copied to the test case's stage directory.
@@ -100,10 +101,10 @@ Execution Policies
 
 All regression tests in ReFrame will execute the pipeline stages described above.
 However, how exactly this pipeline will be executed is responsibility of the test execution policy.
-There are two execution policies in ReFrame: the serial and the asynchronous one.
+There are two execution policies in ReFrame: the serial and the asynchronous execution policy.
 
 In the serial execution policy, a new test gets into the pipeline after the previous one has exited.
-As the figure below shows, this can lead to long idling times in the run phase, since the execution blocks until the associated test job finishes.
+As the figure below shows, this can lead to long idling times in the build and run phases, since the execution blocks until the associated test job finishes.
 
 
 .. figure:: _static/img/serial-exec-policy.svg
@@ -114,7 +115,7 @@ As the figure below shows, this can lead to long idling times in the run phase,
 
 
 In the asynchronous execution policy, multiple tests can be simultaneously on-the-fly.
-When a test enters the run phase, ReFrame does not block, but continues by picking the next test case to run.
+When a test enters the build or run phase, ReFrame does not block, but continues by picking the next test case to run.
 This continues until no more test cases are left for execution or until a maximum concurrency limit is reached.
 At the end, ReFrame enters a busy-wait loop monitoring the spawned test cases.
 As soon as test case finishes, it resumes its pipeline and runs it to completion.
@@ -133,6 +134,67 @@ When the `concurrency limit <config_reference.html#.systems[].partitions[].max_j
 
 ReFrame uses polling to check the status of the spawned jobs, but it does so in a dynamic way, in order to ensure both responsiveness and avoid overloading the system job scheduler with excessive polling.
 
+
+ReFrame's runtime internally encapsulates each test in a task, which is scheduled for execution.
+This task can be in different states and is responsible for executing the test's pipeline.
+The following state diagram shows how test tasks are scheduled, as well as when the various test pipeline stages are executed.
+
+.. figure:: _static/img/regression-task-state-machine.svg
+  :align: center
+  :alt: State diagram of the execution of test tasks.
+
+  :sub:`State diagram of the execution of test tasks with annotations for the execution of the actual pipeline stages.`
+
+There are a number of things to notice in this diagram:
+
+- If a test encounters an exception it is marked as a failure.
+  Even normal failures, such as dependency failures and sanity or performance failures are also exceptions raised explicitly by the framework during a pipeline stage.
+- The pipeline stages that are executed asynchronously, namely the ``compile`` and ``run`` stages, are split in sub-stages for submitting the corresponding job and for checking or waiting its completion.
+  This is why in ReFrame error messages you may see ``compile_complete``  or ``run_complete`` being reported as the failing stage.
+- The execution of a test may be stalled if there are not enough execution slots available for submitting compile or run jobs on the target partition.
+- Although a test is officially marked as "completed" only when its cleanup phase is executed, it is reported as success or failure as soon as it is "retired," i.e., as soon as its performance stage has passed successfully.
+- For successful tests, the ``cleanup`` stage is executed *after* the test is reported as a "success," since a test may not clean up its resources until all of its immediate dependencies finish also successfully.
+  If the ``cleanup`` phase fails, the test is not marked as a failure, but this condition is marked as an error.
+
+
+.. versionchanged:: 3.10.0
+   The ``compile`` stage is now also executed asynchronously.
+
+
+--------------------------------------
+Where each pipeline stage is executed?
+--------------------------------------
+
+There are two executions contexts where a pipeline stage can be executed: the ReFrame execution context and the partition execution context.
+The *ReFrame execution context* is where ReFrame executes.
+This is always the local host.
+The *partition execution context* can either be local or remote depending on how the partition is configured.
+The following table show in which context each pipeline stage executes:
+
+.. table::
+   :align: center
+
+   ============== =================
+   Pipeline Stage Execution Context
+   ============== =================
+   *Setup*        ReFrame
+   *Compile*      ReFrame if :attr:`~reframe.core.pipeline.RegressionTest.build_locally` or :attr:`~reframe.core.pipeline.RegressionTest.local` is :obj:`True` or if :option:`--force-local` is passed, partition otherwise.
+   *Run*          ReFrame if :attr:`~reframe.core.pipeline.RegressionTest.local` is :obj:`True` or if :option:`--force-local` is passed, partition otherwise.
+   *Sanity*       ReFrame
+   *Performance*  ReFrame
+   *Cleanup*      ReFrame
+   ============== =================
+
+It should be noted that even if the partition execution context is local, it is treated differently from the ReFrame execution context.
+For example, a test executing in the ReFrame context will not respect the :js:attr:`max_jobs` partition configuration option, even if the partition is local.
+To control the concurrency of the ReFrame execution context, users should set the :js:attr:`.systems[].max_local_jobs` option instead.
+
+
+.. versionchanged:: 3.10.0
+
+   Execution contexts were formalized.
+
+
 Timing the Test Pipeline
 ------------------------
 

diff --git a/docs/tutorial_basics.rst b/docs/tutorial_basics.rst
@@ -113,11 +113,8 @@ Now it's time to run our first test:
    [==========] Running 1 check(s)
    [==========] Started on Mon Oct 12 18:23:30 2020
 
-   [----------] started processing HelloTest (HelloTest)
+   [----------] start processing checks
    [ RUN      ] HelloTest on generic:default using builtin
-   [----------] finished processing HelloTest (HelloTest)
-
-   [----------] waiting for spawned checks to finish
    [       OK ] (1/1) HelloTest on generic:default using builtin [compile: 0.389s run: 0.406s total: 0.811s]
    [----------] all spawned checks have finished
 
@@ -283,17 +280,11 @@ Let's run the test now:
    [==========] Running 2 check(s)
    [==========] Started on Tue Mar  9 23:25:22 2021
 
-   [----------] started processing HelloMultiLangTest_c (HelloMultiLangTest_c)
+   [----------] start processing checks
    [ RUN      ] HelloMultiLangTest_c on generic:default using builtin
-   [----------] finished processing HelloMultiLangTest_c (HelloMultiLangTest_c)
-
-   [----------] started processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)
    [ RUN      ] HelloMultiLangTest_cpp on generic:default using builtin
    [     FAIL ] (1/2) HelloMultiLangTest_cpp on generic:default using builtin [compile: 0.006s run: n/a total: 0.023s]
    ==> test failed during 'compile': test staged in '/Users/user/Repositories/reframe/stage/generic/default/builtin/HelloMultiLangTest_cpp'
-   [----------] finished processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)
-
-   [----------] waiting for spawned checks to finish
    [       OK ] (2/2) HelloMultiLangTest_c on generic:default using builtin [compile: 0.981s run: 0.468s total: 1.475s]
    [----------] all spawned checks have finished
 
@@ -397,17 +388,11 @@ Let's now rerun our "Hello, World!" tests:
    [==========] Running 2 check(s)
    [==========] Started on Tue Mar  9 23:28:00 2021
 
-   [----------] started processing HelloMultiLangTest_c (HelloMultiLangTest_c)
+   [----------] start processing checks
    [ RUN      ] HelloMultiLangTest_c on catalina:default using gnu
    [ RUN      ] HelloMultiLangTest_c on catalina:default using clang
-   [----------] finished processing HelloMultiLangTest_c (HelloMultiLangTest_c)
-
-   [----------] started processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)
    [ RUN      ] HelloMultiLangTest_cpp on catalina:default using gnu
    [ RUN      ] HelloMultiLangTest_cpp on catalina:default using clang
-   [----------] finished processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)
-
-   [----------] waiting for spawned checks to finish
    [       OK ] (1/4) HelloMultiLangTest_cpp on catalina:default using gnu [compile: 0.768s run: 1.115s total: 1.909s]
    [       OK ] (2/4) HelloMultiLangTest_c on catalina:default using gnu [compile: 0.600s run: 2.230s total: 2.857s]
    [       OK ] (3/4) HelloMultiLangTest_c on catalina:default using clang [compile: 0.238s run: 2.129s total: 2.393s]
@@ -499,12 +484,9 @@ Let's run the test now:
    [==========] Running 1 check(s)
    [==========] Started on Mon Oct 12 20:02:37 2020
 
-   [----------] started processing HelloThreadedTest (HelloThreadedTest)
+   [----------] start processing checks
    [ RUN      ] HelloThreadedTest on catalina:default using gnu
    [ RUN      ] HelloThreadedTest on catalina:default using clang
-   [----------] finished processing HelloThreadedTest (HelloThreadedTest)
-
-   [----------] waiting for spawned checks to finish
    [       OK ] (1/2) HelloThreadedTest on catalina:default using gnu [compile: 1.591s run: 1.205s total: 2.816s]
    [       OK ] (2/2) HelloThreadedTest on catalina:default using clang [compile: 1.141s run: 0.309s total: 1.465s]
    [----------] all spawned checks have finished
@@ -592,12 +574,9 @@ Let's run this version of the test now and see if it fails:
    [==========] Running 1 check(s)
    [==========] Started on Mon Oct 12 20:04:59 2020
 
-   [----------] started processing HelloThreadedExtendedTest (HelloThreadedExtendedTest)
+   [----------] start processing checks
    [ RUN      ] HelloThreadedExtendedTest on catalina:default using gnu
    [ RUN      ] HelloThreadedExtendedTest on catalina:default using clang
-   [----------] finished processing HelloThreadedExtendedTest (HelloThreadedExtendedTest)
-
-   [----------] waiting for spawned checks to finish
    [     FAIL ] (1/2) HelloThreadedExtendedTest on catalina:default using gnu [compile: 1.222s run: 0.891s total: 2.130s]
    [     FAIL ] (2/2) HelloThreadedExtendedTest on catalina:default using clang [compile: 0.835s run: 0.167s total: 1.018s]
    [----------] all spawned checks have finished
@@ -718,11 +697,8 @@ The :option:`--performance-report` will generate a short report at the end for e
    [==========] Running 1 check(s)
    [==========] Started on Mon Oct 12 20:06:09 2020
 
-   [----------] started processing StreamTest (StreamTest)
+   [----------] start processing checks
    [ RUN      ] StreamTest on catalina:default using gnu
-   [----------] finished processing StreamTest (StreamTest)
-
-   [----------] waiting for spawned checks to finish
    [       OK ] (1/1) StreamTest on catalina:default using gnu [compile: 1.386s run: 2.377s total: 3.780s]
    [----------] all spawned checks have finished
 
@@ -967,7 +943,7 @@ We will only do so with the final versions of the tests from the previous sectio
    [==========] Running 4 check(s)
    [==========] Started on Mon Jan 25 00:34:32 2021
 
-   [----------] started processing HelloMultiLangTest_c (HelloMultiLangTest_c)
+   [----------] start processing checks
    [ RUN      ] HelloMultiLangTest_c on daint:login using builtin
    [ RUN      ] HelloMultiLangTest_c on daint:login using gnu
    [ RUN      ] HelloMultiLangTest_c on daint:login using intel
@@ -981,9 +957,6 @@ We will only do so with the final versions of the tests from the previous sectio
    [ RUN      ] HelloMultiLangTest_c on daint:mc using intel
    [ RUN      ] HelloMultiLangTest_c on daint:mc using pgi
    [ RUN      ] HelloMultiLangTest_c on daint:mc using cray
-   [----------] finished processing HelloMultiLangTest_c (HelloMultiLangTest_c)
-
-   [----------] started processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)
    [ RUN      ] HelloMultiLangTest_cpp on daint:login using builtin
    [ RUN      ] HelloMultiLangTest_cpp on daint:login using gnu
    [ RUN      ] HelloMultiLangTest_cpp on daint:login using intel
@@ -997,9 +970,6 @@ We will only do so with the final versions of the tests from the previous sectio
    [ RUN      ] HelloMultiLangTest_cpp on daint:mc using intel
    [ RUN      ] HelloMultiLangTest_cpp on daint:mc using pgi
    [ RUN      ] HelloMultiLangTest_cpp on daint:mc using cray
-   [----------] finished processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)
-
-   [----------] started processing HelloThreadedExtended2Test (HelloThreadedExtended2Test)
    [ RUN      ] HelloThreadedExtended2Test on daint:login using builtin
    [ RUN      ] HelloThreadedExtended2Test on daint:login using gnu
    [ RUN      ] HelloThreadedExtended2Test on daint:login using intel
@@ -1013,15 +983,9 @@ We will only do so with the final versions of the tests from the previous sectio
    [ RUN      ] HelloThreadedExtended2Test on daint:mc using intel
    [ RUN      ] HelloThreadedExtended2Test on daint:mc using pgi
    [ RUN      ] HelloThreadedExtended2Test on daint:mc using cray
-   [----------] finished processing HelloThreadedExtended2Test (HelloThreadedExtended2Test)
-
-   [----------] started processing StreamWithRefTest (StreamWithRefTest)
    [ RUN      ] StreamWithRefTest on daint:login using gnu
    [ RUN      ] StreamWithRefTest on daint:gpu using gnu
    [ RUN      ] StreamWithRefTest on daint:mc using gnu
-   [----------] finished processing StreamWithRefTest (StreamWithRefTest)
-
-   [----------] waiting for spawned checks to finish
    [       OK ] ( 1/42) HelloThreadedExtended2Test on daint:login using cray [compile: 0.959s run: 56.203s total: 57.189s]
    [       OK ] ( 2/42) HelloThreadedExtended2Test on daint:login using intel [compile: 2.096s run: 61.438s total: 64.062s]
    [       OK ] ( 3/42) HelloMultiLangTest_cpp on daint:login using cray [compile: 0.479s run: 98.909s total: 99.406s]
@@ -1205,7 +1169,7 @@ Let's run our adapted test now:
    [==========] Running 1 check(s)
    [==========] Started on Mon Oct 12 20:16:03 2020
 
-   [----------] started processing StreamMultiSysTest (StreamMultiSysTest)
+   [----------] start processing checks
    [ RUN      ] StreamMultiSysTest on daint:login using gnu
    [ RUN      ] StreamMultiSysTest on daint:login using intel
    [ RUN      ] StreamMultiSysTest on daint:login using pgi
@@ -1218,9 +1182,6 @@ Let's run our adapted test now:
    [ RUN      ] StreamMultiSysTest on daint:mc using intel
    [ RUN      ] StreamMultiSysTest on daint:mc using pgi
    [ RUN      ] StreamMultiSysTest on daint:mc using cray
-   [----------] finished processing StreamMultiSysTest (StreamMultiSysTest)
-
-   [----------] waiting for spawned checks to finish
    [       OK ] ( 1/12) StreamMultiSysTest on daint:gpu using pgi [compile: 2.092s run: 11.201s total: 13.307s]
    [       OK ] ( 2/12) StreamMultiSysTest on daint:gpu using gnu [compile: 2.349s run: 17.140s total: 19.509s]
    [       OK ] ( 3/12) StreamMultiSysTest on daint:login using pgi [compile: 2.230s run: 20.946s total: 23.189s]