Update state retry state machine for CPU alloc support #1543

revans2 · 2023-11-07T22:27:38Z

This updates the state machine to reduce the number of states.
Migrate the shuffle states to be more generic around thread pools instead of just shuffle
And to add in support for reusing the state machine/retry framework for CPU memory allocations too.

This is a breaking change and NVIDIA/spark-rapids#9656 is needed in the plugin to avoid breaking the plugin.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2023-11-07T22:35:33Z

build

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2023-11-16T22:50:28Z

build

revans2 · 2023-11-17T14:22:34Z

build

parthosa · 2023-11-17T18:07:19Z

nit: Fix typo in PR title state maching

abellina

I am half way through SparkResourceAdaptorJni. It looks good so far, and I have mostly minor stuff. I'll do another pass.

abellina · 2023-11-20T14:36:32Z

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

    synchronized (Rmm.class) {
      if (sra != null && sra.isOpen()) {
-        sra.associateThreadWithShuffle(threadId);
+        ThreadStateRegistry.addThread(threadId, thread);
+        sra.poolThreadWorkingOnTasks(true, threadId, taskIds);


Suggested change

sra.poolThreadWorkingOnTasks(true, threadId, taskIds);

sra.poolThreadWorkingOnTasks(/*isForShuffle*/ true, threadId, taskIds);

abellina · 2023-11-20T14:36:50Z

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

+    synchronized (Rmm.class) {
+      if (sra != null && sra.isOpen()) {
+        ThreadStateRegistry.addThread(threadId, thread);
+        sra.poolThreadWorkingOnTasks(false, threadId, taskIds);


Suggested change

sra.poolThreadWorkingOnTasks(false, threadId, taskIds);

sra.poolThreadWorkingOnTasks(/*isForShuffle*/ false, threadId, taskIds);

java does not support named parameters and I really dislike /* comments */ embedded in the code. If this is a deal breaker for you, then I will change it to an enum. Just let me know.

abellina · 2023-11-20T14:47:59Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+    other.time_lost_nanos = 0;
+  }
+
+  void add(task_metrics& other) {


Suggested change

void add(task_metrics& other) {

void add(task_metrics const& other) {

abellina · 2023-11-20T14:53:06Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

@@ -306,27 +387,27 @@ class spark_resource_adaptor final : public rmm::mr::device_memory_resource {
  bool supports_streams() const noexcept override { return resource->supports_streams(); }

  /**
-   * Update the internal state so that a specific thread is associated with a task.
+   * Update the internal state so that a specific thread is dediocated to a task.


Suggested change

* Update the internal state so that a specific thread is dediocated to a task.

* Update the internal state so that a specific thread is dedicated to a task.

abellina · 2023-11-20T14:54:53Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+      if (is_for_shuffle) {
+        throw std::invalid_argument("the thread is marked as a non-shuffle thread, and we cannot change it while there are active tasks");
+      } else {
+        throw std::invalid_argument("the thread is marked as a shuffle thread,a nd we cannot change it while there are active tasks");


Suggested change

throw std::invalid_argument("the thread is marked as a shuffle thread,a nd we cannot change it while there are active tasks");

throw std::invalid_argument("the thread is marked as a shuffle thread, and we cannot change it while there are active tasks");

abellina · 2023-11-20T15:38:51Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

        }
      }
    }
+
+    auto metrics_at = task_to_metrics.find(task_id);


this is more of a knowledge question. We have two levels of metrics, task and threads. How do these work at a high level? Do thread metrics become task metrics at some point?

I've been a bit confused tracking these in the code as well. Might be helpful to separate out the metrics updating code into separate functions, or something.

abellina · 2023-11-20T15:45:21Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+  }
+
+  void cpu_dealloc(void* addr, size_t amount) {
+    // addr is not used yet, but is here in case we want it in the future.


should we file a follow on?

Probably not. It is there so we can do memory leak detection/tracking. But we don't have a need for it now. Probably later. Also it matches the RMM APIs so I wanted to keep it the same.

abellina · 2023-11-20T15:45:51Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+
+  void cpu_postalloc_success(void* addr, size_t amount, bool blocking, bool was_recursive) {
+    // addr is not used yet, but is here in case we want it in the future.
+    // amount is not used yet, but is here in case we want it for debugginig/metrics.


follow on to file (perhaps the same as the one below?)

abellina · 2023-11-20T15:51:47Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

-                                                                                jclass,
-                                                                                jlong ptr,
-                                                                                jlong thread_id)
+JNIEXPORT void JNICALL Java_com_nvidia_spark_rapids_jni_SparkResourceAdaptor_submittingToPool(JNIEnv* env,


nit, indentation is a little off here and a couple of the following methods.

abellina · 2023-11-20T15:58:34Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp


+            return true;


maybe a comment here saying that true == is_recursive? I know we have a comment on line 1198, but it's not clear how this boolean is used from the interface (to me) so a comment on the return value would be nice.

jbrennan333

Still reading through the code. Only minor comments/questions so far.

jbrennan333 · 2023-11-20T15:55:09Z

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

+  }
+
+  /**
+   * A dedicated task thread is about to wait on work done work on a pool that could transitively


typo: work done work on

jbrennan333 · 2023-11-20T15:58:57Z

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

   * @param threadId the ID of the thread to throw the exception (not java thread id).
-   * @param numOOMs the number of times the RetryOOM should be thrown
+   * @param numOOMs the number of times the GpuRetryOOM should be thrown
   */
  public static void forceRetryOOM(long threadId, int numOOMs) {


(nit) should we rename these forceGpuRetryOOM, etc..

jbrennan333 · 2023-11-20T16:02:50Z

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

+
+  /**
+   * The allocation failed, and spilling didn't save it.
+   * @param wasOom wat the failure caused by an OOM or something else.


jbrennan333 · 2023-11-20T16:32:25Z

src/test/java/com/nvidia/spark/rapids/jni/RmmSparkTest.java

+        // Ignored
+      }
+      // Force an exception
+      RmmSpark.forceRetryOOM(threadId);


Doesn't this force a GpuRetryOOM?

jbrennan333 · 2023-11-20T16:33:32Z

src/test/java/com/nvidia/spark/rapids/jni/RmmSparkTest.java

+      assertEquals(RmmSparkThreadState.THREAD_RUNNING, RmmSpark.getStateOf(threadId));
+
+      // Force another exception
+      RmmSpark.forceSplitAndRetryOOM(threadId);


Same as above - isn't this a GpuSplitAndRetryOOM?

gerashegalov · 2023-11-17T21:51:55Z

src/main/java/com/nvidia/spark/rapids/jni/SparkResourceAdaptor.java

+          Thread.sleep(100);
+        }
+      } catch (InterruptedException e) {
+        // Ignored we are going to exit.


The contract to ignore is that if we do not rethrow it, we should set the interrupt status.

Suggested change

// Ignored we are going to exit.

// we are going to exit.

Thread.currentThread().interrupt();

gerashegalov · 2023-11-20T18:58:13Z

src/main/java/com/nvidia/spark/rapids/jni/SparkResourceAdaptor.java

+      try {
+        while (handle > 0) {
+          checkAndBreakDeadlocks();
+          Thread.sleep(100);


I would make it a System property with a default

jbrennan333

Overall looks good to me. Only minor comments.

jbrennan333 · 2023-11-20T17:40:32Z

src/main/java/com/nvidia/spark/rapids/jni/SparkResourceAdaptor.java

+
+  /**
+   * The allocation failed, and spilling didn't save it.
+   * @param wasOom wat the failure caused by an OOM or something else.


jbrennan333 · 2023-11-20T18:16:28Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+  int num_times_retry_throw       = 0;
+  int num_times_split_retry_throw = 0;
+  long time_blocked_nanos         = 0;
+  // The amount of time that this thread has lost due to retries (not inclduing blocked time)


typo: inclduing

jbrennan333 · 2023-11-20T18:21:29Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

  // When did the retry time for this thread start, or when did the block time end.
  std::chrono::time_point<std::chrono::steady_clock> retry_start_or_block_end;
  // Is this thread currently in a marked retry block. This is only used for metrics.
  bool is_in_retry = false;
-
+  // The amount of time that this thread has spent in the current retry block (not inclucing block


typo: inclucing

jbrennan333 · 2023-11-20T21:41:50Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+        if (thread->second.pool_task_ids.erase(task_id) != 0) {
+          std::stringstream ss;
+          ss << "CURRENT IDs ";
+          for (const auto& task_id: thread->second.pool_task_ids) {


Rename task_id to avoid conflict with task_id arugment.

jbrennan333 · 2023-11-20T22:23:30Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

        }
      }
    }
+
+    auto metrics_at = task_to_metrics.find(task_id);


I've been a bit confused tracking these in the code as well. Might be helpful to separate out the metrics updating code into separate functions, or something.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2023-11-21T19:16:55Z

@jbrennan333 @gerashegalov @abellina I think I have addressed all of the review comments.

revans2 · 2023-11-21T19:17:00Z

build

revans2 · 2023-11-21T19:20:19Z

build

gerashegalov

LGTM, my comment does not appear addressed but it does not rise to the level of blocking this PR

revans2 · 2023-11-21T19:57:09Z

LGTM, my comment does not appear addressed but it does not rise to the level of blocking this PR

Sorry should be fixed now

revans2 · 2023-11-21T19:57:14Z

build

ttnghia · 2023-11-21T23:13:10Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+class task_metrics {
+  public:


If no private then just use struct:

Suggested change

class task_metrics {

public:

struct task_metrics {

src/main/cpp/src/SparkResourceAdaptorJni.cpp

revans2 · 2023-11-27T19:30:27Z

build

revans2 · 2023-11-27T20:09:23Z

build

revans2 · 2023-11-27T20:51:28Z

@ttnghia and @gerashegalov I think all of the review comments have been addressed. Please take another look.

ttnghia · 2023-11-27T22:39:22Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

@@ -14,11 +14,13 @@
 * limitations under the License.
 */

+#include <algorithm>


This is known to be extremely heavy and we should avoid it if possible. What APIs do we need from it here?

ttnghia · 2023-11-27T22:51:48Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+          log_status("FIXUP", thread_id, found->second.task_id,
+                  found->second.state, ss.str().c_str());


The pattern ss.str().c_str() is used a lot here and can cause undefined behavior/crash when logging is enabled. That is because .str().c_str() is a pointer to a temporary string produced by .str() which can be destroyed immediately.

Suggested change

log_status("FIXUP", thread_id, found->second.task_id,

found->second.state, ss.str().c_str());

auto const log_str = ss.str();

log_status("FIXUP", thread_id, found->second.task_id,

found->second.state, log_str.c_str());

Please also take care of the remaining instances of this.

Another option is, for short string message like this, just use std::to_string directly:

auto const log_str = std::string("desired task_id ") + std::string(task_id); log_status("FIXUP", thread_id, found->second.task_id, found->second.state, log_str.c_str());

We only need stringstream if we have to do a lot of string concatenations, or we have a for loop.

ttnghia · 2023-11-27T23:22:35Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

+            auto this_id = static_cast<long>(pthread_self());
+            auto thread  = threads.find(thread_id_to_wake);


Nit: Try to use auto const. All variables should be const if possible.

ttnghia · 2023-11-27T23:24:50Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

@@ -45,25 +90,16 @@ constexpr char const* SPLIT_AND_RETRY_OOM_CLASS = "com/nvidia/spark/rapids/jni/S
 enum thread_state {


Please use enum class. In C++, pure enum can be implicitly converted into int and vice versa. That is error-prone.

The down size of using enum class is that, you have to add prefix thread_state:: to all enum values in the code.

Of course we still can explicitly convert from enum class to int, but that is always "explicit", i.e., intentional.

ttnghia · 2023-11-27T23:43:14Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

@@ -890,15 +1080,11 @@ class spark_resource_adaptor final : public rmm::mr::device_memory_resource {
    bool are_any_tasks_just_blocked = false;
    for (auto thread = threads.begin(); thread != threads.end(); thread++) {


This pattern also repeats a lot. We can do better with this:

Suggested change

for (auto thread = threads.begin(); thread != threads.end(); thread++) {

for (auto const& [thread_id, thread_state]: threads) {

So we won't use some vague names like thread->first, thread->second.

ttnghia · 2023-11-27T23:46:20Z

src/main/cpp/src/SparkResourceAdaptorJni.cpp

@@ -926,35 +1112,57 @@ class spark_resource_adaptor final : public rmm::mr::device_memory_resource {
   * returns true if the thread that ended was a normally running task thread.
   * This should be used to decide if wake_up_threads_after_task_finishes is called or not.
   */
-  bool remove_thread_association(long thread_id, const std::unique_lock<std::mutex>& lock)
+  bool remove_thread_association(long thread_id, long remove_task_id, const std::unique_lock<std::mutex>& lock)


Nit: Please put const after type (east const) to be consistent with the current repository style.

ttnghia

There seems to be a serious problem with pointer to temporary object (#1543 (comment)) so I block this until it is fixed.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2023-11-28T19:40:24Z

@ttnghia please take another look I think I have fixed all of your comments.

revans2 · 2023-11-28T19:40:29Z

build

ttnghia · 2023-11-28T19:54:29Z

This is good now. You need to run pre-commit run --all-files (https://github.com/NVIDIA/spark-rapids-jni/blob/branch-24.02/CONTRIBUTING.md#c).

revans2 · 2023-11-28T20:02:12Z

This is good now. You need to run pre-commit run --all-files (https://github.com/NVIDIA/spark-rapids-jni/blob/branch-24.02/CONTRIBUTING.md#c).

The pre-commit scripts do not work out of the box with the docker environment so trying to do formatting locally is way too difficult. I will continue to let CI do my pre-commit until it is fixed.

revans2 · 2023-11-28T20:02:19Z

build

Update state retry state maching for CPU alloc support

a87aa22

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 mentioned this pull request Nov 7, 2023

Update for new retry state machine JNI APIs NVIDIA/spark-rapids#9656

Merged

mattahrens added the reliability label Nov 9, 2023

revans2 added 3 commits November 15, 2023 09:40

Cleanup

fdf529f

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Merge branch 'branch-23.12' into simplify_retry

1ebe10d

Make the thread registry more forgiving

03ef938

Signed-off-by: Robert (Bobby) Evans <[email protected]>

abellina self-requested a review November 16, 2023 22:53

revans2 changed the base branch from branch-23.12 to branch-24.02 November 17, 2023 14:15

Merge branch 'branch-24.02' into simplify_retry

c05a4fa

revans2 changed the title ~~Update state retry state maching for CPU alloc support~~ Update state retry state machine for CPU alloc support Nov 17, 2023

abellina reviewed Nov 20, 2023

View reviewed changes

jbrennan333 reviewed Nov 20, 2023

View reviewed changes

gerashegalov reviewed Nov 20, 2023

View reviewed changes

jbrennan333 reviewed Nov 20, 2023

View reviewed changes

revans2 added 3 commits November 21, 2023 08:29

Make race with task assignment safe

9920f12

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Some review comments

0bb1c19

More review comments

b1ea2c0

Signed-off-by: Robert (Bobby) Evans <[email protected]>

gerashegalov previously approved these changes Nov 21, 2023

View reviewed changes

oops

f533b4b

revans2 dismissed gerashegalov’s stale review via f533b4b November 21, 2023 19:56

gerashegalov previously approved these changes Nov 21, 2023

View reviewed changes

ttnghia reviewed Nov 21, 2023

View reviewed changes

src/main/cpp/src/SparkResourceAdaptorJni.cpp Outdated Show resolved Hide resolved

revans2 added 2 commits November 27, 2023 08:47

Merge branch 'branch-24.02' into simplify_retry

390486d

Addressed review comments

39ccf62

revans2 dismissed gerashegalov’s stale review via 39ccf62 November 27, 2023 15:30

jbrennan333 previously approved these changes Nov 27, 2023

View reviewed changes

gerashegalov previously approved these changes Nov 27, 2023

View reviewed changes

ttnghia reviewed Nov 27, 2023

View reviewed changes

ttnghia requested changes Nov 27, 2023

View reviewed changes

Addressed review comments

12d8011

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 dismissed stale reviews from gerashegalov and jbrennan333 via 12d8011 November 28, 2023 19:40

Formatting

245963d

ttnghia approved these changes Nov 28, 2023

View reviewed changes

revans2 merged commit e5cf1af into NVIDIA:branch-24.02 Nov 29, 2023
2 checks passed

revans2 deleted the simplify_retry branch November 29, 2023 14:13

abellina mentioned this pull request Dec 4, 2023

[FEA] better oom injection #1609

Open

	sra.poolThreadWorkingOnTasks(true, threadId, taskIds);
	sra.poolThreadWorkingOnTasks(/isForShuffle/ true, threadId, taskIds);

	sra.poolThreadWorkingOnTasks(false, threadId, taskIds);
	sra.poolThreadWorkingOnTasks(/isForShuffle/ false, threadId, taskIds);

	void add(task_metrics& other) {
	void add(task_metrics const& other) {

	* Update the internal state so that a specific thread is dediocated to a task.
	* Update the internal state so that a specific thread is dedicated to a task.

	throw std::invalid_argument("the thread is marked as a shuffle thread,a nd we cannot change it while there are active tasks");
	throw std::invalid_argument("the thread is marked as a shuffle thread, and we cannot change it while there are active tasks");

	// Ignored we are going to exit.
	// we are going to exit.
	Thread.currentThread().interrupt();

		log_status("FIXUP", thread_id, found->second.task_id,
		found->second.state, ss.str().c_str());

		auto this_id = static_cast<long>(pthread_self());
		auto thread = threads.find(thread_id_to_wake);

		@@ -45,25 +90,16 @@ constexpr char const* SPLIT_AND_RETRY_OOM_CLASS = "com/nvidia/spark/rapids/jni/S
		enum thread_state {

		@@ -890,15 +1080,11 @@ class spark_resource_adaptor final : public rmm::mr::device_memory_resource {
		bool are_any_tasks_just_blocked = false;
		for (auto thread = threads.begin(); thread != threads.end(); thread++) {

	for (auto thread = threads.begin(); thread != threads.end(); thread++) {
	for (auto const& [thread_id, thread_state]: threads) {

Update state retry state machine for CPU alloc support #1543

Update state retry state machine for CPU alloc support #1543

Conversation

revans2 commented Nov 7, 2023 • edited Loading

revans2 commented Nov 7, 2023

revans2 commented Nov 16, 2023

revans2 commented Nov 17, 2023

parthosa commented Nov 17, 2023

abellina left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrennan333 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrennan333 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Nov 21, 2023

revans2 commented Nov 21, 2023

revans2 commented Nov 21, 2023

gerashegalov left a comment

Choose a reason for hiding this comment

revans2 commented Nov 21, 2023

revans2 commented Nov 21, 2023

Choose a reason for hiding this comment

revans2 commented Nov 27, 2023

revans2 commented Nov 27, 2023

revans2 commented Nov 27, 2023

Choose a reason for hiding this comment

ttnghia Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

ttnghia Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

ttnghia Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

ttnghia Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

ttnghia Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

ttnghia left a comment • edited Loading

Choose a reason for hiding this comment

revans2 commented Nov 28, 2023

revans2 commented Nov 28, 2023

ttnghia commented Nov 28, 2023 • edited Loading

revans2 commented Nov 28, 2023

revans2 commented Nov 28, 2023

revans2 commented Nov 7, 2023 •

edited

Loading

abellina left a comment •

edited

Loading

ttnghia Nov 27, 2023 •

edited

Loading

ttnghia Nov 27, 2023 •

edited

Loading

ttnghia Nov 27, 2023 •

edited

Loading

ttnghia Nov 27, 2023 •

edited

Loading

ttnghia Nov 27, 2023 •

edited

Loading

ttnghia Nov 27, 2023 •

edited

Loading

ttnghia left a comment •

edited

Loading

ttnghia commented Nov 28, 2023 •

edited

Loading