Skip to content

Commit

Permalink
Fix handling of vanishing threads that were stepping/stopping
Browse files Browse the repository at this point in the history
The new testcase added by this commit exposes a couple issues with the
amd-dbgapi target's handling of exited threads.

The test spawns a kernel with a number of waves.  The waves do nothing
but exit.  There is breakpoint on the s_endpgm instruction.  Once that
breakpoint is hit, the test issues a "continue" command.  We should
see one breakpoint hit per wave, and then the whole program exiting.
We do see that, however we also see this:

 [New AMDGPU Wave ?:?:?:1 (?,?,?)/?]
 [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
 *repeat for other waves*
 ...
 [Thread 0x7ffff626f640 (LWP 3048491) exited]
 [Thread 0x7fffeb7ff640 (LWP 3048488) exited]
 [Inferior 1 (process 3048475) exited normally]

That "New AMDGPU Wave" output comes from infrun.c itself adding the
thread to the GDB thread list, because it got an event for a thread
not on the thread list yet.  The output shows "?"s instead of proper
coordinates, because the event was a TARGET_WAITKIND_THREAD_EXITED,
i.e., the wave was already gone when infrun.c added the thread to the
thread list.

That shouldn't ever happen for the amd-dbgapi target, threads should
only ever be added by the backend.

Note "New AMDGPU Wave ?:?:?:1" is for wave 1.  What happened was that
wave 1 terminated previously, and a previous call to
amd_dbgapi_target::update_thread_list() noticed the wave had vanished
and removed it from the GDB thread list.  However, because the wave
was stepping when it terminated (due to the displaced step over the
s_endpgm) instruction, it is guaranteed that the amd-dbgapi library
queues a WAVE_COMMAND_TERMINATED event for the exit.

When we process that WAVE_COMMAND_TERMINATED event, in
amd-dbgapi-target.c:process_one_event, we return it to the core as a
TARGET_WAITKIND_THREAD_EXITED event:

 static void
 process_one_event (amd_dbgapi_event_id_t event_id,
		    amd_dbgapi_event_kind_t event_kind)
 {
 ...
	 if (status == AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID
	     && event_kind == AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED)
	   ws.set_thread_exited (0);
 ...
 }

Recall the wave is already gone from the GDB thread list.  So when GDB
sees that TARGET_WAITKIND_THREAD_EXITED event for a thread it doesn't
know about, it adds the thread to the thread list, resulting in that:

 [New AMDGPU Wave ?:?:?:1 (?,?,?)/?]

and then, because it was a TARGET_WAITKIND_THREAD_EXITED event, GDB
marks the thread exited right afterwards:

 [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]

The fix is to make amd_dbgapi_target::update_thread_list() _not_
delete vanishing waves iff they were stepping or in progress of being
stopped.  These two cases are the ones dbgapi guarantees will result
in a WAVE_COMMAND_TERMINATED event if the wave terminates:

  /**
   * A command for a wave was not able to complete because the wave has
   * terminated.
   *
   * Commands that can result in this event are ::amd_dbgapi_wave_stop and
   * ::amd_dbgapi_wave_resume in single step mode.  Since the wave terminated
   * before stopping, this event will be reported instead of
   * ::AMD_DBGAPI_EVENT_KIND_WAVE_STOP.
   *
   * The wave that terminated is available by the ::AMD_DBGAPI_EVENT_INFO_WAVE
   * query.  However, the wave will be invalid since it has already terminated.
   * It is the client's responsibility to know what command was being performed
   * and was unable to complete due to the wave terminating.
   */
  AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED = 2,

As the comment says, it's GDB's responsability to know whether the
wave was stepping or being stopped.  Since we now have a wave_info map
with one entry for each wave, that seems like the place to store that
information.  However, I still decided to put all the coordinate
information in its own structure.  I.e., basically renamed the
existing wave_info to wave_coordinates, and then added a new wave_info
structure that holds the new state, plus a wave_coordinates object.
This seemer cleaner as there are places where we only need to
instantiate a wave_coordinates object.

In addition, I think we could merge stepping_id_map into wave_info_map
into a single map, making the amd_dbgapi_displaced_stepping_id_t state
a field of the new wave_info structure.  This isn't done by this
patch, however.

There's an extra twist.  The testcase also exercises stopping at a new
kernel right after the first kernel fully exits.  In that scenario, we
were hitting this assertion after the first kernel fully exits and the
hit of the breakpoint at the second kernel is handled:

 [amd-dbgapi] process_event_queue: Pulled event from dbgapi: event_id.handle = 26, event_kind = WAVE_STOP
 [amd-dbgapi-lib] suspending queue_3, queue_2, queue_1 (refresh wave list)
 ../../src/gdb/amd-dbgapi-target.c:1625: internal-error: amd_dbgapi_thread_deleted: Assertion `it != info->wave_info_map.end ()' failed.
 A problem internal to GDB has been detected,
 further debugging may prove unreliable.

This is the exact same problem as above, just a different
manifestation.  In this scenario, we end up in update_thread_list
successfully deleting the exited thread (because it was no longer the
current thread) that was incorrectly added by infrun.c.  Because it
was added by infrun.c and not by amd-dbgapi-target.c:add_gpu_thread,
it doesn't have an entry in the wave_info map, so
amd_dbgapi_thread_deleted trips on this assertion:

      gdb_assert (it != info->wave_info_map.end ());

here:

  ...
  -> stop_all_threads
   -> update_thread_list
    -> target_update_thread_list
     -> amd_dbgapi_target::update_thread_list
      -> thread_db_target::update_thread_list
       -> linux_nat_target::update_thread_list
	-> delete_exited_threads
	 -> delete_thread
	  -> delete_thread_1
	   -> gdb::observers::observable<thread_info*>::notify
	    -> amd_dbgapi_thread_deleted
	     -> internal_error_loc

The testcase thus tries both running to exit after the first kernel
exits, and running to a breakpoint in a second kernel after the first
kernel exits.

Change-Id: I43a66f060c35aad1fe0d9ff022ce2afd0537f028
(cherry picked from commit acf2578)
  • Loading branch information
palves authored and lancesix committed Jan 9, 2024
1 parent b5cd5e9 commit 11388cb
Show file tree
Hide file tree
Showing 3 changed files with 380 additions and 48 deletions.
197 changes: 149 additions & 48 deletions gdb/amd-dbgapi-target.c
Original file line number Diff line number Diff line change
Expand Up @@ -130,10 +130,9 @@ get_amd_dbgapi_target_inferior_created_observer_token ()
return amd_dbgapi_target_inferior_created_observer_token;
}

/* A type holding coordinate, etc. info for a given wave. We cache
this because we need this information after a wave exits. */
/* A type holding coordinates, etc. info for a given wave. */

struct wave_info
struct wave_coordinates
{
/* The wave. Set by the ctor. */
amd_dbgapi_wave_id_t wave_id;
Expand All @@ -146,11 +145,44 @@ struct wave_info
uint32_t group_ids[3] {UINT32_MAX, UINT32_MAX, UINT32_MAX};
uint32_t wave_in_group = UINT32_MAX;

explicit wave_info (amd_dbgapi_wave_id_t wave_id);
explicit wave_coordinates (amd_dbgapi_wave_id_t wave_id)
: wave_id (wave_id)
{}

/* Return the target ID string for the wave this wave_info is
/* Return the target ID string for the wave this wave_coordinates is
for. */
std::string to_string () const;

/* Pull out coordinates info from the amd-dbgapi library. */
void fetch ();
};

/* A type holding info about a given wave. */

struct wave_info
{
/* We cache the coordinates info because we need it after a wave
exits. The wave's ID is here. */
wave_coordinates coords;

/* The last resume_mode passed to amd_dbgapi_wave_resume for this
wave. We track this because we are guaranteed to see a
WAVE_COMMAND_TERMINATED event if a stepping wave terminates, and
we need to know to not delete such a wave until we process that
event. */
amd_dbgapi_resume_mode_t last_resume_mode = AMD_DBGAPI_RESUME_MODE_NORMAL;

/* Whether we've called amd_dbgapi_wave_stop for this wave and are
waiting for its stop event. Similarly, we track this because
we're guaranteed to get a WAVE_COMMAND_TERMINATED event if the
wave terminates while being stopped. */
bool stopping = false;

explicit wave_info (amd_dbgapi_wave_id_t wave_id)
: coords (wave_id)
{
coords.fetch ();
}
};

/* Big enough to hold the size of the largest register in bytes. */
Expand Down Expand Up @@ -331,6 +363,19 @@ static struct amd_dbgapi_target the_amd_dbgapi_target;
static const registry<inferior>::key<amd_dbgapi_inferior_info>
amd_dbgapi_inferior_data;

/* Fetch the amd_dbgapi_inferior_info data for the given inferior. */

static struct amd_dbgapi_inferior_info *
get_amd_dbgapi_inferior_info (struct inferior *inferior)
{
amd_dbgapi_inferior_info *info = amd_dbgapi_inferior_data.get (inferior);

if (info == nullptr)
info = amd_dbgapi_inferior_data.emplace (inferior, inferior);

return info;
}

/* The async event handler registered with the event loop, indicating that we
might have events to report to the core and that we'd like our wait method
to be called.
Expand All @@ -345,7 +390,7 @@ static const registry<inferior>::key<amd_dbgapi_inferior_info>
static async_event_handler *amd_dbgapi_async_event_handler = nullptr;

std::string
wave_info::to_string () const
wave_coordinates::to_string () const
{
std::string str = "AMDGPU Wave";

Expand Down Expand Up @@ -375,37 +420,41 @@ wave_info::to_string () const
return str;
}

wave_info::wave_info (amd_dbgapi_wave_id_t wave_id)
: wave_id (wave_id)
{
}

/* Read in wave_info for WAVE_ID. */

static wave_info
get_wave_info (amd_dbgapi_wave_id_t wave_id)
void
wave_coordinates::fetch ()
{
wave_info res (wave_id);

/* Any field that fails to be read is left with its in-class
initialized value, which is printed as "?". */

amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_AGENT,
sizeof (res.agent_id), &res.agent_id);
sizeof (agent_id), &agent_id);
amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_QUEUE,
sizeof (res.queue_id), &res.queue_id);
sizeof (queue_id), &queue_id);
amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_DISPATCH,
sizeof (res.dispatch_id), &res.dispatch_id);
sizeof (dispatch_id), &dispatch_id);

amd_dbgapi_wave_get_info (wave_id,
AMD_DBGAPI_WAVE_INFO_WORKGROUP_COORD,
sizeof (res.group_ids), &res.group_ids);
sizeof (group_ids), &group_ids);

amd_dbgapi_wave_get_info (wave_id,
AMD_DBGAPI_WAVE_INFO_WAVE_NUMBER_IN_WORKGROUP,
sizeof (res.wave_in_group), &res.wave_in_group);
sizeof (wave_in_group), &wave_in_group);
}

/* Get the wave_info object for TP, from the wave_info map. It is
assumed that the wave is in the map. */

static wave_info &
get_thread_wave_info (thread_info *tp)
{
amd_dbgapi_inferior_info *info = get_amd_dbgapi_inferior_info (tp->inf);
amd_dbgapi_wave_id_t wave_id = get_amd_dbgapi_wave_id (tp->ptid);

auto it = info->wave_info_map.find (wave_id.handle);
gdb_assert (it != info->wave_info_map.end ());

return res;
return it->second;
}

/* Return the target id string for a given dispatch. */
Expand Down Expand Up @@ -704,19 +753,6 @@ async_event_handler_mark ()
mark_async_event_handler (amd_dbgapi_async_event_handler);
}

/* Fetch the amd_dbgapi_inferior_info data for the given inferior. */

static struct amd_dbgapi_inferior_info *
get_amd_dbgapi_inferior_info (struct inferior *inferior)
{
amd_dbgapi_inferior_info *info = amd_dbgapi_inferior_data.get (inferior);

if (info == nullptr)
info = amd_dbgapi_inferior_data.emplace (inferior, inferior);

return info;
}

/* Set forward progress requirement to REQUIRE for all processes of PROC_TARGET
matching PTID. */

Expand Down Expand Up @@ -931,12 +967,12 @@ amd_dbgapi_target::pid_to_str (ptid_t ptid)

auto it = info->wave_info_map.find (wave_id.handle);
if (it != info->wave_info_map.end ())
return it->second.to_string ();
return it->second.coords.to_string ();

/* A wave we don't know about. Shouldn't usually happen, but
asserting and bringing down the session is a bit too harsh. Just
print all unknown info as "?"s. */
return wave_info (wave_id).to_string ();
return wave_coordinates (wave_id).to_string ();
}

std::string
Expand Down Expand Up @@ -1292,23 +1328,33 @@ amd_dbgapi_target::resume (ptid_t scope_ptid, int step, enum gdb_signal signo)

amd_dbgapi_wave_id_t wave_id = get_amd_dbgapi_wave_id (thread->ptid);
amd_dbgapi_status_t status;

wave_info &wi = get_thread_wave_info (thread);
amd_dbgapi_resume_mode_t &resume_mode = wi.last_resume_mode;
amd_dbgapi_exceptions_t wave_exception;
if (thread->ptid == inferior_ptid)
status = amd_dbgapi_wave_resume (wave_id,
(step
? AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
: AMD_DBGAPI_RESUME_MODE_NORMAL),
exception);
{
resume_mode = (step
? AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
: AMD_DBGAPI_RESUME_MODE_NORMAL);
wave_exception = exception;
}
else
status = amd_dbgapi_wave_resume (wave_id, AMD_DBGAPI_RESUME_MODE_NORMAL,
AMD_DBGAPI_EXCEPTION_NONE);
{
resume_mode = AMD_DBGAPI_RESUME_MODE_NORMAL;
wave_exception = AMD_DBGAPI_EXCEPTION_NONE;
}

status = amd_dbgapi_wave_resume (wave_id, resume_mode, wave_exception);
if (status != AMD_DBGAPI_STATUS_SUCCESS
/* Ignore the error that wave is no longer valid as that could
indicate that the process has exited. GDB treats resuming a
thread that no longer exists as being successful. */
&& status != AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID)
error (_("wave_resume for wave_%ld failed (%s)"), wave_id.handle,
get_status_string (status));

wi.stopping = false;
}
}

Expand All @@ -1323,6 +1369,15 @@ amd_dbgapi_target::commit_resumed ()
require_forward_progress (minus_one_ptid, proc_target, true);
}

/* Return a string version of RESUME_MODE, for debug log purposes. */
static const char *
resume_mode_to_string (amd_dbgapi_resume_mode_t resume_mode)
{
return (resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
? "step"
: "normal");
}

void
amd_dbgapi_target::stop (ptid_t ptid)
{
Expand Down Expand Up @@ -1356,7 +1411,11 @@ amd_dbgapi_target::stop (ptid_t ptid)

status = amd_dbgapi_wave_stop (wave_id);
if (status == AMD_DBGAPI_STATUS_SUCCESS)
return;
{
wave_info &wi = get_thread_wave_info (thread);
wi.stopping = true;
return;
}

if (status != AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID)
error (_("wave_stop for wave_%ld failed (%s)"), wave_id.handle,
Expand All @@ -1370,6 +1429,23 @@ amd_dbgapi_target::stop (ptid_t ptid)
could have terminated since the last time the wave list was
refreshed. */

wave_info &wi = get_thread_wave_info (thread);
wi.stopping = true;

amd_dbgapi_debug_printf ("got AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID "
"for wave_%ld, last_resume_mode=%s, "
"report_thread_events=%d",
wave_id.handle,
resume_mode_to_string (wi.last_resume_mode),
m_report_thread_events);

/* If the wave was stepping when it terminated, then it is
guaranteed that we will see a WAVE_COMMAND_TERMINATED event
for it. Don't report a thread exit event or delete the
thread yet, until we see such event. */
if (wi.last_resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP)
return;

if (m_report_thread_events)
{
get_amd_dbgapi_inferior_info (thread->inf)->wave_events.emplace_back
Expand Down Expand Up @@ -1639,7 +1715,7 @@ add_gpu_thread (inferior *inf, ptid_t wave_ptid)
auto wave_id = get_amd_dbgapi_wave_id (wave_ptid);

if (!info->wave_info_map.try_emplace (wave_id.handle,
get_wave_info (wave_id)).second)
wave_info (wave_id)).second)
internal_error ("wave ID %ld already in map", wave_id.handle);

/* Create new GPU threads silently to avoid spamming the terminal
Expand Down Expand Up @@ -2413,7 +2489,32 @@ amd_dbgapi_target::update_thread_list ()
auto it = threads.find (tp->ptid.tid ());

if (it == threads.end ())
delete_thread_silent (tp);
{
auto wave_id = get_amd_dbgapi_wave_id (tp->ptid);
wave_info &wi = get_thread_wave_info (tp);

/* Waves that were stepping or in progress of being
stopped are guaranteed to report a
WAVE_COMMAND_TERMINATED event if they terminate.
Don't delete such threads until we see the
event. */
if (wi.last_resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
|| wi.stopping)
{
amd_dbgapi_debug_printf
("wave_%ld disappeared, keeping it"
" (last_resume_mode=%s, stopping=%d)",
wave_id.handle,
resume_mode_to_string (wi.last_resume_mode),
wi.stopping);
}
else
{
amd_dbgapi_debug_printf ("wave_%ld disappeared, deleting it",
wave_id.handle);
delete_thread_silent (tp);
}
}
else
threads.erase (it);
}
Expand Down
66 changes: 66 additions & 0 deletions gdb/testsuite/gdb.rocm/continue-over-kernel-exit.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
/* Copyright (C) 2023 Free Software Foundation, Inc.
Copyright (C) 2023 Advanced Micro Devices, Inc. All rights reserved.
This file is part of GDB.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

#include <hip/hip_runtime.h>
#include <stdio.h>

#define CHECK(cmd) \
do \
{ \
hipError_t error = cmd; \
if (error != hipSuccess) \
{ \
fprintf (stderr, "error: '%s'(%d) at %s:%d\n", \
hipGetErrorString (error), error, \
__FILE__, __LINE__); \
exit (EXIT_FAILURE); \
} \
} while (0)

__global__ void
kern ()
{
asm ("s_endpgm_insn: s_endpgm");
}

__global__ void
second_kernel ()
{
}

int
main ()
{
/* Use 1-thread blocks to easily control number of waves. */
size_t blocksize = 1;
size_t gridsize = 10;

kern<<<gridsize, blocksize>>> ();

/* Stopping at this second kernel after the first kernel completely
finishes makes GDB refresh its thread list while the
amd-dbgapi-target is still active, which triggers different code
paths in GDB that lead to deleting exited threads. We test both
stopping here, and not stopping here. */
second_kernel<<<1, 1>>> ();

CHECK (hipDeviceSynchronize ());

return 0;
}
Loading

0 comments on commit 11388cb

Please sign in to comment.