-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug stream thread info pr #9398
Debug stream thread info pr #9398
Conversation
LOG_MODULE_REGISTER(debug_strem_slot); | ||
|
||
struct cpu_mutex { | ||
struct k_mutex m; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I assume mutex is per core and not not lock for all core ? - if so we should inline comment this description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, if this is nothing else but a mutex, why embed it into another structure? Just use an array of struct k_mutex
objects. And if you do insist of having a separate structure for it, maybe use the debug_
namespace with it too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are mutexes meant for different cores and AFAIK they should be on different cache-lines to avoid trouble. That is why they are inside a struct that I can force to be cache-aligned.
uint32_t record_size = rec->size; | ||
uint32_t record_start, buf_remain; | ||
|
||
LOG_DBG("Sending record %u id %u len %u\n", rec->serial, rec->id, rec->size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go after the buffer check below ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A matter or taste. I wanted to know for sure the function is at least called, when I could not see anything happening in the circular buffer.
* Section descriptor defines core ID, offset and size of the circular | ||
* buffer in the debug window slot. | ||
*/ | ||
struct debug_stream_section_descriptor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again - is user API, needs to go in another directory and needs packed/aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packed hardly does anything when everything is 32-bit words, and the descriptor is already aligned to cache-line size. The header itself must be written to beginning of slot so its automatically 4k page size aligned, which should be enough, and also if its not the idea is still to have it at the very beginning of the slot. I can put the packed aligned stuff to the header too, its not needed, and I feel it could even be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I was amazed when I realized that the C standard has nothing like __packed. C is really not a low-level language ;-)
It seems unlikely but I think 32 bits fields could be 64-bits aligned on some 64 bits architectures, depending on the compiler flags, the phase of the moon etc.
https://www.catb.org/esr/structure-packing/ (sorry for the invalid certificate)
https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html#index-packed-type-attribute
https://linux.die.net/man/1/pahole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best to be explicit here even though parent is packed and this is all 32bit words, i.e. we could have direct users that may reference this structure without the parent context (and hence not optimize for 32 but aligned/packed access).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsarha FWIW agree with @lgirdwood - please add packed
. It's good for all ABI types to have it even if they "obviously" effectively would be packed by default by the compiler anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can of course create all kinds of constructs to invent the __aligned() again with variable sized paddings, and use __packed, but now I do not have time for it so, I leave this as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsarha Wouldn't just removing __aligned
and using __packed
already solve it? And just look through those types to verify that you arrange elements from larger to smaller - first all 64-bit ones, then 32-bit ones etc. Only if you embed structures, well, then yes, just pad the structure that you're embedding with some uint8_t reserved[N]
array to align to the next 64 bits, that's it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is several ways to do this, and probably it would work even if the core specific circular-buffer descriptors were not 64-byte aligned, since after the initialization the contents is read only and the memory is accessed through uncached address-space. Still @lgirdwood suggested that also the static descriptor part should be 64-bit aligned, just to be safe. I am really fine with anything that works, but I would like to do it only once more, and the time is running little short if we want this to go to 2.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood @lyakh , I am more and more starting to think the __aligned() should be dropped from "struct debug_stream_section_descriptor", and for that matter all the padding hacks too. We can not have the ABI depend on cache-line size, or at least we should then write the offsets explicitly out in the debug window, but that is pretty much what the "struct debug_stream_section_descriptor" is for: It is all static data that is anyway initialized by core 0 before the threads running on other cores have started and its in uncached memory, and it contains the offsets of the core specific circular buffers, and their sizes. So, there should not and must not be any cash-line size dependent padding in the header that describes where the core-specific circular buffers are. I've tested this works fine with sof-mtl-nocodec-ssp0-ssp2.tplg on MTL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsarha I agree. The padding that I was proposing wasn't for cache-line size alignment, it was for "natural data type alignment." I.e. if you have an 8-bit member followed by a 16-bit value and you cannot swap them, add a padding like
struct abi_data {
uint8_t type;
uint8_t reserved0[3];
uint32_t size;
....
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No showstoppers, aside the assert usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug-stream.py
has 352 pylint
warnings: that's almost as many as the number of lines in the file! I'm NOT requesting "pylint perfection" but the number of warnings should be small enough to keep pylint
usable. It's not usable right now.
The good news is: 256 out of these 352 are the same W0311 and you can get rid of ALL of them with a single black
call:
apt install black
black tools/debugstream/debug-stream.py
reformatted tools/debugstream/debug-stream.py
Warning: black
reformatting happens in place.
Another pro tip: sorted_pylint () { pylint "$@" | sort -t: -k2 -n ; }
tools/debugstream/debug-stream.py:370:0: W0311: Bad indentation. Found 5 spaces, expected 20 (bad-indentation)
tools/debugstream/debug-stream.py:371:0: W0311: Bad indentation. Found 5 spaces, expected 20 (bad-indentation)
tools/debugstream/debug-stream.py:372:0: W0311: Bad indentation. Found 6 spaces, expected 24 (bad-indentation)
tools/debugstream/debug-stream.py:374:0: W0311: Bad indentation. Found 2 spaces, expected 8 (bad-indentation)
tools/debugstream/debug-stream.py:375:0: W0311: Bad indentation. Found 3 spaces, expected 12 (bad-indentation)
tools/debugstream/debug-stream.py:376:0: W0311: Bad indentation. Found 3 spaces, expected 12 (bad-indentation)
tools/debugstream/debug-stream.py:377:0: W0311: Bad indentation. Found 2 spaces, expected 8 (bad-indentation)
tools/debugstream/debug-stream.py:378:0: W0311: Bad indentation. Found 3 spaces, expected 12 (bad-indentation)
tools/debugstream/debug-stream.py:379:0: W0311: Bad indentation. Found 4 spaces, expected 16 (bad-indentation)
tools/debugstream/debug-stream.py:380:0: W0311: Bad indentation. Found 4 spaces, expected 16 (bad-indentation)
tools/debugstream/debug-stream.py:381:0: W0311: Bad indentation. Found 2 spaces, expected 8 (bad-indentation)
tools/debugstream/debug-stream.py:382:0: W0311: Bad indentation. Found 2 spaces, expected 8 (bad-indentation)
Forgot this sorry: most warnings are easy to fix but in some rare cases pylint is wrong or you don't have the time to figure out how (feel free to ask for help privately). Then you can easily silence any warning with a Or you can just leave some for someone else to fix later. |
c0345cd
to
b391af1
Compare
The new version should address all the comment according to reply, except for "serial" > "seqnum" and getting rid off __aligned() and using __packed and padding members instead. Oh, and I am not working from python ctypes to construct change, not yet anyway. |
b391af1
to
0a67a2b
Compare
Now also "serial" -> "seqnum" change is done. Aome idle debug prints were also removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as none of my questions actually uncover bugs and since this is new code, they can be addressed after merge, nothing seems to be critical
} __aligned(CONFIG_DCACHE_LINE_SIZE); | ||
|
||
/* CPU specific mutexes for each circular buffer */ | ||
static struct cpu_mutex cpu_mutex[CONFIG_MP_MAX_NUM_CPUS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this puts cpu_mutex[]
into .bss
which is uncached in present builds. That means, that you don't need to cache-line align them and to flush caches. In general, even if it were cached, you would always access it as such, mixing cached and uncached access to a mutex would very likely break its functionality. So you wouldn't need to flush caches anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, then that is a problem. Is all static (and global) data uncached? If that is the case, then I guess I should reserve the table from heap, but are all heap allocations automatically cache-line aligned?
The mutex only to control access within one core so AFAIU the cache should not be a problem. This is for the situation where higher priority thread interrupts record sending and that higher priority thread tries to send a record of its own.
I'd still say its safer to keep the mutexes on different cache-lines, even if the memory is uncached, so I suggest not to do anything right now, but move the mutexes to cached memory later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsarha I'd keep them uncached and not bother with cache lines. Should be both simpler and safer. Though a bit slower, but don't think this slow down would be significant for you. But if you do decide to go via cache, you can get cached aliases to that memory, but then yes, you'd need to reserve a whole cache line-sized buffer for each.
buf->next_seqno = 0; | ||
buf->w_ptr = 0; | ||
k_mutex_init(&cpu_mutex[i].m); | ||
sys_cache_data_flush_range(&cpu_mutex[i], sizeof(cpu_mutex[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove this, for now, but eventually I would like have these mutexes in cached memory.
LOG_MODULE_REGISTER(debug_strem_slot); | ||
|
||
struct cpu_mutex { | ||
struct k_mutex m; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, if this is nothing else but a mutex, why embed it into another structure? Just use an array of struct k_mutex
objects. And if you do insist of having a separate structure for it, maybe use the debug_
namespace with it too
if (!buf) | ||
return -ENODEV; | ||
|
||
if (rec->size >= desc.buf_words) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is .size
in words? Should be clearly documented then... Or maybe use a different name or make it in bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its documented here:
struct debug_stream_record {
uint32_t id; /* Record id of abstract data record */
uint32_t seqno; /* Increments after each record */
uint32_t size; /* Size of the whole record in words */
uint32_t data[];
} __packed;
, but I change the data-member name.
int i; | ||
|
||
/* Mark the thread as active */ | ||
ud->active_threads[ud->thread_count] = tid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thread_count
sounds like it's the number of threads, i.e. it begins with 1? Can this access outside of the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsarha but does it begin with 0 or with 1? I.e. does the element ud->active_threads[0]
ever get used here?
* returns it. | ||
*/ | ||
|
||
static uint32_t thread_info_get_cycles(void *tid, k_thread_runtime_stats_t *thread_stats, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use struct k_thread *
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I am trying to hint that the pointer is used as an id only, its not accessed at all. I'll add a comment.
|
||
/* If there is more than THREAD_INFO_MAX_THREADS threads in this core */ | ||
if (i == ARRAY_SIZE(ud->previous->threads)) | ||
LOG_INF("No place found for %s %p", name, tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning?
/* If no cached value was found, look for an empty slot to store the recent value */ | ||
for (i = 0; i < ARRAY_SIZE(ud->previous->threads); i++) { | ||
if (ud->previous->threads[i].tid == NULL) { | ||
ud->previous->threads[i].tid = tid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be a follow-up, but maybe we can have a better name than previous
here. You don't store more than one snapshot, so isn't it just "current"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the cycles number from the previous round. After the calculations it will be "current", but when its used next time its again from "previous" round.
#define THREAD_STACK_SIZE (2048) | ||
static K_THREAD_STACK_ARRAY_DEFINE(info_thread_stacks, CONFIG_MP_MAX_NUM_CPUS, | ||
THREAD_STACK_SIZE); | ||
static struct k_thread info_thread[CONFIG_MP_MAX_NUM_CPUS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONFIG_MP_NUM_CPUS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONFIG_MP_NUM_CPUS is obsolete, MP_MAX_NUM_CPUS should be used:
config MP_NUM_CPUS
int "Number of CPUs/cores [DEPRECATED]"
default MP_MAX_NUM_CPUS
range 1 12
help
This is deprecated, please use MP_MAX_NUM_CPUS instead.
config MP_MAX_NUM_CPUS
int "Maximum number of CPUs/cores"
default 1
range 1 12
help
Maximum number of multiprocessing-capable cores available to the
multicpu API and SMP features.
…tion This commit adds definition for abstract debug stream record and stream header. The header file contains the necessary documentation. Signed-off-by: Jyri Sarha <[email protected]>
Implementation for debug stream transportation over debug window slot. The protocol details are documented in debug_stream_slot.h header. Signed-off-by: Jyri Sarha <[email protected]>
Implementation for collecting Thread stack and CPU usage statistic from Zephyr core and send the info and debug_stream records. The records encoding is defined and documented in debug_stream_thread_info.c source file. Signed-off-by: Jyri Sarha <[email protected]>
Python implementation for receiving and decoding debug-stream records from debug window slot transportation. Opens SOF debugfs file "debug_stream" reads and decodes the records from the circular buffer documented in soc_debug_window_slot.h. This initial version only knows of DEBUG_STREAM_RECORD_ID_THREAD_INFO records. Signed-off-by: Jyri Sarha <[email protected]>
0a67a2b
to
5fe5744
Compare
Following changes were made:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for addressing comments
buf->next_seqno = 0; | ||
buf->w_ptr = 0; | ||
k_mutex_init(&cpu_mutex[i].m); | ||
/* The core specific mutexes are now .dss which is uncached so the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.bss
int i; | ||
|
||
/* Mark the thread as active */ | ||
ud->active_threads[ud->thread_count] = tid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsarha but does it begin with 0 or with 1? I.e. does the element ud->active_threads[0]
ever get used here?
@marcinszkudlinski ok for you to proceed? default for this is "n". |
@wszypelt Can you check? This seems to have failed to "Copy build to DUT" step. |
This PR contains following things:
This PR requires following PR to compile (if the feature is enabled in Kconfig):
zephyrproject-rtos/zephyr#77467
and following PR to work (for python script to access the debug window slot):
thesofproject/linux#5154
and this fix to be able to disable tlemetry performance measurements, that tries to use the same debug-window slot (a configuration that allows both is also possible, but I wanted to get this PR out):
#9392
If one wants to try the PR in practice on MTL, one only needs the linux PR, and this branch of mine that contains all the other necessary changes:
https://github.com/jsarha/sof/tree/debug_stream_thread_info_testing