-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telemetry: systick info #8737
telemetry: systick info #8737
Conversation
345cca4
to
a10f59d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-review with some questions/things to look at.
struct sof_tl { | ||
uint32_t type; | ||
uint32_t max_length; | ||
} __packed __aligned(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attribute((packed, aligned(4)));
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whenever available, __packed
and similar macros are recommended by checkpatch over explicit __attribute__(())
versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this checkpatch recommendation came straight from the Linux kernel and I'm not sure it makes sense in SOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I just have some opens mainly on the ABI.
@@ -462,9 +462,14 @@ static int basefw_get_large_config(struct comp_dev *dev, | |||
uint32_t *data_offset, | |||
char *data) | |||
{ | |||
/* We can use extended param id for both extended and standard param id */ | |||
union ipc4_extended_param_id extended_param_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobonex I assume the vendor config get means that it is flexible and can be used to get/set any custom data for any vendor ? If so, is this the right structure, i.e. the telemetry data will be ABI standard data and part of a generic telemetry ABI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is just a special case of large_config_get that reference FW used. Scheduler_info_get IPC uses this format. Normal param_id uses one byte out of 4 and extended one uses the remaining bytes as space for additional argument.
|
||
window->descs[slot_num].type = ADSP_DW_SLOT_TELEMETRY; | ||
window->descs[slot_num].resource_id = 0; | ||
wnd_data->separator_1 = 0x0000C0DE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some magic number ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separators were defined in reference FW. Those are only used to identify telemetry structs when looking at the memory directly (e.g. using RWE), so those can be anything really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so to be scalable (where FW can add and modify data payloads in a way that wont break host based tooling) we would need to put a small header as the separator to convey extra information instead of just a magic number e.g.
struct telem_hdr {
uint16_t magic; // 0xc0de
uint16_t type; // type of C struct, i.e. what am I.
uint16_t words; // size of data elem in words
uint16_t elems; // > 1 if payload is array
} __packed;
// payload follows here
This could convey all the info tooling needs and would allow FW to grow data and not break tooling along teh way. i.e. if tooling does not know about a type
then it can go to the next header (and so on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of telemetry data in memory window 2 here is the same as in reference firmware. If we want to be compatible it would be best to leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need to maintain compatability but we also need to scale and grow as new features upstream. We can wrap any legacy/reference structure as-is, but they would need to be pre-fixed with above header so that new data could be added (without impacting any legacy data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure if we can just change the separators. Pre-fixing with the header would also move the fields from their designated places in the MW slot, and things would just stop working.
About scaling: memory window 2 is pretty scalable by itself. We can still extend the window to get more slots where we can add new structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to live with the fixed location (slot + slot offset + separator) to keep compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a debug mechanism that has to grow and scale as FW features, use cases and capabilities grow and scale (and they will), its way simpler and cheaper to fix the debug tooling (i.e. look for a header to find start, all other data is the same) than to continue with an ABI that cant grow with the FW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobonex the only way to scale/grow over time is to say that
window->descs[slot_num].type = ADSP_DW_SLOT_TELEMETRY;
window->descs[slot_num].resource_id = 0;
is the compatibility mode, and then changing
window->descs[slot_num].type = ADSP_DW_SLOT_TELEMETRY_V2;
window->descs[slot_num].resource_id = SOME_NEW_RESOURCE_ID;
to support changes in the ABI payloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would look something like this. More details:
We have a few slots available, what data is in what slot is known by reading the descriptors. Here we set descriptor to ADSP_DW_SLOT_TELEMETRY to reserve the slot for telemetry. Data in this slot is telemetry_wnd_data structure. Only systick data is here for now, but there are more structures not ported yet, currently commented out. I didn't count the actual space needed, but this slot will be probably filled whole after the rest is ported. So this slot would only contain the legacy telemetry data.
We can mark another slot as ADSP_DW_SLOT_TELEMETRY_2 for example and do basically anything in this another slot.
src/debug/telemetry/telemetry.c
Outdated
systick_info[prid].max_sys_tick_count); | ||
systick_info[prid].last_ccount = current_ccount; | ||
|
||
#ifdef PERFORMANCE_MEASUREMENTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood , do you think PERFORMANCE_MEASUREMENTS is too simple? or can reuse CONFIG_PERFORMANCE_COUNTERS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PERFORMANCE_MEASUREMENTS I just ported from reference FW, for now there is no reason for it it seems. Will probably just delete this one. I added TELEMETRY_CONFIG instead to disable the whole thing.
struct system_tick_info *systick_info = | ||
(struct system_tick_info *)wnd_data->system_tick_info; | ||
|
||
switch (state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to print the state change for telemetry in the logs as it may have impact on performance or on issue reproducibility. i.e. a developer will know telemetry state when debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This made me think a bit. This IPC doesn't do much as of yet. It does some additional things in reference firmware, but nothing directly related to systick measurement (tests still use this one so we need it). It seems that systick measurement in reference FW was on all the time, same here. This does not seem like it would take a lot of time but still...
I could modify this IPC and add a bool so that systick measurement is stopped unless it's triggered by this IPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this would just be informational to the developer. They would at least know if telemetry was on/off when an issue occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had a LOT of "heisenbugs" in the past...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got some more info, systick info measurement works all the time by design. So there's no on/off switch. Still, what I can do is adding the log at the init of telemetry indicating that systick info measurement is active.
src/debug/telemetry/telemetry.c
Outdated
systick_info[prid].max_sys_tick_count); | ||
systick_info[prid].last_ccount = current_ccount; | ||
|
||
#ifdef PERFORMANCE_MEASUREMENTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood , do you think PERFORMANCE_MEASUREMENTS is too simple? or can reuse CONFIG_PERFORMANCE_COUNTERS?
* Implementing simplified queue just for avg calculation. | ||
* Queue is circular, oldest element replaced by latest | ||
*/ | ||
struct perf_queue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
telemetry_
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this struct is actually only used internally. Should I move it to .c so that it's not public?
a10f59d
to
c7b7540
Compare
Update:
|
94a5b7b
to
6e264f8
Compare
dcee9de
to
0129077
Compare
Update:
|
issues addressed, only a couple of questions to clarify remain
4dcae31
to
82d22c2
Compare
Add Vendor Config Get, a special case of Large Config Get. Large Config Get handling now checks for this case and extracts extended param_id from ipc payload as param_id and handles the rest of the payload as usual. Base_fw now uses extended param_id. Signed-off-by: Tobiasz Dryjanski <[email protected]>
b821d75
to
4df3d4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a complicated PR to review. I think overall this starts to be ready to merge. There are few opens raised in comments that have not been closed in the discussion and I spotted one item in code review I'd like to be checked.
For the general problem of abstracting our hw specific code from base_fw.c config, I don't think I can require that from this PR, although this is adding to the tech debt pile.
enum ipc4_perf_measurements_state_set state = *data; | ||
|
||
struct telemetry_wnd_data *wnd_data = | ||
(struct telemetry_wnd_data *)ADSP_DW->slots[SOF_DW_TELEMETRY_SLOT]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to a bit wrong direction in base_fw.c w.r.t. keeping in platform neutral (see #7549 and #8391 ). OTOH, the file is not really vendor neutral now and we don't have framework how to handle this, so I can't really point to alternative solution here. But this does add tech debt we need to address later. I think the telemetry backend should be abstracted better. E.g. Intel platforms would use a specific memory window, but other platforms may want to use another transport even if the telemetry data and objects are the same and same IPC commands are used to control telemetry operation.
src/debug/telemetry/telemetry.c
Outdated
(struct system_tick_info *)wnd_data->system_tick_info; | ||
|
||
systick_info[prid].count = telemetry_systick_counter[prid]; | ||
systick_info[prid].last_sys_tick_count = current_ccount - begin_ccount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thing is still open. We already have overload of "tick" as this is widely used term with a very specific meaning in Zephyr, and on SOF side, IPC4 has added definition of tick in base_fw.h that is not the same as Zephyr. Now in this code, a third type of "tick" is added.
Given this is a very hardware specific interface to begin with, I'd actually just use "ccount" directly to make this explicit. CCOUNT is not shared by cores, and with DSP clock rate changes, clock/power gating, the ccount will not be a usable timestamp even for a single core. That's why we use sof_cycle_get_64() (which is not ccount) for timestamps in SOF. But if the telemetry interface expects ccount, then that's what you need to fill.
But what are the systick fields here? I've seen some uses of "DSP ticks" (= CCOUNT), so I guess that is possible. I'd then add comments to the code as this will be very confusing as this is the third type of "tick" in the same codebase.
If you compile Zephyr with CONFIG_TIMING_FUNCTIONS, then you can get CCOUNT in a more portable way with timing_counter_get() on xtensa. However, we disable CONFIG_TIMING_FUNCTIONS by default in the build and only use it performance overlay builds. so this will require some additional investigation to added (e.g. @abonislawski disabled the DSP clock gating for perf overlay -> this is not ok for telemetry if it needs to be enabled by default).
Use of word "tick" is very confusing as we have established meaning of ticks in Zephyr and before this PRGiven this is a very hardware specific interface
*/ | ||
struct telemetry_wnd_data { | ||
uint32_t separator_1; | ||
struct system_tick_info system_tick_info[CONFIG_MAX_CORE_COUNT]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobonex Can you confirm telemetry will not overwrite the coredump? This is minimum requirement to merge this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some open issues still unaddressed, and one more from me. Otherwise looks Ok to me. But in the future I think we should write some generic debug slot reservation code to Zephyr size so that all the slots would be allocated in the same place.
Yes, with @marcinszkudlinski we were planning to clean up the slots in the near future. We were thinking about either defining all the slot owners in one file or writing this reservation process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tobonex . So based on my count, there are two opens left:
- the tick/ccount naming needs to be clarified (at a minimum with a comment explain why we are mixing units)
- fix bisect issue noted by Jyri -> you should not use header/definitions in the series that are only added later in the series. at every git commit, the tree should build
src/debug/telemetry/telemetry.c
Outdated
(struct system_tick_info *)wnd_data->system_tick_info; | ||
|
||
systick_info[prid].count = telemetry_systick_counter[prid]; | ||
systick_info[prid].last_sys_tick_count = current_ccount - begin_ccount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobonex At a minimum, we need a comment to explain the tick semantics and why we are putting non-ccount values into variable called ccount. I can accept to do this on purpose, but you need to have a comment that explains why this is done.
*/ | ||
struct telemetry_wnd_data { | ||
uint32_t separator_1; | ||
struct system_tick_info system_tick_info[CONFIG_MAX_CORE_COUNT]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack @tobonex . I just wanted to confirm it is tested that after coredump overwrites the telemetery, we won't have telemetry again overwrite the coredump. This would make coredump unsable in the build.
4df3d4f
to
888ea4b
Compare
Add ipc to retrieve info about schedulers. Signed-off-by: Tobiasz Dryjanski <[email protected]>
Increase Debug Window (Memory window 2) size by one slot (4096 bytes). The reason for this change is making space for telemetry functionality. Signed-off-by: Tobiasz Dryjanski <[email protected]>
Introduces telemetry structure into debug memory window. Adds systick_info which counts execution time of LL tasks. DP tasks are not supported yet. Signed-off-by: Tobiasz Dryjanski <[email protected]>
Add config for enabling telemetry and enable it Signed-off-by: Tobiasz Dryjanski <[email protected]>
Adds performance measurement state ipc to change the state of performance measurement. Signed-off-by: Tobiasz Dryjanski <[email protected]>
888ea4b
to
0d7cecc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only checked that my earlier comment was properly addressed, but for my part this looks good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tobonex , looks good now. Let's see if the CI passes. There's one issue (see inline), but we can merge with this if there are no other issues. If you need to update for some reason, please fix this as well.
@@ -21,3 +21,5 @@ rsource "ipc/Kconfig" | |||
rsource "math/Kconfig" | |||
|
|||
rsource "library_manager/Kconfig" | |||
|
|||
rsource "debug/telemetry/Kconfig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the previous patch "telemetry: Add systick_info part of telemetry". No functional impact as this commit is the first user, but just noting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However you must make sure every commit can compile and run basic (not telemetry) tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every commit here does compile
SOFCI TEST |
This patch introduces telemetry by adding systick info measurements. Also adds basic implementations of ipc's used for by our tests for systick info measurements.
Currently this implementation only handles measurement of LL execution time, as measuring DP tasks may be a bit tricky.
This patch also requires increasing the size of memory window 2, which is on the Zephyr side. It may pass tests without it, but the results will be incorrect as it may write on memory occupied by Zephyr logs. This memory window size change is being taken care of also. Will link any related Zephyr PR's once they are created.
I'm a bit uncertain about some parts, so I'll self review with some questions.