Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telemetry: systick info #8737

Merged
merged 6 commits into from
Apr 12, 2024
Merged

Conversation

tobonex
Copy link
Contributor

@tobonex tobonex commented Jan 12, 2024

This patch introduces telemetry by adding systick info measurements. Also adds basic implementations of ipc's used for by our tests for systick info measurements.
Currently this implementation only handles measurement of LL execution time, as measuring DP tasks may be a bit tricky.

This patch also requires increasing the size of memory window 2, which is on the Zephyr side. It may pass tests without it, but the results will be incorrect as it may write on memory occupied by Zephyr logs. This memory window size change is being taken care of also. Will link any related Zephyr PR's once they are created.

I'm a bit uncertain about some parts, so I'll self review with some questions.

@tobonex tobonex force-pushed the performance_data_zephyr_pr branch 3 times, most recently from 345cca4 to a10f59d Compare January 12, 2024 18:34
Copy link
Contributor Author

@tobonex tobonex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review with some questions/things to look at.

src/ipc/ipc4/handler.c Outdated Show resolved Hide resolved
src/debug/telemetry/telemetry.c Show resolved Hide resolved
src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/audio/base_fw.c Outdated Show resolved Hide resolved
src/audio/base_fw.c Show resolved Hide resolved
src/ipc/ipc4/handler.c Show resolved Hide resolved
src/schedule/schedule.c Show resolved Hide resolved
src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/schedule/zephyr_ll.c Outdated Show resolved Hide resolved
struct sof_tl {
uint32_t type;
uint32_t max_length;
} __packed __aligned(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attribute((packed, aligned(4)));
?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whenever available, __packed and similar macros are recommended by checkpatch over explicit __attribute__(()) versions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this checkpatch recommendation came straight from the Linux kernel and I'm not sure it makes sense in SOF.

src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/schedule/zephyr_ll.c Show resolved Hide resolved
src/audio/base_fw.c Show resolved Hide resolved
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just have some opens mainly on the ABI.

@@ -462,9 +462,14 @@ static int basefw_get_large_config(struct comp_dev *dev,
uint32_t *data_offset,
char *data)
{
/* We can use extended param id for both extended and standard param id */
union ipc4_extended_param_id extended_param_id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobonex I assume the vendor config get means that it is flexible and can be used to get/set any custom data for any vendor ? If so, is this the right structure, i.e. the telemetry data will be ABI standard data and part of a generic telemetry ABI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is just a special case of large_config_get that reference FW used. Scheduler_info_get IPC uses this format. Normal param_id uses one byte out of 4 and extended one uses the remaining bytes as space for additional argument.

src/audio/base_fw.c Outdated Show resolved Hide resolved
src/include/ipc4/base_fw.h Outdated Show resolved Hide resolved

window->descs[slot_num].type = ADSP_DW_SLOT_TELEMETRY;
window->descs[slot_num].resource_id = 0;
wnd_data->separator_1 = 0x0000C0DE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some magic number ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separators were defined in reference FW. Those are only used to identify telemetry structs when looking at the memory directly (e.g. using RWE), so those can be anything really.

Copy link
Member

@lgirdwood lgirdwood Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so to be scalable (where FW can add and modify data payloads in a way that wont break host based tooling) we would need to put a small header as the separator to convey extra information instead of just a magic number e.g.

struct telem_hdr {
    uint16_t magic; // 0xc0de
    uint16_t type; // type of C struct, i.e. what am I.
    uint16_t words;  // size of data elem in words
    uint16_t elems; // > 1 if payload is array
} __packed;
// payload follows here

This could convey all the info tooling needs and would allow FW to grow data and not break tooling along teh way. i.e. if tooling does not know about a type then it can go to the next header (and so on).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of telemetry data in memory window 2 here is the same as in reference firmware. If we want to be compatible it would be best to leave it as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to maintain compatability but we also need to scale and grow as new features upstream. We can wrap any legacy/reference structure as-is, but they would need to be pre-fixed with above header so that new data could be added (without impacting any legacy data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure if we can just change the separators. Pre-fixing with the header would also move the fields from their designated places in the MW slot, and things would just stop working.

About scaling: memory window 2 is pretty scalable by itself. We can still extend the window to get more slots where we can add new structures.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to live with the fixed location (slot + slot offset + separator) to keep compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a debug mechanism that has to grow and scale as FW features, use cases and capabilities grow and scale (and they will), its way simpler and cheaper to fix the debug tooling (i.e. look for a header to find start, all other data is the same) than to continue with an ABI that cant grow with the FW.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobonex the only way to scale/grow over time is to say that

window->descs[slot_num].type = ADSP_DW_SLOT_TELEMETRY;
window->descs[slot_num].resource_id = 0;

is the compatibility mode, and then changing

window->descs[slot_num].type = ADSP_DW_SLOT_TELEMETRY_V2;
window->descs[slot_num].resource_id = SOME_NEW_RESOURCE_ID;

to support changes in the ABI payloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would look something like this. More details:

We have a few slots available, what data is in what slot is known by reading the descriptors. Here we set descriptor to ADSP_DW_SLOT_TELEMETRY to reserve the slot for telemetry. Data in this slot is telemetry_wnd_data structure. Only systick data is here for now, but there are more structures not ported yet, currently commented out. I didn't count the actual space needed, but this slot will be probably filled whole after the rest is ported. So this slot would only contain the legacy telemetry data.

We can mark another slot as ADSP_DW_SLOT_TELEMETRY_2 for example and do basically anything in this another slot.

systick_info[prid].max_sys_tick_count);
systick_info[prid].last_ccount = current_ccount;

#ifdef PERFORMANCE_MEASUREMENTS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood , do you think PERFORMANCE_MEASUREMENTS is too simple? or can reuse CONFIG_PERFORMANCE_COUNTERS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PERFORMANCE_MEASUREMENTS I just ported from reference FW, for now there is no reason for it it seems. Will probably just delete this one. I added TELEMETRY_CONFIG instead to disable the whole thing.

src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/include/sof/debug/telemetry/telemetry.h Show resolved Hide resolved
src/schedule/zephyr_ll.c Outdated Show resolved Hide resolved
struct system_tick_info *systick_info =
(struct system_tick_info *)wnd_data->system_tick_info;

switch (state) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to print the state change for telemetry in the logs as it may have impact on performance or on issue reproducibility. i.e. a developer will know telemetry state when debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me think a bit. This IPC doesn't do much as of yet. It does some additional things in reference firmware, but nothing directly related to systick measurement (tests still use this one so we need it). It seems that systick measurement in reference FW was on all the time, same here. This does not seem like it would take a lot of time but still...
I could modify this IPC and add a bool so that systick measurement is stopped unless it's triggered by this IPC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this would just be informational to the developer. They would at least know if telemetry was on/off when an issue occurred.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had a LOT of "heisenbugs" in the past...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got some more info, systick info measurement works all the time by design. So there's no on/off switch. Still, what I can do is adding the log at the init of telemetry indicating that systick info measurement is active.

src/debug/telemetry/telemetry.c Show resolved Hide resolved
systick_info[prid].max_sys_tick_count);
systick_info[prid].last_ccount = current_ccount;

#ifdef PERFORMANCE_MEASUREMENTS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood , do you think PERFORMANCE_MEASUREMENTS is too simple? or can reuse CONFIG_PERFORMANCE_COUNTERS?

src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/schedule/zephyr_ll.c Show resolved Hide resolved
src/ipc/ipc4/handler.c Outdated Show resolved Hide resolved
src/audio/base_fw.c Show resolved Hide resolved
src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
* Implementing simplified queue just for avg calculation.
* Queue is circular, oldest element replaced by latest
*/
struct perf_queue {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

telemetry_ prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this struct is actually only used internally. Should I move it to .c so that it's not public?

src/schedule/zephyr_ll.c Show resolved Hide resolved
@tobonex tobonex force-pushed the performance_data_zephyr_pr branch from a10f59d to c7b7540 Compare January 19, 2024 16:16
@tobonex
Copy link
Contributor Author

tobonex commented Jan 19, 2024

Update:

  • Added CONFIG_TELEMETRY as an option to disable telemetry
  • Disabling telemetry for tgl, as there are some duplicated defines that describe memory windows which messes up the build.
  • Disabling scheduler info dp for tgl, as it does not use zephyr_dp.

@tobonex tobonex force-pushed the performance_data_zephyr_pr branch 2 times, most recently from 94a5b7b to 6e264f8 Compare January 23, 2024 16:23
lyakh
lyakh previously requested changes Jan 24, 2024
src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
@tobonex tobonex force-pushed the performance_data_zephyr_pr branch 2 times, most recently from dcee9de to 0129077 Compare January 24, 2024 17:21
@tobonex
Copy link
Contributor Author

tobonex commented Jan 24, 2024

Update:

  • changed ccount function to sof_cycle_get_64
  • telemetry.c variables made static and added a set per core
  • other minor fixes

src/debug/telemetry/telemetry.c Outdated Show resolved Hide resolved
src/debug/telemetry/telemetry.c Show resolved Hide resolved
@lyakh lyakh dismissed their stale review January 29, 2024 08:43

issues addressed, only a couple of questions to clarify remain

@tobonex tobonex marked this pull request as ready for review January 29, 2024 08:55
@tmleman tmleman requested a review from btian1 March 21, 2024 11:52
src/ipc/ipc4/handler.c Outdated Show resolved Hide resolved
src/schedule/schedule.c Show resolved Hide resolved
@tobonex tobonex force-pushed the performance_data_zephyr_pr branch from 4dcae31 to 82d22c2 Compare March 25, 2024 17:10
@tobonex tobonex requested a review from lyakh March 25, 2024 17:36
Add Vendor Config Get, a special case of Large Config Get. Large Config Get
handling now checks for this case and extracts extended param_id from ipc
payload as param_id and handles the rest of the payload as usual. Base_fw
now uses extended param_id.

Signed-off-by: Tobiasz Dryjanski <[email protected]>
@tobonex tobonex force-pushed the performance_data_zephyr_pr branch from b821d75 to 4df3d4f Compare April 2, 2024 14:46
@tobonex tobonex requested review from marc-hb and kv2019i April 2, 2024 17:54
src/audio/base_fw.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complicated PR to review. I think overall this starts to be ready to merge. There are few opens raised in comments that have not been closed in the discussion and I spotted one item in code review I'd like to be checked.

For the general problem of abstracting our hw specific code from base_fw.c config, I don't think I can require that from this PR, although this is adding to the tech debt pile.

src/audio/base_fw.c Show resolved Hide resolved
enum ipc4_perf_measurements_state_set state = *data;

struct telemetry_wnd_data *wnd_data =
(struct telemetry_wnd_data *)ADSP_DW->slots[SOF_DW_TELEMETRY_SLOT];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to a bit wrong direction in base_fw.c w.r.t. keeping in platform neutral (see #7549 and #8391 ). OTOH, the file is not really vendor neutral now and we don't have framework how to handle this, so I can't really point to alternative solution here. But this does add tech debt we need to address later. I think the telemetry backend should be abstracted better. E.g. Intel platforms would use a specific memory window, but other platforms may want to use another transport even if the telemetry data and objects are the same and same IPC commands are used to control telemetry operation.

(struct system_tick_info *)wnd_data->system_tick_info;

systick_info[prid].count = telemetry_systick_counter[prid];
systick_info[prid].last_sys_tick_count = current_ccount - begin_ccount;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thing is still open. We already have overload of "tick" as this is widely used term with a very specific meaning in Zephyr, and on SOF side, IPC4 has added definition of tick in base_fw.h that is not the same as Zephyr. Now in this code, a third type of "tick" is added.

Given this is a very hardware specific interface to begin with, I'd actually just use "ccount" directly to make this explicit. CCOUNT is not shared by cores, and with DSP clock rate changes, clock/power gating, the ccount will not be a usable timestamp even for a single core. That's why we use sof_cycle_get_64() (which is not ccount) for timestamps in SOF. But if the telemetry interface expects ccount, then that's what you need to fill.

But what are the systick fields here? I've seen some uses of "DSP ticks" (= CCOUNT), so I guess that is possible. I'd then add comments to the code as this will be very confusing as this is the third type of "tick" in the same codebase.

If you compile Zephyr with CONFIG_TIMING_FUNCTIONS, then you can get CCOUNT in a more portable way with timing_counter_get() on xtensa. However, we disable CONFIG_TIMING_FUNCTIONS by default in the build and only use it performance overlay builds. so this will require some additional investigation to added (e.g. @abonislawski disabled the DSP clock gating for perf overlay -> this is not ok for telemetry if it needs to be enabled by default).

Use of word "tick" is very confusing as we have established meaning of ticks in Zephyr and before this PRGiven this is a very hardware specific interface

*/
struct telemetry_wnd_data {
uint32_t separator_1;
struct system_tick_info system_tick_info[CONFIG_MAX_CORE_COUNT];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobonex Can you confirm telemetry will not overwrite the coredump? This is minimum requirement to merge this PR.

Copy link
Contributor

@jsarha jsarha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some open issues still unaddressed, and one more from me. Otherwise looks Ok to me. But in the future I think we should write some generic debug slot reservation code to Zephyr size so that all the slots would be allocated in the same place.

src/audio/base_fw.c Outdated Show resolved Hide resolved
@tobonex
Copy link
Contributor Author

tobonex commented Apr 5, 2024

Some open issues still unaddressed, and one more from me. Otherwise looks Ok to me. But in the future I think we should write some generic debug slot reservation code to Zephyr size so that all the slots would be allocated in the same place.

Yes, with @marcinszkudlinski we were planning to clean up the slots in the near future. We were thinking about either defining all the slot owners in one file or writing this reservation process.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tobonex . So based on my count, there are two opens left:

  • the tick/ccount naming needs to be clarified (at a minimum with a comment explain why we are mixing units)
  • fix bisect issue noted by Jyri -> you should not use header/definitions in the series that are only added later in the series. at every git commit, the tree should build

(struct system_tick_info *)wnd_data->system_tick_info;

systick_info[prid].count = telemetry_systick_counter[prid];
systick_info[prid].last_sys_tick_count = current_ccount - begin_ccount;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobonex At a minimum, we need a comment to explain the tick semantics and why we are putting non-ccount values into variable called ccount. I can accept to do this on purpose, but you need to have a comment that explains why this is done.

*/
struct telemetry_wnd_data {
uint32_t separator_1;
struct system_tick_info system_tick_info[CONFIG_MAX_CORE_COUNT];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack @tobonex . I just wanted to confirm it is tested that after coredump overwrites the telemetery, we won't have telemetry again overwrite the coredump. This would make coredump unsable in the build.

@tobonex tobonex force-pushed the performance_data_zephyr_pr branch from 4df3d4f to 888ea4b Compare April 10, 2024 11:56
tobonex added 5 commits April 10, 2024 14:02
Add ipc to retrieve info about schedulers.

Signed-off-by: Tobiasz Dryjanski <[email protected]>
Increase Debug Window (Memory window 2) size by one slot (4096 bytes).
The reason for this change is making space for telemetry functionality.

Signed-off-by: Tobiasz Dryjanski <[email protected]>
Introduces telemetry structure into debug memory window. Adds
systick_info which counts execution time of LL tasks. DP tasks are not
supported yet.

Signed-off-by: Tobiasz Dryjanski <[email protected]>
Add config for enabling telemetry and enable it

Signed-off-by: Tobiasz Dryjanski <[email protected]>
Adds performance measurement state ipc to change the state of
performance measurement.

Signed-off-by: Tobiasz Dryjanski <[email protected]>
@tobonex tobonex force-pushed the performance_data_zephyr_pr branch from 888ea4b to 0d7cecc Compare April 10, 2024 12:03
@tobonex tobonex requested review from jsarha and kv2019i April 10, 2024 13:33
Copy link
Contributor

@jsarha jsarha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only checked that my earlier comment was properly addressed, but for my part this looks good now.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tobonex , looks good now. Let's see if the CI passes. There's one issue (see inline), but we can merge with this if there are no other issues. If you need to update for some reason, please fix this as well.

@@ -21,3 +21,5 @@ rsource "ipc/Kconfig"
rsource "math/Kconfig"

rsource "library_manager/Kconfig"

rsource "debug/telemetry/Kconfig"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the previous patch "telemetry: Add systick_info part of telemetry". No functional impact as this commit is the first user, but just noting.

Copy link
Collaborator

@marc-hb marc-hb Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However you must make sure every commit can compile and run basic (not telemetry) tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every commit here does compile

@wszypelt
Copy link

SOFCI TEST

@kv2019i kv2019i merged commit 2810090 into thesofproject:main Apr 12, 2024
42 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.