Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime/pprof: add data-type profiling #69699

Open
florianl opened this issue Sep 28, 2024 · 5 comments
Open

proposal: runtime/pprof: add data-type profiling #69699

florianl opened this issue Sep 28, 2024 · 5 comments
Labels
Milestone

Comments

@florianl
Copy link
Contributor

Proposal Details

Proposal Details

With field reordering and padding of structs, static analysis can help to improve memory layouts of Go structs. This can lead to a more efficient way to access struct fields, as the fields within the struct are aligned to some degree. Combined with dead code analysis, unused fields in structs can be identified by static analysis and help to reduce the size of structs.

This proposal tries to introduce the ideas from Data-type profiling for perf to Go's pprof ecosystem to provide a Go native approach. Today it is already possible with perf on Unix systems to do data-type profiling, reorder structs accordingly and benefit from the performance improvements.

Introduce a new runtime/pprof Profile that tracks the number read/write accesses of fields within a Go struct.

The report of this new runtime/pprof Profile should enable users to identify often used fields within a struct, in order to reorder struct fields to improve memory efficiency of their application.

Example reporting of for a Go struct generated by the approach described in Data-type profiling for perf:

Annotate type: 'struct runtime.mspan' (654 samples)
Percent     Offset       Size  Field
 100.00          0        160  struct runtime.mspan {
   0.00          0          0      internal/runtime/sys.NotInHeap   _ {
   0.00          0          0          internal/runtime/sys.nih     _;
                                   };
   1.05          0          8      runtime.mspan*   next;
   0.00          8          8      runtime.mspan*   prev;
   0.23         16          8      runtime.mSpanList*       list;
  41.18         24          8      uintptr  startAddr;
   2.30         32          8      uintptr  npages;
   0.19         40          8      runtime.gclinkptr        manualFreeList;
   1.74         48          2      uint16   freeindex;
   1.57         50          2      uint16   nelems;
   0.23         52          2      uint16   freeIndexForScan;
   1.82         56          8      uint64   allocCache;
   1.56         64          8      runtime.gcBits*  allocBits;
   5.51         72          8      runtime.gcBits*  gcmarkBits;
   0.42         80          8      runtime.gcBits*  pinnerBits;
   1.54         88          4      uint32   sweepgen;
   4.58         92          4      uint32   divMul;
   2.70         96          2      uint16   allocCount;
  12.49         98          1      runtime.spanClass        spanclass;
   0.00         99          1      runtime.mSpanStateBox    state {
   0.00         99          1          internal/runtime/atomic.Uint8        s {
   0.00         99          0              internal/runtime/atomic.noCopy   noCopy;
   0.00         99          1              uint8    value;
                                       };
                                   };
   1.69        100          1      uint8    needzero;
   0.11        101          1      bool     isUserArenaChunk;
   0.23        102          2      uint16   allocCountBeforeCache;
  18.64        104          8      uintptr  elemsize;
   0.00        112          8      uintptr  limit;
   0.00        120          8      runtime.mutex    speciallock {
   0.00        120          0          runtime.lockRankStruct       lockRankStruct;
   0.00        120          8          uintptr      key;
                                   };
   0.22        128          8      runtime.special* specials;
   0.00        136         16      runtime.addrRange        userArenaChunkFree {
   0.00        136          8          runtime.offAddr      base {
   0.00        136          8              uintptr  a;
                                       };
   0.00        144          8          runtime.offAddr      limit {
   0.00        144          8              uintptr  a;
                                       };
                                   };
   0.00        152          8      internal/abi.Type*       largeType;
                               };

The above shown example reports the field access of the Go internal struct mspan while running the benchmarks in net/http with go version devel go1.24-eb6f2c24cd Sat Sep 28 01:07:09 2024 +0000 linux/amd64.

Alternative

Instead of introducing a new runtime/pprof Profile, a similar approach to go build -cover could be used. During build time access to fields in Go structs could be instrumented and a report should be generated when executing the resulting Go binary. The resulting report then can be used by go tool cover to report the number of times a field in a struct was accessed.

Question

I'm lacking Go runtime internal knowledge to provide a proof of concept with this proposal.

  • Should runtime internal Go structs be exposed as well with data type profiling?
  • Should the profiling of Go structs differentiate between publicly exposed fields and non-public internal fields?
  • Is it possible and safe to turn on/off data-type profiling during runtime?
  • Should the profile collect samples of field access, similar to the perf approach, or count and report exact numbers
@gopherbot gopherbot added this to the Proposal milestone Sep 28, 2024
@prattmic
Copy link
Member

This is a very intriguing type of profile I’ve never heard of before.

Do you intend that this profile would work the same way as the linked perf profile type? That is, using a precise “memory access” (or memory load) hardware PMU metric.

Aside: I found the patch message to be the most straightforward and concise summary of how that profile works: https://lwn.net/Articles/954938/

Along those lines, do you know if the existing perf profile works on Go programs? I don’t see fundamental reasons it shouldn’t, but we may be missing some DWARF. So even if we don’t add a profile to runtime/pprof, fixing up problems with perf profiles may be doable.

@prattmic
Copy link
Member

cc @golang/runtime

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Sep 29, 2024
@florianl
Copy link
Contributor Author

Do you intend that this profile would work the same way as the linked perf profile type? That is, using a precise “memory access” (or memory load) hardware PMU metric.

Implementing this new profile based on PMU metrics would benefit accuracy, I think. I'm missing Go runtime internal knowledge to tell whether there is an option implementing it without PMU metrics.

Along those lines, do you know if the existing perf profile works on Go programs?

I'm using perf whenever it is available and so far I didn't run into issues or did miss some information when profiling Go executables. The given example of struct runtime.mspan in the initial post of this proposal was generated by perf.
To my knowledge, perf is not available on every OS, e.g. I'm not aware of perf on windows. Also perf is often not deployed to production systems. Therefore, the Go ecosystem would benefit from insights of this new profile if it is integrated natively.

@prattmic
Copy link
Member

That's great to hear that the perf tool seems to work well.

#36821 and #53286 cover providing PMU-based profiles in Go, though those are targeted at the more typical profiles (cycles, instructions, etc). Cross-platform support is discussed there as well. I believe the summary is that Linux of course has the perf events API, Windows has an API, though none of us are familiar with it, and macOS does not seem to have a (public) API at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

4 participants