-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Elastic Fabric Adapter (EFA) metric collection to awscontainerinsightreceiver. #186
Conversation
receiver/awscontainerinsightreceiver/internal/cadvisor/testutils/helpers.go
Show resolved
Hide resolved
|
||
func (s *Scraper) fillMetric(metric stores.CIMetric, metricType string, namespace string, podName string, | ||
containerName string, deviceName string, timestamp time.Time, counters *efaCounters) { | ||
s.assignRateValueToField(ci.MetricName(metricType, ci.EfaRdmaReadBytes), counters.rdmaReadBytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably makes sense to have a map, and you can loop over to assignRate value to each field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did what you're asking in the next commit, let me know if you were imagining something different.
d.logger.Warn("Can't get containerInstanceId") | ||
} else { | ||
tags[ci.ContainerInstanceIDKey] = d.ecsInfo.GetContainerInstanceID() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cant we move lines 110-120 to addECSResources func as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done. I had moved this from cadvisor_linux and didn't make any modifications.
} | ||
} | ||
|
||
func addECSResources(tags map[string]string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a silly question, but is there any reason not to reuse this - https://github.com/amazon-contributing/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/internal/stores/utils.go#L98
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, moved the Instance* metric type logic to there.
For latest revision, I squashed all my commits together and rebased on aws-cwa-dev. This got rid of the history... if you want to see the previous state of my PR, you can view the tree at: https://github.com/straussb/opentelemetry-collector-contrib/tree/db5503cb45f8e78b169da51c95bc4489f55db3ff And compare to current state at: https://github.com/straussb/opentelemetry-collector-contrib/compare/db5503cb45f8e78b169da51c95bc4489f55db3ff..efa The files that had rebase conflicts were:
So review those a bit more carefully. Since last review, I also added a bit more logic in efaSysfs.go and podresourcesclient.go to check permissions/ownership of |
TypeContainerGPU = "ContainerGPU" | ||
TypePodGPU = "PodGPU" | ||
TypeNodeGPU = "NodeGPU" | ||
TypeClusterGPU = "ClusterGPU" | ||
TypeNeuronContainer = "ContainerNeuron" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be updated in this commit, but it would be nice to update this to TypeContainerNeuron
as well for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
|
||
stat, ok := info.Sys().(*syscall.Stat_t) | ||
if !ok { | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ok to return this condition as true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would essentially mean that we are not running on Linux. But you're right, we should skip in that case so we remember to implement this logic for Windows if we ever add support for that. Changed in next commit.
// This was already patched and submitted, see | ||
// https://www.spinics.net/lists/linux-rdma/msg68596.html | ||
// Remove this as soon as the fix lands in the enterprise distros. | ||
if strings.Contains(stringValue, "N/A (no PMA)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this break if the fix gets promoted? how do we track the release of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I'm going to follow up with a commit to improve partial error handling - still post node-level metrics if we couldn't get pod resources data, as we discussed in the meeting; and also include data for some devices/counters even if reading some other data fails. As part of that commit, I will remove this logic and just skip if parsing to int failed.
(The comment is wrong anyway, I copied from somewhere else, so thanks for pointing this out. There's no upstream fix that will change anything, since we've implemented our own counter-reading code here.)
if err != nil { | ||
acir.settings.Logger.Debug("Unable to start neuron scraper", zap.Error(err)) | ||
} | ||
|
||
acir.initEfaSysfsScraper(localnodeDecorator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should catch a error here as well, and log if the EFA scraper is not able to start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously there was no reason that the scraper couldn't start. Added one case now.
ReadCounter(deviceName efaDeviceName, port string, counter string) (uint64, error) | ||
} | ||
|
||
type podResourcesStore interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why create a new one, we can reuse the one defined in the neuronScraper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to duplicate in this case - I think the typical Go style would say that each class can independently mock out its dependencies as it chooses to. I thought about moving the interface to podresourcesstore.go but it doesn't make sense to take that flexibility away from the dependent classes.
…ightreceiver. The new component scrapes hardware counters from /sys/class/infiniband on disk. The layout of that directory is: /sys/class/infiniband/<device name> └── ports └── 1 └── hw_counters ├── rdma_read_bytes ├── rdma_write_bytes ├── rdma_write_recv_bytes ├── rx_bytes ├── rx_drops └── tx_bytes These are cumulative counters and so they are converted to deltas before sending down the pipeline. We sum up data from all ports. The device data is joined with data from the kubelet podresources API which tells us which container a given device is assigned to. The metrics are reported at container, pod, and node levels. This commit also refactors some metric decoration code from cadvisor to a common localnode decorator, intended to be used by any awscontainerinsightreceiver component that gathers metrics from the local node (as oppoosed to e.g. the k8s control plane scraper). This is because we want to share the logic of populating PodName, Kubernetes labels, etc. I also renamed RawContainerInsightsMetric to CIMetricImpl for brevity.
eed8454
into
amazon-contributing:aws-cwa-dev
Add Elastic Fabric Adapter (EFA) metric collection to awscontainerinsightreceiver.
The new component scrapes hardware counters from /sys/class/infiniband on disk. The layout of that directory is:
These are cumulative counters and so they are converted to deltas before sending down the pipeline.
We sum up data from all ports.
The device data is joined with data from the kubelet podresources API which tells us which container a given device is
assigned to.
The metrics are reported at container, pod, and node levels.
This commit also refactors some metric decoration code from cadvisor to a common localnode decorator, intended to be
used by any awscontainerinsightreceiver component that gathers metrics from the local node (as oppoosed to e.g. the k8s
control plane scraper). This is because we want to share the logic of populating PodName, Kubernetes labels, etc.
I also renamed RawContainerInsightsMetric to CIMetricImpl for brevity.
See also accompanying cloudwatch-agent PR: aws/amazon-cloudwatch-agent#1098