-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add normalized power metrics #227
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
@tjablin for clarification - this is a number that appears only in the power tables, and is calculated as measured performance / measured energy? |
Power WG (Sachin) would like to discuss this item with Tom Jablin. |
This would be great and make the tables more comprehensible for anyone trying to understand energy efficiency. |
@DilipSequeira Yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Adding Sachin for review from power WG. |
We discussed in the Power WG. The concern is that a normalized metric such as Perf/W is context dependent and does not hold constant across a range of performance and power. The metric is only relevant to the measured performance metric and measured power metric . If that is the case, then it can be easily derived from the given primary metrics. What is the proposal to communicate to any reviewer referencing that : The Perf/W metric published is only relevant for the associated measured primary metrics for performance and power and that it cannot be applied to any other performance value measured or otherwise estimated. |
@s-idgunji I don't understand your comment at all. Could please provide a concrete example? |
@tjablin - System Power is elastic. You can constrain your system to be at lower Power say Po1 and for a benchmark get Perf Pe1 , the Perf/W is Pe1/Po1 . You can increase the system Power to Po2 and get Perf Pe2 , Perf/W for this scenario is Pe2/Po2 Pe2/Po2 != Pe1/Po1 But if submission is just for Pe1 , Po1 one cannot use that normalized metric Pe1/Po1 for any performance point like Pe2 . That's what I meant. And we have to be clear that such aspects are understood through explicit communication and not assume that reviewers understand such aspects. |
Power WG will propose the appropriate text for the table and the normalized metric added (if this normalized metric is agreed as an addition by Inference WG) |
I believe that the Results Guidelines are here to help. In particular, requiring to include the MLPerf result ID automatically identifies the system under test, the software, etc. Including the normalized Pe1/Po1 metric into the result's row will prevent its misuse with another Pe2 result from another row (with another result ID). The only problem I see is that someone can take the normalized metric for one workload and use it with another workload in the same row. However, given that samples/Joule may be very different for different workloads (e.g. 200 for ResNet, 10 for BERT), it's only a minor concern. |
Notes from Oct 20th Power WG meeting pertaining to this PR For PR#227 , if the Inference WG votes for adding normalized metric , Power WG will add a footnote to the table to describe that the normalized metric be used only in context of the submitted primary metrics. Exact wording to be worked out in a future meeting since it’s not needed prior submission date, and is not blocking any work towards v2.0 submission. |
The below figure has appeared on NVIDIA's blog with the following caption:
For ResNet50, the claimed Performance improvement of ~3x could have been assumed to be taken from either non-MaxQ (6,138.84 / 2,039.11 = 3.01) or MaxQ (4,750.26 / 1,506.53 = 3.15).
But for BERT-99, the claimed Performance improvement of ~5x could have only come from non-MaxQ (476.34 / 96.73 = 4.92), not from MaxQ (394.33 / 61.35 = 6.43).
Therefore, it appears that the normalized metric is used out of context here, contrary to the last comment. I think this would be cleaner to have two separate figures:
I'm not saying the NVIDIA figure violates any rule today, but we should be more explicit about how to make such comparisons in the future. |
Hi @psyhtest - We discussed this in the Power WG and from the discussions with @anirban-ghosh in the meeting, that the table is fine. On discussing with @DilipSequeira and/or @georgelyuan, if there are any questions please bring up in the Power WG next week. Thanks. |
I believe Tejus had the same reservation as me. George agrees that this could have been more explicit. Let's revisit in the next meeting. |
I don't think that was the case. When it was clarified that the comparison is between consistent {Perf,Power} tuples there was no issue. From a Power WG, we'd like to be clear on what exactly is the issue that you are raising. |
Let Pid be the Performance (samples per second) and Wid be the Power of submission number id, and Rid to be their ratio, i.e. the normalized metric. For example, for ResNet50:
The figure effectively shows (P2.0-140 / P1.1-110 = 3.01) as "Perf" bars, and (R2.0-141 / R1.1-111 = 1.88) as "Perf/Watt" bars next to each other. But the latter is actually (P2.0-141 / W2.0-141) / (P1.1-111 / W1.1-111) = (P2.0-141 / P1.1-111) * (W1.1-111 / W2.0-141) = 3.15 * 0.59. In summary, the adjacent "Perf" and "Perf/Watt" bars are based on the ratio of different Performance values. |
We know the ratio of (W1.1-111 / W2.0-141 = 0.59) because W1.1-111 and W2.0-141 were measured and reported by NVIDIA. @s-idgunji As you said many times yourself, the Perf/Watt ratio cannot be assumed constant across different modes such as MaxN and MaxQ. Unfortunately, we do not know the MaxN ratio (W1.1-110 / W2.0-140). But, hypothetically, if (W1.1-110= 50) and (W2.0-140 = 70), this ratio is 0.71. Conversely, if (W1.1-110= 50) and the ratio is assumed the same as the MaxQ ratio (0.59), (W2.0-140 = 70) = 85. Without taking any Orin measurements myself, I find the former hard to believe. |
=============================================================================
And nothing in the table is breaking this ask. As far as I can tell, there is no Watt with the Perf only submission and neither is the Max Perf being used that is the case in Perf/W comparison. The rule is that any Perf/W can use Perf with associated Watt and no other. And one can only refer to the measured Perf/W in the comparison. If normalized data in the NVIDIA table needs to be presented separately , footnoted clearly to avoid confusion, and you have suggestions, please reach out to NVIDIA directly. If on the other hand you think that the ratios used are incorrect for Perf/W , i.e incorrect Perf Value with incorrect Power Value , then please point out in our next WG. And do you want to propose guidelines on how Perf/W be normalized and results be presented by submitters? That can be discussed separately as well. But, we want to be careful that eventually Perf/W comparisons can also lead to Perf comparisons. For e.g. how can submitters compare "Perf only" Perf data vs "Perf, Power" Perf data . Is that a valid comparison? Some submitter could do that. So you can see that this can become increasingly complicated as submissions can cover energy efficient points to max performance points and so on. I also want to point out that this PR is specifically about adding a foot note in MLC externally visible tables on our MLCommons website that should state that any normalized Perf/W data is only valid for that particular entry and no other. |
[Just catching up on this thread.] Suppose that MLPerf had additional metrics (say, DRAM usage, and system noise in dB) with entries optimized for each using different SW configuration options. Would it be legitimate to present a comparison of two systems across the scenarios capturing all of these things in a single chart? I think it's useful to be able to do so, and we should think about what criteria we want to impose in such cases. I would say, at minimum:
AFAICT our blog post conforms to the first two of these, and IMO is consistent with the spirit of the rules as well as the letter. It doesn't conform to the third, and hence is somewhat imprecise about what's being presented. I think it would be reasonable to refine the requirements around footnoting so that it's clearer how such tables should be annotated in future rounds to prevent any possible confusion. |
I think this PR is useful to be merged. Joules/stream can be added as an additional column in the results table as it makes life so easier for any user seeing the table. The following data is taken from the 3.0 results and is for resnet50 model. The perf and power metric columns are from the results table and I'm not sure a normal user can do a simple comparison here based on the values shown as different scenarios are involved here. But the last column - Joules per sample gives a straight away comparison and gives a useful metric to the user and this also gives a power efficiency comparison across the 4 different scenarios. PS: In the below table Power is reported for offline and server scenarios and energy in milliJoules for Sinlgestream and Multistream as in the official results table
|
Adding the link to the full results with unified power metric (inference per Joules) here |
Needs more discussion, need opinion from Power |
We can discuss this. We had already agreed to adding energy metrics. What is this new update ? |
@mrmhodak - We had a discussion in the Power WG meeting.
The next steps are to follow on in the next instance (had to cover another 3.1 submission topic) Sachin |
Closing, no longer relevant with new spreadsheet coming soon. |
No description provided.