-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
const weight packing support #146
Comments
It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). |
The first datatype that we want to support is BF16. And I don't think weight packing does not apply to INT4. For W4A16 scenario, we can still use a similar block format like NK8k16n2k for weight packing to improve cache locality when converting INT4 to BF16 before BRGEMM. Any special reason making it not applicable? |
My bad, I was referring to it's not beneficial for the datatype conversion (from INT4 to either FP16 or INT8 higher-precision) step during the entire tensor transformation pipeline. |
In openVINO, the IR of a model consists of its topology and constant values, like weights, in memory. For each Graph, there is a Take In compile stage, the operations (for example, type casting ops) that follow the When the When the I think we can still use most of the design of constant tensor cache for oneDNN Graph. The model will be split into two parts, A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration. |
Based on the current MLIR integration here: openvinotoolkit/openvino@4b524ca#diff-b40ca25e9ca41e663971ae2274f78b5a444a0f3ba5d014d2323e20f199b690b0, a MLIR subgraph will be represented as a single op in OV graph, sounds like it perfectly match. |
During model inference, model weight is frozen and won't change between iterations. CPU prefers special weight layout to accelerate the execution, then we need to prepack the model weight before model execution. This issue covers below items:
The text was updated successfully, but these errors were encountered: