Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

const weight packing support #146

Open
ZhennanQin opened this issue Jul 3, 2024 · 5 comments · May be fixed by #74
Open

const weight packing support #146

ZhennanQin opened this issue Jul 3, 2024 · 5 comments · May be fixed by #74
Assignees
Labels
CPU enhancement New feature or request

Comments

@ZhennanQin
Copy link
Contributor

During model inference, model weight is frozen and won't change between iterations. CPU prefers special weight layout to accelerate the execution, then we need to prepack the model weight before model execution. This issue covers below items:

  • Analyze how weight pre-packing is done in openvino.
  • Provide the RFC about how to do weight pre-packing in MLIR to meet openvino requirement.
  • Implement the weight pre-packing pass with current CPU pipeline to support BF16 MLP inference.
@ZhennanQin ZhennanQin added this to the CPU milestone Jul 3, 2024
@ciyongch
Copy link
Contributor

ciyongch commented Jul 3, 2024

It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario).
We might need to clarify that the original weight's datatype?

@ZhennanQin
Copy link
Contributor Author

It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). We might need to clarify that the original weight's datatype?

The first datatype that we want to support is BF16. And I don't think weight packing does not apply to INT4. For W4A16 scenario, we can still use a similar block format like NK8k16n2k for weight packing to improve cache locality when converting INT4 to BF16 before BRGEMM. Any special reason making it not applicable?

@ciyongch
Copy link
Contributor

ciyongch commented Jul 3, 2024

My bad, I was referring to it's not beneficial for the datatype conversion (from INT4 to either FP16 or INT8 higher-precision) step during the entire tensor transformation pipeline.
For re-layout, it's always beneficial from cache locality perspective.

@ZhennanQin ZhennanQin assigned niuxiaog and unassigned Devjiu Jul 5, 2024
@lmontigny lmontigny added the enhancement New feature or request label Jul 16, 2024
@niuxiaog
Copy link
Contributor

In openVINO, the IR of a model consists of its topology and constant values, like weights, in memory.

For each Graph, there is a GraphContext attr. A GraphContext holds a WeightsSharing, which is basically a std::unordered_map<std::string, MemoryInfo::Ptr> that stores the memory of cached tensors.

Take FullyConnectedOp (FC for short) with DNNL primitive as an example. Each FC has a DnnlFCExecutor, which has an attr of type ExecutorContext. The ExecutorContext holds an unordered_map<string, MemoryPtr> to store the memory of its private cached weights.

In compile stage, the operations (for example, type casting ops) that follow the ConstantOp (weights, bias or others) will be executed and the results are cached in the unordered_map of the GraphContext.

When the FC has dynamic shape input, which is the case for llama2, these is nothing to do with the weights in compile stage. Actually, there is no explicit ReorderOp in the graph after the ConstantOp which holds the weight of a FullyConnectedOp. In the first execution, all the input shapes are defined and the DnnlFCExecutor is constructed. During the construction, the weight is packed to the blocking format and stored in the unordered_map of the ExecutorContext. In later executions, the packed weight can be used directly.

When the FC has static shape, the above packing and caching process is done in compile stage. All the executions directly use the cached weight.

I think we can still use most of the design of constant tensor cache for oneDNN Graph. The model will be split into two parts, fold and compute. Since the actual values of constant tensors are available in openVINO, the fold part can be compiled and executed in compile stage, and only the compute part is executed in execution stage.

A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration.

@ZhennanQin
Copy link
Contributor Author

A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration.

Based on the current MLIR integration here: openvinotoolkit/openvino@4b524ca#diff-b40ca25e9ca41e663971ae2274f78b5a444a0f3ba5d014d2323e20f199b690b0, a MLIR subgraph will be represented as a single op in OV graph, sounds like it perfectly match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CPU enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants