-
Notifications
You must be signed in to change notification settings - Fork 0
Understanding Ansor
Gaurav Verma edited this page Dec 16, 2023
·
3 revisions
- Ansor explores a large optimization search space. It samples high-performing programs from a "hierarchal representation"[A] of the search space.
- It further fine-tunes the sampled programs using the evolutionary search and gradient-based cost model (e.g., xgboost), resulting in the best-performing tensor programs.
- Provides/Explores comprehensive coverage of optimizations
- Ansor dynamically prioritizes sub-graphs of the DNNs that are more likely to improve the overall performance.
Ansor is evaluated on three classes of hardware: Intel CPU, ARM CPU, and NVIDIA GPU.
A task is a process performed to generate high-performance programs for a subgraph.
Ansor uses the operator fusion algorithm from Relay to convert DNNs from input model formats to partitioned subgraphs.
- Program Sampler:
- constructs large search space with two levels: sketch and annotation
- samples diverse programs from it
- expands the search space by recursively applying the derivation rules
- randomly samples complete programs
- Performance Tuner:
- fine-tunes the performance of the sampled program from above (1).
- It uses re-sampled new programs and good programs from the previous iterations to start the evolutionary search (ES)
- ES fine-tunes the programs by mutation and crossover (Questionable to me!)
- Task Scheduler:
- allocates time and resources for optimizing multiple subgraphs in the DNNs.
- Uses a scheduling algorithm based on gradient descent to the most likely candidates (subgraphs).
- A task is picked based on the gradient-based algorithm maximizing the objective function in a given time budget.
- To generate a sketch for a DAG: topologically traverse all the nodes and construct the structure iteratively.
- The sketch includes a basic tile and fusion structures for compute-intensive computation nodes (Conv2D) with data-reuse opportunities.
- Element-wise nodes are inlined (ReLU)
- Multiple sketches are generated using a derivation-based recursive approach
- Derivation rules will differ as per the underlying hardware. Example: multi-level tiling:
- CPU: SSRSRS, S = one tile level of space loop; R = one tile level of reduction loop
- GPU: SSSRRSRS to match the architecture of the GPU
- Sketches are annotated to make them complete programs for fine-tuning and evaluation
- E.g., fill out tile size, unroll an inner loop, etc.
- Users can provide custom hints to adjust the annotation policy.
- Iterative process
- First, use ES to find high-performant programs using a cost model (probabilistic selection)
- Get actual measurements on hardware
- Retrain the cost model
- meta-heuristic algorithm
- The following types of mutations are applied:
- tile size mutation
- parallel mutation
- pragma mutation
- computation location mutation
- node-based crossover
[A] Hierarchical representation:** high-level structures generation and low-level details; sketch and annotation