Understanding Ansor

Ansor explores a large optimization search space. It samples high-performing programs from a "hierarchal representation"[A] of the search space.
It further fine-tunes the sampled programs using the evolutionary search and gradient-based cost model (e.g., xgboost), resulting in the best-performing tensor programs.
Provides/Explores comprehensive coverage of optimizations
Ansor dynamically prioritizes sub-graphs of the DNNs that are more likely to improve the overall performance.

Ansor is evaluated on three classes of hardware: Intel CPU, ARM CPU, and NVIDIA GPU.

A task is a process performed to generate high-performance programs for a subgraph.

Ansor uses the operator fusion algorithm from Relay to convert DNNs from input model formats to partitioned subgraphs.

Program Sampler:
- constructs large search space with two levels: sketch and annotation
- samples diverse programs from it
- expands the search space by recursively applying the derivation rules
- randomly samples complete programs
Performance Tuner:
- fine-tunes the performance of the sampled program from above (1).
- It uses re-sampled new programs and good programs from the previous iterations to start the evolutionary search (ES)
- ES fine-tunes the programs by mutation and crossover (Questionable to me!)
Task Scheduler:
- allocates time and resources for optimizing multiple subgraphs in the DNNs.
- Uses a scheduling algorithm based on gradient descent to the most likely candidates (subgraphs).
- A task is picked based on the gradient-based algorithm maximizing the objective function in a given time budget.

To generate a sketch for a DAG: topologically traverse all the nodes and construct the structure iteratively.
The sketch includes a basic tile and fusion structures for compute-intensive computation nodes (Conv2D) with data-reuse opportunities.
Element-wise nodes are inlined (ReLU)
Multiple sketches are generated using a derivation-based recursive approach
Derivation rules will differ as per the underlying hardware. Example: multi-level tiling:
- CPU: SSRSRS, S = one tile level of space loop; R = one tile level of reduction loop
- GPU: SSSRRSRS to match the architecture of the GPU

Sketches are annotated to make them complete programs for fine-tuning and evaluation
- E.g., fill out tile size, unroll an inner loop, etc.
Users can provide custom hints to adjust the annotation policy.

Iterative process
First, use ES to find high-performant programs using a cost model (probabilistic selection)
Get actual measurements on hardware
Retrain the cost model

[A] Hierarchical representation:** high-level structures generation and low-level details; sketch and annotation

Provide feedback