Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Book #133

Merged
merged 4 commits into from
Sep 19, 2024
Merged

Book #133

Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ fn gelu_array<F: Float>(input: &Array<F>, output: &mut Array<F>) {

#[cube]
fn gelu_scalar<F: Float>(x: F) -> F {
x * (F::erf(x / F::sqrt(2.0.into())) + 1.0) / 2.0
x * (F::erf(x / F::sqrt(2.0f32.into())) + F::new(1.0)) / F::new(2.0)
}
```

Expand Down
4 changes: 2 additions & 2 deletions crates/cubecl-cuda/src/compiler/warp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,8 @@ fn reduce_operator(
"
{out} = {input};
{{
for (int offset = warpSizeChecked / 2; offset > 0; offset /= 2) {{
{out} {op} __shfl_down_sync(0xFFFFFFFF, {out}, offset);
for (int offset = 1; offset < warpSizeChecked; offset *=2 ) {{
{out} {op} __shfl_xor_sync(-1, {out}, offset);
}}
}}
"
Expand Down
18 changes: 18 additions & 0 deletions cubecl-book/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
target

# MacOS temp file
.DS_Store

book-test
guide/book

.vscode
tests/cubecl-book/book/
book/

# Ignore Jetbrains specific files.
.idea/

# Ignore Vim temporary and swap files.
*.sw?
*~
4 changes: 4 additions & 0 deletions cubecl-book/.prettierrc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"printWidth": 100,
"proseWrap": "always"
}
11 changes: 11 additions & 0 deletions cubecl-book/book.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[book]
authors = [
"Nathaniel Simard",
]
language = "en"
multilingual = false
src = "src"
title = "The CubeCL Book 🧊"

[output.html]
mathjax-support = true
12 changes: 12 additions & 0 deletions cubecl-book/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- [Overview](./overview.md)
- [Why CubeCL](./motivation.md)
- [Getting Started](./getting-started/summary.md)
- [Installation](./getting-started/installation.md)
- [Design](./getting-started/design.md)
- [Examples](./getting-started/examples.md)
- [Core Features](./core-features/summary.md)
- [Comptime](./core-features/comptime.md)
- [Vectorization](./core-features/vectorization.md)
- [Autotune](./core-features/autotune.md)
- [Language Support](./language-support/summary.md)
- [Trait Support](./language-support/trait.md)
11 changes: 11 additions & 0 deletions cubecl-book/src/core-features/autotune.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Autotune

Autotuning drastically simplifies kernel selection by running small benchmarks at runtime to figure
out the best kernels with the best configurations to run on the current hardware; an essential
feature for portability. This feature combines gracefully with comptime to test the effect of
different comptime values on performance; sometimes it can be surprising!

Even if the benchmarks may add some overhead when running the application for the first time, the
information gets cached on the device and will be reused. It is usually a no-brainer trade-off for
throughput-oriented programs such as deep learning models. You can even ship the autotune cache with
your program, reducing cold start time when you have more control over the deployment target.
57 changes: 57 additions & 0 deletions cubecl-book/src/core-features/comptime.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Comptime

CubeCL isn't just a new compute language: though it feels like you are writing GPU kernels, you are,
in fact, writing compiler plugins that you can fully customize! Comptime is a way to modify the
compiler IR at runtime when compiling a kernel for the first time.

This enables a lot of optimizations and flexibility without having to write many separate variants
of the same kernels to ensure maximal performance.

## Loop Unrolling

You can easily unroll loops in CubeCL using the `unroll` attribute on top of a for loop.

```rust
#[cube(launch)]
fn sum<F: Float>(input: &Array<F>, output: &mut Array<F>, #[comptime] end: Option<u32>) {
let unroll = end.is_some();
let end = end.unwrap_or_else(|| input.len());
let mut sum = F::new(0.0);

#[unroll(unroll)]
for i in 0..end {
sum += input[i];
}

output[ABSOLUTE_POS] = sum;
}
```

Note that if you provide a variable `end` that can't be determined at compile time, a panic will
arise when trying to execute that kernel.

## Feature Specialization

You could also achieve the sum using subcube operations. We will write a kernel that use that
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved
instruction when available based on a comptime feature flag. When it isn't available, it will
fallback on the previous implementation essentially making it portable.
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved

```rust
#[cube(launch)]
fn sum_subgroup<F: Float>(
input: &Array<F>,
output: &mut Array<F>,
#[comptime] subgroup: bool,
#[comptime] end: Option<u32>,
) {
if subgroup {
output[UNIT_POS] = subcube_sum(input[UNIT_POS]);
} else {
sum_basic(input, output, end);
}
}
```

Note that no branching will actually occured on the GPU, since three different kernels can be
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved
generated from the last code snippet. You can also use the
[trait system](../language-support/trait.md) to achieve a similar behavior.
4 changes: 4 additions & 0 deletions cubecl-book/src/core-features/summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Core Features

In this section, we'll explore the core features of CubeCL and what sets it apart from other
high-performance computing languages like CUDA, OpenCL, and HIP.
9 changes: 9 additions & 0 deletions cubecl-book/src/core-features/vectorization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Vectorization

High-performance kernels should rely on SIMD instructions whenever possible, but doing so can
quickly get pretty complicated! With CubeCL, you can specify the vectorization factor of each input
variable when launching a kernel. Inside the kernel code, you still use only one type, which is
dynamically vectorized and supports automatic broadcasting. The runtimes are able to compile kernels
and have all the necessary information to use the best instruction! However, since the algorithmic
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved
behavior may depend on the vectorization factor, CubeCL allows you to access it directly in the
kernel when needed, without any performance loss, using the comptime system!
4 changes: 4 additions & 0 deletions cubecl-book/src/getting-started/cubecl.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
57 changes: 57 additions & 0 deletions cubecl-book/src/getting-started/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Design

CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids,
because not all axes are the same size. Since all compute APIs need to map to the hardware, which
are tiles that can be accessed using a 3D representation, our topology can easily be mapped to
concepts from other APIs.

<div align="center">

### CubeCL - Topology

<img src="./cubecl.drawio.svg" width="100%"/>
<br />
</div>
<br />

_A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions
along the x, y, and z axes. Similarly, a hyper-cube is composed of cubes, just as a cube is composed
of units. Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube
along the x, y, and z axes. Hence, a hyper-cube of 3x3x3 will have 27 cubes. In this example, the
total number of working units would be 27 x 27 = 729._

### Topology Equivalence

Since all topology variables are constant within the kernel entry point, we chose to use the Rust
constant syntax with capital letters. Often when creating kernels, we don't always care about the
relative position of a unit within a cube along each axis, but often we only care about its position
in general. Therefore, each kind of variable also has its own axis-independent variable, which is
often not present in other languages, except WebGPU with `local_invocation_index`.

<br />

| CubeCL | CUDA | WebGPU |
| -------------- | ----------- | ---------------------- |
| CUBE_COUNT | N/A | N/A |
| CUBE_COUNT_X | gridDim.x | num_workgroups.x |
| CUBE_COUNT_Y | gridDim.y | num_workgroups.y |
| CUBE_COUNT_Z | gridDim.z | num_workgroups.z |
| CUBE_POS | N/A | N/A |
| CUBE_POS_X | blockIdx.x | workgroup.x |
| CUBE_POS_Y | blockIdx.y | workgroup.y |
| CUBE_POS_Z | blockIdx.z | workgroup.z |
| CUBE_DIM | N/A | N/A |
| CUBE_DIM_X | blockDim.x | workgroup_size.x |
| CUBE_DIM_Y | blockDim.y | workgroup_size.y |
| CUBE_DIM_Z | blockDim.z | workgroup_size.z |
| UNIT_POS | N/A | local_invocation_index |
| UNIT_POS_X | threadIdx.x | local_invocation_id.x |
| UNIT_POS_Y | threadIdx.y | local_invocation_id.y |
| UNIT_POS_Z | threadIdx.z | local_invocation_id.z |
| SUBCUBE_DIM | warpSize | subgroup_size |
| ABSOLUTE_POS | N/A | N/A |
| ABSOLUTE_POS_X | N/A | global_id.x |
| ABSOLUTE_POS_Y | N/A | global_id.y |
| ABSOLUTE_POS_Z | N/A | global_id.z |

</details>
11 changes: 11 additions & 0 deletions cubecl-book/src/getting-started/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Examples

For now we only have a limited amount of examples listed in the table bellow. Note that you can also
look at how the [linalg](https://github.com/tracel-ai/cubecl/tree/main/crates/cubecl-linalg) is
implemented.

| Example | Description |
| :------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------- |
| [GeLU](https://github.com/tracel-ai/cubecl/tree/main/examples/gelu) | Implement the GeLU activation function using CubeCL. |
| [Sum Things](https://github.com/tracel-ai/cubecl/tree/main/examples/sum_things) | Sum some number using many different variation leveraging the CubeCL core features and trait support. |
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved
| [Normalization](https://github.com/tracel-ai/cubecl/tree/main/examples/normalization) | Show how to use normalization on vectorized elements. |
19 changes: 19 additions & 0 deletions cubecl-book/src/getting-started/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Installation

Installing CubeCL is straightforward. It’s available as a Rust crate, and you can add it to your
project by updating your `Cargo.toml`:

```toml
[dependencies]
cubecl = { version = "{version}", features = ["cuda", "wgpu"] }
```

The more challenging aspect is ensuring that you have the necessary drivers to run the selected
runtime.

For `wgpu` on Linux and Windows, Vulkan drivers are required. These drivers are usually included
with the default OS installation. However, on certain setups, such as Windows Subsystem for Linux
(WSL), you may need to install them manually if they are missing.

For `cuda`, simply install the CUDA drivers on your device. You can follow the installation
instructions provided on the [NVIDIA website](https://developer.nvidia.com/cuda-downloads).
6 changes: 6 additions & 0 deletions cubecl-book/src/getting-started/summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Getting Started

In this section, we’ll walk through the installation process for CubeCL, explore the design
principles behind its language extension, and provide some examples to help you start experimenting
with the tool. By the end, you'll have a solid understanding of how CubeCL integrates into your
workflow and how to leverage its features for high-performance computing.
4 changes: 4 additions & 0 deletions cubecl-book/src/language-support/summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Language Support

In this section, we will highlight key language features of CubeCL and demonstrate how to use them
in your kernels to enhance performance, portability, and maintainability.
90 changes: 90 additions & 0 deletions cubecl-book/src/language-support/trait.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Trait Support

CubeCL partially supports trait to modularize your kernel code without any overhead. For now most
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved
features are supported except stateful functions.

```rust
#[cube]
trait MyTrait {
/// Supported
fn my_function(x: &Array<f32>) -> f32;
/// Unsupported
fn my_function_2(&self, x: &Array<f32>) -> f32;
}
```

The trait system allows you to do specialization quite easily. Let's take the same example as in the
[comptime section](../core-features/comptime.md).

First you can define your trait. Note that if you use your trait from the launch function, you will
need to add `'static + Send + Sync`.

```rust
#[cube]
trait SumKind: 'static + Send + Sync {
fn sum<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F;
}
```

Then we can define some implementations:

```rust
struct SumBasic;
struct SumSubcube;

#[cube]
impl SumKind for SumBasic {
fn sum<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F {
let unroll = end.is_some();
let end = end.unwrap_or_else(|| input.len());

let mut sum = F::new(0.0);

#[unroll(unroll)]
for i in 0..end {
sum += input[i];
}

sum
}
}

#[cube]
impl SumKind for SumSubcube {
fn sum<F: Float>(input: &Slice<F>, #[comptime] _end: Option<u32>) -> F {
subcube_sum(input[UNIT_POS])
}
}
```

Associated types are also supported. Let say you want to create a series from a sum.
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved

```rust
#[cube]
trait CreateSeries: 'static + Send + Sync {
type SumKind: SumKind;

fn execute<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F;
}
```

You may want to define what kind of series you want to create using an implementation.

```rust
struct SumThenMul<K: SumKind> {
_p: PhantomData<K>,
}

#[cube]
impl<K: SumKind> CreateSeries for SumThenMul<K> {
type SumKind = K;

fn execute<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F {
let val = Self::SumKind::sum(input, end);
val * input[UNIT_POS]
}
}
```

It's actually not the best example of using associated types, but it shows how they are totally
supported with CubeCL.
33 changes: 33 additions & 0 deletions cubecl-book/src/motivation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Why CubeCL is Important

In the current landscape of high-performance computing, developers face several significant
challenges that CubeCL aims to address:

### Complexity in Performance Optimization

Optimizing compute kernels for different hardware is inherently complex. Developers must understand
the intricacies of each platform’s architecture and API, leading to a steep learning curve and the
risk of suboptimal performance. The need for manual tuning and platform-specific adjustments often
results in code that is difficult to maintain and extend.

The simplest way to solve this problem is to provide high level abstractions that can be composed in
a variety of ways. All of those variation can be autotuned to select the best settings for the
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved
current hardware and problem at hand.

### Lack of Portability

Portability remains a significant issue. Code written for one API or even for a single GPU
architecture often cannot be easily transferred to another, hindering the ability to develop
software that can run across diverse hardware environments.

The GPU computing ecosystem is fragmented, with different hardware platforms like CUDA, ROCm, Metal,
and Vulkan requiring their own specialized codebases. This fragmentation forces developers to write
and maintain separate implementations for each platform, increasing both development time and
complexity.

## Conclusion

In essence, these challenges underscore the need for a more unified and developer-friendly approach
to high-performance computing. CubeCL seeks to bridge these gaps by addressing the core issues
within the current ecosystem, offering a new direction for high-performance and portable computing
solutions.
6 changes: 6 additions & 0 deletions cubecl-book/src/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Overview

Welcome to The CubeCL Book 👋

This book will help you get started with your high-performance computing project using CubeCL,
making sure you leverage the most any hardware.
nathanielsimard marked this conversation as resolved.
Show resolved Hide resolved
Loading
Loading