tracel-ai · nathanielsimard · Sep 19, 2024 · Sep 18, 2024 · Sep 19, 2024 · Sep 19, 2024
diff --git a/README.md b/README.md
@@ -42,7 +42,7 @@ fn gelu_array<F: Float>(input: &Array<F>, output: &mut Array<F>) {
 
 #[cube]
 fn gelu_scalar<F: Float>(x: F) -> F {
-    x * (F::erf(x / F::sqrt(2.0.into())) + 1.0) / 2.0
+    x * (F::erf(x / F::sqrt(2.0f32.into())) + F::new(1.0)) / F::new(2.0)
 }
 ```
 

diff --git a/crates/cubecl-cuda/src/compiler/warp.rs b/crates/cubecl-cuda/src/compiler/warp.rs
@@ -105,8 +105,8 @@ fn reduce_operator(
         "
     {out} = {input};
 {{
-    for (int offset = warpSizeChecked / 2; offset > 0; offset /= 2) {{
-        {out} {op} __shfl_down_sync(0xFFFFFFFF, {out}, offset);
+    for (int offset = 1; offset < warpSizeChecked; offset *=2 ) {{
+       {out} {op} __shfl_xor_sync(-1, {out}, offset);
     }}
 }}
 "

diff --git a/cubecl-book/.gitignore b/cubecl-book/.gitignore
@@ -0,0 +1,18 @@
+target
+
+# MacOS temp file
+.DS_Store
+
+book-test
+guide/book
+
+.vscode
+tests/cubecl-book/book/
+book/
+
+# Ignore Jetbrains specific files.
+.idea/
+
+# Ignore Vim temporary and swap files.
+*.sw?
+*~
diff --git a/cubecl-book/.prettierrc.json b/cubecl-book/.prettierrc.json
@@ -0,0 +1,4 @@
+{
+    "printWidth": 100,
+    "proseWrap": "always"
+}
diff --git a/cubecl-book/book.toml b/cubecl-book/book.toml
@@ -0,0 +1,11 @@
+[book]
+authors = [
+    "Nathaniel Simard",
+]
+language = "en"
+multilingual = false
+src = "src"
+title = "The CubeCL Book 🧊"
+
+[output.html]
+mathjax-support = true
diff --git a/cubecl-book/src/SUMMARY.md b/cubecl-book/src/SUMMARY.md
@@ -0,0 +1,12 @@
+- [Overview](./overview.md)
+- [Why CubeCL](./motivation.md)
+- [Getting Started](./getting-started/summary.md)
+  - [Installation](./getting-started/installation.md)
+  - [Design](./getting-started/design.md)
+  - [Examples](./getting-started/examples.md)
+- [Core Features](./core-features/summary.md)
+  - [Comptime](./core-features/comptime.md)
+  - [Vectorization](./core-features/vectorization.md)
+  - [Autotune](./core-features/autotune.md)
+- [Language Support](./language-support/summary.md)
+  - [Trait Support](./language-support/trait.md)
diff --git a/cubecl-book/src/core-features/autotune.md b/cubecl-book/src/core-features/autotune.md
@@ -0,0 +1,11 @@
+# Autotune
+
+Autotuning drastically simplifies kernel selection by running small benchmarks at runtime to figure
+out the best kernels with the best configurations to run on the current hardware; an essential
+feature for portability. This feature combines gracefully with comptime to test the effect of
+different comptime values on performance; sometimes it can be surprising!
+
+Even if the benchmarks may add some overhead when running the application for the first time, the
+information gets cached on the device and will be reused. It is usually a no-brainer trade-off for
+throughput-oriented programs such as deep learning models. You can even ship the autotune cache with
+your program, reducing cold start time when you have more control over the deployment target.
diff --git a/cubecl-book/src/core-features/comptime.md b/cubecl-book/src/core-features/comptime.md
@@ -0,0 +1,57 @@
+# Comptime
+
+CubeCL isn't just a new compute language: though it feels like you are writing GPU kernels, you are,
+in fact, writing compiler plugins that you can fully customize! Comptime is a way to modify the
+compiler IR at runtime when compiling a kernel for the first time.
+
+This enables a lot of optimizations and flexibility without having to write many separate variants
+of the same kernels to ensure maximal performance.
+
+## Loop Unrolling
+
+You can easily unroll loops in CubeCL using the `unroll` attribute on top of a for loop.
+
+```rust
+#[cube(launch)]
+fn sum<F: Float>(input: &Array<F>, output: &mut Array<F>, #[comptime] end: Option<u32>) {
+    let unroll = end.is_some();
+    let end = end.unwrap_or_else(|| input.len());
+    let mut sum = F::new(0.0);
+
+    #[unroll(unroll)]
+    for i in 0..end {
+        sum += input[i];
+    }
+
+    output[ABSOLUTE_POS] = sum;
+}
+```
+
+Note that if you provide a variable `end` that can't be determined at compile time, a panic will
+arise when trying to execute that kernel.
+
+## Feature Specialization
+
+You could also achieve the sum using subcube operations. We will write a kernel that use that
+instruction when available based on a comptime feature flag. When it isn't available, it will
+fallback on the previous implementation essentially making it portable.
+
+```rust
+#[cube(launch)]
+fn sum_subgroup<F: Float>(
+    input: &Array<F>,
+    output: &mut Array<F>,
+    #[comptime] subgroup: bool,
+    #[comptime] end: Option<u32>,
+) {
+    if subgroup {
+        output[UNIT_POS] = subcube_sum(input[UNIT_POS]);
+    } else {
+        sum_basic(input, output, end);
+    }
+}
+```
+
+Note that no branching will actually occured on the GPU, since three different kernels can be
+generated from the last code snippet. You can also use the
+[trait system](../language-support/trait.md) to achieve a similar behavior.
diff --git a/cubecl-book/src/core-features/summary.md b/cubecl-book/src/core-features/summary.md
@@ -0,0 +1,4 @@
+# Core Features
+
+In this section, we'll explore the core features of CubeCL and what sets it apart from other
+high-performance computing languages like CUDA, OpenCL, and HIP.
diff --git a/cubecl-book/src/core-features/vectorization.md b/cubecl-book/src/core-features/vectorization.md
@@ -0,0 +1,9 @@
+# Vectorization
+
+High-performance kernels should rely on SIMD instructions whenever possible, but doing so can
+quickly get pretty complicated! With CubeCL, you can specify the vectorization factor of each input
+variable when launching a kernel. Inside the kernel code, you still use only one type, which is
+dynamically vectorized and supports automatic broadcasting. The runtimes are able to compile kernels
+and have all the necessary information to use the best instruction! However, since the algorithmic
+behavior may depend on the vectorization factor, CubeCL allows you to access it directly in the
+kernel when needed, without any performance loss, using the comptime system!
diff --git a/cubecl-book/src/getting-started/cubecl.drawio.svg b/cubecl-book/src/getting-started/cubecl.drawio.svg
diff --git a/cubecl-book/src/getting-started/design.md b/cubecl-book/src/getting-started/design.md
@@ -0,0 +1,57 @@
+# Design
+
+CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids,
+because not all axes are the same size. Since all compute APIs need to map to the hardware, which
+are tiles that can be accessed using a 3D representation, our topology can easily be mapped to
+concepts from other APIs.
+
+<div align="center">
+
+### CubeCL - Topology
+
+<img src="./cubecl.drawio.svg" width="100%"/>
+<br />
+</div>
+<br />
+
+_A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions
+along the x, y, and z axes. Similarly, a hyper-cube is composed of cubes, just as a cube is composed
+of units. Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube
+along the x, y, and z axes. Hence, a hyper-cube of 3x3x3 will have 27 cubes. In this example, the
+total number of working units would be 27 x 27 = 729._
+
+### Topology Equivalence
+
+Since all topology variables are constant within the kernel entry point, we chose to use the Rust
+constant syntax with capital letters. Often when creating kernels, we don't always care about the
+relative position of a unit within a cube along each axis, but often we only care about its position
+in general. Therefore, each kind of variable also has its own axis-independent variable, which is
+often not present in other languages, except WebGPU with `local_invocation_index`.
+
+<br />
+
+| CubeCL         | CUDA        | WebGPU                 |
+| -------------- | ----------- | ---------------------- |
+| CUBE_COUNT     | N/A         | N/A                    |
+| CUBE_COUNT_X   | gridDim.x   | num_workgroups.x       |
+| CUBE_COUNT_Y   | gridDim.y   | num_workgroups.y       |
+| CUBE_COUNT_Z   | gridDim.z   | num_workgroups.z       |
+| CUBE_POS       | N/A         | N/A                    |
+| CUBE_POS_X     | blockIdx.x  | workgroup.x            |
+| CUBE_POS_Y     | blockIdx.y  | workgroup.y            |
+| CUBE_POS_Z     | blockIdx.z  | workgroup.z            |
+| CUBE_DIM       | N/A         | N/A                    |
+| CUBE_DIM_X     | blockDim.x  | workgroup_size.x       |
+| CUBE_DIM_Y     | blockDim.y  | workgroup_size.y       |
+| CUBE_DIM_Z     | blockDim.z  | workgroup_size.z       |
+| UNIT_POS       | N/A         | local_invocation_index |
+| UNIT_POS_X     | threadIdx.x | local_invocation_id.x  |
+| UNIT_POS_Y     | threadIdx.y | local_invocation_id.y  |
+| UNIT_POS_Z     | threadIdx.z | local_invocation_id.z  |
+| SUBCUBE_DIM    | warpSize    | subgroup_size          |
+| ABSOLUTE_POS   | N/A         | N/A                    |
+| ABSOLUTE_POS_X | N/A         | global_id.x            |
+| ABSOLUTE_POS_Y | N/A         | global_id.y            |
+| ABSOLUTE_POS_Z | N/A         | global_id.z            |
+
+</details>
diff --git a/cubecl-book/src/getting-started/examples.md b/cubecl-book/src/getting-started/examples.md
@@ -0,0 +1,11 @@
+# Examples
+
+For now we only have a limited amount of examples listed in the table bellow. Note that you can also
+look at how the [linalg](https://github.com/tracel-ai/cubecl/tree/main/crates/cubecl-linalg) is
+implemented.
+
+| Example                                                                               | Description                                                                                           |
+| :------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------- |
+| [GeLU](https://github.com/tracel-ai/cubecl/tree/main/examples/gelu)                   | Implement the GeLU activation function using CubeCL.                                                  |
+| [Sum Things](https://github.com/tracel-ai/cubecl/tree/main/examples/sum_things)       | Sum some number using many different variation leveraging the CubeCL core features and trait support. |
+| [Normalization](https://github.com/tracel-ai/cubecl/tree/main/examples/normalization) | Show how to use normalization on vectorized elements.                                                 |
diff --git a/cubecl-book/src/getting-started/installation.md b/cubecl-book/src/getting-started/installation.md
@@ -0,0 +1,19 @@
+# Installation
+
+Installing CubeCL is straightforward. It’s available as a Rust crate, and you can add it to your
+project by updating your `Cargo.toml`:
+
+```toml
+[dependencies]
+cubecl = { version = "{version}", features = ["cuda", "wgpu"] }
+```
+
+The more challenging aspect is ensuring that you have the necessary drivers to run the selected
+runtime.
+
+For `wgpu` on Linux and Windows, Vulkan drivers are required. These drivers are usually included
+with the default OS installation. However, on certain setups, such as Windows Subsystem for Linux
+(WSL), you may need to install them manually if they are missing.
+
+For `cuda`, simply install the CUDA drivers on your device. You can follow the installation
+instructions provided on the [NVIDIA website](https://developer.nvidia.com/cuda-downloads).
diff --git a/cubecl-book/src/getting-started/summary.md b/cubecl-book/src/getting-started/summary.md
@@ -0,0 +1,6 @@
+# Getting Started
+
+In this section, we’ll walk through the installation process for CubeCL, explore the design
+principles behind its language extension, and provide some examples to help you start experimenting
+with the tool. By the end, you'll have a solid understanding of how CubeCL integrates into your
+workflow and how to leverage its features for high-performance computing.
diff --git a/cubecl-book/src/language-support/summary.md b/cubecl-book/src/language-support/summary.md
@@ -0,0 +1,4 @@
+# Language Support
+
+In this section, we will highlight key language features of CubeCL and demonstrate how to use them
+in your kernels to enhance performance, portability, and maintainability.
diff --git a/cubecl-book/src/language-support/trait.md b/cubecl-book/src/language-support/trait.md
@@ -0,0 +1,90 @@
+# Trait Support
+
+CubeCL partially supports trait to modularize your kernel code without any overhead. For now most
+features are supported except stateful functions.
+
+```rust
+#[cube]
+trait MyTrait {
+    /// Supported
+    fn my_function(x: &Array<f32>) -> f32;
+    /// Unsupported
+    fn my_function_2(&self, x: &Array<f32>) -> f32;
+}
+```
+
+The trait system allows you to do specialization quite easily. Let's take the same example as in the
+[comptime section](../core-features/comptime.md).
+
+First you can define your trait. Note that if you use your trait from the launch function, you will
+need to add `'static + Send + Sync`.
+
+```rust
+#[cube]
+trait SumKind: 'static + Send + Sync {
+    fn sum<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F;
+}
+```
+
+Then we can define some implementations:
+
+```rust
+struct SumBasic;
+struct SumSubcube;
+
+#[cube]
+impl SumKind for SumBasic {
+    fn sum<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F {
+        let unroll = end.is_some();
+        let end = end.unwrap_or_else(|| input.len());
+
+        let mut sum = F::new(0.0);
+
+        #[unroll(unroll)]
+        for i in 0..end {
+            sum += input[i];
+        }
+
+        sum
+    }
+}
+
+#[cube]
+impl SumKind for SumSubcube {
+    fn sum<F: Float>(input: &Slice<F>, #[comptime] _end: Option<u32>) -> F {
+        subcube_sum(input[UNIT_POS])
+    }
+}
+```
+
+Associated types are also supported. Let say you want to create a series from a sum.
+
+```rust
+#[cube]
+trait CreateSeries: 'static + Send + Sync {
+    type SumKind: SumKind;
+
+    fn execute<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F;
+}
+```
+
+You may want to define what kind of series you want to create using an implementation.
+
+```rust
+struct SumThenMul<K: SumKind> {
+    _p: PhantomData<K>,
+}
+
+#[cube]
+impl<K: SumKind> CreateSeries for SumThenMul<K> {
+    type SumKind = K;
+
+    fn execute<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F {
+        let val = Self::SumKind::sum(input, end);
+        val * input[UNIT_POS]
+    }
+}
+```
+
+It's actually not the best example of using associated types, but it shows how they are totally
+supported with CubeCL.
diff --git a/cubecl-book/src/motivation.md b/cubecl-book/src/motivation.md
@@ -0,0 +1,33 @@
+# Why CubeCL is Important
+
+In the current landscape of high-performance computing, developers face several significant
+challenges that CubeCL aims to address:
+
+### Complexity in Performance Optimization
+
+Optimizing compute kernels for different hardware is inherently complex. Developers must understand
+the intricacies of each platform’s architecture and API, leading to a steep learning curve and the
+risk of suboptimal performance. The need for manual tuning and platform-specific adjustments often
+results in code that is difficult to maintain and extend.
+
+The simplest way to solve this problem is to provide high level abstractions that can be composed in
+a variety of ways. All of those variation can be autotuned to select the best settings for the
+current hardware and problem at hand.
+
+### Lack of Portability
+
+Portability remains a significant issue. Code written for one API or even for a single GPU
+architecture often cannot be easily transferred to another, hindering the ability to develop
+software that can run across diverse hardware environments.
+
+The GPU computing ecosystem is fragmented, with different hardware platforms like CUDA, ROCm, Metal,
+and Vulkan requiring their own specialized codebases. This fragmentation forces developers to write
+and maintain separate implementations for each platform, increasing both development time and
+complexity.
+
+## Conclusion
+
+In essence, these challenges underscore the need for a more unified and developer-friendly approach
+to high-performance computing. CubeCL seeks to bridge these gaps by addressing the core issues
+within the current ecosystem, offering a new direction for high-performance and portable computing
+solutions.
diff --git a/cubecl-book/src/overview.md b/cubecl-book/src/overview.md
@@ -0,0 +1,6 @@
+# Overview
+
+Welcome to The CubeCL Book 👋
+
+This book will help you get started with your high-performance computing project using CubeCL,
+making sure you leverage the most any hardware.
-Original file line number
+Diff line change
@@ Expand Up @@
     #[cube]
     fn gelu_scalar<F: Float>(x: F) -> F {
-        x * (F::erf(x / F::sqrt(2.0.into())) + 1.0) / 2.0
+        x * (F::erf(x / F::sqrt(2.0f32.into())) + F::new(1.0)) / F::new(2.0)
     }
     ```
@@ Expand Down @@