diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a2dc1e6220..557d979820 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -97,6 +97,18 @@ To run the tests in the provided docker containers: * `pip install -e .` * `pytest ` or `make ` to run the desired tests +### Checking documentation + +If your changes affects the documentation, please get a chance to build the docs locally and view it to verify if the changes +are what you wanted. + +```bash +cd docs +pip install -e '.[docs]' +make clean && make html +make host # open the output link in a browser. +``` + ## Code Style & Typing diff --git a/composer/algorithms/alibi/README.md b/composer/algorithms/alibi/README.md index 85190f2aa9..b26d15d692 100644 --- a/composer/algorithms/alibi/README.md +++ b/composer/algorithms/alibi/README.md @@ -6,7 +6,7 @@ ALiBi (Attention with Linear Biases) dispenses with position embeddings for tokens in transformer-based NLP models, instead encoding position information by biasing the query-key attention scores proportionally to each token pair’s distance. ALiBi yields excellent extrapolation to unseen sequence lengths compared to other position embedding schemes. We leverage this extrapolation capability by training with shorter sequence lengths, which reduces the memory and computation load. -| ![Alibi](https://storage.googleapis.com/docs.mosaicml.com/images/methods/alibi.png) | +| ![Alibi](../_images/alibi.png) | |:--: |*The matrix on the left depicts the attention score for each key-query token pair. The matrix on the right depicts the distance between each query-key token pair. m is a head-specific scalar that is fixed during training. Figure from [Press et al., 2021](https://openreview.net/forum?id=R8sQPpGCv0).*| diff --git a/composer/algorithms/blurpool/README.md b/composer/algorithms/blurpool/README.md index afdc7d7525..6e26c9e536 100644 --- a/composer/algorithms/blurpool/README.md +++ b/composer/algorithms/blurpool/README.md @@ -10,7 +10,7 @@ BlurPool increases the accuracy of convolutional neural networks for computer vi nearly the same speed, by applying a spatial low-pass filter before pooling operations and strided convolutions. Doing so reduces [aliasing](https://en.wikipedia.org/wiki/Aliasing) when performing these operations. -| ![BlurPool](https://storage.googleapis.com/docs.mosaicml.com/images/methods/blurpool-antialiasing.png) | +| ![BlurPool](../_images/blurpool-antialiasing.png) | |:--: |*A diagram of the BlurPool replacements (bottom row) for typical pooling and downsampling operations (top row) in convolutional neural networks. In each case, BlurPool applies a low-pass filter before the spatial downsampling to avoid aliasing. This image is Figure 2 in [Zhang (2019)](https://proceedings.mlr.press/v97/zhang19a.html).*| diff --git a/composer/algorithms/channels_last/README.md b/composer/algorithms/channels_last/README.md index 482a76ec68..03ee0c24ea 100644 --- a/composer/algorithms/channels_last/README.md +++ b/composer/algorithms/channels_last/README.md @@ -8,7 +8,7 @@ Channels Last improves the throughput of convolution operations in networks for NVIDIA GPUs natively perform convolution operations in NHWC format, so storing the tensors this way eliminates transpositions that would otherwise need to take place, increasing throughput. This is a systems-level method that does not change the math or outcome of training in any way. -| ![ChannelsLast](https://storage.googleapis.com/docs.mosaicml.com/images/methods/channels_last.png) | +| ![ChannelsLast](../_images/channels_last.png) | |:--: |*A diagram of a convolutional layer using the standard NCHW tensor memory layout (left) and the NHWC tensor memory layout (right). Fewer operations take place in NHWC format because the convolution operation is natively performed in NHWC format (right); in contrast, the NCHW tensor must be transposed to NHWC before the convolution and transposed back to NCHW after (right). This digram is from [NVIDIA](https://developer.nvidia.com/blog/tensor-core-ai-performance-milestones/).*| diff --git a/composer/algorithms/colout/README.md b/composer/algorithms/colout/README.md index a4d17c1479..6100627df5 100644 --- a/composer/algorithms/colout/README.md +++ b/composer/algorithms/colout/README.md @@ -8,7 +8,7 @@ ColOut is a data augmentation technique that drops a fraction of the rows or col If the fraction of rows/columns dropped isn't too large, the image content is not significantly altered but the image size is reduced, speeding up training. This modification modestly reduces accuracy, but it is a worthwhile tradeoff for the increased speed. -| ![ColOut](https://storage.googleapis.com/docs.mosaicml.com/images/methods/col_out.png) | +| ![ColOut](../_images/col_out.png) | |:--: |*Several instances of an image of an apple from the CIFAR-100 dataset with ColOut applied. ColOut randomly removes different rows and columns each time it is applied.*| diff --git a/composer/algorithms/cutmix/README.md b/composer/algorithms/cutmix/README.md index 1943d2fbb8..3880bedf7e 100644 --- a/composer/algorithms/cutmix/README.md +++ b/composer/algorithms/cutmix/README.md @@ -8,7 +8,7 @@ CutMix is a data augmentation technique that modifies images by cutting out a sm It is a regularization technique that can improve the generalization accuracy of computer vision models. -| ![CutMix](https://storage.googleapis.com/docs.mosaicml.com/images/methods/cutmix.png) | +| ![CutMix](../_images/cutmix.png) | |:--: |*An image with CutMix applied. A picture of a cat has been placed over the top left corner of a picture of a dog. This image is taken from [Figure 1 from Yun et al. (2019)](https://arxiv.org/abs/1905.04899).*| diff --git a/composer/algorithms/cutout/README.md b/composer/algorithms/cutout/README.md index 3ef6c26fef..87fc4aa83a 100644 --- a/composer/algorithms/cutout/README.md +++ b/composer/algorithms/cutout/README.md @@ -7,7 +7,7 @@ Cutout is a data augmentation technique that masks one or more square regions of an input image, replacing them with gray boxes. It is a regularization technique that improves the accuracy of models for computer vision. -| ![CutOut](https://storage.googleapis.com/docs.mosaicml.com/images/methods/cutout.png) | +| ![CutOut](../_images/cutout.png) | |:--: |*Several images from the CIFAR-10 dataset with Cutout applied. Cutout adds a gray box that occludes a portion of each image. This is [Figure 1 from DeVries & Taylor (2017)](https://arxiv.org/abs/1708.04552).*| diff --git a/composer/algorithms/factorize/README.md b/composer/algorithms/factorize/README.md index 116b3d1cf6..5bca397413 100644 --- a/composer/algorithms/factorize/README.md +++ b/composer/algorithms/factorize/README.md @@ -8,7 +8,7 @@ Factorize splits a large linear or convolutional layer into two smaller ones that compute a similar function. This can be applied to models for both computer vision and natural language processing. -| ![Factorize](https://storage.googleapis.com/docs.mosaicml.com/images/methods/factorize-no-caption.png) | +| ![Factorize](../_images/factorize-no-caption.png) | |:--: |*Figure 1 of [Zhang et al. (2015)](https://ieeexplore.ieee.org/abstract/document/7332968). (a) The weights `W` of a 2D convolutional layer with `k x k` filters, `c` input channels, and `d` output channels are factorized into two smaller convolutions (b) with weights `W'` and `P` with `d'` intermediate channels. The first convolution uses the original filter size but produces only `d'` channels. The second convolution has `1 x 1` filters and produces the original `d` output channels but has only `d'` input channels. This changes the complexity per spatial position from $O(k^2cd)$ to $O(k^2cd') + O(d'd)$.*| diff --git a/composer/algorithms/ghost_batchnorm/README.md b/composer/algorithms/ghost_batchnorm/README.md index 7e7e58fcb1..bdf0dc8338 100644 --- a/composer/algorithms/ghost_batchnorm/README.md +++ b/composer/algorithms/ghost_batchnorm/README.md @@ -8,7 +8,7 @@ During training, BatchNorm normalizes each batch of inputs to have a mean of 0 a Ghost BatchNorm instead splits the batch into multiple "ghost" batches, each containing `ghost_batch_size` samples, and normalizes each one to have a mean of 0 and variance of 1. This causes training with a large batch size to behave similarly to training with a small batch size. -| ![Ghost BatchNorm](https://storage.googleapis.com/docs.mosaicml.com/images/methods/ghost-batch-normalization.png) | +| ![Ghost BatchNorm](../_images/ghost-batch-normalization.png) | |:--: |*A visualization of different normalization methods on an activation tensor in a neural network with multiple channels. M represents the batch dimension, C represents the channel dimension, and F represents the spatial dimensions (such as height and width). Ghost BatchNorm (upper right) is a modified version of BatchNorm that normalizes the mean and variance for disjoint sub-batches of the full batch. This image is Figure 1 in [Dimitriou & Arandjelovic, 2020](https://arxiv.org/abs/2007.08554).*| diff --git a/composer/algorithms/layer_freezing/README.md b/composer/algorithms/layer_freezing/README.md index c97427daaa..6034b45877 100644 --- a/composer/algorithms/layer_freezing/README.md +++ b/composer/algorithms/layer_freezing/README.md @@ -8,7 +8,7 @@ Layer Freezing gradually makes early modules untrainable ("freezing" them), saving the cost of backpropagating to and updating frozen modules. The hypothesis behind Layer Freezing is that early layers may learn their features sooner than later layers, meaning they do not need to be updated later in training. - diff --git a/composer/algorithms/mixup/README.md b/composer/algorithms/mixup/README.md index 1bfa268858..fb8f9d3064 100644 --- a/composer/algorithms/mixup/README.md +++ b/composer/algorithms/mixup/README.md @@ -10,7 +10,7 @@ For any pair of examples, it trains the network on a random convex combination o To create the corresponding targets, it uses the same random convex combination of the targets of the individual examples. Training in this fashion improves generalization. -| ![MixUp](https://storage.googleapis.com/docs.mosaicml.com/images/methods/mix_up.png) | +| ![MixUp](../_images/mix_up.png) | |:--: |*Two different training examples (a picture of a bird and a picture of a frog) that have been combined by MixUp into a single example. The corresponding targets are a convex combination of the targets for the bird class and the frog class.*| diff --git a/composer/algorithms/progressive_resizing/README.md b/composer/algorithms/progressive_resizing/README.md index 72e3ed8665..25801d5a03 100644 --- a/composer/algorithms/progressive_resizing/README.md +++ b/composer/algorithms/progressive_resizing/README.md @@ -7,7 +7,7 @@ Progressive Resizing works by initially training on images that have been downsampled to a smaller size. It slowly grows the images back to their full size by a set point in training and uses full-size images for the remainder of training. Progressive resizing reduces costs during the early phase of training when the network may learn coarse-grained features that do not require details lost by reducing image resolution. -| ![ProgressiveResizing](https://storage.googleapis.com/docs.mosaicml.com/images/methods/progressive_resizing_vision.png) | +| ![ProgressiveResizing](../_images/progressive_resizing_vision.png) | |:--| |*An example image as it would appear to the network at different stages of training with progressive resizing. At the beginning of training, each training example is at its smallest size. Throughout the pre-training phase, example size increases linearly. At the end of the pre-training phase, example size has reached its full value and remains at that value for the remainder of training (the fine-tuning phase).*| diff --git a/composer/algorithms/randaugment/README.md b/composer/algorithms/randaugment/README.md index 83b9570bb1..055f280792 100644 --- a/composer/algorithms/randaugment/README.md +++ b/composer/algorithms/randaugment/README.md @@ -8,7 +8,7 @@ For each data sample, RandAugment randomly samples `depth` image augmentations f Each augmentation is applied with a context-specific `severity` sampled uniformly from 0 to 10. Training in this fashion regularizes the network and can improve generalization performance. -| ![RandAugment](https://storage.googleapis.com/docs.mosaicml.com/images/methods/rand_augment.jpg) | +| ![RandAugment](../_images/rand_augment.jpg) | |:--:| |*An image of a dog that undergoes three different augmentation chains. Each of these chains is a possible augmentation that might be applied by RandAugment and gets combined with the original image.*| diff --git a/composer/algorithms/selective_backprop/README.md b/composer/algorithms/selective_backprop/README.md index e54d66c8bb..28830f90bf 100644 --- a/composer/algorithms/selective_backprop/README.md +++ b/composer/algorithms/selective_backprop/README.md @@ -7,7 +7,7 @@ Selective Backprop prioritizes examples with high loss at each iteration, skipping backpropagation on examples with low loss. This speeds up training with limited impact on generalization. -| ![SelectiveBackprop](https://storage.googleapis.com/docs.mosaicml.com/images/methods/selective-backprop.png) | +| ![SelectiveBackprop](../_images/selective-backprop.png) | |:--| |*Four examples are forward propagated through the network. Selective backprop only backpropagates the two examples that have the highest loss.*| diff --git a/composer/algorithms/seq_length_warmup/README.md b/composer/algorithms/seq_length_warmup/README.md index 84aee2c697..f964967e55 100644 --- a/composer/algorithms/seq_length_warmup/README.md +++ b/composer/algorithms/seq_length_warmup/README.md @@ -7,7 +7,7 @@ Sequence Length Warmup linearly increases the sequence length (number of tokens per sentence) used to train a language model from a `min_seq_length` to a `max_seq_length` over some duration at the beginning of training. The underlying motivation is that sequence length is a proxy for the difficulty of an example, and this method assumes a simple curriculum where the model is trained on easy examples (by this definition) first. Sequence Length Warmup is able to reduce the training time of GPT-style models by ~1.5x while still achieving the same loss as baselines. -| ![SequenceLengthWarmup](https://storage.googleapis.com/docs.mosaicml.com/images/methods/seq_len_warmup.svg)| +| ![SequenceLengthWarmup](../_images/seq_len_warmup.svg)| |:--| |*The sequence length used to train a model over the course of training. It increases linearly over the first 30% of training before reaching its full value for the remainder of training.*| diff --git a/composer/algorithms/squeeze_excite/README.md b/composer/algorithms/squeeze_excite/README.md index 36887d6a7a..cff398c58c 100644 --- a/composer/algorithms/squeeze_excite/README.md +++ b/composer/algorithms/squeeze_excite/README.md @@ -6,7 +6,7 @@ Adds a channel-wise attention operator in CNNs. Attention coefficients are produced by a small, trainable MLP that uses the channels' globally pooled activations as input. It requires more work on each forward pass, slowing down training and inference, but leads to higher quality models. -| ![Squeeze-Excite](https://storage.googleapis.com/docs.mosaicml.com/images/methods/squeeze-and-excitation.png) | +| ![Squeeze-Excite](../_images/squeeze-and-excitation.png) | |:--| | *After an activation tensor **X** is passed through Conv2d **F**tr to yield a new tensor **U**, a Squeeze-and-Excitation (SE) module scales the channels in a data-dependent manner. The scales are produced by a single-hidden-layer, fully-connected network whose input is the global-averaged-pooled **U**. This can be seen as a channel-wise attention mechanism.* | diff --git a/composer/algorithms/weight_standardization/README.md b/composer/algorithms/weight_standardization/README.md index c14a177446..05d18b1c0a 100644 --- a/composer/algorithms/weight_standardization/README.md +++ b/composer/algorithms/weight_standardization/README.md @@ -6,7 +6,7 @@ Weight Standardization is a reparametrization of convolutional weights such that the input channel and kernel dimensions have zero mean and unit variance. The authors suggested using this method when the per-device batch size is too small to work well with batch normalization models. Additionally, the authors suggest this method enables using other normalization layers instead of batch normalizaiton while maintaining similar performance. We have been unable to verify either of these claims on Composer benchmarks. Instead, we have found weight standardization to improve performance with a small throughput degradation when training ResNet architectures on semantic segmentation tasks. There are a few papers that have found weight standardization useful as well. -| ![WeightStandardization](https://storage.googleapis.com/docs.mosaicml.com/images/methods/weight_standardization.png) | +| ![WeightStandardization](../_images/weight_standardization.png) | |:--| | *Comparing various normalization layers applied to activations (blue) and weight standardization applied to convolutional weights (orange). This figure is Figure 2 in [Qiao et al., 2019](https://arxiv.org/abs/1903.10520).* | diff --git a/docs/source/_images/alibi.png b/docs/source/_images/alibi.png new file mode 100644 index 0000000000..fbc82b29a4 Binary files /dev/null and b/docs/source/_images/alibi.png differ diff --git a/docs/source/_images/aug_mix.png b/docs/source/_images/aug_mix.png new file mode 100644 index 0000000000..f1e116ddf4 Binary files /dev/null and b/docs/source/_images/aug_mix.png differ diff --git a/docs/source/_images/block_wise_stochastic_depth.png b/docs/source/_images/block_wise_stochastic_depth.png new file mode 100644 index 0000000000..1207cb7618 Binary files /dev/null and b/docs/source/_images/block_wise_stochastic_depth.png differ diff --git a/docs/source/_images/blurpool-antialiasing.png b/docs/source/_images/blurpool-antialiasing.png new file mode 100644 index 0000000000..e320b684fb Binary files /dev/null and b/docs/source/_images/blurpool-antialiasing.png differ diff --git a/docs/source/_images/channels_last.png b/docs/source/_images/channels_last.png new file mode 100644 index 0000000000..5fcea9be4a Binary files /dev/null and b/docs/source/_images/channels_last.png differ diff --git a/docs/source/_images/col_out.png b/docs/source/_images/col_out.png new file mode 100644 index 0000000000..cbbaf29896 Binary files /dev/null and b/docs/source/_images/col_out.png differ diff --git a/docs/source/_images/cutmix.png b/docs/source/_images/cutmix.png new file mode 100644 index 0000000000..89bd5b2a36 Binary files /dev/null and b/docs/source/_images/cutmix.png differ diff --git a/docs/source/_images/cutout.png b/docs/source/_images/cutout.png new file mode 100644 index 0000000000..5d4ec29b1c Binary files /dev/null and b/docs/source/_images/cutout.png differ diff --git a/docs/source/_images/factorize-no-caption.png b/docs/source/_images/factorize-no-caption.png new file mode 100644 index 0000000000..09646aa25a Binary files /dev/null and b/docs/source/_images/factorize-no-caption.png differ diff --git a/docs/source/_images/ghost-batch-normalization.png b/docs/source/_images/ghost-batch-normalization.png new file mode 100644 index 0000000000..e5f4ee56a8 Binary files /dev/null and b/docs/source/_images/ghost-batch-normalization.png differ diff --git a/docs/source/_images/logo-dark-bg.png b/docs/source/_images/logo-dark-bg.png new file mode 100644 index 0000000000..abd08aed6e Binary files /dev/null and b/docs/source/_images/logo-dark-bg.png differ diff --git a/docs/source/_images/mix_up.png b/docs/source/_images/mix_up.png new file mode 100644 index 0000000000..bff62ec167 Binary files /dev/null and b/docs/source/_images/mix_up.png differ diff --git a/docs/source/_images/profiler_trace_example.png b/docs/source/_images/profiler_trace_example.png new file mode 100644 index 0000000000..7ed9fcdee0 Binary files /dev/null and b/docs/source/_images/profiler_trace_example.png differ diff --git a/docs/source/_images/progressive_resizing_vision.png b/docs/source/_images/progressive_resizing_vision.png new file mode 100644 index 0000000000..d415722e3e Binary files /dev/null and b/docs/source/_images/progressive_resizing_vision.png differ diff --git a/docs/source/_images/r50_aws_explorer.png b/docs/source/_images/r50_aws_explorer.png new file mode 100644 index 0000000000..db6aef9a80 Binary files /dev/null and b/docs/source/_images/r50_aws_explorer.png differ diff --git a/docs/source/_images/r50_aws_explorer_recipe.png b/docs/source/_images/r50_aws_explorer_recipe.png new file mode 100644 index 0000000000..4e250a4005 Binary files /dev/null and b/docs/source/_images/r50_aws_explorer_recipe.png differ diff --git a/docs/source/_images/rand_augment.jpg b/docs/source/_images/rand_augment.jpg new file mode 100644 index 0000000000..05904234e6 Binary files /dev/null and b/docs/source/_images/rand_augment.jpg differ diff --git a/docs/source/_images/scale_schedule.png b/docs/source/_images/scale_schedule.png new file mode 100644 index 0000000000..95a6be025d Binary files /dev/null and b/docs/source/_images/scale_schedule.png differ diff --git a/docs/source/_images/selective-backprop.png b/docs/source/_images/selective-backprop.png new file mode 100644 index 0000000000..ed8840649a Binary files /dev/null and b/docs/source/_images/selective-backprop.png differ diff --git a/docs/source/_images/seq_len_warmup.svg b/docs/source/_images/seq_len_warmup.svg new file mode 100644 index 0000000000..06b733d619 --- /dev/null +++ b/docs/source/_images/seq_len_warmup.svg @@ -0,0 +1 @@ +Sequence Length Warmup2k4k6k8kStep02004006008001000Sequence Length
text { font-size: 10px; fill: black }
.react-vis-magic-css-import-rule{display:inherit}.rv-treemap{font-size:12px;position:relative}.rv-treemap__leaf{overflow:hidden;position:absolute}.rv-treemap__leaf--circle{align-items:center;border-radius:100%;display:flex;justify-content:center}.rv-treemap__leaf__content{overflow:hidden;padding:10px;text-overflow:ellipsis}.rv-xy-plot{color:#c3c3c3;position:relative}.rv-xy-plot canvas{pointer-events:none}.rv-xy-plot .rv-xy-canvas{pointer-events:none;position:absolute}.rv-xy-plot__inner{display:block}.rv-xy-plot__axis__line{fill:none;stroke-width:2px;stroke:#e6e6e9}.rv-xy-plot__axis__tick__line{stroke:#e6e6e9}.rv-xy-plot__axis__tick__text{fill:#6b6b76;font-size:11px}.rv-xy-plot__axis__title text{fill:#6b6b76;font-size:11px}.rv-xy-plot__grid-lines__line{stroke:#e6e6e9}.rv-xy-plot__circular-grid-lines__line{fill-opacity:0;stroke:#e6e6e9}.rv-xy-plot__series,.rv-xy-plot__series path{pointer-events:all}.rv-xy-plot__series--line{fill:none;stroke:#000;stroke-width:2px}.rv-crosshair{position:absolute;font-size:11px;pointer-events:none}.rv-crosshair__line{background:#47d3d9;width:1px}.rv-crosshair__inner{position:absolute;text-align:left;top:0}.rv-crosshair__inner__content{border-radius:4px;background:#3a3a48;color:#fff;font-size:12px;padding:7px 10px;box-shadow:0 2px 4px rgba(0,0,0,0.5)}.rv-crosshair__inner--left{right:4px}.rv-crosshair__inner--right{left:4px}.rv-crosshair__title{font-weight:bold;white-space:nowrap}.rv-crosshair__item{white-space:nowrap}.rv-hint{position:absolute;pointer-events:none}.rv-hint__content{border-radius:4px;padding:7px 10px;font-size:12px;background:#3a3a48;box-shadow:0 2px 4px rgba(0,0,0,0.5);color:#fff;text-align:left;white-space:nowrap}.rv-discrete-color-legend{box-sizing:border-box;overflow-y:auto;font-size:12px}.rv-discrete-color-legend.horizontal{white-space:nowrap}.rv-discrete-color-legend-item{color:#3a3a48;border-radius:1px;padding:9px 10px}.rv-discrete-color-legend-item.horizontal{display:inline-block}.rv-discrete-color-legend-item.horizontal .rv-discrete-color-legend-item__title{margin-left:0;display:block}.rv-discrete-color-legend-item__color{display:inline-block;vertical-align:middle;overflow:visible}.rv-discrete-color-legend-item__color__path{stroke:#dcdcdc;stroke-width:2px}.rv-discrete-color-legend-item__title{margin-left:10px}.rv-discrete-color-legend-item.disabled{color:#b8b8b8}.rv-discrete-color-legend-item.clickable{cursor:pointer}.rv-discrete-color-legend-item.clickable:hover{background:#f9f9f9}.rv-search-wrapper{display:flex;flex-direction:column}.rv-search-wrapper__form{flex:0}.rv-search-wrapper__form__input{width:100%;color:#a6a6a5;border:1px solid #e5e5e4;padding:7px 10px;font-size:12px;box-sizing:border-box;border-radius:2px;margin:0 0 9px;outline:0}.rv-search-wrapper__contents{flex:1;overflow:auto}.rv-continuous-color-legend{font-size:12px}.rv-continuous-color-legend .rv-gradient{height:4px;border-radius:2px;margin-bottom:5px}.rv-continuous-size-legend{font-size:12px}.rv-continuous-size-legend .rv-bubbles{text-align:justify;overflow:hidden;margin-bottom:5px;width:100%}.rv-continuous-size-legend .rv-bubble{background:#d8d9dc;display:inline-block;vertical-align:bottom}.rv-continuous-size-legend .rv-spacer{display:inline-block;font-size:0;line-height:0;width:100%}.rv-legend-titles{height:16px;position:relative}.rv-legend-titles__left,.rv-legend-titles__right,.rv-legend-titles__center{position:absolute;white-space:nowrap;overflow:hidden}.rv-legend-titles__center{display:block;text-align:center;width:100%}.rv-legend-titles__right{right:0}.rv-radial-chart .rv-xy-plot__series--label{pointer-events:none}

/*]]>*/ diff --git a/docs/source/_images/squeeze-and-excitation.png b/docs/source/_images/squeeze-and-excitation.png new file mode 100644 index 0000000000..2c029a4684 Binary files /dev/null and b/docs/source/_images/squeeze-and-excitation.png differ diff --git a/docs/source/_images/weight_standardization.png b/docs/source/_images/weight_standardization.png new file mode 100644 index 0000000000..25d92b30ef Binary files /dev/null and b/docs/source/_images/weight_standardization.png differ diff --git a/docs/source/method_cards/scale_schedule.md b/docs/source/method_cards/scale_schedule.md index c9942f20b2..6ed65f5b67 100644 --- a/docs/source/method_cards/scale_schedule.md +++ b/docs/source/method_cards/scale_schedule.md @@ -6,7 +6,7 @@ Scale Schedule changes the number of training steps by a dilation factor and dil accordingly. Doing so varies the training budget, making it possible to explore tradeoffs between cost (measured in time or money) and the quality of the final model. -| ![scale_schedule.png](https://storage.googleapis.com/docs.mosaicml.com/images/methods/scale_schedule.png) | +| ![scale_schedule.png](../_images/scale_schedule.png) | |:--| |*Scale schedule scales the learning rate decay schedule.*| diff --git a/docs/source/method_cards/stochastic_depth.md b/docs/source/method_cards/stochastic_depth.md index c753a1d2f5..146770f65c 100644 --- a/docs/source/method_cards/stochastic_depth.md +++ b/docs/source/method_cards/stochastic_depth.md @@ -4,7 +4,7 @@ Block-wise stochastic depth assigns every residual block a probability of dropping the transformation function, leaving only the skip connection. This regularizes and reduces the amount of computation. -![block_wise_stochastic_depth.png](https://storage.googleapis.com/docs.mosaicml.com/images/methods/block_wise_stochastic_depth.png) +![block_wise_stochastic_depth.png](../_images/block_wise_stochastic_depth.png) ## How to Use