NVIDIA · viadea · Oct 29, 2021 · Oct 20, 2021 · Oct 21, 2021 · Oct 21, 2021
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -1,7 +1,7 @@
 ---
 layout: page
 title: Frequently Asked Questions
-nav_order: 11
+nav_order: 12
 ---
 # Frequently Asked Questions
 
@@ -10,18 +10,19 @@ nav_order: 11
 
 ### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?
 
-The RAPIDS Accelerator for Apache Spark requires version 3.0.1, 3.0.2, 3.1.1 or 3.1.2 of Apache
-Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to be
-internal the code for those plans can change even between bug fix releases. As a part of our
+The RAPIDS Accelerator for Apache Spark requires version 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2 or 3.2.0 of
+Apache Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to
+be internal the code for those plans can change even between bug fix releases. As a part of our
 process, we try to stay on top of these changes and release updates as quickly as possible.
 
 ### Which distributions are supported?
 
 The RAPIDS Accelerator for Apache Spark officially supports:
 - [Apache Spark](get-started/getting-started-on-prem.md)
-- [AWS EMR 6.2.0, 6.3.0](get-started/getting-started-aws-emr.md)
+- [AWS EMR 6.2+](get-started/getting-started-aws-emr.md)
 - [Databricks Runtime 7.3, 8.2](get-started/getting-started-databricks.md)
 - [Google Cloud Dataproc 2.0](get-started/getting-started-gcp.md)
+- Cloudera CDP 7.1.6+
 
 Most distributions based on a supported Apache Spark version should work, but because the plugin
 replaces parts of the physical plan that Apache Spark considers to be internal the code for those
@@ -30,15 +31,15 @@ to set up testing and validation on their distributions.
 
 ### What CUDA versions are supported?
 
-CUDA 11.0 and 11.2 are currently supported.  Please look [here](download.md) for download links for
-the latest release.
+CUDA 11.x is currently supported.  Please look [here](download.md) for download links for the latest
+release.
 
 ### What hardware is supported? 
 
 The plugin is tested and supported on V100, T4, A10, A30 and A100 datacenter GPUs.  It is possible
 to run the plugin on GeForce desktop hardware with Volta or better architectures.  GeForce hardware
-does not support [CUDA enhanced
-compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#enhanced-compat-minor-releases),
+does not support [CUDA forward
+compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title),
 and will need CUDA 11.2 installed. If not, the following error will be displayed:
 
 ```
@@ -47,6 +48,9 @@ ai.rapids.cudf.CudaException: forward compatibility was attempted on non support
         at com.nvidia.spark.rapids.GpuDeviceManager$.findGpuAndAcquire(GpuDeviceManager.scala:78)
 ```
 
+More information about cards that support forward compatibility can be found
+[here](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#faq).
+
 ### How can I check if the RAPIDS Accelerator is installed and which version is running?
 
 On startup the RAPIDS Accelerator will log a warning message on the Spark driver showing the
@@ -120,6 +124,17 @@ starts. If you are only going to run a single query that only takes a few second
 be problematic. In general if you are going to do 30 seconds or more of processing within a single
 session the overhead can be amortized.
 
+### How long does it take to translate a query to run on the GPU?
+
+The time it takes to translate the Apache Spark physical plan to one that can run on the GPU
+is proportional to the size of the plan. But, it also depends on the CPU you are
+running on and if the JVM has optimized that code path yet. The first queries run in a client will
+be worse than later queries. Small queries can typically be translated in a millisecond or two while
+larger queries can take tens of milliseconds. In all cases tested the translation time is orders of
+magnitude smaller than the total runtime of the query.
+
+See the entry on [explain](#explain) for details on how to measure this for your queries.
+
 ### How can I tell what will run on the GPU and what will not run on it?
 <a name="explain"></a>
 
@@ -166,10 +181,27 @@ In this `indicator` is one of the following
   * will not run on the GPU with an explanation why
   * will be removed from the plan with a reason why
 
-Generally if an operator is not compatible with Spark for some reason and is off the explanation
+Generally if an operator is not compatible with Spark for some reason and is off, the explanation
 will include information about how it is incompatible and what configs to set to enable the
 operator if you can accept the incompatibility.
 
+These messages are logged at the WARN level so even in `spark-shell` which by default only logs
+at WARN or above you should see these messages.
+
+This translation takes place in two steps. The first step looks at the plan, figures out what
+can be translated to the GPU, and then does the translation. The second step optimizes the
+transitions between the CPU and the GPU.
+Explain will also log how long these translations took at the INFO level with lines like.
+
+```
+INFO GpuOverrides: Plan conversion to the GPU took 3.13 ms
+INFO GpuOverrides: GPU plan transition optimization took 1.66 ms
+```
+
+Because it is at the INFO level, the default logging level for `spark-shell` is not going to display
+this information. If you want to monitor this number for your queries you might need to adjust your
+logging configuration.
+
 ### Why does the plan for the GPU query look different from the CPU query?
 
 Typically, there is a one to one mapping between CPU stages in a plan and GPU stages.  There are a
@@ -228,12 +260,18 @@ efficient to stay on the CPU instead of going back and forth.
 
 Yes, DPP still works.  It might not be as efficient as it could be, and we are working to improve it.
 
+DPP is not supported on Databricks with the plugin.
+Queries on Databricks will not fail but it can not benefit from DPP.
+
 ### Is Adaptive Query Execution (AQE) Supported?
 
 In the 0.2 release, AQE is supported but all exchanges will default to the CPU.  As of the 0.3 
 release, running on Spark 3.0.1 and higher any operation that is supported on GPU will now stay on 
 the GPU when AQE is enabled. 
 
+AQE is not supported on Databricks with the plugin. 
+If AQE is enabled on Databricks, queries may fail with `StackOverflowError` error.
+
 #### Why does my query show as not on the GPU when Adaptive Query Execution is enabled?
 
 When running an `explain()` on a query where AQE is on, it is possible that AQE has not finalized
@@ -250,13 +288,15 @@ AdaptiveSparkPlan isFinalPlan=false
 
 ### Are cache and persist supported?
 
-Yes cache and persist are supported, but they are not GPU accelerated yet. We are working with
-the Spark community on changes that would allow us to accelerate compression when caching data.
+Yes cache and persist are supported, the cache is GPU accelerated 
+but still stored on the host memory. 
+Please refer to [RAPIDS Cache Serializer](./additional-functionality/cache-serializer.md) 
+for more details.
 
 ### Can I cache data into GPU memory?
 
-No, that is not currently supported. It would require much larger changes to Apache Spark to be able
-to support this.
+No, that is not currently supported. 
+It would require much larger changes to Apache Spark to be able to support this.
 
 ### Is PySpark supported?
 

diff --git a/docs/additional-functionality/README.md b/docs/additional-functionality/README.md
@@ -1,7 +1,7 @@
 ---
 layout: page
 title: Additional Functionality
-nav_order: 9
+nav_order: 10
 has_children: true
 permalink: /additional-functionality/
 ---
diff --git a/docs/additional-functionality/cache-serializer.md b/docs/additional-functionality/cache-serializer.md
@@ -29,21 +29,17 @@ nav_order: 2
   `spark.sql.inMemoryColumnarStorage.enableVectorizedReader` will not be honored as the GPU
   data is always read in as columnar. If `spark.rapids.sql.enabled` is set to false
   the cached objects will still be compressed on the CPU as a part of the caching process.
-
-  Please note that ParquetCachedBatchSerializer doesn't support negative decimal scale, so if 
-  `spark.sql.legacy.allowNegativeScaleOfDecimal` is set to true ParquetCachedBatchSerializer
-  should not be used.  Using the serializer with negative decimal scales will generate
-  an error at runtime.
 
   To use this serializer please run Spark with the following conf.
   ```
-  spark-shell --conf spark.sql.cache.serializer=com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer"
+  spark-shell --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer
   ```
 
 
 ##          Supported Types                       
 
- All types are supported on the CPU, on the GPU, ArrayType, MapType and BinaryType are not
- supported. If an unsupported type is encountered the Rapids Accelerator for Apache Spark will fall 
+ All types are supported on the CPU.
+ On the GPU, MapType and BinaryType are not supported. 
+ If an unsupported type is encountered the Rapids Accelerator for Apache Spark will fall 
  back to using the CPU for caching. 
 
diff --git a/docs/additional-functionality/rapids-shuffle.md b/docs/additional-functionality/rapids-shuffle.md
@@ -28,6 +28,13 @@ in these scenarios:
   - GPU-to-GPU: Shuffle blocks that were able to fit in GPU memory.
   - Host-to-GPU and Disk-to-GPU: Shuffle blocks that spilled to host (or disk) but will be manifested 
   in the GPU in the downstream Spark task.
+
+The RAPIDS Shuffle Manager uses the `spark.shuffle.manager` plugin interface in Spark and it relies
+on fast connections between executors, where shuffle data is kept in a cache backed by GPU, host, or disk. 
+As such, it doesn't implement functionality to interact with the External Shuffle Service (ESS). 
+To enable the RAPIDS Shuffle Manager, users need to disable ESS using `spark.shuffle.service.enabled=false`. 
+Note that Spark's Dynamic Allocation feature requires ESS to be configured, and must also be 
+disabled with `spark.dynamicAllocation.enabled=false`.
 
 ### System Setup
 
@@ -36,7 +43,7 @@ be installed on the host and inside Docker containers (if not baremetal). A host
 requirements, like the MLNX_OFED driver and `nv_peer_mem` kernel module. 
 
 The minimum UCX requirement for the RAPIDS Shuffle Manager is 
-[UCX 1.11.0](https://github.com/openucx/ucx/releases/tag/v1.11.0).
+[UCX 1.11.2](https://github.com/openucx/ucx/releases/tag/v1.11.2).
 
 #### Baremetal
 
@@ -66,9 +73,9 @@ The minimum UCX requirement for the RAPIDS Shuffle Manager is
    further.
 
 2. Fetch and install the UCX package for your OS from:
-   [UCX 1.11.0](https://github.com/openucx/ucx/releases/tag/v1.11.0).
+    [UCX 1.11.2](https://github.com/openucx/ucx/releases/tag/v1.11.2).
 
-   NOTE: Please install the artifact with the newest CUDA 11.x version (for UCX 1.11.0 please
+   NOTE: Please install the artifact with the newest CUDA 11.x version (for UCX 1.11.2 please
    pick CUDA 11.2) as CUDA 11 introduced [CUDA Enhanced Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#enhanced-compat-minor-releases). 
    Starting with UCX 1.12, UCX will stop publishing individual artifacts for each minor version of CUDA.
 
@@ -78,35 +85,35 @@ The minimum UCX requirement for the RAPIDS Shuffle Manager is
    RDMA packages have extra requirements that should be satisfied by MLNX_OFED.
 
 ##### CentOS UCX RPM
-The UCX packages for CentOS 7 and 8 are divided into different RPMs. For example, UCX 1.11.0
+The UCX packages for CentOS 7 and 8 are divided into different RPMs. For example, UCX 1.11.2
 available at
-https://github.com/openucx/ucx/releases/download/v1.11.0/ucx-v1.11.0-centos7-mofed5.x-cuda11.2.tar.bz2
+https://github.com/openucx/ucx/releases/download/v1.11.2/ucx-v1.11.2-centos7-mofed5.x-cuda11.2.tar.bz2
 contains:
 
 ```
-ucx-devel-1.11.0-1.el7.x86_64.rpm   
-ucx-debuginfo-1.11.0-1.el7.x86_64.rpm
-ucx-1.11.0-1.el7.x86_64.rpm         
-ucx-cuda-1.11.0-1.el7.x86_64.rpm
-ucx-rdmacm-1.11.0-1.el7.x86_64.rpm  
-ucx-cma-1.11.0-1.el7.x86_64.rpm
-ucx-ib-1.11.0-1.el7.x86_64.rpm
+ucx-devel-1.11.2-1.el7.x86_64.rpm   
+ucx-debuginfo-1.11.2-1.el7.x86_64.rpm
+ucx-1.11.2-1.el7.x86_64.rpm         
+ucx-cuda-1.11.2-1.el7.x86_64.rpm
+ucx-rdmacm-1.11.2-1.el7.x86_64.rpm  
+ucx-cma-1.11.2-1.el7.x86_64.rpm
+ucx-ib-1.11.2-1.el7.x86_64.rpm
 ```
 
 For a setup without RoCE or Infiniband networking, the only packages required are: 
 
 ```
-ucx-1.11.0-1.el7.x86_64.rpm
-ucx-cuda-1.11.0-1.el7.x86_64.rpm
+ucx-1.11.2-1.el7.x86_64.rpm
+ucx-cuda-1.11.2-1.el7.x86_64.rpm
 ```
 
 If accelerated networking is available, the package list is: 
 
 ```
-ucx-1.11.0-1.el7.x86_64.rpm
-ucx-cuda-1.11.0-1.el7.x86_64.rpm
-ucx-rdmacm-1.11.0-1.el7.x86_64.rpm
-ucx-ib-1.11.0-1.el7.x86_64.rpm
+ucx-1.11.2-1.el7.x86_64.rpm
+ucx-cuda-1.11.2-1.el7.x86_64.rpm
+ucx-rdmacm-1.11.2-1.el7.x86_64.rpm
+ucx-ib-1.11.2-1.el7.x86_64.rpm
 ```
 
 ---
@@ -145,7 +152,7 @@ system if you have RDMA capable hardware.
 Within the Docker container we need to install UCX and its requirements. These are Dockerfile
 examples for Ubuntu 18.04:
 
-The following are examples of Docker containers with UCX 1.11.0 and cuda-11.2 support.
+The following are examples of Docker containers with UCX 1.11.2 and cuda-11.2 support.
 
 | OS Type | RDMA | Dockerfile |
 | ------- | ---- | ---------- |
@@ -281,7 +288,6 @@ In this section, we are using a docker container built using the sample dockerfi
     | Spark Shim | spark.shuffle.manager value                              |
     | -----------| -------------------------------------------------------- |
     | 3.0.1      | com.nvidia.spark.rapids.spark301.RapidsShuffleManager    |
-    | 3.0.1 EMR  | com.nvidia.spark.rapids.spark301emr.RapidsShuffleManager |
     | 3.0.2      | com.nvidia.spark.rapids.spark302.RapidsShuffleManager    |
     | 3.0.3      | com.nvidia.spark.rapids.spark303.RapidsShuffleManager    |
     | 3.0.4      | com.nvidia.spark.rapids.spark304.RapidsShuffleManager    |
@@ -290,7 +296,7 @@ In this section, we are using a docker container built using the sample dockerfi
     | 3.1.2      | com.nvidia.spark.rapids.spark312.RapidsShuffleManager    |
     | 3.1.3      | com.nvidia.spark.rapids.spark313.RapidsShuffleManager    |
 
-2. Settings for UCX 1.11.0+:
+2. Settings for UCX 1.11.2+:
 
     Minimum configuration:
 
@@ -326,6 +332,9 @@ Apache Spark 3.1.3 is: `com.nvidia.spark.rapids.spark313.RapidsShuffleManager`.
 Please note `LD_LIBRARY_PATH` should optionally be set if the UCX library is installed in a
 non-standard location.
 
+With the RAPIDS Shuffle Manager configured, the setting `spark.rapids.shuffle.enabled` (default on) 
+can be used to enable or disable the usage of RAPIDS Shuffle Manager during your application. 
+
 #### UCX Environment Variables
 - `UCX_TLS`: 
   - `cuda_copy`, and `cuda_ipc`: enables handling of CUDA memory in UCX, both for copy-based transport

diff --git a/docs/additional-functionality/rapids-udfs.md b/docs/additional-functionality/rapids-udfs.md
@@ -139,38 +139,38 @@ in the [udf-examples](../../udf-examples) project.
 
 ### Spark Scala UDF Examples
 
-- [URLDecode](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLDecode.scala)
+- [URLDecode](../../udf-examples/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLDecode.scala)
 decodes URL-encoded strings using the
 [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
-- [URLEncode](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLEncode.scala)
+- [URLEncode](../../udf-examples/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLEncode.scala)
 URL-encodes strings using the
 [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
 
 ### Spark Java UDF Examples
 
-- [URLDecode](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/URLDecode.java)
+- [URLDecode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/URLDecode.java)
 decodes URL-encoded strings using the
 [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
-- [URLEncode](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/URLEncode.java)
+- [URLEncode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/URLEncode.java)
 URL-encodes strings using the
 [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
-- [CosineSimilarity](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java)
+- [CosineSimilarity](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java)
 computes the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
-between two float vectors using [native code](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/cpp/src)
+between two float vectors using [native code](../../udf-examples/src/main/cpp/src)
 
 ### Hive UDF Examples
 
-- [URLDecode](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/URLDecode.java)
+- [URLDecode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/URLDecode.java)
 implements a Hive simple UDF using the
 [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
 to decode URL-encoded strings
-- [URLEncode](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/URLEncode.java)
+- [URLEncode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/URLEncode.java)
 implements a Hive generic UDF using the
 [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
 to URL-encode strings
-- [StringWordCount](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java)
+- [StringWordCount](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java)
 implements a Hive simple UDF using
-[native code](https://github.com/NVIDIA/spark-rapids/tree/main/udf-examples/src/main/cpp/src) to count words in strings
+[native code](../../udf-examples/src/main/cpp/src) to count words in strings
 
 
 ## GPU Support for Pandas UDF

diff --git a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.centos_no_rdma b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.centos_no_rdma
@@ -22,16 +22,16 @@
 #                               See: https://github.com/openucx/ucx/releases/
 
 ARG CUDA_VER=11.2.2
-ARG UCX_VER=v1.11.0
+ARG UCX_VER=1.11.2
 ARG UCX_CUDA_VER=11.2
 
 FROM nvidia/cuda:${CUDA_VER}-runtime-centos7
 ARG UCX_VER
 ARG UCX_CUDA_VER
 
 RUN yum update -y && yum install -y wget bzip2
-RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/$UCX_VER/ucx-$UCX_VER-centos7-mofed5.x-cuda$UCX_CUDA_VER.tar.bz2
+RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-v$UCX_VER-centos7-mofed5.x-cuda$UCX_CUDA_VER.tar.bz2
 RUN cd /tmp && tar -xvf *.bz2 && \
-  yum install -y ucx-1.11.0-1.el7.x86_64.rpm && \
-  yum install -y ucx-cuda-1.11.0-1.el7.x86_64.rpm && \
+  yum install -y ucx-$UCX_VER-1.el7.x86_64.rpm && \
+  yum install -y ucx-cuda-$UCX_VER-1.el7.x86_64.rpm && \
   rm -rf /tmp/*.rpm