A simple template for TensorFlow's highly efficient CudnnLSTM
module
- TensorFlow v1.8+
- CUDA v9.0+
- cuDNN v7.0+
- scikit-learn
- tqdm
How to check my CUDA and cuDNN versions
TensorFlow's performance guide includes a section on RNN performance, which states:
On NVIDIA GPUs, the use of
tf.contrib.cudnn_rnn
should always be preferred unless you want layer normalization, which it doesn't support.
According to this benchmark result by RETURNN,
CudnnLSTM
achieves significant speedups
compared to TensorFlow's other LSTM implementations
(~2x faster than LSTMBlockFused
and ~5x faster than BasicLSTM
).
We also took the tutorial code for PTB language modeling
and tried running the three versions of LSTM implemented there:
BasicLSTMCell
, LSTMBlockCell
, and CudnnLSTM
.
We found that the CudnnLSTM
example does not run in TF v1.8 due to
API changes, but after fixing minor issues we were able to run it on a single GPU.
The benchmark results we got running the "large" model are as follows:
Module | Average wps* | Speedup w.r.t. BasicLSTMCell |
---|---|---|
BasicLSTMCell |
15k | 1x |
LSTMBlockCell |
17k | 1.1x |
CudnnLSTM |
32k | 2.1x |
*wps refers to the number of processed words per second.
In all three cases, we used a single NVIDIA Tesla P40 GPU, which was utilized
80-85% (100% memory) during training.
The tutorial code only supports multi-GPU training using BasicLSTMCell
,
and using 2 P40 GPUs we got approximately 25k wps
(1.7x speedup w.r.t. single-GPU BasicLSTMCell
,
but still 22% slower than a single-GPU CudnnLSTM
.)
We did not test the handling of variable-length sequences per batch for CudnnLSTM
,
but there seem to be some issues (e.g., see #6633).
Bucketing could be
a useful (but not perfect) workaround for this problem.
CudnnLSTM
does not support layer normalization,
because cuDNN itself does not support it.
PyTorch's built-in nn.LSTM
module already supports CUDNN integration (!),
as shown here
and here.
For one thing, PyTorch's nn.LSTM
is not a contrib
module with little documentation.
While we leave a rigorous comparison between PyTorch's nn.LSTM
and
TensorFlow's cudnn_rnn.CudnnLSTM
as future work, it does appear that
PyTorch's version is as efficient as but more stable than TensorFlow's counterpart.
When we tried running PyTorch's own LSTM language modeling example,
using nearly the same set of parameters
(2 layers, 1.5k hidden size, 35k vocab size, 20 batch size and 35 timesteps),
we got around 100 milliseconds per batch on a single P40 GPU (95+% utilization).
For the aforementioned tutorial code from TensorFlow,
we got around 120 milliseconds per batch on the same machine (95+% utilization).
So, if you're already a PyTorch user and your system is built on PyTorch,
there's little reason to switch to using TF's CudnnLSTM
for performance, at least for now.
Keras also has a similar-looking module that was introduced last year. We did not test it, but it appears to have a nice documentation.