Replies: 2 comments 5 replies
-
Happy to take this piece on, as I already have some past Python code that I can work into a form that does this task reasonably efficiently. And then will embark on learning how to do it via Rayon. 😅 |
Beta Was this translation helpful? Give feedback.
5 replies
-
I simplified this task to make it much more accessible to a general audience in #63. We can consider this closed 😇. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
@sanders41
To showcase that Rust can enable parallelism more efficiently than Python can, I have a real-world, top-down example of a problem that is rather common when doing NLP tasks like topic modelling. Although Python can be sped up via multi-processing, it's inherently slow due to its interpreted nature and there's always more overhead for intermediate operations, GC, etc. when compared to a compiled language like Rust. My hypothesis is that using a library like Rayon in Rust will be significantly faster than in Python, and this is something that can be an added trick up the sleeve of Python engineers who are preparing data for NLP tasks.
Data
A dataset that can be used to demonstrate this is the New York Times news dataset from Kaggle.
Task
Topic modelling via LDA typically requires that the raw text is first cleaned (via regex pattern matching) to remove extraneous symbols and punctuation, and then the words are converted to lemmas (their root form) via a process called lemmatization. Python of course has an inbuilt regex module, but it has also long had an excellent library for lemmatization called spaCy that's written in Cython, so it's really fast, but even when using spaCy, Python users are limited by the speed of execution of for loops in Python.
Rust has a regex crate and a lemmatizer crate that can be used for the same purpose.
The NYT dataset has ~8.8K news articles that are hundreds of words each, and the two steps (cleaning via regex in Python and lemmatization via Python + spaCy) are executed sequentially (in a for loop) to preprocess the data. This is expected to be slow in Python as you have more and more articles.
Parallel execution
The above task is CPU-bound and also embarrassingly parallel, as each news article's final output is completely independent of the others, and so the task can be sped up by breaking the news article dataset into chunks and processing them in parallel. In Python, this would be done by firing up a concurrent process pool, or a library like joblib, which does achieve a sublinear speedup. But there's overhead in managing the process pool and in initiating it in the first place.
Some code snippets below from an old workflow of mine that used pandas (though we can avoid any form of DataFrames here for the piece). Joblib was the parallelization framework used, though it could just as easily be done using
concurrent.futures.ProcessPoolExecutor
.In Rust, this can be sped up by using Rayon, which would perform the same task as joblib does in Python, but because it's Rust, we can hypothesize that there is far less overhead in the parallel execution run time.
It shouldn't be too hard to learn enough about Rayon to translate the regex (cleaning) and lemmatization logic into a parallelized execution.
Output
We'd measure the total preprocessing wall-clock time in Python (using whatever multi-proc speedup techniques we can muster) to generate a lemmatized version of each news article, writing them to a new CSV file.
We'd then measure the timing for the same workflow powered by Rust + Rayon. Hypothesis is that Rust will be quite a bit faster (assuming the code logic is right) 😅
Beta Was this translation helpful? Give feedback.
All reactions