Showcasing parallelism in Rust: Preprocessing NLP data piece #41

prrao87 · 2024-01-13T17:07:29Z

prrao87
Jan 13, 2024
Maintainer

To showcase that Rust can enable parallelism more efficiently than Python can, I have a real-world, top-down example of a problem that is rather common when doing NLP tasks like topic modelling. Although Python can be sped up via multi-processing, it's inherently slow due to its interpreted nature and there's always more overhead for intermediate operations, GC, etc. when compared to a compiled language like Rust. My hypothesis is that using a library like Rayon in Rust will be significantly faster than in Python, and this is something that can be an added trick up the sleeve of Python engineers who are preparing data for NLP tasks.

Data

A dataset that can be used to demonstrate this is the New York Times news dataset from Kaggle.

Task

Topic modelling via LDA typically requires that the raw text is first cleaned (via regex pattern matching) to remove extraneous symbols and punctuation, and then the words are converted to lemmas (their root form) via a process called lemmatization. Python of course has an inbuilt regex module, but it has also long had an excellent library for lemmatization called spaCy that's written in Cython, so it's really fast, but even when using spaCy, Python users are limited by the speed of execution of for loops in Python.

Rust has a regex crate and a lemmatizer crate that can be used for the same purpose.

The NYT dataset has ~8.8K news articles that are hundreds of words each, and the two steps (cleaning via regex in Python and lemmatization via Python + spaCy) are executed sequentially (in a for loop) to preprocess the data. This is expected to be slow in Python as you have more and more articles.

Parallel execution

The above task is CPU-bound and also embarrassingly parallel, as each news article's final output is completely independent of the others, and so the task can be sped up by breaking the news article dataset into chunks and processing them in parallel. In Python, this would be done by firing up a concurrent process pool, or a library like joblib, which does achieve a sublinear speedup. But there's overhead in managing the process pool and in initiating it in the first place.

Some code snippets below from an old workflow of mine that used pandas (though we can avoid any form of DataFrames here for the piece). Joblib was the parallelization framework used, though it could just as easily be done using concurrent.futures.ProcessPoolExecutor.

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    "Extract relevant text from DataFrame using a regex"
    # Regex pattern for only alphanumeric, hyphenated text with 3 or more chars
    pattern = re.compile(r"[A-Za-z0-9\-]{3,50}")
    df['clean'] = df['content'].str.findall(pattern).str.join(' ')
    df['clean'] = df['clean'].str.lower()  # Lowercase text before lemmatization
    return df

def lemmatize(doc: Doc, stopwords: Set[str]) -> List[str]:
    "Perform lemmatization and stopword removal in the clean text"
    lemma_list = [tok.lemma_ for tok in doc
                  if tok.is_alpha and tok.text not in stopwords]
    return lemma_list

def chunker(iterable: List[str], total_length: int, chunksize: int):
    return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize))


def flatten(list_of_lists: List[List[str]]) -> List[str]:
    "Flatten a list of lists to a combined list"
    return [item for sublist in list_of_lists for item in sublist]


def process_chunk(stopwords: Set[str], texts: List[str]) -> List[List[str]]:
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=20):
        preproc_pipe.append(lemmatize(doc, stopwords))
    return preproc_pipe


def preprocess_concurrent(texts: List[str], stopwords: Set[str], chunksize: int=100):
    executor = Parallel(n_jobs=params['n_proc'], backend='multiprocessing', prefer="processes")
    do = delayed(partial(process_chunk, stopwords))
    tasks = (do(chunk) for chunk in chunker(texts, len(texts), chunksize=chunksize))
    result = executor(tasks)
    return flatten(result)

In Rust, this can be sped up by using Rayon, which would perform the same task as joblib does in Python, but because it's Rust, we can hypothesize that there is far less overhead in the parallel execution run time.

It shouldn't be too hard to learn enough about Rayon to translate the regex (cleaning) and lemmatization logic into a parallelized execution.

Output

We'd measure the total preprocessing wall-clock time in Python (using whatever multi-proc speedup techniques we can muster) to generate a lemmatized version of each news article, writing them to a new CSV file.

We'd then measure the timing for the same workflow powered by Rust + Rayon. Hypothesis is that Rust will be quite a bit faster (assuming the code logic is right) 😅

prrao87 · 2024-01-13T17:08:31Z

prrao87
Jan 13, 2024
Maintainer Author

Happy to take this piece on, as I already have some past Python code that I can work into a form that does this task reasonably efficiently. And then will embark on learning how to do it via Rayon. 😅

5 replies

sanders41 Jan 13, 2024
Maintainer

I think I see the issue now...our definitions concurrent and parallel are different. To me concurrent is io bound async type work and parallel is cpu bound multi-core. I'm glad this came up because it tells us in the book we should define what we mean by each.

prrao87 Jan 13, 2024
Maintainer Author

Sure, we could just keep this one as parallel, CPU bound then, that should be simple enough.

For concurrent, I've always been thrown off by people saying "Python isn't truly concurrent because of the GIL". There's also asyncio and concurrent thread pools in Python, which do the same thing, but one's better than the other.

Rust's async is among the toughest out there, because for now, they're dependent on a runtime and require external crates, but in the future, they'll support native async (starting with async traits in 1.75). So that's why I'm wondering if it makes sense to consider demonstrating something like the futures crate, and compare that with asyncio in Python (maybe for querying an API or a web page?)

prrao87 Jan 13, 2024
Maintainer Author

I think the larger point is that Python threads are for now, GIL-bound, so we can't make a fair comparison with Python concurrent thread pools and Rust's thread pools because as Python evolves, so will the nogil version, and so threads are suddenly a decent option in Python.

From the other direction, Rust as a language for now doesn't support native async, and so async runtimes as managed by external crates are required to manage multiple threads and there are many different ways the work is moved between threads, totally different from Python. I'm not sure I know of a like-for-like way to compare asyncio in Python to a non-blocking async runtime on a single core in Rust. And, in the future, when the language supports async natively, it may once again change the way one interfaces with the runtimes.

We sure live in interesting times 😅.

sanders41 Jan 13, 2024
Maintainer

For concurrent, I've always been thrown off by people saying "Python isn't truly concurrent because of the GIL". There's also asyncio and concurrent thread pools in Python, which do the same thing, but one's better than the other.

My guess is the reason for this is, as we have seen here, people had different definitions for what concurrent means.

As for the book as long as we are using the same definition, and make that clear to others, I don't think it matters which definition we chose. With that I think we should take yours. With that definition I think it makes sense to close my PR and go with the idea you have.

prrao87 Jan 14, 2024
Maintainer Author

I think we agree that concurrent = async or non-blocking for io bound, and parallel = CPU bound on multiple cores/threads.

However, Rayon seems to use "work stealing" on threads to perform CPU bound operations via closures. So it's not exploiting multiple cores, but because there's no locking the threads can perform with true parallelism. I'll definitely be taking tips from you as I make progress on this, the terminology is quite deep 😅

prrao87 · 2024-02-08T21:02:29Z

prrao87
Feb 8, 2024
Maintainer Author

I simplified this task to make it much more accessible to a general audience in #63. We can consider this closed 😇.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Showcasing parallelism in Rust: Preprocessing NLP data piece #41

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Showcasing parallelism in Rust: Preprocessing NLP data piece #41

prrao87 Jan 13, 2024 Maintainer

Data

Task

Parallel execution

Output

Replies: 2 comments · 5 replies

prrao87 Jan 13, 2024 Maintainer Author

sanders41 Jan 13, 2024 Maintainer

prrao87 Jan 13, 2024 Maintainer Author

prrao87 Jan 13, 2024 Maintainer Author

sanders41 Jan 13, 2024 Maintainer

prrao87 Jan 14, 2024 Maintainer Author

prrao87 Feb 8, 2024 Maintainer Author

prrao87
Jan 13, 2024
Maintainer

Replies: 2 comments 5 replies

prrao87
Jan 13, 2024
Maintainer Author

sanders41 Jan 13, 2024
Maintainer

prrao87 Jan 13, 2024
Maintainer Author

prrao87 Jan 13, 2024
Maintainer Author

sanders41 Jan 13, 2024
Maintainer

prrao87 Jan 14, 2024
Maintainer Author

prrao87
Feb 8, 2024
Maintainer Author