Initial commit

boxysean · Nov 11, 2020 · efa8df0 · efa8df0
commit efa8df0
Show file tree

Hide file tree

Showing 24 changed files with 1,407 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+site/
+build/
+dist/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,14 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+
+## [1.0.0] - 2020-11-11
+### Added
+- Initial release of pylateral.
+
+[Unreleased]: https://github.com/boxysean/pylateral/compare/v1.0.0...HEAD
+[1.0.0]: https://github.com/boxysean/pylateral/releases/tag/v1.0.0
+
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2020 Sean McIntyre
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,3 @@
+.PHONY: test
+test: ## Run the tests
+	pytest tests/
diff --git a/README.md b/README.md
@@ -0,0 +1,28 @@
+pylateral
+=========
+
+Intuitive multi-threaded task processing in python.
+
+## Example
+
+    import urllib.request
+
+    @pylateral.task
+    def request_and_print(url):
+        response = urllib.request.urlopen(url)
+        print(response.read())
+        
+    URLS = [
+        "https://www.nytimes.com/",
+        "https://www.cnn.com/",
+        "https://europe.wsj.com/",
+        "https://www.bbc.co.uk/",
+        "https://some-made-up-domain.com/",
+    ]
+
+    with pylateral.task_pool():
+        for url in URLS:
+            request_and_print(url)
+
+    print("Complete!")
+
diff --git a/docs/comparison.md b/docs/comparison.md
@@ -0,0 +1,201 @@
+Comparison with other python libraries
+======================================
+
+There's lots of way to skin the threading cat! 
+
+### When to use *pylateral*
+
+- Your workload is network-bound and/or IO-bound (e.g., API calls, database queries, read/write to FTP, read/write to files).
+
+- Your workload can be run [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel).
+
+- You are writing a script or prototype that isn't very large nor complex.
+
+### When not to use *pylateral*
+
+- Your workload is CPU-bound and blocked by the [Global Interpreter Lock](https://en.wikipedia.org/wiki/CPython#Design). *python* threading will not help speed up your workload, consider using [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) or [concurrent.futures.ProcessPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor) instead.
+
+- The complexity of your program would benefit from thinking about it in terms of [futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises). Consider using [asyncio](https://docs.python.org/3/library/asyncio.html) or [concurrent.futures.ThreadPoolExecutors](https://docs.python.org/3/library/concurrent.futures.html) instead.
+
+- When you want to have tighter controls around the lifecycle of your thread. Consider using [threading](https://docs.python.org/3/library/threading.html) instead.
+
+- For larger workloads, consider using [dask.distributed](https://distributed.dask.org/en/latest/#), [Airflow](https://airflow.apache.org/), [Dagster](https://github.com/dagster-io/dagster/) or [Prefect](https://www.prefect.io/) to perform work across many nodes.
+
+- You would benefit from a web UI for viewing and interacting with your tasks. For that, consider using [Airflow](https://airflow.apache.org/) or [Prefect](https://www.prefect.io/).
+
+Feature comparison
+------------------
+
+| Feature                            | pylateral | [asyncio](https://docs.python.org/3/library/asyncio.html) | [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html)       | [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) | [threading](https://docs.python.org/3/library/threading.html) |
+| ---------------------------------- | --------- | ------- | ------------------------ | --------------- | --------- |
+| Easy to adapt single-threaded code | ✅         | ❌      | ❌                       | ❌              | ❌        |
+| [Simple nested tasks](usage.md#working-with-nested-tasks)                | ✅         | ✅      | ❌                       | ❌              | ❌        |
+| Concurrent IO-bound workloads      | ✅         | ✅      | ✅                       | ✅              | ✅        |
+| Concurrent CPU-bound workloads     | ❌         | ❌      | ✅ (Process Pool)        | ✅              | ❌        |
+| Flexibility in using return values | ❌         | ✅      | ✅                       | ❌              | ❌        |
+
+Code comparison
+----------
+
+[PEP-3148 -- futures - execute computations asynchronously](https://www.python.org/dev/peps/pep-3148/#id13) introduces `concurrent.futures` and illustrates it by example. Here I show that example in *pylateral*, stacked up against the main threading libraries offered in python.
+
+### `asyncio`
+
+```python
+import aiohttp
+import asyncio
+import sqlite3
+
+URLS = [
+    'http://www.foxnews.com/',
+    'http://www.cnn.com/',
+    'http://europe.wsj.com/',
+    'http://www.bbc.co.uk/',
+    'http://some-made-up-domain.com/',
+]
+
+async def extract_and_load(url, timeout=30):
+    try:
+        async with aiohttp.ClientSession() as session:
+            async with session.get(url, timeout=timeout) as response:
+                web_result = await response.text()
+                print(f"{url} is {len(web_result)} bytes")
+
+                with sqlite3.connect('example.db') as conn, conn as cursor:
+                    cursor.execute('CREATE TABLE IF NOT EXISTS web_results (url text, length int);')
+                    cursor.execute('INSERT INTO web_results VALUES (?, ?)', (url, len(web_result)))
+    except Exception as e:
+        print(f"{url} generated an exception: {e}")
+        return False
+    else:
+        return True
+
+async def main():
+    succeeded = await asyncio.gather(*[
+        extract_and_load(url)
+        for url in URLS
+    ])
+
+    print(f"Successfully completed {sum(1 for result in succeeded if result)}")
+
+asyncio.run(main())
+```
+
+### `concurrent.futures.ThreadPoolExecutor`
+
+```python
+import concurrent.futures
+import requests
+import sqlite3
+
+URLS = [
+    'http://www.foxnews.com/',
+    'http://www.cnn.com/',
+    'http://europe.wsj.com/',
+    'http://www.bbc.co.uk/',
+    'http://some-made-up-domain.com/',
+]
+
+def extract_and_load(url, timeout=30):
+    try:
+        web_result = requests.get(url, timeout=timeout).text
+        print(f"{url} is {len(web_result)} bytes")
+
+        with sqlite3.connect('example.db') as conn, conn as cursor:
+            cursor.execute('CREATE TABLE IF NOT EXISTS web_results (url text, length int);')
+            cursor.execute('INSERT INTO web_results VALUES (?, ?)', (url, len(web_result)))
+    except Exception as e:
+        print(f"{url} generated an exception: {e}")
+        return False
+    else:
+        return True
+
+succeeded = []
+
+with concurrent.futures.ThreadPoolExecutor() as executor:
+    future_to_url = dict(
+        (executor.submit(extract_and_load, url), url)
+        for url in URLS
+    )
+
+    for future in concurrent.futures.as_completed(future_to_url):
+        succeeded.append(future.result())
+
+print(f"Successfully completed {sum(1 for result in succeeded if result)}")
+```
+
+### `pylateral`
+
+```python
+import requests
+import sqlite3
+
+import pylateral
+
+URLS = [
+    'http://www.foxnews.com/',
+    'http://www.cnn.com/',
+    'http://europe.wsj.com/',
+    'http://www.bbc.co.uk/',
+    'http://some-made-up-domain.com/',
+]
+
+@pylateral.task(has_return_value=True)
+def extract_and_load(url, timeout=30):
+    try:
+        web_result = requests.get(url, timeout=timeout).text
+        print(f"{url} is {len(web_result)} bytes")
+
+        with sqlite3.connect('example.db') as conn, conn as cursor:
+            cursor.execute('CREATE TABLE IF NOT EXISTS web_results (url text, length int);')
+            cursor.execute('INSERT INTO web_results VALUES (?, ?)', (url, len(web_result)))
+    except Exception as e:
+        print(f"{url} generated an exception: {e}")
+        return False
+    else:
+        return True
+
+with pylateral.task_pool() as pool:
+    for url in URLS:
+        extract_and_load(url)
+
+succeeded = pool.results
+
+print(f"Successfully completed {sum(1 for result in succeeded if result)}")
+```
+
+### Unthreaded
+
+```python
+import requests
+import sqlite3
+
+URLS = [
+    'http://www.foxnews.com/',
+    'http://www.cnn.com/',
+    'http://europe.wsj.com/',
+    'http://www.bbc.co.uk/',
+    'http://some-made-up-domain.com/',
+]
+
+def extract_and_load(url, timeout=30):
+    try:
+        web_result = requests.get(url, timeout=timeout).text
+        print(f"{url} is {len(web_result)} bytes")
+
+        with sqlite3.connect('example.db') as conn, conn as cursor:
+            cursor.execute('CREATE TABLE IF NOT EXISTS web_results (url text, length int);')
+            cursor.execute('INSERT INTO web_results VALUES (?, ?)', (url, len(web_result)))
+    except Exception as e:
+        print(f"{url} generated an exception: {e}")
+        return False
+    else:
+        return True
+
+succeeded = [
+    extract_and_load(url)
+    for url in URLs
+]
+
+print(f"Successfully completed {sum(1 for result in succeeded if result)}")
+```
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,69 @@
+pylateral
+=========
+
+**Simple multi-threaded task processing in python**
+
+Example
+-------
+
+    import urllib.request
+
+    import pylateral
+
+    @pylateral.task
+    def request_and_print(url):
+        response = urllib.request.urlopen(url)
+        print(response.read())
+        
+    URLS = [
+        "https://www.nytimes.com/",
+        "https://www.cnn.com/",
+        "https://europe.wsj.com/",
+        "https://www.bbc.co.uk/",
+        "https://some-made-up-domain.com/",
+    ]
+
+    with pylateral.task_pool():
+        for url in URLS:
+            request_and_print(url)
+
+    print("Complete!")
+
+### What's going on here
+
+-  `def request_and_print(url)` is a *pylateral* task that, when called, is run on a task pool thread rather than on the main thread.
+
+- `with pylateral.task_pool()` allocates threads and a task pool. The context manager may exit only when there are no remaining tasks.
+
+- Each call to `request_and_print(url)` adds that task to the task pool. Meanwhile, the main thread continues execution.
+
+- The `Complete!` statement is printed after all the `request_and_print()` task invocations are complete by the pool threads.
+
+To learn more about the features of *pylateral*, check out the [usage](usage.md) section.
+
+Background
+----------
+
+A couple of years ago, I inherited my company's codebase to get data into our data warehouse using an ELT approach (extract-and-loads done in python, transforms done in [dbt](https://www.getdbt.com/)/SQL). The codebase has dozens of python scripts to integrate first-party and third-party data from databases, FTPs, and APIs, which are run on a scheduler (typically daily or hourly). The scripts I inherited were single-threaded procedural scripts, looking like glue code, and spending most of their time in network I/O. This got my company pretty far!
+
+As my team and I added more and more integrations with more and more data, we wanted to have faster and faster scripts to reduce our dev cycles and reduce our multi-hour nightly jobs to minutes. Because our scripts were network-bound, multi-threading was a good way to accomplish this, and so I looked into `concurrent.futures` and `asyncio`, but I decided against these options because:
+
+1. It wasn't immediately apparently how to adapt my codebase to use these libraries without either some fundamental changes to our execution platform and/or reworking of our scripts from the ground up and/or adding significant lines of multi-threading code to each script.
+
+2. I believe the procedural style glue code we have is quite easy to comprehend, which I think has a positive impact on the scale of supporting a wide-variety of programs.
+
+And so, I designed *pylateral*, a simple interface to `concurrent.futures.ThreadPoolExecutor` for extract-and-load workloads. The design considerations of this interface include:
+
+- The usage is minimally-invasive to the original un-threaded approach of my company's codebase. (And so, teaching the library has been fairly straightforward despite the multi-threaded paradigm shift.)
+
+- The `@pylateral.task` decorator should be used to encapsulate a homogeneous method accepting different parameters. The contents of the method should be primarily I/O to achieve the concurrency gains of python multi-threading.
+
+- If no `pylateral.pool` context manager has been entered, or if it has been disabled by an environment variable, the `@pylateral.task` decorator does nothing (and the code runs serially).
+
+- While it's possible to return a value from a `@pylateral.task` method, I encourage my team to use the decorator to start-and-complete work; think of writing "embarrassingly parallel" methods that can be "mapped".
+
+### Why not other libraries?
+
+I think that *pylateral* meets an unmet need in python's concurrency eco-system: a simple way to gain the benefits of multi-threading without radically transforming either mindset or codebase.
+
+That said, I don't think *pylateral* is a [silver bullet](https://en.wikipedia.org/wiki/No_Silver_Bullet). See my [comparison](comparison.md) of *pylateral* against other concurrency offerings.
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -0,0 +1,6 @@
+mkdocs==1.0.4
+markdown==3.2
+mkdocs-exclude==1.0.2
+mkdocs-material==4.6.2
+markdown-include==0.5.1
+pygments==2.5.2
-Original file line number
+Diff line change
@@ -0,0 +1,3 @@
+    site/
+    build/
+    dist/