Skip to content

Commit

Permalink
Updated data plug README: add not-thread-safe exaplanation
Browse files Browse the repository at this point in the history
  • Loading branch information
vlianCMU committed Mar 29, 2024
1 parent 239bb30 commit a792685
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions sdt_dask/dataplugs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ To create your own DataPlug, you must provide two key files: a Python module (`y
2. **Define Initialization Parameters**: Customize the __init__ method to accept parameters specific to your data source. These parameters can vary widely depending on the nature of the data source, such as file paths, API keys, database credentials, or any other configuration necessary for data access.
3. **Implement `get_data` Method**: This method is the core of your DataPlug, tasked with retrieving and cleaning the data before returning a pandas DataFrame. The method should accept a keys argument as a tuple, which contains the necessary identifiers or parameters to fetch the specific dataset. This flexible approach allows for a wide range of data retrieval scenarios, accommodating various data sources and user requirements.

4. **Important - Non-Serializable Object**:
When distributing tasks across Dask workers, avoid using pre-initialized instances of objects that maintain state, open connections, or hold resources that cannot be serialized (e.g., botocore.client.S3 instances). These objects should not be serialized or transferred across processes due to their internal state and open connections. Instead, create and utilize such instances within the scope of each task. This guidance ensures that each task independently manages its resources, enhancing process safety and stability. This principle applies broadly to all non-serializable objects used in distributed computing tasks.
4. **Important - Non-Serializable and Not-Thread-Safe Object**:
When distributing tasks across Dask workers, avoid using pre-initialized instances of objects that maintain state, open connections, or hold resources that cannot be serialized (e.g., botocore.client.S3 instances). These objects should not be serialized or transferred across processes due to their internal state and open connections. Instead, create and utilize such instances within the scope of each task. This guidance ensures that each task independently manages its resources, enhancing process safety and stability. This principle applies broadly to all non-serializable objects used in distributed computing tasks. MEANWHILE, to guarantee thread safety across our application, you are expected to adopt a consistent pattern of creating dedicated instances for any object that is not thread-safe to share across threads or processes. This involves generating new instances of resources or clients for each thread or operational context. This strategy is essential for preventing data sharing and synchronization issues in concurrent environments, ensuring that each thread operates with its own isolated instance.

5. **(Optional) Additional Methods**: Beyond get_data, you may implement any number of private or public methods to aid in data retrieval, transformation, or cleaning. Examples include methods for parsing file names, performing complex queries on databases, or applying specific data cleaning operations tailored to your data source.

Expand Down Expand Up @@ -77,4 +77,6 @@ And also assumes the AWS configuration has been set up in local environment
```python
data_plug.get_data(("s3-file-key",))
```
- **Important Notice**: You should avoid using pre-initialized S3 client objects within Dask tasks that are being distributed to Dask workers. Since instances of botocore.client.S3 include open connections and internal state, they cannot be correctly serialized and safely transferred across processes. Instead, you should create and use S3 client instances within each task, ensuring that the use of S3 clients is confined to the execution scope of individual tasks. And this notice applies to all objects that maintains state, open connections, or resources that are not serializable.
- **Important Notice 1.**: You should avoid using pre-initialized S3 client objects within Dask tasks that are being distributed to Dask workers. Since instances of botocore.client.S3 include open connections and internal state, they cannot be correctly serialized and safely transferred across processes. Instead, you should create and use S3 client instances within each task, ensuring that the use of S3 clients is confined to the execution scope of individual tasks. And this notice applies to all objects that maintains state, open connections, or resources that are not serializable.

- **Important Notice 2.**: To ensure thread safety, a new boto3.session.Session() should be created and subsequently a new client instance for each operation. Boto3 client is not thread safe and should be avoided to share across threads. This approach addresses the issue of resource instances not being thread-safe and prevents potential data sharing and synchronization issues when accessing AWS resources concurrently across threads or processes. It's vital to apply this pattern for any object that is not safe to share across threads, ensuring isolated instances per thread for reliable and error-free operation.

0 comments on commit a792685

Please sign in to comment.