You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In large-scale feature datasets, adding new columns (or features) can be inefficient if the entire dataset needs to be rewritten. A more efficient method would involve associating only the new feature data with existing records, without rewriting the whole dataset. This could significantly optimize storage and improve iteration speeds, especially in machine learning and ETL pipelines.
Key Considerations:
Storage Efficiency: Rather than rewriting the entire dataset when adding new columns, only the newly added columns should be written. This avoids duplication of existing data and minimizes storage consumption.
Efficient Query and Write Performance: Queries and writes should be optimized by linking new feature data to existing records without the need to reprocess or modify the entire dataset. This will maintain query performance while reducing unnecessary data movement.
Feature Request / Improvement
In large-scale feature datasets, adding new columns (or features) can be inefficient if the entire dataset needs to be rewritten. A more efficient method would involve associating only the new feature data with existing records, without rewriting the whole dataset. This could significantly optimize storage and improve iteration speeds, especially in machine learning and ETL pipelines.
Key Considerations:
Current Approaches:
Bytedance scheme: Whether the way of sorting according to the primary key is the best scheme.
https://developer.volcengine.com/articles/7260058755952279606
Query engine
Spark
Willingness to contribute
The text was updated successfully, but these errors were encountered: