-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSProcessing] BERT Tokenizer #700
Conversation
ff21c2d
to
178da9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the implementation looks solid. Let's work on some changes before merging.
graphstorm-processing/graphstorm_processing/config/hf_configs.py
Outdated
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/config/hf_configs.py
Outdated
Show resolved
Hide resolved
...ng/graphstorm_processing/data_transformations/dist_transformations/dist_hf_transformation.py
Show resolved
Hide resolved
...ng/graphstorm_processing/data_transformations/dist_transformations/dist_hf_transformation.py
Outdated
Show resolved
Hide resolved
...ng/graphstorm_processing/data_transformations/dist_transformations/dist_hf_transformation.py
Outdated
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Theodore Vasiloudis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small changes requested.
graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu
Outdated
Show resolved
Hide resolved
docs/source/gs-processing/usage/distributed-processing-setup.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few changes left.
Let's ensure the following:
- run
black
on the codebase to ensure formatting is correct. - Increment the version in pyproject.toml to
0.2.2
. - Ensure both images build and can run a text transformation using the new build argument on both EMR-S and SM.
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Outdated
Show resolved
Hide resolved
...ng/graphstorm_processing/data_transformations/dist_transformations/dist_hf_transformation.py
Outdated
Show resolved
Hide resolved
docs/source/gs-processing/usage/distributed-processing-setup.rst
Outdated
Show resolved
Hide resolved
Co-authored-by: Theodore Vasiloudis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we should run some benchmarks to see if we have any bottlenecks with the implementation.
Update our result on the mag with the tokenized feature. |
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Beside implementing the BERT Tokenizer feature
Result: