Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] BERT Tokenizer #700

Merged
merged 40 commits into from
Feb 1, 2024
Merged

Conversation

jalencato
Copy link
Collaborator

@jalencato jalencato commented Jan 10, 2024

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.


Beside implementing the BERT Tokenizer feature

  • Add etype_featsize dict in graph loader to fix a bug
  • Add dependency for setuptools that is necessary for emr-serverless run
  • Reformat the _process_node_feature and _process_edge feature part to process tokenize feature and also allow enough backward compatibilities.

Result:

@jalencato jalencato added the ready able to trigger the CI label Jan 26, 2024
@jalencato jalencato marked this pull request as ready for review January 26, 2024 21:55
@thvasilo thvasilo added gsprocessing For issues and PRs related the the GSProcessing library 0.2.2 labels Jan 29, 2024
@thvasilo thvasilo added this to the 0.2.2 Release milestone Jan 29, 2024
@thvasilo thvasilo self-requested a review January 29, 2024 18:48
Copy link
Contributor

@thvasilo thvasilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the implementation looks solid. Let's work on some changes before merging.

Copy link
Contributor

@thvasilo thvasilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes requested.

Copy link
Contributor

@thvasilo thvasilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes left.

Let's ensure the following:

  • run black on the codebase to ensure formatting is correct.
  • Increment the version in pyproject.toml to 0.2.2.
  • Ensure both images build and can run a text transformation using the new build argument on both EMR-S and SM.

Copy link
Contributor

@thvasilo thvasilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we should run some benchmarks to see if we have any bottlenecks with the implementation.

@jalencato
Copy link
Collaborator Author

LGTM, we should run some benchmarks to see if we have any bottlenecks with the implementation.

Update our result on the mag with the tokenized feature.

@jalencato jalencato merged commit c29bf54 into awslabs:main Feb 1, 2024
3 checks passed
@jalencato jalencato deleted the bert_tokenzier branch September 4, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2.2 gsprocessing For issues and PRs related the the GSProcessing library ready able to trigger the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants