Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pyarrow as a direct dependency #2228

Merged
merged 4 commits into from
Jul 29, 2024

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Mar 1, 2024

Tracking issue

Continues flyteorg/flyte#4418

Why are the changes needed?

From flyteorg/flyte#4418 (comment), pyarrow is the largest dependency. This PR removes the dependency and lazy loads it.

What changes were proposed in this pull request?

With this PR, pyarrow is now lazy loaded. The lazy loading mechanism is the same as the one used for pandas.

How was this patch tested?

In two of the test environments, pyarrow is removed to make sure flytekit works without pyarrow installed.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 1, 2024
Copy link

codecov bot commented Mar 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.32%. Comparing base (4be4e33) to head (99f7922).

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2228       +/-   ##
===========================================
+ Coverage   76.16%   91.32%   +15.15%     
===========================================
  Files         243      144       -99     
  Lines       21282     6651    -14631     
  Branches     3915        0     -3915     
===========================================
- Hits        16210     6074    -10136     
+ Misses       4427      577     -3850     
+ Partials      645        0      -645     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@thomasjpfan
Copy link
Member Author

I am putting this as a draft. Removing pyarrow removes the indirect dependency on numpy. (pyarrow was the only library that depends on numpy).

I need to make sure flytekit works without numpy installed as well.

@thomasjpfan thomasjpfan marked this pull request as draft March 1, 2024 17:53
@thomasjpfan thomasjpfan marked this pull request as ready for review July 6, 2024 19:25
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jul 9, 2024

@pingsutw This PR removes one of the biggest dependencies. On Linux amd64, pyarrow uncompressed is 140MB.

The downside is that pandas_df.to_parquet will not work out of the box, (which is used by StructuredDataset)

@pingsutw
Copy link
Member

The downside is that pandas_df.to_parquet will not work out of the box, (which is used by StructuredDataset)

I just tested it and the error message is clear, so I think it's fine.

import pandas as pd

from flytekit import task, workflow, StructuredDataset, ImageSpec

df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [1, 22]})

new_flytekit = "git+https://github.com/thomasjpfan/flytekit.git@603d3996bc4ed6907ba6c0296ae395e80f8e1dfc"
image_spec = ImageSpec(
    base_image="python:3.10-slim-bookworm",
    registry="pingsutw",
    packages=[new_flytekit, "pandas"],
    apt_packages=["git"]
)


@task(container_image=image_spec)
def t1(sd: StructuredDataset) -> StructuredDataset:
    print(sd.open(pd.DataFrame).all())
    return sd


@task(container_image=image_spec.with_packages("pyarrow"))
def t2(sd: StructuredDataset) -> StructuredDataset:
    print(sd.open(pd.DataFrame).all())
    return sd


@workflow
def wf():
    t1(sd=StructuredDataset(df))
    t2(sd=StructuredDataset(df))

Screenshot 2024-07-12 at 3 55 36 PM

@pingsutw
Copy link
Member

@eapolinario @wild-endeavor wdyt

@pingsutw pingsutw merged commit 11faf39 into flyteorg:master Jul 29, 2024
46 checks passed
mao3267 pushed a commit to mao3267/flytekit that referenced this pull request Aug 1, 2024
mao3267 pushed a commit to mao3267/flytekit that referenced this pull request Aug 2, 2024
Future-Outlier added a commit that referenced this pull request Aug 26, 2024
…class] (#2603)

* fix: set dataclass member as optional if default value is provided

Signed-off-by: mao3267 <[email protected]>

* lint

Signed-off-by: mao3267 <[email protected]>

* feat: handle nested dataclass conversion in JsonParamType

Signed-off-by: mao3267 <[email protected]>

* fix: handle errors caused by NoneType default value

Signed-off-by: mao3267 <[email protected]>

* test: add nested dataclass unit tests

Signed-off-by: mao3267 <[email protected]>

* Sagemaker dict determinism (#2597)

* truncate sagemaker agent outputs

Signed-off-by: Samhita Alla <[email protected]>

* fix tests and update agent output

Signed-off-by: Samhita Alla <[email protected]>

* lint

Signed-off-by: Samhita Alla <[email protected]>

* fix test

Signed-off-by: Samhita Alla <[email protected]>

* add idempotence token to workflow

Signed-off-by: Samhita Alla <[email protected]>

* fix type

Signed-off-by: Samhita Alla <[email protected]>

* fix mixin

Signed-off-by: Samhita Alla <[email protected]>

* modify output handler

Signed-off-by: Samhita Alla <[email protected]>

* make the dictionary deterministic

Signed-off-by: Samhita Alla <[email protected]>

* nit

Signed-off-by: Samhita Alla <[email protected]>

---------

Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* refactor(core): Enhance return type extraction logic (#2598)

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Feat: Make exception raised by external command authenticator more actionable (#2594)

Signed-off-by: Fabio Grätz <[email protected]>
Co-authored-by: Fabio Grätz <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Fix: Properly re-raise non-grpc exceptions during refreshing of proxy-auth credentials in auth interceptor (#2591)

Signed-off-by: Fabio Grätz <[email protected]>
Co-authored-by: Fabio Grätz <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* validate idempotence token length in subsequent tasks (#2604)

* validate idempotence token length in subsequent tasks

Signed-off-by: Samhita Alla <[email protected]>

* remove redundant param

Signed-off-by: Samhita Alla <[email protected]>

* add tests

Signed-off-by: Samhita Alla <[email protected]>

---------

Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Add nvidia-l4 gpu accelerator (#2608)

Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* eliminate redundant literal conversion for `Iterator[JSON]` type (#2602)

* eliminate redundant literal conversion for  type

Signed-off-by: Samhita Alla <[email protected]>

* add test

Signed-off-by: Samhita Alla <[email protected]>

* lint

Signed-off-by: Samhita Alla <[email protected]>

* add isclass check

Signed-off-by: Samhita Alla <[email protected]>

---------

Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* [FlyteSchema] Fix numpy problems (#2619)

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* add nim plugin (#2475)

* add nim plugin

Signed-off-by: Samhita Alla <[email protected]>

* move nim to inference

Signed-off-by: Samhita Alla <[email protected]>

* import fix

Signed-off-by: Samhita Alla <[email protected]>

* fix port

Signed-off-by: Samhita Alla <[email protected]>

* add pod_template method

Signed-off-by: Samhita Alla <[email protected]>

* add containers

Signed-off-by: Samhita Alla <[email protected]>

* update

Signed-off-by: Samhita Alla <[email protected]>

* clean up

Signed-off-by: Samhita Alla <[email protected]>

* remove cloud import

Signed-off-by: Samhita Alla <[email protected]>

* fix extra config

Signed-off-by: Samhita Alla <[email protected]>

* remove decorator

Signed-off-by: Samhita Alla <[email protected]>

* add tests, update readme

Signed-off-by: Samhita Alla <[email protected]>

* add env

Signed-off-by: Samhita Alla <[email protected]>

* add support for lora adapter

Signed-off-by: Samhita Alla <[email protected]>

* minor fixes

Signed-off-by: Samhita Alla <[email protected]>

* add startup probe

Signed-off-by: Samhita Alla <[email protected]>

* increase failure threshold

Signed-off-by: Samhita Alla <[email protected]>

* remove ngc secret group

Signed-off-by: Samhita Alla <[email protected]>

* move plugin to flytekit core

Signed-off-by: Samhita Alla <[email protected]>

* fix docs

Signed-off-by: Samhita Alla <[email protected]>

* remove hf group

Signed-off-by: Samhita Alla <[email protected]>

* modify podtemplate import

Signed-off-by: Samhita Alla <[email protected]>

* fix import

Signed-off-by: Samhita Alla <[email protected]>

* fix ngc api key

Signed-off-by: Samhita Alla <[email protected]>

* fix tests

Signed-off-by: Samhita Alla <[email protected]>

* fix formatting

Signed-off-by: Samhita Alla <[email protected]>

* lint

Signed-off-by: Samhita Alla <[email protected]>

* docs fix

Signed-off-by: Samhita Alla <[email protected]>

* docs fix

Signed-off-by: Samhita Alla <[email protected]>

* update secrets interface

Signed-off-by: Samhita Alla <[email protected]>

* add secret prefix

Signed-off-by: Samhita Alla <[email protected]>

* fix tests

Signed-off-by: Samhita Alla <[email protected]>

* add urls

Signed-off-by: Samhita Alla <[email protected]>

* add urls

Signed-off-by: Samhita Alla <[email protected]>

* remove urls

Signed-off-by: Samhita Alla <[email protected]>

* minor modifications

Signed-off-by: Samhita Alla <[email protected]>

* remove secrets prefix; add failure threshold

Signed-off-by: Samhita Alla <[email protected]>

* add hard-coded prefix

Signed-off-by: Samhita Alla <[email protected]>

* add comment

Signed-off-by: Samhita Alla <[email protected]>

* make secrets prefix a required param

Signed-off-by: Samhita Alla <[email protected]>

* move nim to flytekit plugin

Signed-off-by: Samhita Alla <[email protected]>

* update readme

Signed-off-by: Samhita Alla <[email protected]>

* update readme

Signed-off-by: Samhita Alla <[email protected]>

* update readme

Signed-off-by: Samhita Alla <[email protected]>

---------

Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* [Elastic/Artifacts] Pass through model card (#2575)

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Remove pyarrow as a direct dependency (#2228)

Signed-off-by: Thomas J. Fan <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Boolean flag to show local container logs to the terminal (#2521)

Signed-off-by: aditya7302 <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Co-authored-by: Kevin Su <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Enable Ray Fast Register (#2606)

Signed-off-by: Jan Fiedler <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* [Artifacts/Elastic] Skip partitions (#2620)

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Install flyteidl from master in plugins tests (#2621)

Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Using ParamSpec to show underlying typehinting (#2617)

Signed-off-by: JackUrb <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Support ArrayNode mapping over Launch Plans (#2480)

* set up array node

Signed-off-by: Paul Dittamo <[email protected]>

* wip array node task wrapper

Signed-off-by: Paul Dittamo <[email protected]>

* support function like callability

Signed-off-by: Paul Dittamo <[email protected]>

* temp check in some progress on python func wrapper

Signed-off-by: Paul Dittamo <[email protected]>

* only support launch plans in new array node class for now

Signed-off-by: Paul Dittamo <[email protected]>

* add map task array node implementation wrapper

Signed-off-by: Paul Dittamo <[email protected]>

* ArrayNode only supports LPs for now

Signed-off-by: Paul Dittamo <[email protected]>

* support local execute for new array node implementation

Signed-off-by: Paul Dittamo <[email protected]>

* add local execute unit tests for array node

Signed-off-by: Paul Dittamo <[email protected]>

* set exeucution version in array node spec

Signed-off-by: Paul Dittamo <[email protected]>

* check input types for local execute

Signed-off-by: Paul Dittamo <[email protected]>

* remove code that is un-needed for now

Signed-off-by: Paul Dittamo <[email protected]>

* clean up array node class

Signed-off-by: Paul Dittamo <[email protected]>

* improve naming

Signed-off-by: Paul Dittamo <[email protected]>

* clean up

Signed-off-by: Paul Dittamo <[email protected]>

* utilize enum execution mode to set array node execution path

Signed-off-by: Paul Dittamo <[email protected]>

* default execution mode to FULL_STATE for new array node class

Signed-off-by: Paul Dittamo <[email protected]>

* support min_successes for new array node

Signed-off-by: Paul Dittamo <[email protected]>

* add map task wrapper unit test

Signed-off-by: Paul Dittamo <[email protected]>

* set min successes for array node map task wrapper

Signed-off-by: Paul Dittamo <[email protected]>

* update docstrings

Signed-off-by: Paul Dittamo <[email protected]>

* Install flyteidl from master in plugins tests

Signed-off-by: Eduardo Apolinario <[email protected]>

* lint

Signed-off-by: Paul Dittamo <[email protected]>

* clean up min success/ratio setting

Signed-off-by: Paul Dittamo <[email protected]>

* lint

Signed-off-by: Paul Dittamo <[email protected]>

* make array node class callable

Signed-off-by: Paul Dittamo <[email protected]>

---------

Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Richer printing for some artifact objects (#2624)

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* ci: Add Python 3.9 to build matrix (#2622)

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Future-Outlier <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* bump (#2627)

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Added alt prefix head to FlyteFile.new_remote (#2601)

* Added alt prefix head to FlyteFile.new_remote

Signed-off-by: pryce-turner <[email protected]>

* Added get_new_path method to FileAccessProvider, fixed new_remote method of FlyteFile

Signed-off-by: pryce-turner <[email protected]>

* Updated tests and added new path creator to FlyteFile/Dir new_remote methods

Signed-off-by: pryce-turner <[email protected]>

* Improved docstrings, fixed minor path sep bug, more descriptive naming, better test

Signed-off-by: pryce-turner <[email protected]>

* Formatting

Signed-off-by: pryce-turner <[email protected]>

---------

Signed-off-by: pryce-turner <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Feature gate for FlyteMissingReturnValueException (#2623)

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Remove use of multiprocessing from the OAuth client (#2626)

* Remove use of multiprocessing from the OAuth client

Signed-off-by: Robert Deaton <[email protected]>

* Lint

Signed-off-by: Robert Deaton <[email protected]>

---------

Signed-off-by: Robert Deaton <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Update codespell in precommit to version 2.3.0 (#2630)

Signed-off-by: mao3267 <[email protected]>

* Fix Snowflake Agent Bug (#2605)

* fix snowflake agent bug

Signed-off-by: Future-Outlier <[email protected]>

* a work version

Signed-off-by: Future-Outlier <[email protected]>

* Snowflake work version

Signed-off-by: Future-Outlier <[email protected]>

* fix secret encode

Signed-off-by: Future-Outlier <[email protected]>

* all works, I am so happy

Signed-off-by: Future-Outlier <[email protected]>

* improve additional protocol

Signed-off-by: Future-Outlier <[email protected]>

* fix tests

Signed-off-by: Future-Outlier <[email protected]>

* Fix Tests

Signed-off-by: Future-Outlier <[email protected]>

* update agent

Signed-off-by: Kevin Su <[email protected]>

* Add snowflake test

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* sd

Signed-off-by: Kevin Su <[email protected]>

* snowflake loglinks

Signed-off-by: Future-Outlier <[email protected]>

* add metadata

Signed-off-by: Future-Outlier <[email protected]>

* secret

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* remove table

Signed-off-by: Future-Outlier <[email protected]>

* add comment for get private key

Signed-off-by: Future-Outlier <[email protected]>

* update comments:

Signed-off-by: Future-Outlier <[email protected]>

* Fix Tests

Signed-off-by: Future-Outlier <[email protected]>

* update comments

Signed-off-by: Future-Outlier <[email protected]>

* update comments

Signed-off-by: Future-Outlier <[email protected]>

* Better Secrets

Signed-off-by: Future-Outlier <[email protected]>

* use union secret

Signed-off-by: Future-Outlier <[email protected]>

* Update Changes

Signed-off-by: Future-Outlier <[email protected]>

* use if not get_plugin().secret_requires_group()

Signed-off-by: Future-Outlier <[email protected]>

* Use Union SDK

Signed-off-by: Future-Outlier <[email protected]>

* Update

Signed-off-by: Future-Outlier <[email protected]>

* Fix Secrets

Signed-off-by: Future-Outlier <[email protected]>

* Fix Secrets

Signed-off-by: Future-Outlier <[email protected]>

* remove pacakge.json

Signed-off-by: Future-Outlier <[email protected]>

* lint

Signed-off-by: Future-Outlier <[email protected]>

* add snowflake-connector-python

Signed-off-by: Future-Outlier <[email protected]>

* fix test_snowflake

Signed-off-by: Future-Outlier <[email protected]>

* Try to fix tests

Signed-off-by: Future-Outlier <[email protected]>

* fix tests

Signed-off-by: Future-Outlier <[email protected]>

* Try Fix snowflake Import

Signed-off-by: Future-Outlier <[email protected]>

* snowflake test passed

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Co-authored-by: Kevin Su <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* run test_missing_return_value on python 3.10+ (#2637)

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* [Elastic] Fix context usage and apply fix to fork method (#2628)

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Add flytekit-omegaconf plugin (#2299)

* add flytekit-hydra

Signed-off-by: mg515 <[email protected]>

* fix small typo readme

Signed-off-by: mg515 <[email protected]>

* ruff ruff

Signed-off-by: mg515 <[email protected]>

* lint more

Signed-off-by: mg515 <[email protected]>

* rename plugin into flytekit-omegaconf

Signed-off-by: mg515 <[email protected]>

* lint sort imports

Signed-off-by: mg515 <[email protected]>

* use flytekit logger

Signed-off-by: mg515 <[email protected]>

* use flytekit logger #2

Signed-off-by: mg515 <[email protected]>

* fix typing info in is_flatable

Signed-off-by: mg515 <[email protected]>

* use default_factory instead of mutable default value

Signed-off-by: mg515 <[email protected]>

* add python3.11 and python3.12 to setup.py

Signed-off-by: mg515 <[email protected]>

* make fmt

Signed-off-by: mg515 <[email protected]>

* define error message only once

Signed-off-by: mg515 <[email protected]>

* add docstring

Signed-off-by: mg515 <[email protected]>

* remove GenericEnumTransformer and tests

Signed-off-by: mg515 <[email protected]>

* fallback to TypeEngine.get_transformer(node_type) to find suitable transformer

Signed-off-by: mg515 <[email protected]>

* explicit valueerrors instead of asserts

Signed-off-by: mg515 <[email protected]>

* minor style improvements

Signed-off-by: mg515 <[email protected]>

* remove obsolete warnings

Signed-off-by: mg515 <[email protected]>

* import flytekit logger instead of instantiating our own

Signed-off-by: mg515 <[email protected]>

* docstrings in reST format

Signed-off-by: mg515 <[email protected]>

* refactor transformer mode

Signed-off-by: mg515 <[email protected]>

* improve docs

Signed-off-by: mg515 <[email protected]>

* refactor dictconfig class into smaller methods

Signed-off-by: mg515 <[email protected]>

* add unit tests for dictconfig transformer

Signed-off-by: mg515 <[email protected]>

* refactor of parse_type_description()

Signed-off-by: mg515 <[email protected]>

* add omegaconf plugin to pythonbuild.yaml

---------

Signed-off-by: mg515 <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Adds extra-index-url to default image builder (#2636)

Signed-off-by: Thomas J. Fan <[email protected]>
Co-authored-by: Kevin Su <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* reference_task should inherit from PythonTask (#2643)

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* Fix Get Agent Secret Using Key (#2644)

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: mao3267 <[email protected]>

* fix: prevent converting Flyte types as custom dataclasses

Signed-off-by: mao3267 <[email protected]>

* fix: add None to output type

Signed-off-by: mao3267 <[email protected]>

* test: add unit test for nested dataclass inputs

Signed-off-by: mao3267 <[email protected]>

* test: add unit tests for nested dataclass, dataclass default value as None, and flyte type exceptions

Signed-off-by: mao3267 <[email protected]>

* fix: handle NoneType as default value of list type dataclass members

Signed-off-by: mao3267 <[email protected]>

* fix: add comments for `has_nested_dataclass` function

Signed-off-by: mao3267 <[email protected]>

* fix: make lint

Signed-off-by: mao3267 <[email protected]>

* fix: update tests regarding input through file and pipe

Signed-off-by: mao3267 <[email protected]>

* Make JsonParamType convert faster

Signed-off-by: Future-Outlier <[email protected]>

* make has_nested_dataclass func more clean and add tests for dataclass_with_optional_fields

Signed-off-by: Future-Outlier <[email protected]>

* make logic more backward compatible

Signed-off-by: Future-Outlier <[email protected]>

* fix: handle indexing errors in dict/list while checking nested dataclass, add comments

Signed-off-by: mao3267 <[email protected]>

---------

Signed-off-by: mao3267 <[email protected]>
Co-authored-by: Kevin Su <[email protected]>
Co-authored-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants