Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: switch over to dask-based processing idioms, improve dataset handling #882

Merged
merged 119 commits into from
Dec 12, 2023
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
0e24cb7
some first movement on dask-task-graph style runner/executor
lgray Jun 12, 2023
dcb6f74
changes for preprocessing prototype
lgray Jun 27, 2023
a03beb9
new dask-based dataset pre-processor
lgray Aug 22, 2023
95c5fc8
Added the rucio utils functions from pocketcoffea
valsdav Aug 23, 2023
1f870c2
Added dataset querying function
valsdav Aug 23, 2023
c5d2d57
Working on interface for datasets query
valsdav Aug 28, 2023
d4d371c
Querying and listing implemented: selected of results
valsdav Aug 28, 2023
e1f11cf
Printing sites availability for replicas
valsdav Aug 28, 2023
1e191bf
Added replica site selection
valsdav Aug 28, 2023
02cbbbe
Added saving
valsdav Aug 28, 2023
daf5e52
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 28, 2023
c9101bb
Formatting and flake8
valsdav Aug 28, 2023
0c11976
Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…
valsdav Aug 28, 2023
b78f640
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 28, 2023
21cd240
Fixed comments spelling
valsdav Aug 28, 2023
dbed7d4
py 3.9 for cirrus
lgray Aug 28, 2023
f8653eb
Merge pull request #883 from valsdav/local_executors_to_dask
lgray Aug 28, 2023
153c0ae
Switched to rucio-clients
valsdav Aug 30, 2023
9fb3026
Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…
valsdav Aug 30, 2023
5a45e4a
Added some docs to the cli
valsdav Aug 30, 2023
acb9372
Merge pull request #884 from valsdav/local_executors_to_dask
lgray Aug 30, 2023
00fb57c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 30, 2023
6d5d800
roll back test to py3.8
lgray Aug 30, 2023
e0f1726
math.ceil instead of integer division
lgray Aug 30, 2023
885bbf0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 30, 2023
3283256
Forgot to remove the slash.
lgray Aug 30, 2023
2e3a16b
Merge branch 'master' into local_executors_to_dask
lgray Sep 6, 2023
8983524
Merge branch 'master' into local_executors_to_dask
lgray Oct 17, 2023
4c8d917
make rucio-clients an extra
lgray Oct 17, 2023
cf55bf8
Merge branch 'master' into local_executors_to_dask
lgray Oct 18, 2023
35656c6
Merge branch 'master' into local_executors_to_dask
lgray Oct 20, 2023
5793cb8
Merge branch 'master' into local_executors_to_dask
lgray Nov 6, 2023
84135da
Merge branch 'master' into local_executors_to_dask
lgray Nov 9, 2023
9c598bb
two wrappers to apply processor wrapped code to datasets
lgray Nov 10, 2023
2069786
let preprocess deal with missing files in a configurable way, add in …
lgray Nov 10, 2023
e3959f4
remove old tests and switch to new tools in tests
lgray Nov 21, 2023
0443724
Merge branch 'master' into local_executors_to_dask
lgray Nov 21, 2023
707195e
disable workqueue tests (to be replaced with taskvine)
lgray Nov 21, 2023
2c65b36
patch up ci
lgray Nov 21, 2023
cc22be8
more testwq removal
lgray Nov 21, 2023
dfed4b9
adjustments to removing executor / lazydataframe
lgray Nov 21, 2023
a661987
accumulator tests
lgray Nov 22, 2023
080f2af
also allow Callables in apply_to_fileset
lgray Nov 22, 2023
d91e6b6
Merge branch 'master' into local_executors_to_dask
lgray Nov 27, 2023
4a60053
add dataset tools test
lgray Nov 27, 2023
98a719c
Make scope of dataset_query less cms-only. Edits language
valsdav Nov 28, 2023
8f669a0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 28, 2023
5ec1068
Merge branch 'master' into local_executors_to_dask
lgray Nov 28, 2023
5e127c2
remove accumulator concept
lgray Nov 28, 2023
1de78e6
typing for apply_to_dataset/fileset, use setdefault
lgray Nov 28, 2023
5ee0b20
typing for preprocess
lgray Nov 28, 2023
009fdd0
flake8
lgray Nov 28, 2023
e00e691
fix up typing
lgray Nov 28, 2023
7644fe7
more typing for apply_processor
lgray Nov 28, 2023
9fee7d3
being pedantic about types
lgray Nov 29, 2023
9ba1442
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 29, 2023
1055e8b
Merge branch 'master' into local_executors_to_dask
lgray Nov 29, 2023
8625484
Merge branch 'master' into local_executors_to_dask
lgray Nov 30, 2023
54b27b6
Merge branch 'master' into local_executors_to_dask
lgray Nov 30, 2023
734507c
Merge branch 'master' into local_executors_to_dask
lgray Dec 2, 2023
1dc29db
Merge branch 'master' into local_executors_to_dask
lgray Dec 2, 2023
3ca9e8f
Merge branch 'master' into local_executors_to_dask
lgray Dec 4, 2023
059061a
taskvine test was using old location of NanoAODSchema
lgray Dec 4, 2023
3385bdb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 4, 2023
0e2baf1
lint: no longer need to import processor
lgray Dec 4, 2023
0812318
Merge branch 'master' into local_executors_to_dask
lgray Dec 6, 2023
21f72d9
Merge branch 'master' into local_executors_to_dask
lgray Dec 6, 2023
d0b5957
Merge remote-tracking branch 'origin/local_executors_to_dask' into lo…
valsdav Dec 6, 2023
c6baa74
Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…
valsdav Dec 6, 2023
59ef352
Getting rucio client from config in environmental variable
valsdav Dec 6, 2023
fcefe19
Merge branch 'master' into local_executors_to_dask
lgray Dec 6, 2023
25df8b1
Added preprocess command to cli
valsdav Dec 6, 2023
c8a87b3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 6, 2023
31e74eb
save json as gzipped, add some options
lgray Dec 6, 2023
0487671
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 6, 2023
9c970e4
flake: drop getpass since it is not used
lgray Dec 6, 2023
ef91531
Merge branch 'master' into local_executors_to_dask
lgray Dec 7, 2023
6ff199a
Merge branch 'local_executors_to_dask' of github.com:CoffeaTeam/coffe…
valsdav Dec 7, 2023
533efea
Merge remote-tracking branch 'fork/local_executors_to_dask' into loca…
valsdav Dec 7, 2023
681782d
add failed-tail processing for uproot reports
lgray Dec 7, 2023
3ca2d96
add failed-tail stuff to __all__ of dataset_tools
lgray Dec 7, 2023
5d1ed9d
Merge branch 'master' into local_executors_to_dask
lgray Dec 7, 2023
3bd760a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 7, 2023
a29e910
lint
lgray Dec 7, 2023
13921ab
typo
lgray Dec 7, 2023
8e4424d
typo, and fileset entrypoint needs dict of reports.
lgray Dec 8, 2023
fe7984f
adapt apply_processor to possibility of reports
lgray Dec 8, 2023
87332b8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 8, 2023
f846c70
fix bugs
lgray Dec 8, 2023
40c0649
Merge branch 'local_executors_to_dask' of github.com:CoffeaTeam/coffe…
valsdav Dec 8, 2023
6e5cadd
Added processing of multiple datasets to get replicas in CLI
valsdav Dec 8, 2023
f04b60f
Added Select all options to cli
valsdav Dec 8, 2023
42af84b
lint
lgray Dec 8, 2023
4ba8c0a
Moved from cmd2 to pure rich interface for the CLI
valsdav Dec 11, 2023
d116351
Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…
valsdav Dec 11, 2023
dd21cbc
Updated help message
valsdav Dec 11, 2023
c079634
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 11, 2023
f09b64f
linting
valsdav Dec 11, 2023
126d42d
Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…
valsdav Dec 11, 2023
2049913
typo
lgray Dec 11, 2023
5240110
Adding non-cli interaction from datacard
valsdav Dec 11, 2023
e64e3c0
Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…
valsdav Dec 11, 2023
d826030
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 11, 2023
dd31a2a
better defaults and typos
valsdav Dec 11, 2023
66ffcd7
Merge branch 'master' into local_executors_to_dask
lgray Dec 11, 2023
c551b93
Merge remote-tracking branch 'origin/local_executors_to_dask' into lo…
valsdav Dec 11, 2023
6986a10
Adding docs and dataset_discovery notebook
valsdav Dec 11, 2023
95b9949
typo
lgray Dec 11, 2023
a166a81
more docs in the notebook
valsdav Dec 11, 2023
66f9120
Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…
valsdav Dec 11, 2023
a10a6e7
Merge branch 'master' into local_executors_to_dask
lgray Dec 12, 2023
0bcda6c
Merge branch 'master' into local_executors_to_dask
lgray Dec 12, 2023
0aa91ed
Merge branch 'local_executors_to_dask' of github.com:CoffeaTeam/coffe…
lgray Dec 12, 2023
2457e08
Merge pull request #940 from valsdav/local_executors_to_dask
lgray Dec 12, 2023
02001b3
actually pass uproot_options down
lgray Dec 12, 2023
ba88fe4
fix get_failed_steps_for_fileset/dataset
lgray Dec 12, 2023
6e22866
properly check each input file for steps
lgray Dec 12, 2023
6d0722c
adjust uproot pins, add test
lgray Dec 12, 2023
994ed4a
typing and docs
lgray Dec 12, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 29 additions & 29 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -126,33 +126,33 @@ jobs:
env:
GH_PAT: ${{ secrets.GITHUB_OAUTH }}

testwq:
runs-on: ubuntu-latest
needs: pre-commit
strategy:
matrix:
python-version: ["3.11"]
name: test coffea-workqueue

steps:
- uses: actions/checkout@v4
- name: Set up Conda
uses: conda-incubator/setup-miniconda@v2
env:
ACTIONS_ALLOW_UNSECURE_COMMANDS: true
with:
auto-update-conda: true
python-version: ${{ matrix.python-version }}
channels: conda-forge
- name: Test work_queue
shell: bash -l {0}
run: |
conda create --yes --name coffea-env -c conda-forge python=${{ matrix.python-version }} ndcctools dill conda-pack conda
conda activate coffea-env
python -m pip install --ignore-installed .
cd tests
conda-pack --output coffea-env.tar.gz
python wq.py coffea-env.tar.gz
# testwq:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much reason to test workqueue if taskvine tests that it can eat dask graphs.
Perhapse we might want to keep some sort of smoke test for various dask backends just to ensure they play well?

Copy link
Collaborator Author

@lgray lgray Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been clear in the past that it's nice to have coffea do its own full-stack tests for things like this.

I don't think we've achieved the level of parity where tests on dask distributed completely confirm they will work on taskvine or the other way around.

@btovar do you think that's achievable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. I can make a pr. Further, since tasvine is orthogonal to coffea+dask, should this more in something like an examples directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@btovar a test that also serves as a worked-out example would be great! I don't think it's so orthogonal since, for instance, dask and distributed have deeper ties than dask and taskvine (so far as I know right now), but I still test that distributed works in coffea tests for the essential ingredients.

# runs-on: ubuntu-latest
# needs: pre-commit
# strategy:
# matrix:
# python-version: ["3.11"]
# name: test coffea-workqueue
#
# steps:
# - uses: actions/checkout@v4
# - name: Set up Conda
# uses: conda-incubator/setup-miniconda@v2
# env:
# ACTIONS_ALLOW_UNSECURE_COMMANDS: true
# with:
# auto-update-conda: true
# python-version: ${{ matrix.python-version }}
# channels: conda-forge
# - name: Test work_queue
# shell: bash -l {0}
# run: |
# conda create --yes --name coffea-env -c conda-forge python=${{ matrix.python-version }} ndcctools dill conda-pack conda
# conda activate coffea-env
# python -m pip install --ignore-installed .
# cd tests
# conda-pack --output coffea-env.tar.gz
# python wq.py coffea-env.tar.gz

# testskyhookjob:
# runs-on: ubuntu-latest
Expand All @@ -178,7 +178,7 @@ jobs:
release:
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
runs-on: ubuntu-latest
needs: [test, testwq]
needs: [test]
strategy:
matrix:
python-version: ["3.11"]
Expand All @@ -201,7 +201,7 @@ jobs:
password: ${{ secrets.PYPI_TOKEN }}

pass:
needs: [test, testwq]
needs: [test]
runs-on: ubuntu-latest
steps:
- run: echo "All jobs passed"
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,11 @@ servicex = [
"servicex>=2.5.3",
"func-adl_servicex",
]
rucio = [
"rucio-clients>=32;python_version>'3.8'",
"rucio-clients<32;python_version<'3.9'",
"cmd2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check, is some of the interactive console stuff available via rich (already a dependency) or do we need cmd2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to full rich cli, removed cmd2 dependency

]
dev = [
"pre-commit",
"flake8",
Expand Down
11 changes: 11 additions & 0 deletions src/coffea/dataset_tools/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from coffea.dataset_tools.apply_processor import apply_to_fileset, apply_to_one_dataset
from coffea.dataset_tools.manipulations import max_chunks, slice_chunks
from coffea.dataset_tools.preprocess import preprocess

__all__ = [
"preprocess",
"apply_to_one_dataset",
lgray marked this conversation as resolved.
Show resolved Hide resolved
"apply_to_fileset",
"max_chunks",
"slice_chunks",
]
38 changes: 38 additions & 0 deletions src/coffea/dataset_tools/apply_processor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import copy
from typing import Callable, Union

from coffea.nanoevents import NanoAODSchema, NanoEventsFactory
from coffea.processor import ProcessorABC


def apply_to_one_dataset(
data_manipulation: Union[ProcessorABC, Callable],
lgray marked this conversation as resolved.
Show resolved Hide resolved
dataset,
schemaclass=NanoAODSchema,
metadata={},
):
files = dataset["files"]
events = NanoEventsFactory.from_root(
files,
metadata=metadata,
schemaclass=schemaclass,
).events()
if isinstance(data_manipulation, ProcessorABC):
return data_manipulation.process(events)
elif isinstance(data_manipulation, Callable):
return data_manipulation(events)
else:
raise ValueError("data_manipulation must either be a ProcessorABC or Callable")


def apply_to_fileset(
data_manipulation: Union[ProcessorABC, Callable], fileset, schemaclass=NanoAODSchema
):
out = {}
for name, dataset in fileset.items():
metadata = copy.deepcopy(dataset.get("metadata", {}))
metadata["dataset"] = name
lgray marked this conversation as resolved.
Show resolved Hide resolved
out[name] = apply_to_one_dataset(
data_manipulation, dataset, schemaclass, metadata
)
return out
Loading
Loading