Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options to prep for how vak.io.audio.to_spect calls dask.bag #611

Merged
merged 7 commits into from
Jan 24, 2023

Conversation

NickleDave
Copy link
Collaborator

Fixes #580

@NickleDave
Copy link
Collaborator Author

@kalleknast this is still a work in progress, just tagging you to let you know I at least made it to the stage of draft pull request.

Need to fix test failures but overall this is what the functionality will look like: adding a new option to the [PREP] section of the .toml file, audio_dask_bag_kwargs that lets you specify keyword arguments, like audio_dask_bag_kwargs = { npartitions = 20}.

I really appreciate the work you did testing whether we could figure this out for a user with multiprocessing etc. but I'd prefer to solve this way for now, esp because I expect a lot of this code to be factored out as in #558 so that someone could prep a dataset "by hand" if the defaults don't work for them

Will tag you again when I'm ready for you to test whether this solves your issue on your machine, thank you!

@NickleDave NickleDave force-pushed the add-dask-bag-options-prep branch from 17de60b to 0e3ab1f Compare January 14, 2023 22:04
@NickleDave NickleDave force-pushed the add-dask-bag-options-prep branch from 0e3ab1f to aea6fab Compare January 14, 2023 22:14
@NickleDave
Copy link
Collaborator Author

Hi @kalleknast,

I tried running this locally on the sample data you provided, but I am getting an error that I'll paste below.

I'll also attach the .toml config file I'm using.

I tried with { npartitions = 20 } and { npartitions = 15} but in both cases I get the same error you reported before (first comment in #580).

Could you please test and let me know if the fix works for you? I think I have basically implemented it as you suggested originally, so I'm wondering if the error is machine specific. I can also test on a machine with more CPUs / RAM in the next couple of days.

config:
barbtwin1-train.zip

Error is:

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

as in #580

Full traceback:

Traceback (most recent call last):
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/bin/vak", line 8, in <module>
    sys.exit(main())
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/vak/__main__.py", line 48, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/vak/cli/cli.py", line 49, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/vak/cli/cli.py", line 23, in prep
    prep(toml_path=toml_path)
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/vak/cli/prep.py", line 133, in prep
    vak_df, csv_path = core.prep(
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/vak/core/prep.py", line 227, in prep
    vak_df = dataframe.from_files(
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/vak/io/dataframe.py", line 143, in from_files
    spect_files = audio.to_spect(
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/vak/io/audio.py", line 239, in to_spect
    spect_files = list(bag.map(_spect_file))
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/dask/bag/core.py", line 1480, in __iter__
    return iter(self.compute())
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/dask/base.py", line 314, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/dask/base.py", line 599, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/dask/multiprocessing.py", line 233, in get
    result = get_async(
  File "/home/pimienta/Documents/data/vocal/Hjalmar-Turesson/.venv/lib/python3.9/site-packages/dask/local.py", line 500, in get_async
    for key, res_info, failed in queue_get(queue).result():
  File "/home/pimienta/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/pimienta/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

@codecov-commenter
Copy link

codecov-commenter commented Jan 14, 2023

Codecov Report

Base: 93.21% // Head: 93.21% // No change to project coverage 👍

Coverage data is based on head (1980ca0) compared to base (1980ca0).
Patch has no changes to coverable lines.

❗ Current head 1980ca0 differs from pull request most recent head aea6fab. Consider uploading reports for the commit aea6fab to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #611   +/-   ##
=======================================
  Coverage   93.21%   93.21%           
=======================================
  Files          66       66           
  Lines        2490     2490           
=======================================
  Hits         2321     2321           
  Misses        169      169           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@kalleknast
Copy link

It worked fine (although when I try to re-run it, I keep getting a toml.decoder.TomlDecodeError).
I got 64 GB of memory and 24 CPUs.

@NickleDave NickleDave marked this pull request as ready for review January 24, 2023 03:36
@NickleDave
Copy link
Collaborator Author

Great, thank you @kalleknast for testing.
I will go ahead and merge to get this in main.

I have another feature branch in progress but once I merge that I can do a release so it's easier to install.

Appreciate your input and help testing.

@NickleDave NickleDave merged commit 25a0c8a into main Jan 24, 2023
@NickleDave
Copy link
Collaborator Author

@all-contributors please add @kalleknast for bugs ideas

@allcontributors
Copy link
Contributor

@NickleDave

I've put up a pull request to add @kalleknast! 🎉

@NickleDave NickleDave deleted the add-dask-bag-options-prep branch February 12, 2023 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Optimize dask.bag for file sizes + add dask.bag options to config
3 participants