Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for #4583

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

agriyakhetarpal
Copy link
Member

@agriyakhetarpal agriyakhetarpal commented Nov 13, 2024

Description

This came up as an unrelated issue in #4582 when testing the wheels and apparently stems from #4441 where a thread was being used to check for the input. This caused the Linux wheel tests to fail: https://github.com/agriyakhetarpal/PyBaMM/actions/runs/11805727556/job/32890103930. This PR redoes that with a multi-platform approach and avoids the use of threads altogether by checking for different environments: Windows via msvcrt and Linux/macOS via termios.

A possible corner case for Jupyter Notebooks (running in VS Code versus Jupyter Notebook/JupyterLab) has also been fixed. I've resorted to a more rudimentary fix using the previous method for now, since this fix was erroneous and did not work intermittently.

Type of change

Please add a line in the relevant section of CHANGELOG.md to document the change (include PR #) - note reverse order of PR #s. If necessary, also add to the list of breaking changes.

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist:

  • No style issues: $ pre-commit run (or $ nox -s pre-commit) (see CONTRIBUTING.md for how to set this up to run automatically when committing locally, in just two lines of code)
  • All tests pass: $ python run-tests.py --all (or $ nox -s tests)
  • The documentation builds: $ python run-tests.py --doctest (or $ nox -s doctests)

You can run integration tests, unit tests, and doctests together at once, using $ python run-tests.py --quick (or $ nox -s quick).

Further checks:

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

noxfile.py Outdated Show resolved Hide resolved
@agriyakhetarpal
Copy link
Member Author

agriyakhetarpal commented Nov 13, 2024

This approach might seem overkill but this is what came to mind first and seems like the most robust way to me, so I'd like to test this a bit more – I've converted this to a draft for now. I triggered a wheel build here, let's see if it passes: https://github.com/agriyakhetarpal/PyBaMM/actions/runs/11808256997

I've confirmed the fix locally with a fresh PyBaMM installation (i.e., with the pybamm/config.yml file not present in my Application Support folder) every time before importing PyBaMM:

  • inside a terminal
  • in an IPython shell
  • in a notebook running in JupyterLab
  • in a notebook running in VS Code

across the following scenarios:

  • telemetry disabled
  • telemetry enabled (which should be the same thing)
  • a tim out (no input was given)

Copy link

codecov bot commented Nov 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.26%. Comparing base (4223be8) to head (5495814).
Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4583      +/-   ##
===========================================
- Coverage    99.26%   99.26%   -0.01%     
===========================================
  Files          302      302              
  Lines        22889    22866      -23     
===========================================
- Hits         22721    22698      -23     
  Misses         168      168              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@agriyakhetarpal
Copy link
Member Author

The buttons aren't working in the notebooks, actually – I'm quite sure they were working when I was testing earlier, so I guess I made some additional changes which made them break. I'll debug later in the day, and I'll consider dropping the buttons in favour of a simple input in case I don't get them to work.

@agriyakhetarpal
Copy link
Member Author

I switched to a more basic implementation for Jupyter notebooks, and I've seen bug reports about how inputs don't work properly with JupyterLab <4, but I don't think (too) many people are using version 3 now, so we should be good.

@agriyakhetarpal agriyakhetarpal changed the title Fix IDAKLU solver crash when running in multiple threads + telemetry is prompted for Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for Nov 13, 2024
@agriyakhetarpal
Copy link
Member Author

agriyakhetarpal commented Nov 13, 2024

Additionally, I tested on Google Colab by installing PyBaMM in editable mode from my branch – we should be ready to go with this. Suggestions on cleaning up the code are welcome. The only place I haven't been able to test is Windows, because I don't have a Windows machine.

@agriyakhetarpal agriyakhetarpal marked this pull request as ready for review November 13, 2024 14:49
@agriyakhetarpal
Copy link
Member Author

@kratman
Copy link
Contributor

kratman commented Nov 13, 2024

Ok I will take a close look at this in an hour or so

src/pybamm/config.py Outdated Show resolved Hide resolved
Comment on lines 81 to 91
if is_notebook(): # pragma: no cover
try:
from IPython.display import clear_output

user_input = input("Do you want to enable telemetry? (Y/n): ")
clear_output()

return user_input

except Exception: # pragma: no cover
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a major issue, but this has no timeout so people notebooks will just hang if they don't realize this popped up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional. I opine that notebooks are meant to be used interactively by default – I expect this to be a one-time issue, since importing PyBaMM and choosing "yes" or "no" once would mean that the config gets saved and this never comes up again (until the config file gets deleted, of course).

Comment on lines +134 to +135
while time.time() - start_time < timeout:
rlist, _, _ = select.select([sys.stdin], [], [], 0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are already using select, do we need the loop? The timeout isn't very long

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked and this loop ended up being required – since while normal input() in Python blocks indefinitely, I found that there's no (built-in) way to timeout if we are in raw terminal mode (set above in tty.setraw()). So we need to manually handle everything, which is to read the "y" or "n" or empty characters and flush them to the screen.

select lets us say "wake me up when, either":

  • there's input to read, or
  • one-tenth of a second has passed

which the loop apparently allows us to do. But if you have something else in mind, I'm happy to try that.

import select

# Save terminal settings for later
old_settings = termios.tcgetattr(sys.stdin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems excessive that we need to change the terminal settings for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this allows you to do is that you don't get new indentation when you write to the terminal again. I wrapped this in a try-finally block based on the Python docs: https://docs.python.org/3/library/termios.html#example.

So, by saving the attributes, I'm trying to disable the fact that sys.stdout.write("\n") will not return the cursor to the leftmost position (because we are in raw mode and we need to do that manually instead – for which I want to avoid).

We end up with this, otherwise:

❯ python -c "import pybamm; print(pybamm.IDAKLUSolver())"
PyBaMM can collect usage data and send it to the PyBaMM team to help us improve the software.
We do not collect any sensitive information such as models, parameters, or simulation results - only information on which parts of the code are being used and how frequently.
This is entirely optional and does not impact the functionality of PyBaMM.
For more information, see https://docs.pybamm.org/en/latest/source/user_guide/index.html#telemetry
Do you want to enable telemetry? (Y/n): n

                                         Telemetry disabled.
                                                            <pybamm.solvers.idaklu_solver.IDAKLUSolver object at 0x102de3ad0>
                                                                                                                             %

and this would break one's terminal until one restarts it – a more robust solution could be to use TCSAFLUSH instead of TCSADRAIN, I'll switch to that.

src/pybamm/config.py Outdated Show resolved Hide resolved
Comment on lines 196 to 201
else:
print("Invalid input. Please enter 'Y/y' for yes or 'n/N' for no:")
user_input = get_input_or_timeout(timeout)
if user_input is None:
print("\nTimeout reached. Defaulting to not enabling telemetry.")
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I brought it up before, but if we just check for the yes options, we can skip having an infinite loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a person presses "Enter" in haste, that enables telemetry for them (because of your suggestion that I applied above), which might not be what they want – I'm trying to reduce the chance of that happening

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should I keep it as "yes" if Enter is pressed or "no"? I feel the latter is the better option

src/pybamm/config.py Outdated Show resolved Hide resolved
src/pybamm/config.py Outdated Show resolved Hide resolved
src/pybamm/config.py Outdated Show resolved Hide resolved
Co-authored-by: Eric G. Kratz <[email protected]>
src/pybamm/config.py Outdated Show resolved Hide resolved
@agriyakhetarpal
Copy link
Member Author

Continuing from my question above: though I pushed e4859dc which fixes the failing test, it also changes the behaviour on pressing "Enter" to enable telemetry – which, to me, does not seem like good UX because it can cause an inadvertent "yes" from a user. It might be better to be explicit rather than implicit: either allow only a "yes" or a "no" in the response and ask them to enter their response again if it's empty, or let "Enter" disable telemetry rather than enabling it). I understand that it would be good for us to get a reasonable amount of telemetry data from as many users as we can, but we should also be conscientious with our approach. @valentinsulzer, do you have any thoughts? I might have missed this when reviewing #4441. That said, if prompts in other applications do regard pressing Enter/an empty input as a "yes", then this approach is completely fine with me.

@agriyakhetarpal agriyakhetarpal linked an issue Nov 15, 2024 that may be closed by this pull request
@kratman
Copy link
Contributor

kratman commented Nov 19, 2024

Crashes should be resolved. I could not reproduce any of the notebook issues

@agriyakhetarpal
Copy link
Member Author

I'd say that this PR could still be useful because it avoids using threading altogether for telemetry. Could you please re-review it? I was under the assumption that #4591 would go in after this PR went in.

@kratman
Copy link
Contributor

kratman commented Nov 20, 2024

@agriyakhetarpal I built the wheel with develop and did some testing on Windows.

A few things I noticed:

  1. The tests in CI and wheel build all pass: Build wheel 24.11.0 kratman/PyBaMM#5
  2. Some tests fail if I try to run the tests locally with the wheel, but the same tests fail when I repeat the process for the v24.9.0 release. The failing tests all appear to be from Jax
  3. Forcing posthog to be active during tests does show posthog in some of the tracebacks, but they appear to be some of the same tests that I saw failing in (2).
  4. When I forced posthog to run, I also disabled the pybamm.config.generate() function, so any remaining problems are likely in the internals of posthog, not from the use of threading while generating the config.
  5. The timeout for creating the config is likely too short. With the slow import of a fresh PyBaMM install, it is not hard to miss the window for a 10 second opt-in.

The fact that tests are failing locally on windows bothers me, but it looks like some of this was already present and not caught by all of our tests. It is also possible that the ~45 failing Jax tests are due to issues with my Windows setup since the tests pass in CI. I won't know for sure until I do additional testing for the release.

I will continue investigating the failures on windows, but I don't think removing the threading or adding extra logic to text output is needed at the moment. The changes to the installation guide, noxfile.py, and the pyproject.toml look good.

Next steps:

  • Try to trigger crashes in notebooks and scripts instead of tests. This will rule out any threading issues related to pytest
  • Compare my Windows dev environment to the CI dev environment to see if the failing Jax tests are from a solvable issue.

@agriyakhetarpal
Copy link
Member Author

Just a quick comment for now; I'll get back to the rest in a while: @kratman, if you can't reproduce the Windows crashes in CI, could you also check how old the machine that you are using locally is? Some JAX operations might crash or simply not work on very old machines because of the lack of AVX512 instructions, so maybe this is indeed just something that is for your setup and should be fine for slightly newer Windows computers.

@kratman
Copy link
Contributor

kratman commented Nov 20, 2024

Some JAX operations might crash or simply not work on very old machines because of the lack of AVX512 instructions, so maybe this is indeed just something that is for your setup and should be fine for slightly newer Windows computers.

This is not a particularly old machine. It is an i5-1235U (2022), which is the Alder Lake group of chips that do not have AVX-512, it does have AVX and AVX2 though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fortnightly build for wheels failed
2 participants