Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable building kineto, add CUPTI dep #305

Merged
merged 19 commits into from
Jan 14, 2025
Merged

Conversation

mgorny
Copy link
Contributor

@mgorny mgorny commented Dec 25, 2024

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Fixes #76

Let's try enabling new dependencies separately, to see which one caused CI problems.

@conda-forge-admin
Copy link
Contributor

conda-forge-admin commented Dec 25, 2024

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. The recipe can only be automatically migrated to the new v1 format if it is parseable by conda-recipe-manager.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/12689415146. Examine the logs at this URL for more detail.

@mgorny
Copy link
Contributor Author

mgorny commented Dec 25, 2024

Hmm, shouldn't my pull request start CI jobs now, or did I misunderstand the purpose of requesting open-gpu-server access?

@hmaarrfk
Copy link
Contributor

it definitely gets overloaded....

@mgorny
Copy link
Contributor Author

mgorny commented Dec 26, 2024

Could you launch the CI for me?

@Tobias-Fischer
Copy link
Contributor

@mgorny - just launched CI. We should skip most builds if we're still testing to save resources, but I am assuming builds are likely to pass?

@mgorny
Copy link
Contributor Author

mgorny commented Dec 26, 2024

Kinda — one of the the three packages has been causing issues in #298, so I'm trying to find out which one was it. Feel free to cancel osx and aarch64 builds, it happened on linux-64.

@h-vetinari
Copy link
Member

In fact, it only happened for generic blas + CUDA, which is IMO the only job that we'd need to be running here.

@h-vetinari
Copy link
Member

Hmm, shouldn't my pull request start CI jobs now, or did I misunderstand the purpose of requesting open-gpu-server access?

There's a second layer of access control in https://github.com/conda-forge/.cirun/blob/master/.access.yml, but you should be pulled in through the reference to the conda-forge-users.json from the open-gpu-server. Perhaps that .cirun things needs to be refreshed though? 🤔

@mgorny
Copy link
Contributor Author

mgorny commented Dec 27, 2024

In fact, it only happened for generic blas + CUDA, which is IMO the only job that we'd need to be running here.

Except that this time mkl + cpu failed/crashed :-/. Though I'm not sure if it's really kineto-related or a fluke.

@h-vetinari
Copy link
Member

h-vetinari commented Dec 27, 2024

Check out this run (for 97cb097) - all the others passed.

@mgorny
Copy link
Contributor Author

mgorny commented Dec 27, 2024

Well, okay, then an existing fluke. Unfortunately but irrelevant. Let's try the next package.

@h-vetinari
Copy link
Member

Huh? we hadn't seen if CI passes with cupti yet? The single test failure on a non-CUDA job was irrelevant.

@mgorny
Copy link
Contributor Author

mgorny commented Dec 27, 2024

Ah, sorry. I've gotten confused by the jobs being cancelled.

@h-vetinari
Copy link
Member

Ah, sorry. I've gotten confused by the jobs being cancelled.

That was me manually pruning the list so that we only run the single job that's relevant.

@h-vetinari
Copy link
Member

@mgorny: Except that this time mkl + cpu failed/crashed :-/. Though I'm not sure if it's really kineto-related or a fluke.

@h-vetinari: The single test failure on a non-CUDA job was irrelevant.

OK, perhaps I spoke too soon - I hadn't considered the possibility that

=================================== FAILURES ===================================
____________________________ test/test_autograd.py _____________________________
[gw0] linux -- Python 3.11.11 $PREFIX/bin/python
worker 'gw0' crashed while running 'test/test_autograd.py::TestAutograd::test_profiler_seq_nr'
=============================== warnings summary ===============================

might have something to do with kineto being enabled. If the CUDA job ends up passing, we can rerun the MKL+CPU job again to see if it was a fluke or if it reproduces.

@mgorny
Copy link
Contributor Author

mgorny commented Dec 27, 2024

Oh my, it is kineto after all! I'm going to do the other two deps, triton dep (#166) and ccache instructions in a separate pull request.

@h-vetinari
Copy link
Member

ccache instructions

Part of that is already in the windows PR. Could you review there?

@mgorny
Copy link
Contributor Author

mgorny commented Dec 27, 2024

ccache instructions

Part of that is already in the windows PR. Could you review there?

Oh, sorry, didn't check mail in time. Will do.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 3, 2025

Ok, so I've been able to reproduce the test failures on a GPU-enabled host, and they were failing due to CUDA running out of GPU memory. Besides that, I'm seeing a lot of:

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.

Which leads me to the following:

  1. Should we have a run_constrained enforcing openmp* version of openblas?
  2. We may try forcing -n 1 for CUDA test runs, but I suppose this will make them significantly slower.
  3. We could also enable kineto supporting without CUPTI — but I'm not sure how helpful that actually would be.

@h-vetinari
Copy link
Member

Great that you managed to debug this!

Should we have a run_constrained enforcing openmp* version of openblas?

Sounds reasonable to me

We may try forcing -n 1 for CUDA test runs, but I suppose this will make them significantly slower.

Perhaps -n 2 or even -n 4 already suffices to avoid the OOM? The gpu_2xlarge runners we're using have 8 vCPUs.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 5, 2025

Hmm, after some more testing: it seems that perhaps it could be circumvented by either skipping large_cuda tests entirely, or splitting them into a separate non-parallel test run. I'm not 100% sure if it'll solve the issue, but it definitely improved things here.

We could also consider using pytest-rerunfailures to declare tests "flaky" in general, and have pytest retry 5 or 10 times, in case that could save us from discarding a whole CI run.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 5, 2025

It seems that I was overly optimistic after all, and even non-large tests eventually start running out of memory. Trying with lower -n still, let's see what value will work. There's also some risk of hanging at the end, though.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 8, 2025

The plot thickens: the tests were apparently passing only because of high -n. If I run them with lower -n (or in serial), these three fail:

FAILED [0.0012s] test/test_autograd_fallback.py::TestAutogradFallback::test_base_does_not_require_grad_mode_nothing
FAILED [0.0009s] test/test_autograd_fallback.py::TestAutogradFallback::test_base_does_not_require_grad_mode_warn
FAILED [0.0013s] test/test_autograd_fallback.py::TestAutogradFallback::test_composite_registered_to_cpu_mode_nothing

Should I skip them as flaky?

@h-vetinari
Copy link
Member

h-vetinari commented Jan 8, 2025

Should I skip them as flaky?

Yes, please skip them with a comment. Doesn't sound flaky to me though, more likely a test that has a bug in the sense that it implicitly depends on high parallelism.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 8, 2025

Okay, this time I've tested all 4 linux-64 builds locally before pushing, and hopefully this will save us from further false starts.

due to the extreme runtime in emulation, and the almost non-existent
variance between python versions, this is a better trade-off than
testing nothing, or being stuck in emulation for hours.
@h-vetinari
Copy link
Member

Another failure (possibly flaky)

=================================== FAILURES ===================================
____________________________ test/test_autograd.py _____________________________
[gw0] linux -- Python 3.12.8 $PREFIX/bin/python
worker 'gw0' crashed while running 'test/test_autograd.py::TestAutograd::test_profiler_seq_nr'
=============================== warnings summary ===============================
=========================== short test summary info ============================
FAILED [0.0000s] test/test_autograd.py::TestAutograd::test_profiler_seq_nr
= 1 failed, 7564 passed, 1383 skipped, 31 xfailed, 75786 warnings in 627.53s (0:10:27) =
WARNING: Tests failed for pytorch-2.5.1-cpu_generic_py312_h1f840dd_9.conda - moving package to /home/conda/feedstock_root/build_artifacts/broken

Tobias-Fischer added a commit to baszalmstra/pytorch-cpu-feedstock that referenced this pull request Jan 9, 2025
@mgorny
Copy link
Contributor Author

mgorny commented Jan 9, 2025

Sigh, now we hit what looks like a random crash.

@h-vetinari
Copy link
Member

There's one failure on the re-enabled aarch+CUDA tests

=================================== FAILURES ===================================
_ TestLinalgCPU.test_eigh_svd_illcondition_matrix_input_should_not_crash_cpu_float32 _
[gw1] linux -- Python 3.12.8 $PREFIX/bin/python

self = <test_linalg.TestLinalgCPU testMethod=test_eigh_svd_illcondition_matrix_input_should_not_crash_cpu_float32>
device = 'cpu', dtype = torch.float32

    @skipCPUIfNoLapack
    @dtypes(torch.float, torch.double)
    @unittest.skipIf(_get_torch_cuda_version() < (12, 1), "Test is fixed on cuda 12.1 update 1.")
    def test_eigh_svd_illcondition_matrix_input_should_not_crash(self, device, dtype):
        # See https://github.com/pytorch/pytorch/issues/94772, https://github.com/pytorch/pytorch/issues/105359
        # This test crashes with `cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED` on cuda 11.8,
        # but passes on cuda 12.1 update 1 or later.
        a = torch.ones(512, 512, dtype=dtype, device=device)
        a[0, 0] = 1.0e-5
        a[-1, -1] = 1.0e5
    
        eigh_out = torch.linalg.eigh(a)
        svd_out = torch.linalg.svd(a)
    
        # Matrix input a is too ill-conditioned.
        # We'll just compare the first two singular values/eigenvalues. They are 1.0e5 and 511.0
        # The precision override with tolerance of 1.0 makes sense since ill-conditioned inputs are hard to converge
        # to exact values.
>       self.assertEqual(eigh_out.eigenvalues.sort(descending=True).values[:2], [1.0e5, 511.0], atol=1.0, rtol=1.0e-2)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 1 / 2 (50.0%)
E       Greatest absolute difference: 29490.681640625 at index (1,) (up to 1.0 allowed)
E       Greatest relative difference: 57.71170425415039 at index (1,) (up to 0.01 allowed)
E       
E       To execute this test, run the following from the base repo dir:
E           python test/test_linalg.py TestLinalgCPU.test_eigh_svd_illcondition_matrix_input_should_not_crash_cpu_float32
E       
E       This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

test/test_linalg.py:1044: AssertionError
=============================== warnings summary ===============================

Given that the test is named test_eigh_svd_illcondition_matrix_input_should_not_crash_cpu_float32, and the fact that it did in fact not crash, I'm going to skip this one.

The profiling test that crashed also has some specific assumptions that seem very tight, so I'll skip that one too for now.

@h-vetinari
Copy link
Member

Dammit, some more pointless failures:

=========================== short test summary info ============================
FAILED [0.0127s] test/test_nn.py::TestNN::test_BCELoss_weights_no_reduce_cuda - AssertionError: Tensor-likes are not close!

Mismatched elements: 2 / 150 (1.3%)
Greatest absolute difference: 0.003756726657229592 at index (11, 9) (up to 0.0003 allowed)
Greatest relative difference: 1.4612581437751952e-06 at index (11, 9) (up to 0 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_nn.py TestNN.test_BCELoss_weights_no_reduce_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.1935s] test/test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 327680 (0.0%)
Greatest absolute difference: 0.03125 at index (1, 10, 222) (up to 0.01 allowed)
Greatest relative difference: 0.01495361328125 at index (1, 10, 222) (up to 0.01 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_torch.py TestTorch.test_index_add_correctness

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
= 2 failed, 13190 passed, 2595 skipped, 91 xfailed, 143279 warnings in 2923.36s (0:48:43) =

@h-vetinari
Copy link
Member

Wow, the CUDA MKL build looks even worse

=========================== short test summary info ============================
FAILED [0.0000s] test/test_autograd.py::TestAutograd::test_profiler_propagation
FAILED [0.0069s] test/test_nn.py::TestNN::test_BCELoss_weights_no_reduce_cuda - AssertionError: Tensor-likes are not close!

Mismatched elements: 2 / 150 (1.3%)
Greatest absolute difference: 0.003756726657229592 at index (11, 9) (up to 0.0003 allowed)
Greatest relative difference: 1.4612581437751952e-06 at index (11, 9) (up to 0 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_nn.py TestNN.test_BCELoss_weights_no_reduce_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0274s] test/test_nn.py::TestNNDeviceTypeCUDA::test_ctc_loss_cudnn_tensor_cuda - AssertionError: Tensor-likes are not close!

Mismatched elements: 12117 / 48480 (25.0%)
Greatest absolute difference: inf at index (15, 0, 0) (up to 0.0001 allowed)
Greatest relative difference: inf at index (15, 0, 0) (up to 0 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_nn.py TestNNDeviceTypeCUDA.test_ctc_loss_cudnn_tensor_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.3278s] test/test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 327680 (0.0%)
Greatest absolute difference: 0.03125 at index (3, 209, 207) (up to 0.01 allowed)
Greatest relative difference: 0.01495361328125 at index (3, 209, 207) (up to 0.01 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_torch.py TestTorch.test_index_add_correctness

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [1.1158s] test/test_linalg.py::TestLinalgCPU::test_inverse_errors_large_cpu_complex128 - RuntimeError: Pivots given to lu_solve must all be greater or equal to 1. Did you properly pass the result of lu_factor?

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_inverse_errors_large_cpu_complex128

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.5932s] test/test_linalg.py::TestLinalgCPU::test_inverse_errors_large_cpu_complex64 - RuntimeError: Pivots given to lu_solve must all be greater or equal to 1. Did you properly pass the result of lu_factor?

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_inverse_errors_large_cpu_complex64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.2797s] test/test_linalg.py::TestLinalgCPU::test_inverse_errors_large_cpu_float32 - RuntimeError: Pivots given to lu_solve must all be greater or equal to 1. Did you properly pass the result of lu_factor?

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_inverse_errors_large_cpu_float32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.4246s] test/test_linalg.py::TestLinalgCPU::test_inverse_errors_large_cpu_float64 - RuntimeError: Pivots given to lu_solve must all be greater or equal to 1. Did you properly pass the result of lu_factor?

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_inverse_errors_large_cpu_float64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
= 8 failed, 13162 passed, 2581 skipped, 91 xfailed, 143299 warnings in 3429.55s (0:57:09) =

@h-vetinari
Copy link
Member

And it's not just accuracy problems:

Intel oneMKL ERROR: Parameter 6 was incorrect on entry to ZLASWP.

@h-vetinari
Copy link
Member

I'm skipping this in #316, but let's try to investigate & fix this in the next round @mgorny?

@mgorny
Copy link
Contributor Author

mgorny commented Jan 10, 2025

This is weird, because the tests passed for me. I could imagine precision problems on CUDA testing because I don't have a nvidia GPU on my machine, but non-CUDA failures are really weird.

@h-vetinari
Copy link
Member

This is weird, because the tests passed for me. I could imagine precision problems on CUDA testing because I don't have a nvidia GPU on my machine, but non-CUDA failures are really weird.

New failures were CUDA-only, so that part at least is in line with your testing

@mgorny
Copy link
Contributor Author

mgorny commented Jan 10, 2025

Well, that mkl failure looks to happen on CPU. Lemme try again locally on a GPU-enabled host.

@h-vetinari
Copy link
Member

Well, that mkl failure looks to happen on CPU. Lemme try again locally on a GPU-enabled host.

Are you referring to the test class / name (i.e. TestLinalgCPU)? Because at least on the level of the CI runs,

linux_64_blas_implmklc_compiler_version13channel_targetsconda-forge_maincuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13is_rcFalse

is green, whereas

linux_64_blas_implmklc_compiler_version13channel_targetsconda-forge_maincuda_compilercuda-nvcccuda_compiler_version12.6cxx_compiler_version13is_rcFalse

is red.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 10, 2025

This bit:

2025-01-10T09:31:08.3775140Z ____________ TestLinalgCPU.test_inverse_errors_large_cpu_complex128 ____________
2025-01-10T09:31:08.3779626Z [gw2] linux -- Python 3.13.1 $PREFIX/bin/python
2025-01-10T09:31:08.3782609Z 
2025-01-10T09:31:08.3787083Z self = <test_linalg.TestLinalgCPU testMethod=test_inverse_errors_large_cpu_complex128>
2025-01-10T09:31:08.3789969Z device = 'cpu', dtype = torch.complex128

It explicitly says it's running on CPU.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 10, 2025

And yeah, can't reproduce on the GPU-enabled host either.

@hmaarrfk
Copy link
Contributor

And yeah, can't reproduce on the GPU-enabled host either.

In my years of building and using pytorch/tensorflow and related scientific software, if the host or the runner has any kind of memory pressure the tests can start to fail.

Memory allocation failures are real, and memory not being initialized correctly can manifest itself as real bugs.

I would:

  • Avoid adding anything other than "spot tests" or "smoke tests" on the CIs
  • test on a powerful machine.
  • report when one can, tolerance issues upstream.

@hmaarrfk hmaarrfk merged commit 3b2386e into conda-forge:main Jan 14, 2025
21 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-enable kineto submodule
6 participants