Reenable building kineto, add CUPTI dep #305

mgorny · 2024-12-25T08:57:22Z

Checklist

Used a personal fork of the feedstock to propose changes
Bumped the build number (if the version is unchanged)
Reset the build number to 0 (if the version changed)
Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
Ensured the license file is being packaged.

Fixes #76

Let's try enabling new dependencies separately, to see which one caused CI problems.

conda-forge-admin · 2024-12-25T08:58:45Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.
ℹ️ The recipe is not parsable by parser conda-recipe-manager. The recipe can only be automatically migrated to the new v1 format if it is parseable by conda-recipe-manager.

_{This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/12689415146. Examine the logs at this URL for more detail.}

mgorny · 2024-12-25T13:19:35Z

Hmm, shouldn't my pull request start CI jobs now, or did I misunderstand the purpose of requesting open-gpu-server access?

hmaarrfk · 2024-12-26T00:06:39Z

it definitely gets overloaded....

mgorny · 2024-12-26T14:27:58Z

Could you launch the CI for me?

Tobias-Fischer · 2024-12-26T20:00:48Z

@mgorny - just launched CI. We should skip most builds if we're still testing to save resources, but I am assuming builds are likely to pass?

mgorny · 2024-12-26T20:03:58Z

Kinda — one of the the three packages has been causing issues in #298, so I'm trying to find out which one was it. Feel free to cancel osx and aarch64 builds, it happened on linux-64.

h-vetinari · 2024-12-27T05:17:48Z

In fact, it only happened for generic blas + CUDA, which is IMO the only job that we'd need to be running here.

h-vetinari · 2024-12-27T05:19:45Z

Hmm, shouldn't my pull request start CI jobs now, or did I misunderstand the purpose of requesting open-gpu-server access?

There's a second layer of access control in https://github.com/conda-forge/.cirun/blob/master/.access.yml, but you should be pulled in through the reference to the conda-forge-users.json from the open-gpu-server. Perhaps that .cirun things needs to be refreshed though? 🤔

mgorny · 2024-12-27T05:19:53Z

In fact, it only happened for generic blas + CUDA, which is IMO the only job that we'd need to be running here.

Except that this time mkl + cpu failed/crashed :-/. Though I'm not sure if it's really kineto-related or a fluke.

h-vetinari · 2024-12-27T05:21:59Z

Check out this run (for 97cb097) - all the others passed.

mgorny · 2024-12-27T05:29:03Z

Well, okay, then an existing fluke. Unfortunately but irrelevant. Let's try the next package.

Fixes conda-forge#269

h-vetinari · 2024-12-27T05:37:56Z

Huh? we hadn't seen if CI passes with cupti yet? The single test failure on a non-CUDA job was irrelevant.

mgorny · 2024-12-27T05:39:55Z

Ah, sorry. I've gotten confused by the jobs being cancelled.

This reverts commit 193d481.

h-vetinari · 2024-12-27T05:40:17Z

Ah, sorry. I've gotten confused by the jobs being cancelled.

That was me manually pruning the list so that we only run the single job that's relevant.

h-vetinari · 2024-12-27T05:45:32Z

@mgorny: Except that this time mkl + cpu failed/crashed :-/. Though I'm not sure if it's really kineto-related or a fluke.

@h-vetinari: The single test failure on a non-CUDA job was irrelevant.

OK, perhaps I spoke too soon - I hadn't considered the possibility that

=================================== FAILURES ===================================
____________________________ test/test_autograd.py _____________________________
[gw0] linux -- Python 3.11.11 $PREFIX/bin/python
worker 'gw0' crashed while running 'test/test_autograd.py::TestAutograd::test_profiler_seq_nr'
=============================== warnings summary ===============================

might have something to do with kineto being enabled. If the CUDA job ends up passing, we can rerun the MKL+CPU job again to see if it was a fluke or if it reproduces.

mgorny · 2024-12-27T14:40:24Z

Oh my, it is kineto after all! I'm going to do the other two deps, triton dep (#166) and ccache instructions in a separate pull request.

h-vetinari · 2024-12-27T14:58:11Z

ccache instructions

Part of that is already in the windows PR. Could you review there?

mgorny · 2024-12-27T15:25:05Z

ccache instructions

Part of that is already in the windows PR. Could you review there?

Oh, sorry, didn't check mail in time. Will do.

mgorny · 2025-01-03T16:51:06Z

Ok, so I've been able to reproduce the test failures on a GPU-enabled host, and they were failing due to CUDA running out of GPU memory. Besides that, I'm seeing a lot of:

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.

Which leads me to the following:

Should we have a run_constrained enforcing openmp* version of openblas?
We may try forcing -n 1 for CUDA test runs, but I suppose this will make them significantly slower.
We could also enable kineto supporting without CUPTI — but I'm not sure how helpful that actually would be.

h-vetinari · 2025-01-05T16:10:55Z

Great that you managed to debug this!

Should we have a run_constrained enforcing openmp* version of openblas?

Sounds reasonable to me

We may try forcing -n 1 for CUDA test runs, but I suppose this will make them significantly slower.

Perhaps -n 2 or even -n 4 already suffices to avoid the OOM? The gpu_2xlarge runners we're using have 8 vCPUs.

mgorny · 2025-01-05T19:25:03Z

Hmm, after some more testing: it seems that perhaps it could be circumvented by either skipping large_cuda tests entirely, or splitting them into a separate non-parallel test run. I'm not 100% sure if it'll solve the issue, but it definitely improved things here.

We could also consider using pytest-rerunfailures to declare tests "flaky" in general, and have pytest retry 5 or 10 times, in case that could save us from discarding a whole CI run.

mgorny · 2025-01-05T20:09:24Z

It seems that I was overly optimistic after all, and even non-large tests eventually start running out of memory. Trying with lower -n still, let's see what value will work. There's also some risk of hanging at the end, though.

mgorny · 2025-01-08T13:36:12Z

The plot thickens: the tests were apparently passing only because of high -n. If I run them with lower -n (or in serial), these three fail:

FAILED [0.0012s] test/test_autograd_fallback.py::TestAutogradFallback::test_base_does_not_require_grad_mode_nothing
FAILED [0.0009s] test/test_autograd_fallback.py::TestAutogradFallback::test_base_does_not_require_grad_mode_warn
FAILED [0.0013s] test/test_autograd_fallback.py::TestAutogradFallback::test_composite_registered_to_cpu_mode_nothing

Should I skip them as flaky?

h-vetinari · 2025-01-08T13:48:48Z

Should I skip them as flaky?

Yes, please skip them with a comment. Doesn't sound flaky to me though, more likely a test that has a bug in the sense that it implicitly depends on high parallelism.

mgorny · 2025-01-08T15:46:13Z

Okay, this time I've tested all 4 linux-64 builds locally before pushing, and hopefully this will save us from further false starts.

due to the extreme runtime in emulation, and the almost non-existent variance between python versions, this is a better trade-off than testing nothing, or being stuck in emulation for hours.

h-vetinari · 2025-01-08T22:53:56Z

Another failure (possibly flaky)

=================================== FAILURES ===================================
____________________________ test/test_autograd.py _____________________________
[gw0] linux -- Python 3.12.8 $PREFIX/bin/python
worker 'gw0' crashed while running 'test/test_autograd.py::TestAutograd::test_profiler_seq_nr'
=============================== warnings summary ===============================

=========================== short test summary info ============================
FAILED [0.0000s] test/test_autograd.py::TestAutograd::test_profiler_seq_nr
= 1 failed, 7564 passed, 1383 skipped, 31 xfailed, 75786 warnings in 627.53s (0:10:27) =
WARNING: Tests failed for pytorch-2.5.1-cpu_generic_py312_h1f840dd_9.conda - moving package to /home/conda/feedstock_root/build_artifacts/broken

…rom conda-forge#305

mgorny · 2025-01-09T04:32:55Z

Sigh, now we hit what looks like a random crash.

h-vetinari · 2025-01-09T11:32:25Z

There's one failure on the re-enabled aarch+CUDA tests

=================================== FAILURES ===================================
_ TestLinalgCPU.test_eigh_svd_illcondition_matrix_input_should_not_crash_cpu_float32 _
[gw1] linux -- Python 3.12.8 $PREFIX/bin/python

self = <test_linalg.TestLinalgCPU testMethod=test_eigh_svd_illcondition_matrix_input_should_not_crash_cpu_float32>
device = 'cpu', dtype = torch.float32

    @skipCPUIfNoLapack
    @dtypes(torch.float, torch.double)
    @unittest.skipIf(_get_torch_cuda_version() < (12, 1), "Test is fixed on cuda 12.1 update 1.")
    def test_eigh_svd_illcondition_matrix_input_should_not_crash(self, device, dtype):
        # See https://github.com/pytorch/pytorch/issues/94772, https://github.com/pytorch/pytorch/issues/105359
        # This test crashes with `cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED` on cuda 11.8,
        # but passes on cuda 12.1 update 1 or later.
        a = torch.ones(512, 512, dtype=dtype, device=device)
        a[0, 0] = 1.0e-5
        a[-1, -1] = 1.0e5
    
        eigh_out = torch.linalg.eigh(a)
        svd_out = torch.linalg.svd(a)
    
        # Matrix input a is too ill-conditioned.
        # We'll just compare the first two singular values/eigenvalues. They are 1.0e5 and 511.0
        # The precision override with tolerance of 1.0 makes sense since ill-conditioned inputs are hard to converge
        # to exact values.
>       self.assertEqual(eigh_out.eigenvalues.sort(descending=True).values[:2], [1.0e5, 511.0], atol=1.0, rtol=1.0e-2)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 1 / 2 (50.0%)
E       Greatest absolute difference: 29490.681640625 at index (1,) (up to 1.0 allowed)
E       Greatest relative difference: 57.71170425415039 at index (1,) (up to 0.01 allowed)
E       
E       To execute this test, run the following from the base repo dir:
E           python test/test_linalg.py TestLinalgCPU.test_eigh_svd_illcondition_matrix_input_should_not_crash_cpu_float32
E       
E       This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

test/test_linalg.py:1044: AssertionError
=============================== warnings summary ===============================

Given that the test is named test_eigh_svd_illcondition_matrix_input_should_not_crash_cpu_float32, and the fact that it did in fact not crash, I'm going to skip this one.

The profiling test that crashed also has some specific assumptions that seem very tight, so I'll skip that one too for now.

…onda-forge-pinning 2025.01.09.11.45.06

h-vetinari · 2025-01-09T22:11:43Z

Dammit, some more pointless failures:

=========================== short test summary info ============================
FAILED [0.0127s] test/test_nn.py::TestNN::test_BCELoss_weights_no_reduce_cuda - AssertionError: Tensor-likes are not close!

Mismatched elements: 2 / 150 (1.3%)
Greatest absolute difference: 0.003756726657229592 at index (11, 9) (up to 0.0003 allowed)
Greatest relative difference: 1.4612581437751952e-06 at index (11, 9) (up to 0 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_nn.py TestNN.test_BCELoss_weights_no_reduce_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.1935s] test/test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 327680 (0.0%)
Greatest absolute difference: 0.03125 at index (1, 10, 222) (up to 0.01 allowed)
Greatest relative difference: 0.01495361328125 at index (1, 10, 222) (up to 0.01 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_torch.py TestTorch.test_index_add_correctness

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
= 2 failed, 13190 passed, 2595 skipped, 91 xfailed, 143279 warnings in 2923.36s (0:48:43) =

h-vetinari · 2025-01-10T09:38:07Z

Wow, the CUDA MKL build looks even worse

=========================== short test summary info ============================
FAILED [0.0000s] test/test_autograd.py::TestAutograd::test_profiler_propagation
FAILED [0.0069s] test/test_nn.py::TestNN::test_BCELoss_weights_no_reduce_cuda - AssertionError: Tensor-likes are not close!

Mismatched elements: 2 / 150 (1.3%)
Greatest absolute difference: 0.003756726657229592 at index (11, 9) (up to 0.0003 allowed)
Greatest relative difference: 1.4612581437751952e-06 at index (11, 9) (up to 0 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_nn.py TestNN.test_BCELoss_weights_no_reduce_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0274s] test/test_nn.py::TestNNDeviceTypeCUDA::test_ctc_loss_cudnn_tensor_cuda - AssertionError: Tensor-likes are not close!

Mismatched elements: 12117 / 48480 (25.0%)
Greatest absolute difference: inf at index (15, 0, 0) (up to 0.0001 allowed)
Greatest relative difference: inf at index (15, 0, 0) (up to 0 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_nn.py TestNNDeviceTypeCUDA.test_ctc_loss_cudnn_tensor_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.3278s] test/test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 327680 (0.0%)
Greatest absolute difference: 0.03125 at index (3, 209, 207) (up to 0.01 allowed)
Greatest relative difference: 0.01495361328125 at index (3, 209, 207) (up to 0.01 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_torch.py TestTorch.test_index_add_correctness

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [1.1158s] test/test_linalg.py::TestLinalgCPU::test_inverse_errors_large_cpu_complex128 - RuntimeError: Pivots given to lu_solve must all be greater or equal to 1. Did you properly pass the result of lu_factor?

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_inverse_errors_large_cpu_complex128

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.5932s] test/test_linalg.py::TestLinalgCPU::test_inverse_errors_large_cpu_complex64 - RuntimeError: Pivots given to lu_solve must all be greater or equal to 1. Did you properly pass the result of lu_factor?

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_inverse_errors_large_cpu_complex64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.2797s] test/test_linalg.py::TestLinalgCPU::test_inverse_errors_large_cpu_float32 - RuntimeError: Pivots given to lu_solve must all be greater or equal to 1. Did you properly pass the result of lu_factor?

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_inverse_errors_large_cpu_float32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.4246s] test/test_linalg.py::TestLinalgCPU::test_inverse_errors_large_cpu_float64 - RuntimeError: Pivots given to lu_solve must all be greater or equal to 1. Did you properly pass the result of lu_factor?

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_inverse_errors_large_cpu_float64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
= 8 failed, 13162 passed, 2581 skipped, 91 xfailed, 143299 warnings in 3429.55s (0:57:09) =

h-vetinari · 2025-01-10T09:43:08Z

And it's not just accuracy problems:

Intel oneMKL ERROR: Parameter 6 was incorrect on entry to ZLASWP.

h-vetinari · 2025-01-10T09:43:37Z

I'm skipping this in #316, but let's try to investigate & fix this in the next round @mgorny?

mgorny · 2025-01-10T09:46:55Z

This is weird, because the tests passed for me. I could imagine precision problems on CUDA testing because I don't have a nvidia GPU on my machine, but non-CUDA failures are really weird.

h-vetinari · 2025-01-10T11:33:39Z

This is weird, because the tests passed for me. I could imagine precision problems on CUDA testing because I don't have a nvidia GPU on my machine, but non-CUDA failures are really weird.

New failures were CUDA-only, so that part at least is in line with your testing

mgorny · 2025-01-10T13:01:29Z

Well, that mkl failure looks to happen on CPU. Lemme try again locally on a GPU-enabled host.

h-vetinari · 2025-01-10T13:26:15Z

Well, that mkl failure looks to happen on CPU. Lemme try again locally on a GPU-enabled host.

Are you referring to the test class / name (i.e. TestLinalgCPU)? Because at least on the level of the CI runs,

linux_64_blas_implmklc_compiler_version13channel_targetsconda-forge_maincuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13is_rcFalse

is green, whereas

linux_64_blas_implmklc_compiler_version13channel_targetsconda-forge_maincuda_compilercuda-nvcccuda_compiler_version12.6cxx_compiler_version13is_rcFalse

is red.

mgorny · 2025-01-10T13:30:12Z

This bit:

2025-01-10T09:31:08.3775140Z ____________ TestLinalgCPU.test_inverse_errors_large_cpu_complex128 ____________
2025-01-10T09:31:08.3779626Z [gw2] linux -- Python 3.13.1 $PREFIX/bin/python
2025-01-10T09:31:08.3782609Z 
2025-01-10T09:31:08.3787083Z self = <test_linalg.TestLinalgCPU testMethod=test_inverse_errors_large_cpu_complex128>
2025-01-10T09:31:08.3789969Z device = 'cpu', dtype = torch.complex128

It explicitly says it's running on CPU.

mgorny · 2025-01-10T14:43:18Z

And yeah, can't reproduce on the GPU-enabled host either.

hmaarrfk · 2025-01-10T16:10:49Z

And yeah, can't reproduce on the GPU-enabled host either.

In my years of building and using pytorch/tensorflow and related scientific software, if the host or the runner has any kind of memory pressure the tests can start to fail.

Memory allocation failures are real, and memory not being initialized correctly can manifest itself as real bugs.

I would:

Avoid adding anything other than "spot tests" or "smoke tests" on the CIs
test on a powerful machine.
report when one can, tolerance issues upstream.

Reenable building kineto, add CUPTI dep

fa4e032

Fixes conda-forge#76

Tobias-Fischer added 2 commits December 27, 2024 05:54

Run

7c71c75

Trigger CI

e1c9f31

Enable building against libcudss

193d481

Fixes conda-forge#269

Revert "Enable building against libcudss"

6a9037c

This reverts commit 193d481.

h-vetinari force-pushed the kineto branch from 9c9751f to 6a9037c Compare December 27, 2024 05:40

mgorny mentioned this pull request Dec 27, 2024

Add triton dependency, readd cudss and cusparselt, mention dev speedup tricks in the README #309

Merged

5 tasks

Merge branch 'main' into kineto

8f420c1

Skip tests failing with low -n

fa143b8

h-vetinari added 2 commits January 8, 2025 17:34

emulate test suite on aarch only for one python version

ad8fda6

due to the extreme runtime in emulation, and the almost non-existent variance between python versions, this is a better trade-off than testing nothing, or being stuck in emulation for hours.

bump build number

03c0d6a

Tobias-Fischer added a commit to baszalmstra/pytorch-cpu-feedstock that referenced this pull request Jan 9, 2025

Check if multiple lib initialization is still a problem; skip tests f…

8fac9a5

…rom conda-forge#305

h-vetinari and others added 3 commits January 9, 2025 12:53

skip two spurious test failures

c9eeb26

Allow builds of rc-tagged commits

63779a3

MNT: Re-rendered with conda-build 24.11.2, conda-smithy 3.45.1, and c…

3b2386e

…onda-forge-pinning 2025.01.09.11.45.06

This was referenced Jan 9, 2025

Allow builds of rc-tagged commits #280

Closed

linux torch builds report an incompatible version string #315

Closed

This was referenced Jan 10, 2025

Joint merge of windows and kineto PRs #316

Merged

feat: yet another attempt to add windows builds #231

Merged

hmaarrfk merged commit 3b2386e into conda-forge:main Jan 14, 2025
21 of 25 checks passed

h-vetinari mentioned this pull request Jan 16, 2025

Enabling CUDA CUPTI on Windows #319

Open

Reenable building kineto, add CUPTI dep #305

Reenable building kineto, add CUPTI dep #305

Conversation

mgorny commented Dec 25, 2024

conda-forge-admin commented Dec 25, 2024 • edited Loading

mgorny commented Dec 25, 2024

hmaarrfk commented Dec 26, 2024

mgorny commented Dec 26, 2024

Tobias-Fischer commented Dec 26, 2024

mgorny commented Dec 26, 2024

h-vetinari commented Dec 27, 2024

h-vetinari commented Dec 27, 2024

mgorny commented Dec 27, 2024

h-vetinari commented Dec 27, 2024 • edited Loading

mgorny commented Dec 27, 2024

h-vetinari commented Dec 27, 2024

mgorny commented Dec 27, 2024

h-vetinari commented Dec 27, 2024

h-vetinari commented Dec 27, 2024

mgorny commented Dec 27, 2024

h-vetinari commented Dec 27, 2024

mgorny commented Dec 27, 2024

mgorny commented Jan 3, 2025

h-vetinari commented Jan 5, 2025

mgorny commented Jan 5, 2025

mgorny commented Jan 5, 2025

mgorny commented Jan 8, 2025

h-vetinari commented Jan 8, 2025 • edited Loading

mgorny commented Jan 8, 2025

h-vetinari commented Jan 8, 2025

mgorny commented Jan 9, 2025

h-vetinari commented Jan 9, 2025

h-vetinari commented Jan 9, 2025

h-vetinari commented Jan 10, 2025

h-vetinari commented Jan 10, 2025

h-vetinari commented Jan 10, 2025

mgorny commented Jan 10, 2025

h-vetinari commented Jan 10, 2025

mgorny commented Jan 10, 2025

h-vetinari commented Jan 10, 2025

mgorny commented Jan 10, 2025

mgorny commented Jan 10, 2025

hmaarrfk commented Jan 10, 2025

conda-forge-admin commented Dec 25, 2024 •

edited

Loading

h-vetinari commented Dec 27, 2024 •

edited

Loading

h-vetinari commented Jan 8, 2025 •

edited

Loading