Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Run CI on macOS #63

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

[WIP] Run CI on macOS #63

wants to merge 16 commits into from

Conversation

peastman
Copy link
Member

Fixes #62

@mikemhenry
Copy link
Collaborator

@peastman thanks for working on this, ping me if you run into any issues with missing packages

@peastman
Copy link
Member Author

The Linux CUDA builds are failing with an error that I don't think is related to any of my changes? It looks like cuaev is compiled with an incompatible ABI.

11: E   ImportError: /usr/share/miniconda3/envs/nnpops/lib/python3.7/site-packages/torchani/cuaev.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c108ListType3getENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4Type24SingletonOrSharedTypePtrIS7_EE

The Mac Intel build is failing at the setup stage:

ResolvePackageNotFound: 
  - pytorch-cpu==1.11.0=1.11.0
  - python=3.10[build=3.10.*]

It's not clear to me what's causing that. Checking the file list at https://anaconda.org/conda-forge/pytorch-cpu/files I see the file osx-64/pytorch-cpu-1.11.0-cpu_py310hde81af8_2.tar.bz2 which looks like the correct one?

To get builds running on ARM Mac, we need to get the M1 runner working again.

@raimis
Copy link
Contributor

raimis commented Aug 1, 2022

I have feeling the failures on Linux related to conda-forge/openmm-torch-feedstock#20. So, I'll try Mamba for the CI.

@raimis
Copy link
Contributor

raimis commented Aug 1, 2022

The CI issue is fixed: #64

@mikemhenry
Copy link
Collaborator

Nice! I don't have write access to this repo, but I'm guessing that once we merge the changes in from main things will pass.

@peastman
Copy link
Member Author

peastman commented Aug 1, 2022

Thanks! That fixed it. So now we just have the problem with it not finding the pytorch package. Any idea what's up with that?

@mikemhenry
Copy link
Collaborator

Weird, it looks like it exists: https://anaconda.org/conda-forge/pytorch-cpu/1.11.0/download/osx-64/pytorch-cpu-1.11.0-cpu_py310hde81af8_2.tar.bz2
(looks like osx intel 1.11.0 with python 3.10)
@peastman could you try bumping the python version from 3.10 to 3.9? My guess is that there is something missing for the python 3.10 build.

Looking for: ["cmake[version='>=3.20']", 'make', 'mdtraj', 'torchani==2.2.2', 'pytest', 'python=3.10[build=3.10.*]', 'pytorch-cpu==1.11.0=1.11.0', 'sysroot_linux-64==2.17']


Encountered problems while solving:
  - nothing provides requested python 3.10* 3.10.*
  - nothing provides requested pytorch-cpu ==1.11.0 1.11.0

Unless the issue is with this syntax, which looks suspect to me pytorch-cpu==1.11.0=1.11.0 since it looks like it is looking for version 1.11.0 and build version 1.11.0 which I know doesn't exist.

@peastman
Copy link
Member Author

peastman commented Aug 1, 2022

On an Intel Mac, I created an environment with Python 3.10. Then I typed the following command:

mamba install -c conda-forge cmake make mdtraj torchani=2.2.2 pytest pytorch-cpu=1.11.0 python=3.10

and it worked without problem. So why does it fail on CI???

@mikemhenry
Copy link
Collaborator

Is there something going on with sed shenanigans? pytorch-cpu==1.11.0=1.11.0 is still sus to me.
Perhaps echo/print the modified environment.yaml and see if you can use that to make the env?

@mikemhenry
Copy link
Collaborator

I feel like something is off since it also can't find python 3.10
nothing provides requested python 3.10* 3.10.*

@peastman
Copy link
Member Author

peastman commented Aug 1, 2022

You're right. I had to rewrite the sed command because it works differently on Mac than on Linux. It's leading to duplicated version numbers.

@peastman
Copy link
Member Author

peastman commented Aug 1, 2022

Progress! Here's the error it fails with now.

[ 69%] Linking CXX shared library libNNPOpsPyTorch.dylib
ld: library not found for -lmkl_intel_ilp64

It also reports what look like some genuine errors in the code:

/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:101:29: warning: equality comparison result unused [-Wunused-comparison]
                activation_ == ::CFConv::ShiftedSoftplus;
                ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:101:29: note: use '=' to turn this equality comparison into an assignment
                activation_ == ::CFConv::ShiftedSoftplus;
                            ^~
                            =
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:103:29: warning: equality comparison result unused [-Wunused-comparison]
                activation_ == ::CFConv::Tanh;
                ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:103:29: note: use '=' to turn this equality comparison into an assignment
                activation_ == ::CFConv::Tanh;
                            ^~
                            =

@raimis
Copy link
Contributor

raimis commented Aug 2, 2022

Yes, the warnings have to be fixed.

@raimis raimis mentioned this pull request Aug 2, 2022
@mikemhenry
Copy link
Collaborator

I'm not a cmake expert, but I'm guessing we need to modify cmake to add the location of mkl_intel_ilp64 to LIBRARY_PATH?
I don't have an x86 mac to do this locally. I did naively try adding mkl to the dependencies (in case it wasn't getting pulled in on osx) on my fork that but that didn't fix the issue.

@peastman
Copy link
Member Author

peastman commented Aug 2, 2022

We don't directly use MKL anywhere. It's coming from PyTorch. The build scripts for PyTorch are designed to locate lots of different BLAS implementations and use whichever one is available. The conda packages for pytorch list a BLAS implementation as a dependency. Which one depends on the platform, MKL on some and OpenBLAS on others.

@mikemhenry
Copy link
Collaborator

What is interesting is I get a different linking error when building on conda-forge CI
ld: warning: dylib libmkl_intel_ilp64.dylib was built for newer macOS version (10.14) than being linked (10.9)
Which means on conda-forge at least I will need to bump the macOS SKD that is being used (at least I think)

@peastman
Copy link
Member Author

peastman commented Aug 2, 2022

That would make sense. 10.9 is a really old version, from 2013. It sounds like PyTorch requires a newer one.

@peastman
Copy link
Member Author

peastman commented Aug 5, 2022

The log shows the build is run on macOS 11.6.8. We don't specify a value for CMAKE_OSX_DEPLOYMENT_TARGET, so I believe it should be building for that version.

@peastman
Copy link
Member Author

peastman commented Aug 7, 2022

CMAKE_OSX_DEPLOYMENT_TARGET is set to 11.6 as we'd expect. CMake is successfully locating the library in /Users/runner/miniconda3/envs/nnpops/lib/libmkl_intel_ilp64.dylib. It's in the environment's lib directory, which should automatically be included in the link path. Strange.

@peastman
Copy link
Member Author

peastman commented Aug 8, 2022

I tried adding export DYLD_LIBRARY_PATH=/Users/runner/miniconda3/envs/nnpops/lib to ensure it would find the library. I now get a different error.

dyld: Symbol not found: _catlas_dset
  Referenced from: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLinearAlgebra.dylib
  Expected in: /Users/runner/miniconda3/envs/nnpops/lib/libBLAS.dylib
 in /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLinearAlgebra.dylib

The symbol _catlas_dset comes from the libBLAS.dylib provided by Apple as part of the Accelerate framework built into the OS. We seem to somehow be getting conflicts between two different BLAS implementations, one from Apple and one from Intel.

@mikemhenry
Copy link
Collaborator

mikemhenry commented Aug 12, 2022

We seem to somehow be getting conflicts between two different BLAS implementations, one from Apple and one from Intel

It looks like there are some things we can try here: https://conda-forge.org/docs/maintainer/knowledge_base.html#blas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add OSX to CI testing matrix
3 participants