Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[RFC] MXNet 2.0 JVM Language development #17783

Open
lanking520 opened this issue Mar 6, 2020 · 55 comments
Open

[RFC] MXNet 2.0 JVM Language development #17783

lanking520 opened this issue Mar 6, 2020 · 55 comments
Labels
Clojure Java Label to identify Java API component RFC Post requesting for comments Roadmap Scala

Comments

@lanking520
Copy link
Member

lanking520 commented Mar 6, 2020

Since MXNet 2.0 development starts. I would like to initiate a discussion for the future development of JVM languages.

Proposal

  1. Start cleaning on the existing APIs to adapt to 2.0
  2. Start from ground up to rewrite the whole Scala/Java APIs
  3. Start using DJL (djl.ai) as a frontend for MXnet JVM development
  4. Using DJL MXNet JNA as the low level API
  5. Use MXNet JavaCpp as the low level API
  6. (Feel free to add more...)

Statistics

Scala package

scala-mxnet

Clojure package

clojure-mxnet-downloads

@gigasquid @terrytangyuan @zachgk @frankfliu @aaronmarkham

@lanking520 lanking520 added Scala Clojure Java Label to identify Java API component labels Mar 6, 2020
@lanking520
Copy link
Member Author

lanking520 commented Mar 6, 2020

I would propose Option 3. and 4.

DJL is a new Java framework that build on top of any engines. It brings the ease for Java developers to have close to numpy experience in Java. It introduced an interface that defined Java to train and run inference on different ML/DL models.

In the engine layer, we implemented the MXNet specific engine that allows users to achieve most of the up-to-date functionalities:

MXNet specific

  • deep-numpy ops (MXNet): numpy operators introduced in MXNet 1.6
  • autograd (MXNet) : imperative training are supported by doing autogradient collections
  • Block concept (MXNet Gluon): Java blocks for training and inference
  • CachedOp (MXNet): new Symbolic Inference Engine

DJL

  • Full training support: We support on Imperative/Symbolic training in Java.
  • Memory collection: NDManager introduced to lively track and collect memory, ensured 100 hrs stable run in prod.
  • MKLDNN: All ops and NDArray are built on top of MKLDNN acceleration
  • ModelZoo: Model zoo is a central hub for all model storage, it contains model files, preprocessing and post processing logics.

Maintainance

  • Better loading experience: All native MXNet files are uploaded to maven and users can easily switch different engine versions.
  • JNA simplification: JNI layers are generated to save maintainance time and brought more logic into Java layer.

With the benefit listed above, I would recommend Option 3 to go for the DJL path since it already covered most up-to-date MXNet feature and supporting all different symbolic/imperative training/inference.

For Option 4: I am also thinking of bring our JNA layer back to MXNet so the community can build up their own Java/Scala frontend if they don't like DJL.

@lanking520 lanking520 added RFC Post requesting for comments Roadmap labels Mar 6, 2020
@terrytangyuan
Copy link
Member

I propose option 1 and 2 since it took us a lot of efforts to bring MXNet to Scala originally and there are already adopters of Scala API in industries (some may not have been disclosed yet). But I am open to other options. Not familiar with DJL though but I assume @frankfliu and @lanking520 are the masters behind DJL.

@gigasquid
Copy link
Member

gigasquid commented Mar 9, 2020

@lanking520 thanks for the clarification above. A further question - How do you envision a current Scala MXNet user migrate their code? Is it going to be mostly reusable or is it going to be a complete rewrite for them?

@zachgk
Copy link
Contributor

zachgk commented Mar 9, 2020

It is going to be closer to a complete rewrite. On the other hand, making a new Scala API would be imperative instead of symbolic and I think there are going to be a lot of operator changes to better match numpy in 2.0. I don't think the migration costs for a Scala 2.0 would be that much less anyway

For users who don't want a full rewrite, they can continue using an old release or whatever new releases we make on the v1.x branch.

@lanking520 lanking520 reopened this Mar 9, 2020
@gigasquid
Copy link
Member

gigasquid commented Mar 9, 2020

For the Clojure package. It is a lot easier to interop with Java than with Scala - so if the the base is Java that everything is using - it will be better for Clojure.

@szha
Copy link
Member

szha commented Mar 16, 2020

+1 for option 1 and 2. Also +1 for 4 as long as it doesn't add a dependency

My concerns on 3 and 4 are that DJL is a separate project which has its own release cycle. Having it to support MXNet's inference will cause delays as DJL upgrades to the latest version. This will also complicate the testing and validation.

Overall, I think a minimum set of API for at least inference is needed for MXNet JVM ecosystem users.

@leezu
Copy link
Contributor

leezu commented Apr 13, 2020

Another data point is that all of our Scala tests fail randomly with src/c_api/c_api_profile.cc:141: Check failed: !thread_profiling_data.calls_.empty():, so there seem to be some underlying issues.

#17067

@leezu
Copy link
Contributor

leezu commented Apr 24, 2020

Another data point is that we currently only support OpenJDK 8 but the JVM languages are broken with OpenJDK 11 which is used on Ubuntu 18.04 for example. See #18153

@lanking520
Copy link
Member Author

@szha For option 4, I would recommend to consume the JNA layer as a submodule from DJL. I am not sure if this is recommendation serves as "add a dependency in mxnet".

There are two key reason that support for that:

  1. DJL moves really fast and we can quickly change the JNA layer whenever in need. Comparing to the merging speed in MXNet.

  2. Consume as a submodule means MXNet community don't have to take care much on the maintainance. DJL team will regularly provide Jar for MXNet user to consume.

We can also contribute code back in MXNet repo, since it is open source. But we may still keep a copy in our repo for fast iteration. It may cause diverged version on JNA layer.

Overally speaking, my recommendation on option 4 leads towards a direction to consume DJL JNA as a submodule.

@szha
Copy link
Member

szha commented Apr 28, 2020

@lanking520 would it create circular dependency? and how stable is the JNA and what changes are expected? it would be great if you could share a pointer on the JNA code to help clarify these concerns.

@lanking520
Copy link
Member Author

There is no code for JNA, everything is generated. It ensure the general standard and minimum layer in C to avoid error and mistakes.

About JNA, you can find more information here: jnarator. We build an entire project for the jna generation pipeline. All we need is a header file from MXNet to build everything. The dependency required by the gradle build is minimum, as you can fine in here.

To address the concern of stability, we tested DJL MXNet with 100 hour inference run on server and it remains stable. Training experience is also smooth, multi-gpu run 48 hours is also stable. The performance is very close to python with large models and may bring huge boost if model is smaller or equals to "squeezenet level".

@frankfliu can bring more information about the JNA layer.

@szha
Copy link
Member

szha commented Apr 28, 2020

My understanding is that DJL depends on MXNet, so if you want to bring JNA from DJL into MXNet, it will create circular dependency as a 3rdparty module. In terms of stability, I was referring to the development of code base rather than the performance.

@saudet
Copy link

saudet commented Jul 23, 2020

Hi, instead of JNA, I would be happy to provide bindings for the C API and maintain packages based on the JavaCPP Presets here:
https://github.com/bytedeco/javacpp-presets/tree/master/mxnet
JavaCPP adds no overhead, unlike JNA, and is often faster than manually written JNI. Plus JavaCPP provides more tools than JNA to automate the process of parsing header files as well as packaging native libraries in JAR files. I have been maintaining modules for TensorFlow based on JavaCPP, and we actually got a boost in performance when compared to the original JNI code:
tensorflow/java#18 (comment)
I would be able to do the same for MXNet and maintain the result in a repository of your choice. Let me know if this sounds interesting! BTW, the developers of DJL also seem opened to switch from JNA to JavaCPP even though it is not a huge priority. Still, standardizing how native bindings are created and loaded with other libraries for which JavaCPP is pretty much already the standard (such as OpenCV, TensorFlow, CUDA, FFmpeg, LLVM, Tesseract) could go a long way in alleviating concerns of stability.

@szha
Copy link
Member

szha commented Jul 23, 2020

@saudet this looks awesome! An 18% improvement in throughput is quite significant for switching the way of integration for a frontend binding. I think we should definitely start with this offering. @lanking520 @gigasquid what do you think?

@gigasquid
Copy link
Member

@saudet @szha - I think we be a good path forward (from the Clojure perspective)

@lanking520
Copy link
Member Author

lanking520 commented Jul 23, 2020

@saudet Thanks for your proposal. I have four questions would like to ask you:

  1. If we adopt JavaCpp package, how will that be consumed? Under byteco or apache MXNet? Essentially from our previous discussion, we really don't want another 3rdparty checkin.

  2. Can you also do a benchmark on the MXNet's API's performance and possibly share the reproducible code? We did test the performance on JavaCpp vs JNA vs JNI and didn't see much difference on performance (under 10%).

  • MXImperativeInvokeEx
  • CachedOpForward

The above two methods are most frequently used methods in order to do minimum inference request, please try on these two to see how performance goes.

  1. We do have some additional technical issue with JavaCpp, is there any plan to fix it? (I will put it into a separate comment since it is really big.

  2. How do you ensure the performance if the build flag is different? Like the mxnet has to build from source (with necessary modification on source code) in order to work along with javacpp

  3. regarding to the dependencies issue, can we go without additional opencv and openblas in the package?

@lanking520
Copy link
Member Author

What's inside of javacpp-presets-mxnet

  • Native shared libraries:
    • libmxnet.so
    • libjnimxnet.so
    • libmkldnn.0.so
  • MXNet scala and java classes
  • javacpp-presets-mxnet java API implemenations
  • javacpp generated native bindings
    • mxnet C_API
    • mxnet-predict C_API

What's missing

javacpp-presets-mxnet doesn't expose APIs form nnvm/c_api.h (some of current python/gluon API depends on APIs in nnvm/c_api.h)

What's the dependencies

org.bytedeco.mxnet:ImageClassificationPredict:jar:1.5-SNAPSHOT
+- org.bytedeco:mxnet-platform:jar:1.4.0-1.5-SNAPSHOT:compile
|  +- org.bytedeco:opencv-platform:jar:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:android-arm:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:android-arm64:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:android-x86:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:android-x86_64:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:ios-arm64:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:ios-x86_64:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:linux-x86:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:linux-x86_64:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:linux-armhf:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:linux-ppc64le:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:macosx-x86_64:4.0.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:opencv:jar:windows-x86:4.0.1-1.5-SNAPSHOT:compile
|  |  \- org.bytedeco:opencv:jar:windows-x86_64:4.0.1-1.5-SNAPSHOT:compile
|  +- org.bytedeco:openblas-platform:jar:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:android-arm:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:android-arm64:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:android-x86:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:android-x86_64:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:ios-arm64:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:ios-x86_64:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:linux-x86:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:linux-x86_64:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:linux-armhf:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:linux-ppc64le:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:macosx-x86_64:0.3.5-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:openblas:jar:windows-x86:0.3.5-1.5-SNAPSHOT:compile
|  |  \- org.bytedeco:openblas:jar:windows-x86_64:0.3.5-1.5-SNAPSHOT:compile
|  +- org.bytedeco:mkl-dnn-platform:jar:0.18.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:mkl-dnn:jar:linux-x86_64:0.18.1-1.5-SNAPSHOT:compile
|  |  +- org.bytedeco:mkl-dnn:jar:macosx-x86_64:0.18.1-1.5-SNAPSHOT:compile
|  |  \- org.bytedeco:mkl-dnn:jar:windows-x86_64:0.18.1-1.5-SNAPSHOT:compile
|  \- org.bytedeco:mxnet:jar:1.4.0-1.5-SNAPSHOT:compile
\- org.bytedeco:mxnet:jar:macosx-x86_64:1.4.0-1.5-SNAPSHOT:compile
   +- org.bytedeco:opencv:jar:4.0.1-1.5-SNAPSHOT:compile
   +- org.bytedeco:openblas:jar:0.3.5-1.5-SNAPSHOT:compile
   +- org.bytedeco:mkl-dnn:jar:0.18.1-1.5-SNAPSHOT:compile
   +- org.bytedeco:javacpp:jar:1.5-SNAPSHOT:compile
   +- org.slf4j:slf4j-simple:jar:1.7.25:compile
   |  \- org.slf4j:slf4j-api:jar:1.7.25:compile
   \- org.scala-lang:scala-library:jar:2.11.12:compile

Build the project form source

I spent 40 min to build the project on my mac, and has to make some hack to build it.

  • It downloads mxnet source code, and making some hack around the source code
  • It uses it's own set of compiler flags to build libmxnet.so
  • It also build MXNet Scala project.

Classes

See javadoc: http://bytedeco.org/javacpp-presets/mxnet/apidocs/

  1. Java class name is “mxnet”, which is not following java naming conventions
  2. Each pointer has a corresponding java class, which is arguable. It's necessary to expose them as strong type class if they meant to be used directly by end developer. But they really should only be internal implementation of the API. It's overkill to expose them as a Type instead of just a pointer.
  3. All the classes (except mxnet.java) are hand written.
  4. API mapping are hand coded as well.

Performance

JavaCPP native library load takes a long time, it takes average 2.6 seconds to initialize libmxnet.so with javacpp.

Loader.load(org.bytedeco.mxnet.global.mxnet.class);

Issues

The open source code on github doesn't match the binary release on maven central:

  • the maven group and the java package name are different.
  • c predict API is not included in maven version
  • Example code doesn't work with maven artifacts, it can only build with snapshot version locally.

@saudet
Copy link

saudet commented Jul 25, 2020

@saudet Thanks for your proposal. I have four questions would like to ask you:

  1. If we adopt JavaCpp package, how will that be consumed? Under byteco or apache MXNet? Essentially from our previous discussion, we really don't want another 3rdparty checkin.

We can go either way, but I found that for contemporary projects like Deeplearning4j, MXNet, PyTorch, or TensorFlow that need to develop high-level APIs on top of something like JavaCPP prefer to have control over everything in their own repositories, and use JavaCPP pretty much like we would use cython or pybind11 with setuptools for Python.

I started the JavaCPP Presets because for traditional projects such as OpenCV, FFmpeg, LLVM, etc, high-level APIs for other languages than C/C++ are not being developed as part of those projects. I also realized the Java community needed something like Anaconda...

  1. Can you also do a benchmark on the MXNet's API's performance and possibly share the reproducible code? We did test the performance on JavaCpp vs JNA vs JNI and didn't see much difference on performance (under 10%).

    • MXImperativeInvokeEx

    • CachedOpForward

The above two methods are most frequently used methods in order to do minimum inference request, please try on these two to see how performance goes.

If you're doing only batch operations, as would be the case for Python bindings, you're not going to see much difference, no. What you need to look at are things like the Indexer package, which allows us to implement fast custom operations in Java like this: http://bytedeco.org/news/2014/12/23/third-release/
You're not going to be able to do that with JNA or JNI without essentially rewriting that sort of thing.

  1. We do have some additional technical issue with JavaCpp, is there any plan to fix it? (I will put it into a separate comment since it is really big.

  2. How do you ensure the performance if the build flag is different? Like the mxnet has to build from source (with necessary modification on source code) in order to work along with javacpp

  3. regarding to the dependencies issue, can we go without additional opencv and openblas in the package?

Yes, that's the kind of issues that would be best dealt with by using only JavaCPP as a low-level tool, instead of the presets, which is basically a high-level distribution like Anaconda.

@saudet
Copy link

saudet commented Jul 26, 2020

What's missing

javacpp-presets-mxnet doesn't expose APIs form nnvm/c_api.h (some of current python/gluon API depends on APIs in nnvm/c_api.h)

I've added that the other day, thanks to @frankfliu for pointing this out: bytedeco/javacpp-presets@976e6f7

See javadoc: http://bytedeco.org/javacpp-presets/mxnet/apidocs/

  1. Java class name is “mxnet”, which is not following java naming conventions

That's not hardcoded. We can use whatever name we want for that class.

  1. Each pointer has a corresponding java class, which is arguable. It's necessary to expose them as strong type class if they meant to be used directly by end developer. But they really should only be internal implementation of the API. It's overkill to expose them as a Type instead of just a pointer.

We can map everything to Pointer, that's not a problem either.

  1. All the classes (except mxnet.java) are hand written.

No, they are not. Everything in the src/gen directory here is generated at build time:
https://github.com/bytedeco/javacpp-presets/tree/master/mxnet/src/gen/java/org/bytedeco/mxnet

  1. API mapping are hand coded as well.

If you're talking about this file, yes, that's the only thing that is written manually:
https://github.com/bytedeco/javacpp-presets/blob/master/mxnet/src/main/java/org/bytedeco/mxnet/presets/mxnet.java
(The formatting is a bit crappy, I haven't touched it in a while, but we can make it look prettier like this:
https://github.com/bytedeco/javacpp-presets/blob/master/onnxruntime/src/main/java/org/bytedeco/onnxruntime/presets/onnxruntime.java )

Performance

JavaCPP native library load takes a long time, it takes average 2.6 seconds to initialize libmxnet.so with javacpp.

Loader.load(org.bytedeco.mxnet.global.mxnet.class);

Something's wrong, that takes less than 500 ms on my laptop, and that includes loading OpenBLAS, OpenCV, and a lookup for CUDA and MKL, which can obviously be optimized... In any case, we can debug that later to see what is going wrong on your end.

Issues

The open source code on github doesn't match the binary release on maven central:

  • the maven group and the java package name are different.

Both the group ID and the package names are org.bytedeco, but in any case, if that gets maintained somewhere here, I imagine it would be changed to something like org.apache.mxnet.xyz.internal.etc

  • c predict API is not included in maven version

Yes it is: http://bytedeco.org/javacpp-presets/mxnet/apidocs/org/bytedeco/mxnet/global/mxnet.html

  • Example code doesn't work with maven artifacts, it can only build with snapshot version locally.

https://github.com/bytedeco/javacpp-presets/tree/master/mxnet/samples works fine for me on Linux:

$ mvn -U clean compile exec:java -Djavacpp.platform.custom -Djavacpp.platform.host -Dexec.args=apple.jpg
...
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet-platform/1.7.0.rc1-1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet-platform/1.7.0.rc1-1.5.4-SNAPSHOT/maven-metadata.xml (1.3 kB at 2.5 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet-platform/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-platform-1.7.0.rc1-1.5.4-20200725.115300-20.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet-platform/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-platform-1.7.0.rc1-1.5.4-20200725.115300-20.pom (4.7 kB at 9.3 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-presets/1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-presets/1.5.4-SNAPSHOT/maven-metadata.xml (610 B at 1.5 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-presets/1.5.4-SNAPSHOT/javacpp-presets-1.5.4-20200725.155410-6590.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-presets/1.5.4-SNAPSHOT/javacpp-presets-1.5.4-20200725.155410-6590.pom (84 kB at 91 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv-platform/4.4.0-1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv-platform/4.4.0-1.5.4-SNAPSHOT/maven-metadata.xml (1.2 kB at 2.6 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv-platform/4.4.0-1.5.4-SNAPSHOT/opencv-platform-4.4.0-1.5.4-20200725.082627-40.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv-platform/4.4.0-1.5.4-SNAPSHOT/opencv-platform-4.4.0-1.5.4-20200725.082627-40.pom (7.9 kB at 19 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas-platform/0.3.10-1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas-platform/0.3.10-1.5.4-SNAPSHOT/maven-metadata.xml (1.3 kB at 2.1 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas-platform/0.3.10-1.5.4-SNAPSHOT/openblas-platform-0.3.10-1.5.4-20200724.193951-177.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas-platform/0.3.10-1.5.4-SNAPSHOT/openblas-platform-0.3.10-1.5.4-20200724.193951-177.pom (7.9 kB at 16 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-platform/1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-platform/1.5.4-SNAPSHOT/maven-metadata.xml (1.2 kB at 2.8 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-platform/1.5.4-SNAPSHOT/javacpp-platform-1.5.4-20200720.164410-35.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-platform/1.5.4-SNAPSHOT/javacpp-platform-1.5.4-20200720.164410-35.pom (60 kB at 112 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp/1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp/1.5.4-SNAPSHOT/maven-metadata.xml (4.3 kB at 8.1 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp/1.5.4-SNAPSHOT/javacpp-1.5.4-20200725.222627-485.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp/1.5.4-SNAPSHOT/javacpp-1.5.4-20200725.222627-485.pom (20 kB at 52 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas/0.3.10-1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas/0.3.10-1.5.4-SNAPSHOT/maven-metadata.xml (4.2 kB at 6.9 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas/0.3.10-1.5.4-SNAPSHOT/openblas-0.3.10-1.5.4-20200725.222937-191.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas/0.3.10-1.5.4-SNAPSHOT/openblas-0.3.10-1.5.4-20200725.222937-191.pom (4.8 kB at 9.9 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv/4.4.0-1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv/4.4.0-1.5.4-SNAPSHOT/maven-metadata.xml (4.6 kB at 9.7 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv/4.4.0-1.5.4-SNAPSHOT/opencv-4.4.0-1.5.4-20200725.222953-47.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv/4.4.0-1.5.4-SNAPSHOT/opencv-4.4.0-1.5.4-20200725.222953-47.pom (11 kB at 23 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/1.7.0.rc1-1.5.4-SNAPSHOT/maven-metadata.xml
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/1.7.0.rc1-1.5.4-SNAPSHOT/maven-metadata.xml (2.6 kB at 5.7 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-1.7.0.rc1-1.5.4-20200725.222844-30.pom
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-1.7.0.rc1-1.5.4-20200725.222844-30.pom (15 kB at 28 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet-platform/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-platform-1.7.0.rc1-1.5.4-20200725.115300-20.jar
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv/4.4.0-1.5.4-SNAPSHOT/opencv-4.4.0-1.5.4-20200725.222953-47-linux-x86_64.jar
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas-platform/0.3.10-1.5.4-SNAPSHOT/openblas-platform-0.3.10-1.5.4-20200724.193951-177.jar
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv/4.4.0-1.5.4-SNAPSHOT/opencv-4.4.0-1.5.4-20200725.222953-47.jar
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv-platform/4.4.0-1.5.4-SNAPSHOT/opencv-platform-4.4.0-1.5.4-20200725.082627-40.jar
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet-platform/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-platform-1.7.0.rc1-1.5.4-20200725.115300-20.jar (3.4 kB at 8.6 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-platform/1.5.4-SNAPSHOT/javacpp-platform-1.5.4-20200720.164410-35.jar
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-platform/1.5.4-SNAPSHOT/javacpp-platform-1.5.4-20200720.164410-35.jar (6.1 kB at 7.7 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp/1.5.4-SNAPSHOT/javacpp-1.5.4-20200725.222627-485-linux-x86_64.jar
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv-platform/4.4.0-1.5.4-SNAPSHOT/opencv-platform-4.4.0-1.5.4-20200725.082627-40.jar (3.6 kB at 3.4 kB/s)
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas-platform/0.3.10-1.5.4-SNAPSHOT/openblas-platform-0.3.10-1.5.4-20200724.193951-177.jar (3.6 kB at 3.4 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas/0.3.10-1.5.4-SNAPSHOT/openblas-0.3.10-1.5.4-20200725.222937-191.jar
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas/0.3.10-1.5.4-SNAPSHOT/openblas-0.3.10-1.5.4-20200725.222937-191-linux-x86_64.jar
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp/1.5.4-SNAPSHOT/javacpp-1.5.4-20200725.222627-485-linux-x86_64.jar (25 kB at 21 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-1.7.0.rc1-1.5.4-20200725.222844-30.jar
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas/0.3.10-1.5.4-SNAPSHOT/openblas-0.3.10-1.5.4-20200725.222937-191.jar (170 kB at 76 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp/1.5.4-SNAPSHOT/javacpp-1.5.4-20200725.222627-485.jar
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp/1.5.4-SNAPSHOT/javacpp-1.5.4-20200725.222627-485.jar (467 kB at 151 kB/s)
Downloading from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-1.7.0.rc1-1.5.4-20200725.222844-30-linux-x86_64.jar
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv/4.4.0-1.5.4-SNAPSHOT/opencv-4.4.0-1.5.4-20200725.222953-47.jar (1.6 MB at 509 kB/s)
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-1.7.0.rc1-1.5.4-20200725.222844-30.jar (3.3 MB at 706 kB/s)
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/openblas/0.3.10-1.5.4-SNAPSHOT/openblas-0.3.10-1.5.4-20200725.222937-191-linux-x86_64.jar (14 MB at 1.7 MB/s)
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/1.7.0.rc1-1.5.4-SNAPSHOT/mxnet-1.7.0.rc1-1.5.4-20200725.222844-30-linux-x86_64.jar (44 MB at 2.0 MB/s)
Downloaded from sonatype-nexus-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/opencv/4.4.0-1.5.4-SNAPSHOT/opencv-4.4.0-1.5.4-20200725.222953-47-linux-x86_64.jar (26 MB at 1.1 MB/s)
...
Best Result: Granny Smith (id=948, accuracy=0.96502399)
run successfully

What is the error that you're getting? I've also tested on Mac just now and still no problems.

@lanking520
Copy link
Member Author

@saudet Thanks for your reply. Still, I am concerned about the first question:

you mentioned:

We can go either way, but I found that for contemporary projects like Deeplearning4j, MXNet, PyTorch, or TensorFlow that > need to develop high-level APIs on top of something like JavaCPP prefer to have control over everything in their own
repositories, and use JavaCPP pretty much like we would use cython or pybind11 with setuptools for Python.

We are looking for a robust solution for MXNet Java developers to use especially owned and maintained by the Apache MXNet's community. I will be more than happy to see if you would like to contribute the source code that generate MXNet JavaCpp package to this repo. So we can own the maintainance and responsible for the end users that the package is reliable.

At the beginning, we were discussing several ways that we can try to preserve a low level Java API for MXNet that anyone who use Java can start with. Most of the problems were lying under the ownership and maintainance part. I have placed JavaCpp option to option 5 so we can see which one works the best in the end.

@terrytangyuan
Copy link
Member

This is great discussion. Thanks @lanking520 for initiating this. Perhaps we can define some key metrics here so we can compare the solutions later?

@hmf
Copy link

hmf commented Oct 1, 2020

@lanking520 In regards to the Scala API, access via Java is just fine. I am sure someone with the itch may end up providing a Scala wrapper 8-)

@lanking520
Copy link
Member Author

@saudet if it is a maven package consumption should be fine as long as the license isn't fall under (no license, GPL, LGPL or some license that ASF doesn't approve).
I would +1 to the solution you have mentioned in JavaCPP. One last question is the maintainance cost, since JavaCPP is doing the generation work, how much maintainance does it require from community to keep in here?

@hmf Sure, please go ahead and create one if you feel it necessary once we have Java API.

So I would like to summarize the topic here:

  1. Go for JavaCPP solution for its better performance. The source code will also be part of the Apache MXNet. In 2.0, we will expect the CI/CD pipeline for MXNet low level Java API.

  2. Go for JNA build pipeline to the community, it can be used out-of-box now without issue. Similarly, the maintainance is very low and less dependencieces required. The source code can also be donated to Apache MXNet.

Both solution are targeted for MXNet low level Java API.

@gigasquid @leezu @szha @zachgk @terrytangyuan @yzhliu Any thoughts?

@saudet
Copy link

saudet commented Oct 6, 2020

@saudet if it is a maven package consumption should be fine as long as the license isn't fall under (no license, GPL, LGPL or some license that ASF doesn't approve).

Great! Thanks for the clarification. It's Apache v2, so the license is alright.

I would +1 to the solution you have mentioned in JavaCPP. One last question is the maintainance cost, since JavaCPP is doing the generation work, how much maintainance does it require from community to keep in here?

I've created a branch with a fully functional build that bundles MXNet with wrappers for the C API, on my fork here:
https://github.com/saudet/incubator-mxnet/tree/add-javacpp
It uses the defaults for CMake, but without CUDA or OpenCV, and I'm guessing it works on Mac and Windows too, but I've only tested on Linux (Fedora), which outputs the following, mapping all declarations of typedef void* to Pointer like you asked:

$ git clone https://github.com/saudet/incubator-mxnet
$ cd incubator-mxnet
$ git checkout add-javacpp
$ cd java
$ gradle clean build --info
...
org.apache.mxnet.internal.c_api.UnitTest > test STANDARD_OUT
    20000
...
BUILD SUCCESSFUL in 1m 3s
10 actionable tasks: 10 executed
...
$ ls -lh build/libs/
total 38M
-rw-rw-r--. 1 saudet saudet 49K Oct  6 20:54 mxnet-2.0-SNAPSHOT.jar
-rw-rw-r--. 1 saudet saudet 38M Oct  6 20:54 mxnet-2.0-SNAPSHOT-linux-x86_64.jar

The number of lines that are directly related to JavaCPP is less than 100, so even if I die anyone can maintain that. I'm sure that's going to grow a bit, but a C API is very easy to maintain. For example, the presets for the C API of TensorFlow 2.x had to be updated only 10 times over the course of the past year: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/presets/tensorflow.java

@saudet
Copy link

saudet commented Oct 13, 2020

I've pushed changes that show how to use JavaCPP with maven-publish to my fork here:
https://github.com/saudet/incubator-mxnet/tree/add-javacpp/java
Running gradle publish or something equivalent also deploys an mxnet-platform artifact that can be used this way:
https://github.com/bytedeco/javacpp-presets/wiki/Reducing-the-Number-of-Dependencies

For example, with this pom.xml file:

<project>
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.apache</groupId>
    <artifactId>mxnet-sample</artifactId>
    <version>2.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache</groupId>
            <artifactId>mxnet-platform</artifactId>
            <version>2.0-SNAPSHOT</version>
        </dependency>
    </dependencies>
</project>

We can filter out transitively all artifacts that are not for Linux x86_64 this way:

$ mvn dependency:tree -Djavacpp.platform=linux-x86_64
[INFO] Scanning for projects...
[INFO] 
[INFO] ----------------------< org.apache:mxnet-sample >-----------------------
[INFO] Building mxnet-sample 2.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mxnet-sample ---
[INFO] org.apache:mxnet-sample:jar:2.0-SNAPSHOT
[INFO] \- org.apache:mxnet-platform:jar:2.0-SNAPSHOT:compile
[INFO]    +- org.bytedeco:javacpp-platform:jar:1.5.5-SNAPSHOT:compile
[INFO]    |  +- org.bytedeco:javacpp:jar:1.5.5-SNAPSHOT:compile
[INFO]    |  \- org.bytedeco:javacpp:jar:linux-x86_64:1.5.5-SNAPSHOT:compile
[INFO]    +- org.apache:mxnet:jar:2.0-SNAPSHOT:compile
[INFO]    \- org.apache:mxnet:jar:linux-x86_64:2.0-SNAPSHOT:compile
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.360 s
[INFO] Finished at: 2020-10-13T21:19:51+09:00
[INFO] ------------------------------------------------------------------------

And we can do the same with the platform plugin of Gradle JavaCPP:
https://github.com/bytedeco/gradle-javacpp#the-platform-plugin

@gigasquid
Copy link
Member

As far as my feedback for the two options:

  1. Go for JavaCPP solution for its better performance. The source code will also be part of the Apache MXNet. In 2.0, we will expect the CI/CD pipeline for MXNet low level Java API.
  1. Go for JNA build pipeline to the community, it can be used out-of-box now without issue. Similarly, the maintainance is very low and less dependencieces required. The source code can also be donated to Apache MXNet.

They both sound reasonable and improvements to the system. Thank you both @lanking520 and @saudet for your time and efforts. The one aspect that I haven't heard discussed is that implementation of the base Java API - in particular if anyone is planning on tackling this? If so, the person/s building out the dev work themselves might have a preference that would weight it one way or the other.

@saudet
Copy link

saudet commented Nov 11, 2020

Here's another potential benefit of going with a tool like JavaCPP. I've started publishing packages for TVM that bundle its Python API and also wraps its C/C++ API:

Currently, the builds have CUDA/cuDNN, LLVM, MKL, and MKL-DNN/DNNL/oneDNN enabled on Linux, Mac, and Windows, but users do not need to install anything at all--not even CPython! All dependencies get downloaded automatically with Maven (although we can use manually installed ones too if we want). It also works out of the box with GraalVM Native Image and Quarkus this way:

For deployment, the TVM Runtime gets built separately, so it's easy to filter everything and get JAR files that are less than 1 MB, without having to recompile anything at all! It's also easy enough to set up the build in a way to offer a user-friendly interface to generate just the right amount of JNI (in addition to enabling only the backends we are interested in) to get even smaller JAR files. The manually written JNI code currently in TVM's repository doesn't support that. Moreover, it is inefficiently written in a similar fashion to the original JNI code in TensorFlow, see above #17783 (comment), so we can assume that using JavaCPP is going to provide a similar boost in performance there as well.

If TVM is eventually integrated in MXNet as per, for example, #15465, this might be worth thinking about right now. For most AI projects, Java is used mainly at deployment time and manually written JNI or automatically generated JNA isn't going to help much in that case.

@szha
Copy link
Member

szha commented Nov 22, 2020

Thanks all for the discussion. @saudet would you help to bootstrap the adoption of javacpp in mxnet to get it off the ground? I'm happy to help facilitate any testing infrastructure work necessary.

@saudet
Copy link

saudet commented Nov 28, 2020

@szha Thanks! Could you let me know what would be missing if anything to get this initial contribution into master? https://github.com/saudet/incubator-mxnet/tree/add-javacpp/java Probably a little README.md file would be nice, but other than that?

@szha
Copy link
Member

szha commented Nov 28, 2020

In order for it to be adopted by developers and users, I expect that a new language binding should have the following:

  • Tests that are enabled in the CI for PRs. Adding it to any of the existing pipelines should be ok.
  • Documentation website. The javadocs should be built and linked from the main website so that others know what APIs exist. Docs are in docs folder and here is how docs from different language bindings are packaged together for publishing.

@saudet
Copy link

saudet commented Dec 7, 2020

Ok, I'm able to start looking into that.

Well, "language binding", it would basically be just the C API for starters. I think that would be enough for DJL though.
@lanking520 @frankfliu Would there be anything specific from your team?

For Jenkins, I assume I'd need to get access to the server and everything to do something with that myself...
What about GitHub Actions? I see there is some work going on with those. Are there plans to switch to that?

For the docs, that would be something like the Jenkinsfile_website_java_docs in the v1.x branch?
I also see a couple of short Markdown files there for getting started and tutorials, so something like that... using the C API?

@szha
Copy link
Member

szha commented Dec 7, 2020

@saudet for setting up the pipeline, we just need to add a step in existing Jenkinsfiles. I can help facilitate any need for access to the CI.

@lanking520
Copy link
Member Author

Ok, I'm able to start looking into that.

Well, "language binding", it would basically be just the C API for starters. I think that would be enough for DJL though.
@lanking520 @frankfliu Would there be anything specific from your team?

For Jenkins, I assume I'd need to get access to the server and everything to do something with that myself...
What about GitHub Actions? I see there is some work going on with those. Are there plans to switch to that?

For the docs, that would be something like the Jenkinsfile_website_java_docs in the v1.x branch?
I also see a couple of short Markdown files there for getting started and tutorials, so something like that... using the C API?

I would recommed to provide a basic Java interface that allow all Java developers can build frontend to it. As Sheng mentioned, you can start with the Jenkins template to add a Java publish job to it.

@saudet
Copy link

saudet commented Dec 8, 2020

I don't really want to deal with CI, especially Jenkins, it's a major time sink and completely unnecessary with services like GitHub Actions these days, but let's see if I can figure out what needs to be done. If I take the Jenkinsfile_centos_cpu script for Python, it ends up calling functions from here, which basically install environments, runs builds, and executes stuff for Python:
https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh
Is my understanding correct that these scripts are going to need some refactoring to be able to reuse some of that for Java?

If I follow my instincts, I think it's probably going to be easier to look at what's been done for the other minor bindings, such as Julia, but I'm not seeing anything in the Jenkins files for that one:
https://github.com/apache/incubator-mxnet/search?q=julia
How does that one work?

BTW, there's one thing we've neglected to cover. I was under the impression that MXNet was using Cython to access the C API for its Python binding, but it looks like it's using ctypes. TensorFlow started with SWIG, and now uses pybind11, and the closest Java equivalent for those is JavaCPP, that is they support C++ by generating additional code for bindings at build time, so it makes sense to use JavaCPP in the case of TensorFlow to be able to follow what the core developers are doing for Python.

On the other hand, if MXNet uses ctypes for Python, and has no intention of changing, the closest equivalent in Java land would be JNA. They are both "slow" (partly because of libffi) and support only C APIs, but they can dynamically link at runtime without having to build anything, and I'm assuming that's why there is no CI for Julia, for example. So, is the plan for Python to stick with ctypes? Browsing through #17097 I guess that's still not settled? In my opinion, it would make sense to harmonize the strategy of the binding for Java with the one for Python.

@leezu
Copy link
Contributor

leezu commented Dec 8, 2020

BTW, there's one thing we've neglected to cover. I was under the impression that MXNet was using Cython to access the C API for its Python binding, but it looks like it's using ctypes.

MXNet supports both cython and ctypes (fallback) for the Python interface. It depends on your build configuration.
https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L91
We may want to change the default for MXNet 2

I don't really want to deal with CI, especially Jenkins, it's a major time sink and completely unnecessary with services like GitHub Actions these days

It's also fine to use Github Actions if that's easier for you. The main reason for using Jenkins is that the MXNet test suite is too large for a free service such as Github Actions and that there are also GPU tests involved. Java tests can initially run on Github Actions and be migrated later to Jenkins based on need.

@saudet
Copy link

saudet commented Dec 23, 2020

I've updated my fork with a workflow for Java build on GitHub Actions: saudet@2be0540
Please let me know what you think of this!

It's currently building and testing for Linux (CentOS 7), Mac, and Windows on x86_64 with and without CUDA:
https://github.com/saudet/incubator-mxnet/actions/runs/437312011
(It looks like the build doesn't work for CUDA 11.1 with Visual Studio 2019 yet, but that's unrelated to Java.)

Since my account at Sonatype doesn't have deploy access to org.apache, the artifacts are getting deployed here for now:
https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/2.0-SNAPSHOT/

But this can be changed by updating only a single line here:
https://github.com/saudet/incubator-mxnet/blob/add-javacpp/java/build.gradle#L8

In any case, the javadoc secondary artifact also gets deployed as part of the build there.
Where does the publishing to the main site happen? Somewhere in here by the looks of it:
https://github.com/apache/incubator-mxnet/tree/master/ci/publish/website
We can fetch the latest javadoc archive this way, so I assume we could add that to the scripts?

$ mvn dependency:get -Dartifact=org.bytedeco:mxnet:2.0-SNAPSHOT:javadoc
$ unzip ~/.m2/repository/org/bytedeco/mxnet/2.0-SNAPSHOT/mxnet-2.0-SNAPSHOT-javadoc.jar -d ...

It's also fine to use Github Actions if that's easier for you. The main reason for using Jenkins is that the MXNet test suite is too large for a free service such as Github Actions and that there are also GPU tests involved. Java tests can initially run on Github Actions and be migrated later to Jenkins based on need.

For that, GitHub Actions now support self-hosted runners, where we just need to provision some machines on the cloud somewhere, and install the equivalent of Jenkins Agent on them, and that's it. Much easier than maintaining Jenkins.

@leezu
Copy link
Contributor

leezu commented Dec 23, 2020

Thank you @saudet. You can take a look at https://infra.apache.org/publishing-maven-artifacts.html for more information on the Apache Software Foundation (ASF) maven artifact publishing process. Summary: Release candidate artifacts are pushed to a staging area and can be promoted after the release vote passed.

One thing to note is that ASF policies do not allow publishing unreleased (nightly) artifacts to the general public. Those should be placed at special location and only used by interested community members. You can take a look at http://www.apache.org/legal/release-policy.html#publication and this FAQ entry http://www.apache.org/legal/release-policy.html#host-rc
Do you have any suggestion how to best handle it with your Github Actions script / Maven?

For that, GitHub Actions now support self-hosted runners, where we just need to provision some machines on the cloud somewhere, and install the equivalent of Jenkins Agent on them, and that's it. Much easier than maintaining Jenkins.

Github Actions isn't very mature yet. You can see in the doc that "Self-hosted runners on GitHub do not have guarantees around running in ephemeral clean virtual machines, and can be persistently compromised by untrusted code in a workflow." I don't think that's acceptable for projects accepting contributions from the general public.

@leezu
Copy link
Contributor

leezu commented Dec 23, 2020

I downloaded the mxnet-2.0-20201222.141246-19-linux-x86_64.jar and find that

% ldd org/apache/mxnet/internal/c_api/linux-x86_64/libmxnet.so                                                                                                                                       /tmp/lin
        linux-vdso.so.1 (0x00007fff65fdc000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f46015a3000)
        libgfortran.so.3 => not found
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f4601598000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4601575000)
        libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f4601533000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4601352000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4601201000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f46011e6000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4600ff4000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f460b892000)

As libgfortran has changed their ABI a few times over the years, you will need to include libgfortran.so in the jar (which we can distribute under AL2 License thanks to the GCC Runtime Library Exception). However, you must not include libquadmath.so (dependency of libgfortran.so) as it is GPL licensed.

For the gpu version mxnet-2.0-20201222.141246-19-linux-x86_64-gpu.jar, would it make sense to use cu110 instead of gpu if built with cuda 11.0 etc?

@marcoabreu
Copy link
Contributor

Regarding security: I think that the quoted paragraph has the same (in)securities as our jenkins setup, doesn't it?

@leezu
Copy link
Contributor

leezu commented Dec 23, 2020

I don't think so. Microsoft specifically says "We recommend that you do not use self-hosted runners with public repositories." It indicates to me that they have very little confidence in their security model. https://docs.github.com/en/free-pro-team@latest/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security-with-public-repositories

@marcoabreu
Copy link
Contributor

Yes I think they are mentioning the same security problem we are having with our jenkins slaves. Any user could run arbitrary code and install a rootkit. Hence the separation towards restricted slaves.

So from that point of view, I don't consider the github actions self runner any less secure than our jenkins slaves. But of course still insecure.

@leezu
Copy link
Contributor

leezu commented Dec 23, 2020

The problem with runners I had in mind is that there used to be no API to start new instances for each job, but rather that the instances had to be up and running all the time and would be re-used for all jobs. Thus any compromise would be truly persistent. We don't do that in our Jenkins setup, where instances are terminated time-to-time.

But I just checked the Github documentation and Microsoft team has resolved this issue and now provides an API that can provision new runners upon demand. So if there are volunteers, it should be fine to migrate to Github Actions. For example, https://040code.github.io/2020/05/25/scaling-selfhosted-action-runners

@saudet
Copy link

saudet commented Dec 25, 2020

Thank you @saudet. You can take a look at https://infra.apache.org/publishing-maven-artifacts.html for more information on the Apache Software Foundation (ASF) maven artifact publishing process. Summary: Release candidate artifacts are pushed to a staging area and can be promoted after the release vote passed.

Thanks for the links! I've been publishing to the Maven Central Repository, I know how that works.

One thing to note is that ASF policies do not allow publishing unreleased (nightly) artifacts to the general public. Those should be placed at special location and only used by interested community members. You can take a look at http://www.apache.org/legal/release-policy.html#publication and this FAQ entry http://www.apache.org/legal/release-policy.html#host-rc
Do you have any suggestion how to best handle it with your Github Actions script / Maven?

It doesn't sound to me like they forbid publishing snapshots, just that it shouldn't be documented, which is weird, but whatever. It should be alright to deploy snapshots and keep it a "secret", no? They say we "should" do this and that, but if none of their services offers support for Maven artifacts, I suppose this means we can use something else, right?

As libgfortran has changed their ABI a few times over the years, you will need to include libgfortran.so in the jar (which we can distribute under AL2 License thanks to the GCC Runtime Library Exception). However, you must not include libquadmath.so (dependency of libgfortran.so) as it is GPL licensed.

Yes, that's not a problem. However, if we don't have libquadmath.so, libgfortran.so isn't going to load, so is it still useful?

For the gpu version mxnet-2.0-20201222.141246-19-linux-x86_64-gpu.jar, would it make sense to use cu110 instead of gpu if built with cuda 11.0 etc?

I guess? :) In any case, that's not a problem either. However, it's becoming increasingly irrelevant to try to support multiple versions of CUDA given their accelerating release cycle.

@saudet
Copy link

saudet commented Dec 25, 2020

FWIW, it looks to me like libquadmath is LGPL, not GPL: https://github.com/gcc-mirror/gcc/blob/master/libquadmath/COPYING.LIB

@leezu
Copy link
Contributor

leezu commented Dec 25, 2020

Yes, that's not a problem. However, if we don't have libquadmath.so, libgfortran.so isn't going to load, so is it still useful?

There haven't been any ABI breaks in libquadmath.so. Thus we can simply ask users to install libquadmath.so by themselves and everything will work. Our users will not be able to find an incompatible libquadmath.so

FWIW, it looks to me like libquadmath is LGPL, not GPL: https://github.com/gcc-mirror/gcc/blob/master/libquadmath/COPYING.LIB

You're right, but the consequence is the same.

It doesn't sound to me like they forbid publishing snapshots, just that it shouldn't be documented, which is weird, but whatever. It should be alright to deploy snapshots and keep it a "secret", no?

Yes. Reading the https://infra.apache.org/publishing-maven-artifacts.html again, there is also https://repository.apache.org/snapshots which may be the best location for snapshots?

They say we "should" do this and that, but if none of their services offers support for Maven artifacts, I suppose this means we can use something else, right?

I'm not sure what you mean.

@saudet
Copy link

saudet commented Jan 5, 2021

Yes, that's not a problem. However, if we don't have libquadmath.so, libgfortran.so isn't going to load, so is it still useful?

There haven't been any ABI breaks in libquadmath.so. Thus we can simply ask users to install libquadmath.so by themselves and everything will work. Our users will not be able to find an incompatible libquadmath.so

FWIW, it looks to me like libquadmath is LGPL, not GPL: https://github.com/gcc-mirror/gcc/blob/master/libquadmath/COPYING.LIB

You're right, but the consequence is the same.

I suppose it's more business friendly, but it's not a requirement for releasing binaries under Apache, correct? That is, this is a policy specific to the MXNet project?

It doesn't sound to me like they forbid publishing snapshots, just that it shouldn't be documented, which is weird, but whatever. It should be alright to deploy snapshots and keep it a "secret", no?

Yes. Reading the https://infra.apache.org/publishing-maven-artifacts.html again, there is also https://repository.apache.org/snapshots which may be the best location for snapshots?

Yes, I saw that too. There's already a few snapshots from MXNet there, so I assume we can use it freely:
http://repository.apache.org/content/groups/snapshots/org/apache/mxnet/
(Oh and look at that, the artifacts are using a generic "-gpu" string as part of their names. I would like to consult with the authors of those artifacts to see what they think of adding versions here before committing to tracking specific versions of CUDA. It's not something I would personally relish to do, until someone actually cares about it at least that is. The only person that ever complained about this for OpenCV is @ericxsun: bytedeco/javacpp-presets#918 )

Now, where can I get an account for that server... Anyway, someone will just need to put their credentials as secrets for GitHub Actions and then we just need to change the URL here for the snapshots:
https://github.com/saudet/incubator-mxnet/blob/add-javacpp/java/build.gradle#L143

And that's it. They will appear exactly as they currently are on Sonatype:
https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/mxnet/2.0-SNAPSHOT/

They say we "should" do this and that, but if none of their services offers support for Maven artifacts, I suppose this means we can use something else, right?

I'm not sure what you mean.

I was referring to the snapshot repository, which they do offer, so we're good for that, but if we need something else, it would be good to know what the official stance is concerning the use of external services. I suppose anything from GitHub is OK, but other than that, I wonder.

@leezu
Copy link
Contributor

leezu commented Jan 5, 2021

I suppose it's more business friendly, but it's not a requirement for releasing binaries under Apache, correct? That is, this is a policy specific to the MXNet project?

It is a requirement to release the binaries under AL2. You can refer to https://www.apache.org/legal/resolved.html for a list of compatible and incompatible licenses. LGPL is Category-X (not allowed) as it places restrictions on the larger work.

Oh and look at that, the artifacts are using a generic "-gpu" string as part of their names. I would like to consult with the authors of those artifacts to see what they think of adding versions here before committing to tracking specific versions of CUDA. It's not something I would personally relish to do, until someone actually cares about it at least that is.

It's fine with me to re-use the existing pattern if others don't mind.

Now, where can I get an account for that server... Anyway, someone will just need to put their credentials as secrets for GitHub Actions and then we just need to change the URL here for the snapshots:

We can open a ticket with Apache Infra. Would you like to open a PR first?

it would be good to know what the official stance is concerning the use of external services. I suppose anything from GitHub is OK, but other than that, I wonder.

It's fine to use external services as long as the project maintainers (PPMC) control the usage and the published artifacts are compliant with the ASF polices (for example, don't contain LGPL compontents).

@saudet
Copy link

saudet commented Jan 26, 2021

Ok, I've finally updated my fork accordingly along with a few additional changes: saudet@0966818
If that looks good enough for a pull request, I'll do that right away! @leezu

@leezu
Copy link
Contributor

leezu commented Jan 26, 2021

Thank you. It's fine with me. Once you open the PR, @lanking520 and @gigasquid may be able to review too

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Clojure Java Label to identify Java API component RFC Post requesting for comments Roadmap Scala
Projects
None yet
Development

No branches or pull requests

9 participants