-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[RFC] MXNet 2.0 JVM Language development #17783
Comments
I would propose Option 3. and 4. DJL is a new Java framework that build on top of any engines. It brings the ease for Java developers to have close to numpy experience in Java. It introduced an interface that defined Java to train and run inference on different ML/DL models. In the engine layer, we implemented the MXNet specific engine that allows users to achieve most of the up-to-date functionalities: MXNet specific
DJL
Maintainance
With the benefit listed above, I would recommend Option 3 to go for the DJL path since it already covered most up-to-date MXNet feature and supporting all different symbolic/imperative training/inference. For Option 4: I am also thinking of bring our JNA layer back to MXNet so the community can build up their own Java/Scala frontend if they don't like DJL. |
I propose option 1 and 2 since it took us a lot of efforts to bring MXNet to Scala originally and there are already adopters of Scala API in industries (some may not have been disclosed yet). But I am open to other options. Not familiar with DJL though but I assume @frankfliu and @lanking520 are the masters behind DJL. |
@lanking520 thanks for the clarification above. A further question - How do you envision a current Scala MXNet user migrate their code? Is it going to be mostly reusable or is it going to be a complete rewrite for them? |
It is going to be closer to a complete rewrite. On the other hand, making a new Scala API would be imperative instead of symbolic and I think there are going to be a lot of operator changes to better match numpy in 2.0. I don't think the migration costs for a Scala 2.0 would be that much less anyway For users who don't want a full rewrite, they can continue using an old release or whatever new releases we make on the v1.x branch. |
For the Clojure package. It is a lot easier to interop with Java than with Scala - so if the the base is Java that everything is using - it will be better for Clojure. |
+1 for option 1 and 2. Also +1 for 4 as long as it doesn't add a dependency My concerns on 3 and 4 are that DJL is a separate project which has its own release cycle. Having it to support MXNet's inference will cause delays as DJL upgrades to the latest version. This will also complicate the testing and validation. Overall, I think a minimum set of API for at least inference is needed for MXNet JVM ecosystem users. |
Another data point is that all of our Scala tests fail randomly with |
Another data point is that we currently only support OpenJDK 8 but the JVM languages are broken with OpenJDK 11 which is used on Ubuntu 18.04 for example. See #18153 |
@szha For option 4, I would recommend to consume the JNA layer as a submodule from DJL. I am not sure if this is recommendation serves as "add a dependency in mxnet". There are two key reason that support for that:
We can also contribute code back in MXNet repo, since it is open source. But we may still keep a copy in our repo for fast iteration. It may cause diverged version on JNA layer. Overally speaking, my recommendation on option 4 leads towards a direction to consume DJL JNA as a submodule. |
@lanking520 would it create circular dependency? and how stable is the JNA and what changes are expected? it would be great if you could share a pointer on the JNA code to help clarify these concerns. |
There is no code for JNA, everything is generated. It ensure the general standard and minimum layer in C to avoid error and mistakes. About JNA, you can find more information here: jnarator. We build an entire project for the jna generation pipeline. All we need is a header file from MXNet to build everything. The dependency required by the gradle build is minimum, as you can fine in here. To address the concern of stability, we tested DJL MXNet with 100 hour inference run on server and it remains stable. Training experience is also smooth, multi-gpu run 48 hours is also stable. The performance is very close to python with large models and may bring huge boost if model is smaller or equals to "squeezenet level". @frankfliu can bring more information about the JNA layer. |
My understanding is that DJL depends on MXNet, so if you want to bring JNA from DJL into MXNet, it will create circular dependency as a 3rdparty module. In terms of stability, I was referring to the development of code base rather than the performance. |
Hi, instead of JNA, I would be happy to provide bindings for the C API and maintain packages based on the JavaCPP Presets here: |
@saudet this looks awesome! An 18% improvement in throughput is quite significant for switching the way of integration for a frontend binding. I think we should definitely start with this offering. @lanking520 @gigasquid what do you think? |
@saudet Thanks for your proposal. I have four questions would like to ask you:
The above two methods are most frequently used methods in order to do minimum inference request, please try on these two to see how performance goes.
|
What's inside of javacpp-presets-mxnet
What's missingjavacpp-presets-mxnet doesn't expose APIs form nnvm/c_api.h (some of current python/gluon API depends on APIs in nnvm/c_api.h) What's the dependencies
Build the project form sourceI spent 40 min to build the project on my mac, and has to make some hack to build it.
Classes See javadoc: http://bytedeco.org/javacpp-presets/mxnet/apidocs/
PerformanceJavaCPP native library load takes a long time, it takes average 2.6 seconds to initialize libmxnet.so with javacpp. Loader.load(org.bytedeco.mxnet.global.mxnet.class); IssuesThe open source code on github doesn't match the binary release on maven central:
|
We can go either way, but I found that for contemporary projects like Deeplearning4j, MXNet, PyTorch, or TensorFlow that need to develop high-level APIs on top of something like JavaCPP prefer to have control over everything in their own repositories, and use JavaCPP pretty much like we would use cython or pybind11 with setuptools for Python. I started the JavaCPP Presets because for traditional projects such as OpenCV, FFmpeg, LLVM, etc, high-level APIs for other languages than C/C++ are not being developed as part of those projects. I also realized the Java community needed something like Anaconda...
If you're doing only batch operations, as would be the case for Python bindings, you're not going to see much difference, no. What you need to look at are things like the Indexer package, which allows us to implement fast custom operations in Java like this: http://bytedeco.org/news/2014/12/23/third-release/
Yes, that's the kind of issues that would be best dealt with by using only JavaCPP as a low-level tool, instead of the presets, which is basically a high-level distribution like Anaconda. |
I've added that the other day, thanks to @frankfliu for pointing this out: bytedeco/javacpp-presets@976e6f7
That's not hardcoded. We can use whatever name we want for that class.
We can map everything to
No, they are not. Everything in the
If you're talking about this file, yes, that's the only thing that is written manually:
Something's wrong, that takes less than 500 ms on my laptop, and that includes loading OpenBLAS, OpenCV, and a lookup for CUDA and MKL, which can obviously be optimized... In any case, we can debug that later to see what is going wrong on your end.
Both the group ID and the package names are
Yes it is: http://bytedeco.org/javacpp-presets/mxnet/apidocs/org/bytedeco/mxnet/global/mxnet.html
https://github.com/bytedeco/javacpp-presets/tree/master/mxnet/samples works fine for me on Linux:
What is the error that you're getting? I've also tested on Mac just now and still no problems. |
@saudet Thanks for your reply. Still, I am concerned about the first question: you mentioned:
We are looking for a robust solution for MXNet Java developers to use especially owned and maintained by the Apache MXNet's community. I will be more than happy to see if you would like to contribute the source code that generate MXNet JavaCpp package to this repo. So we can own the maintainance and responsible for the end users that the package is reliable. At the beginning, we were discussing several ways that we can try to preserve a low level Java API for MXNet that anyone who use Java can start with. Most of the problems were lying under the ownership and maintainance part. I have placed JavaCpp option to option 5 so we can see which one works the best in the end. |
This is great discussion. Thanks @lanking520 for initiating this. Perhaps we can define some key metrics here so we can compare the solutions later? |
@lanking520 In regards to the Scala API, access via Java is just fine. I am sure someone with the itch may end up providing a Scala wrapper 8-) |
@saudet if it is a maven package consumption should be fine as long as the license isn't fall under (no license, GPL, LGPL or some license that ASF doesn't approve). @hmf Sure, please go ahead and create one if you feel it necessary once we have Java API. So I would like to summarize the topic here:
Both solution are targeted for MXNet low level Java API. @gigasquid @leezu @szha @zachgk @terrytangyuan @yzhliu Any thoughts? |
Great! Thanks for the clarification. It's Apache v2, so the license is alright.
I've created a branch with a fully functional build that bundles MXNet with wrappers for the C API, on my fork here: $ git clone https://github.com/saudet/incubator-mxnet
$ cd incubator-mxnet
$ git checkout add-javacpp
$ cd java
$ gradle clean build --info
...
org.apache.mxnet.internal.c_api.UnitTest > test STANDARD_OUT
20000
...
BUILD SUCCESSFUL in 1m 3s
10 actionable tasks: 10 executed
...
$ ls -lh build/libs/
total 38M
-rw-rw-r--. 1 saudet saudet 49K Oct 6 20:54 mxnet-2.0-SNAPSHOT.jar
-rw-rw-r--. 1 saudet saudet 38M Oct 6 20:54 mxnet-2.0-SNAPSHOT-linux-x86_64.jar The number of lines that are directly related to JavaCPP is less than 100, so even if I die anyone can maintain that. I'm sure that's going to grow a bit, but a C API is very easy to maintain. For example, the presets for the C API of TensorFlow 2.x had to be updated only 10 times over the course of the past year: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/presets/tensorflow.java |
I've pushed changes that show how to use JavaCPP with maven-publish to my fork here: For example, with this pom.xml file: <project>
<modelVersion>4.0.0</modelVersion>
<groupId>org.apache</groupId>
<artifactId>mxnet-sample</artifactId>
<version>2.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache</groupId>
<artifactId>mxnet-platform</artifactId>
<version>2.0-SNAPSHOT</version>
</dependency>
</dependencies>
</project> We can filter out transitively all artifacts that are not for Linux x86_64 this way:
And we can do the same with the platform plugin of Gradle JavaCPP: |
As far as my feedback for the two options:
They both sound reasonable and improvements to the system. Thank you both @lanking520 and @saudet for your time and efforts. The one aspect that I haven't heard discussed is that implementation of the base Java API - in particular if anyone is planning on tackling this? If so, the person/s building out the dev work themselves might have a preference that would weight it one way or the other. |
Here's another potential benefit of going with a tool like JavaCPP. I've started publishing packages for TVM that bundle its Python API and also wraps its C/C++ API: Currently, the builds have CUDA/cuDNN, LLVM, MKL, and MKL-DNN/DNNL/oneDNN enabled on Linux, Mac, and Windows, but users do not need to install anything at all--not even CPython! All dependencies get downloaded automatically with Maven (although we can use manually installed ones too if we want). It also works out of the box with GraalVM Native Image and Quarkus this way: For deployment, the TVM Runtime gets built separately, so it's easy to filter everything and get JAR files that are less than 1 MB, without having to recompile anything at all! It's also easy enough to set up the build in a way to offer a user-friendly interface to generate just the right amount of JNI (in addition to enabling only the backends we are interested in) to get even smaller JAR files. The manually written JNI code currently in TVM's repository doesn't support that. Moreover, it is inefficiently written in a similar fashion to the original JNI code in TensorFlow, see above #17783 (comment), so we can assume that using JavaCPP is going to provide a similar boost in performance there as well. If TVM is eventually integrated in MXNet as per, for example, #15465, this might be worth thinking about right now. For most AI projects, Java is used mainly at deployment time and manually written JNI or automatically generated JNA isn't going to help much in that case. |
Thanks all for the discussion. @saudet would you help to bootstrap the adoption of javacpp in mxnet to get it off the ground? I'm happy to help facilitate any testing infrastructure work necessary. |
@szha Thanks! Could you let me know what would be missing if anything to get this initial contribution into master? https://github.com/saudet/incubator-mxnet/tree/add-javacpp/java Probably a little README.md file would be nice, but other than that? |
In order for it to be adopted by developers and users, I expect that a new language binding should have the following:
|
Ok, I'm able to start looking into that. Well, "language binding", it would basically be just the C API for starters. I think that would be enough for DJL though. For Jenkins, I assume I'd need to get access to the server and everything to do something with that myself... For the docs, that would be something like the Jenkinsfile_website_java_docs in the v1.x branch? |
@saudet for setting up the pipeline, we just need to add a step in existing Jenkinsfiles. I can help facilitate any need for access to the CI. |
I would recommed to provide a basic Java interface that allow all Java developers can build frontend to it. As Sheng mentioned, you can start with the Jenkins template to add a Java publish job to it. |
I don't really want to deal with CI, especially Jenkins, it's a major time sink and completely unnecessary with services like GitHub Actions these days, but let's see if I can figure out what needs to be done. If I take the Jenkinsfile_centos_cpu script for Python, it ends up calling functions from here, which basically install environments, runs builds, and executes stuff for Python: If I follow my instincts, I think it's probably going to be easier to look at what's been done for the other minor bindings, such as Julia, but I'm not seeing anything in the Jenkins files for that one: BTW, there's one thing we've neglected to cover. I was under the impression that MXNet was using Cython to access the C API for its Python binding, but it looks like it's using ctypes. TensorFlow started with SWIG, and now uses pybind11, and the closest Java equivalent for those is JavaCPP, that is they support C++ by generating additional code for bindings at build time, so it makes sense to use JavaCPP in the case of TensorFlow to be able to follow what the core developers are doing for Python. On the other hand, if MXNet uses ctypes for Python, and has no intention of changing, the closest equivalent in Java land would be JNA. They are both "slow" (partly because of libffi) and support only C APIs, but they can dynamically link at runtime without having to build anything, and I'm assuming that's why there is no CI for Julia, for example. So, is the plan for Python to stick with ctypes? Browsing through #17097 I guess that's still not settled? In my opinion, it would make sense to harmonize the strategy of the binding for Java with the one for Python. |
MXNet supports both cython and ctypes (fallback) for the Python interface. It depends on your build configuration.
It's also fine to use Github Actions if that's easier for you. The main reason for using Jenkins is that the MXNet test suite is too large for a free service such as Github Actions and that there are also GPU tests involved. Java tests can initially run on Github Actions and be migrated later to Jenkins based on need. |
I've updated my fork with a workflow for Java build on GitHub Actions: saudet@2be0540 It's currently building and testing for Linux (CentOS 7), Mac, and Windows on x86_64 with and without CUDA: Since my account at Sonatype doesn't have deploy access to But this can be changed by updating only a single line here: In any case, the
For that, GitHub Actions now support self-hosted runners, where we just need to provision some machines on the cloud somewhere, and install the equivalent of Jenkins Agent on them, and that's it. Much easier than maintaining Jenkins. |
Thank you @saudet. You can take a look at https://infra.apache.org/publishing-maven-artifacts.html for more information on the Apache Software Foundation (ASF) maven artifact publishing process. Summary: Release candidate artifacts are pushed to a staging area and can be promoted after the release vote passed. One thing to note is that ASF policies do not allow publishing unreleased (nightly) artifacts to the general public. Those should be placed at special location and only used by interested community members. You can take a look at http://www.apache.org/legal/release-policy.html#publication and this FAQ entry http://www.apache.org/legal/release-policy.html#host-rc
Github Actions isn't very mature yet. You can see in the doc that "Self-hosted runners on GitHub do not have guarantees around running in ephemeral clean virtual machines, and can be persistently compromised by untrusted code in a workflow." I don't think that's acceptable for projects accepting contributions from the general public. |
I downloaded the
As For the gpu version |
Regarding security: I think that the quoted paragraph has the same (in)securities as our jenkins setup, doesn't it? |
I don't think so. Microsoft specifically says "We recommend that you do not use self-hosted runners with public repositories." It indicates to me that they have very little confidence in their security model. https://docs.github.com/en/free-pro-team@latest/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security-with-public-repositories |
Yes I think they are mentioning the same security problem we are having with our jenkins slaves. Any user could run arbitrary code and install a rootkit. Hence the separation towards restricted slaves. So from that point of view, I don't consider the github actions self runner any less secure than our jenkins slaves. But of course still insecure. |
The problem with runners I had in mind is that there used to be no API to start new instances for each job, but rather that the instances had to be up and running all the time and would be re-used for all jobs. Thus any compromise would be truly persistent. We don't do that in our Jenkins setup, where instances are terminated time-to-time. But I just checked the Github documentation and Microsoft team has resolved this issue and now provides an API that can provision new runners upon demand. So if there are volunteers, it should be fine to migrate to Github Actions. For example, https://040code.github.io/2020/05/25/scaling-selfhosted-action-runners |
Thanks for the links! I've been publishing to the Maven Central Repository, I know how that works.
It doesn't sound to me like they forbid publishing snapshots, just that it shouldn't be documented, which is weird, but whatever. It should be alright to deploy snapshots and keep it a "secret", no? They say we "should" do this and that, but if none of their services offers support for Maven artifacts, I suppose this means we can use something else, right?
Yes, that's not a problem. However, if we don't have
I guess? :) In any case, that's not a problem either. However, it's becoming increasingly irrelevant to try to support multiple versions of CUDA given their accelerating release cycle. |
FWIW, it looks to me like libquadmath is LGPL, not GPL: https://github.com/gcc-mirror/gcc/blob/master/libquadmath/COPYING.LIB |
There haven't been any ABI breaks in libquadmath.so. Thus we can simply ask users to install libquadmath.so by themselves and everything will work. Our users will not be able to find an incompatible libquadmath.so
You're right, but the consequence is the same.
Yes. Reading the https://infra.apache.org/publishing-maven-artifacts.html again, there is also https://repository.apache.org/snapshots which may be the best location for snapshots?
I'm not sure what you mean. |
I suppose it's more business friendly, but it's not a requirement for releasing binaries under Apache, correct? That is, this is a policy specific to the MXNet project?
Yes, I saw that too. There's already a few snapshots from MXNet there, so I assume we can use it freely: Now, where can I get an account for that server... Anyway, someone will just need to put their credentials as secrets for GitHub Actions and then we just need to change the URL here for the snapshots: And that's it. They will appear exactly as they currently are on Sonatype:
I was referring to the snapshot repository, which they do offer, so we're good for that, but if we need something else, it would be good to know what the official stance is concerning the use of external services. I suppose anything from GitHub is OK, but other than that, I wonder. |
It is a requirement to release the binaries under AL2. You can refer to https://www.apache.org/legal/resolved.html for a list of compatible and incompatible licenses. LGPL is Category-X (not allowed) as it places restrictions on the larger work.
It's fine with me to re-use the existing pattern if others don't mind.
We can open a ticket with Apache Infra. Would you like to open a PR first?
It's fine to use external services as long as the project maintainers (PPMC) control the usage and the published artifacts are compliant with the ASF polices (for example, don't contain LGPL compontents). |
Ok, I've finally updated my fork accordingly along with a few additional changes: saudet@0966818 |
Thank you. It's fine with me. Once you open the PR, @lanking520 and @gigasquid may be able to review too |
Since MXNet 2.0 development starts. I would like to initiate a discussion for the future development of JVM languages.
Proposal
Statistics
Scala package
Clojure package
@gigasquid @terrytangyuan @zachgk @frankfliu @aaronmarkham
The text was updated successfully, but these errors were encountered: