Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building OpenJ9 for e500v2 core equipped SoC #2585

Open
lmajewski opened this issue Aug 12, 2018 · 123 comments
Open

Building OpenJ9 for e500v2 core equipped SoC #2585

lmajewski opened this issue Aug 12, 2018 · 123 comments

Comments

@lmajewski
Copy link

Dear All,

I'm trying to build OpenJ9 on the PPC SoC equipped with e500v2 core. This core doesn't have the AltiVec IP block (Instead it uses the SPE extension for floating point calculation).

The problem seems to be with the OpenJ9 assumption that all supported cores support AltiVec instructions. One of the assembly tuned files:
./openj9/runtime/compiler/p/runtime/J9PPCCRC32.spp

This is the __crc32_vpmsum [1] optimized implementation of CRC32 calculation for 16B data blocks.

Is there any C implementation of this function available? Or maybe one for SPE assembler?

Please correct me if I'm wrong, but it seems to me that one would need to:

  • Rewrite the [1] in C and then allow gcc to optimize it for e500_v2 core

or

  • Rewrite from scratch the [1] to support SPE assembler instructions instead of altivec

Personally, I would prefer the first option with C, but I'm not sure what would be the performance impact
on OpenJ9.

Has anybody tried to run OpenJ9 on e500_v2?

Thanks in advance,
Łukasz

@fjeremic
Copy link
Contributor

@gita-omr @ymanton this seems like your area of expertise. Could you help answer OP's questions?

@ymanton
Copy link
Member

ymanton commented Aug 13, 2018

The problem seems to be with the OpenJ9 assumption that all supported cores support AltiVec instructions.

We only use AltiVec if we detect the processor at runtime and know that it supports AltiVec. The same applies to VSX and various other hardware features. The __crc32_vpmsum routine for example will only be called if we detected that the processor is an IBM POWER8 or later, otherwise we will not use it.

We don't detect the e500 so we will assume we are running on a basic PPC chip that has no support for AltiVec, VSX, crypto instructions, transactional memory, etc. If those sorts of instructions get executed on your chip that's a bug in the JIT that can be fixed.

@lmajewski
Copy link
Author

Does it mean that the OpenJ9 shall be compiled on very basic PPC ISA if no supported architecture is detected?

Why I do ask?
The guess-platform.sh script checks the system on which we do run. On Linux it seems like the
x86_64, ppc64 and ppc64le are supported.
The ppc (32 bit as e500_v2) is not supported out of the box.

@fjeremic
Copy link
Contributor

The guess-platform.sh script checks the system on which we do run.

This script just attempts to guess the platform you're compiling OpenJ9 on. The compiler options (gcc or xlC) used when compiling OpenJ9 will target the minimum supported architecture level. I'm not sure what that is on Power, but presumably it is a very old processor.

What @ymanton is talking about is what happens at runtime. At runtime OpenJ9 will detect what processor you are running under and the JIT compiler will generate calls to __crc32_vpmsum for example if we detected you are running on IBM POWER8 or later.

@ymanton
Copy link
Member

ymanton commented Aug 13, 2018

As @fjeremic said, guess-platform.sh is checking at build-time, not run-time. Since we don't compile OpenJ9 in 32-bit environments there is currently no support for it in the code, but feel free to add it.

If you want to port OpenJ9 to the e500 then most of your work will be in making changes to the build system to work in a 32-bit ppc environment. Once you have a successful build you shouldn't have much trouble running OpenJ9 except for one issue related to using 64-bit instructions -- we assume that ldarx and stdcx are available, which is not true on 32-bit systems so that will need to be fixed.

If you have not already seen issue #2399 please take a look at it, it discusses problems that are very similar to yours.

@lmajewski
Copy link
Author

I'm not sure what that is on Power, but presumably it is a very old processor.

No, it is not. This is quite powerful embedded system; 2 cores , 1.5GHz, 1 GiB RAM.
It just doesn't support altivec and has SPE instead.

I will look into the pointed thread. Thanks for reply.

@lmajewski
Copy link
Author

I've started the porting.

Why: OpenJ9 claims to be much faster than other JVMs.
Goal: To have OpenJ9 build on PPC (e500_v2 core).

For sake of simplicity I've decided to use zero variant (to avoid altivec issues) and build it native environment.

I've followed: https://www.eclipse.org/openj9/oj9_build.html
Side question: Why gcc 4.8 is used (recommended) ? I'm using gcc 6.4.0
After having the source code (and all prerequisites) the configure passes:
./configure --with-freemarker-jar=/lib/freemarker.jar --with-jobs=2 --with-debug-level=fastdebug --without-freetype --without-x --without-cups --without-alsa --disable-headful --with-jvm-variants=zero

`====================================================
A new configuration has been successfully created in
/root/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-fastdebug
using configure arguments '--with-freemarker-jar=/lib/freemarker.jar --with-jobs=2 --with-debug-level=fastdebug --without-freetype --without-x --without-cups --without-alsa --disable-headful --with-jvm-variants=zero'.

Configuration summary:

  • Debug level: fastdebug
  • JDK variant: normal
  • JVM variants: zero
  • OpenJDK target: OS: linux, CPU architecture: ppc, address length: 32

Tools summary:

  • Boot JDK: openjdk version "1.8.0_102-internal" OpenJDK Runtime Environment (build 1.8.0_102-internal-b14) OpenJDK Zero VM (build 25.102-b14, interpreted mode) (at /usr/lib/jvm/openjdk-8)
  • C Compiler: powerpc-poky-linux-gnuspe-gcc (GCC) version powerpc-poky-linux-gnuspe-gcc (GCC) 6.4.0 (at /usr/bin/powerpc-poky-linux-gnuspe-gcc)
  • C++ Compiler: powerpc-poky-linux-gnuspe-g++ (GCC) version powerpc-poky-linux-gnuspe-g++ (GCC) 6.4.0 (at /usr/bin/powerpc-poky-linux-gnuspe-g++)

Build performance summary:

  • Cores to use: 2
  • Memory limit: 1008 MB
  • ccache status: installed, but disabled (version older than 3.1.4)
    `

The I've decided to build it with:
make CONF=linux-ppc-normal-zero-fastdebug LOG=trace JOBS=2 images

The build errors poped up in:
javac: file not found: /root/openj9-openjdk-jdk8/jdk/src/solaris/classes/sun/awt/org/xml/generator/WrapperGenerator.java [1]

This file has been appended to the end of:
/root/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-fastdebug/jdk/btclasses/_the.BUILD_TOOLS_batch
as part of BUILD_TOOLS generation:

SetupJavaCompilation(BUILD_TOOLS)
[2] SETUP := GENERATE_OLDBYTECODE
[3] SRC := /root/openj9-openjdk-jdk8/jdk/make/src/classes /root/openj9-openjdk-jdk8/jdk/src/solaris/classes/sun/awt/X11/generator
[4] BIN := /root/openj9-openjdk-jdk8/build/linux-ppc-normal-zero-fastdebug/jdk/btclasses
Tools.gmk:38: Running shell command

  • /usr/bin/find /root/openj9-openjdk-jdk8/jdk/make/src/classes /root/openj9-openjdk-jdk8/jdk/src/solaris/classes/sun/awt/X11/generator -type f -o -type l
    gensrc/GensrcProperties.gmk:

When I replace /org/xml -> /X11 the file (WrapperGenerator.java) is present.
Another strange thing - why AWT is build/needed at all? I've asked ./configure to build headless and without X VM.

Any idea why it is like that? Maybe some explanation, which could shed some light?

Regarding the debug infrastructure of OpenJ9 build:

  • makefile's LOG=trace and -d option

Are there any other available?

@fjeremic
Copy link
Contributor

Side question: Why gcc 4.8 is used (recommended) ?

There was work needed to get higher levels working. The JIT specifically made use of a slightly modified CRTP which work on gcc 4.8 but not on 5+ due to spec conformance. We should be able to build now with gcc 7.3 through and will be moving to that compiler level soon. See #1684.

@ymanton
Copy link
Member

ymanton commented Aug 14, 2018

For sake of simplicity I've decided to use zero variant (to avoid altivec issues) and build it native environment.

I don't know how the zero parts of OpenJDK are built for OpenJ9, but OpenJ9 itself doesn't have a "zero" VM so unfortunately it will be the same as building a non-zero VM and various assembly files and the JIT will have to be built.

When I replace /org/xml -> /X11 the file (WrapperGenerator.java) is present.
Another strange thing - why AWT is build/needed at all? I've asked ./configure to build headless and without X VM.

Any idea why it is like that? Maybe some explanation, which could shed some light?

I don't know if it is a bug in the OpenJDK build system or the just the OpenJ9 parts, but the --without-x flag is not respected. I just install all the needed libs and headers and build with the default config. I don't even know why a Solaris Java class is being built on other platforms, but this might also be another bug in the build system.

@lmajewski
Copy link
Author

--without-x flag is not respected

Ok, So this is a dead option.

I just install all the needed libs and headers and build with the default config

I assume that you use PPC64? Have you ever tried to cross compile the OpenJ9?

Is there any way to improve the debug output? I do have a hard time to find places where the files (like _the.BUILD_TOOLS_batch) are generated.

Also please correct me if I'm wrong, but it seems to me like the ./configure is already created in the repository (and downloaded). Maybe I do need to regenerate it?

@ymanton
Copy link
Member

ymanton commented Aug 14, 2018

I assume that you use PPC64? Have you ever tried to cross compile the OpenJ9?

No, OpenJ9 only builds on ppc64le, not ppc64 or ppc (the IBM JDK builds on ppc64 in both 32- and 64-bit modes). I have not tried to cross-compile OpenJ9 myself, but I think we may support that for ARM targets, but I'm not sure.

Is there any way to improve the debug output? I do have a hard time to find places where the files (like _the.BUILD_TOOLS_batch) are generated.

Unfortunately not that I know of, OpenJ9 had to make changes to the OpenJDK build system in order to integrate, but some things are still less than perfect. The only thing I can suggest is that if you're building jdk8 that you set VERBOSE="" in your env for make, which should echo commands so you can better see what's being invoked.

Also please correct me if I'm wrong, but it seems to me like the ./configure is already created in the repository (and downloaded). Maybe I do need to regenerate it?

The version that's checked in should be in sync with configure.ac, but it doesn't hurt to regenerate it. The file you care about is actually common/autoconf/configure, the top-level just calls this one.

@lmajewski
Copy link
Author

I have not tried to cross-compile OpenJ9 myself, but I think we may support that for ARM targets, but I'm not sure.

Do you have maybe the build system adjustments to cross-compile the OpenJ9 on ARM? I mean the arm is also not supported (at all), so I could reuse some of its code on ppc port.

@ymanton
Copy link
Member

ymanton commented Aug 16, 2018

Unfortunately I don't, I haven't spent any time on ARM. @JamesKingdon might have some info on how to get OpenJ9 to cross compile and/or some patches for that on ARM.

@lmajewski
Copy link
Author

If I may ask about OMR's tools - namely tracemerge, hookgen, etc.

What is their purpose? In my native build - for example the tracemerge is used during build:
./tracemerge -majorversion 5 -minorversion 1 -root .

Why do we need to merge trace information during build?
Moreover, this means that it shall be cross-compiled on the HOST (x86_64| PPC64).
Why OpenJ9 needs it?

I've also noticed the OMR_CROSS_CONFIG="yes", which gives tools the possibility to be cross compiled.
This might be quite useful, as omr/tools/tracegen/makefile calls:
include $(top_srcdir)/tools/toolconfigure.mk

However, it seems to be tunned to PPC64 (-m64).

@DanHeidinga
Copy link
Member

OMR and OpenJ9 use a trace engine to record diagnostic info on how the code is executing into a circular buffer on the thread. The descriptions of these trace points need to be converted into binary forms and then merged into a single data file that can be used by the runtime. That's roughly tracemerge.

hookgen is used to generate the appropriate macros for the low overhead pub/sub system used in OMR / OpenJ9 to communicate events across the system.

@lmajewski
Copy link
Author

Ok, so those are components, which will be used by running JVM instance and hence shall be either cross-compiled of build natively.

@DanHeidinga
Copy link
Member

They're only needed as part of the build and not at runtime.

@lmajewski
Copy link
Author

I think that I've misunderstood you in some way.

Are they only used when the OpenJ9 is compiled (so they could be compiled as x86_64)?
Or they need to be available on target (and cross compiled as PPC) ?

@DanHeidinga
Copy link
Member

Sorry I wasn't clear. Most of the tools - like hookgen & tracemerge - are only used when OpenJ9 is compiled and can be compiled as x86_64.

There is one that depends on right architecture: constgen

If you support DDR (used for debugging jvm crashes), it will also need to run on the right architecture.

@lmajewski
Copy link
Author

With current version of openJ9 build system (scripts) the successful configure gives following output:

  • ccache status: installed, but disabled (version older than 3.1.4)

Build performance tip: ccache gives a tremendous speedup for C++ recompilations.
You have ccache installed, but it is a version prior to 3.1.4. Try upgrading.

The problems is that on my system:
/openj9-openjdk-jdk8# ccache -V
ccache version 3.2.5+dirty

Is there any workaround to fix this? Or the ./configure script logic is just wrong and the version is determined in a wrong way?

@DanHeidinga
Copy link
Member

@dnakamura Any thoughts on the ccache question?

@dnakamura
Copy link
Contributor

I believe the openjdk code assumes that the version < 3.1.4 if it fails to parse the version. IT's been a while since I looked at the relevant code, but I think they fail to parse when they seee anything other than digits or a decimal points. Will look into it

@dnakamura
Copy link
Contributor

Ok no my bad. It will handle alphabetic characters in the version string. However to check the version number they are just matching against the regex 3.1.[456789] which means anything > 3.1.9 will fail.

@lmajewski
Copy link
Author

If I may ask again the question regarding the gcc 4.8 (which is recommended for this VM native build):

I've backported the gcc 4.8.2 to my setup. Unfortunately during the ./configure execution, it wants to check if gcc is working:

configure:22215: /usr/bin/powerpc-poky-linux-gnuspe-gcc -O2 -pipe -g -feliminate-unused-debug-types -Wno-error=deprecated-declarations -fno-lifetime-dse -fno-delete-null-pointer-checks -m32 -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double -
-sysroot=/ -Wl,-O1 -Wl,--hash-style=gnu -Wl,--as-needed -fPIC conftest.c >&5
powerpc-poky-linux-gnuspe-gcc: error: unrecognized command line option '-fno-lifetime-dse'

The problem is that this particular optimization option is NOT supported in 4.8.[12345].
It first shows up on 4.9 -> e.g.
https://gcc.gnu.org/onlinedocs/gcc-4.9.3/gcc/Optimize-Options.html

Why it is like that? Is the '-fno-lifetime-dse' only needed on PPC (as it is possible to compile J9 on x86_64).

From the other reply -> the problem with compiling proper code only shows up on gcc 5+, so I guess that 4.9.x can be used?

@ymanton
Copy link
Member

ymanton commented Aug 27, 2018

Looks like that issue comes from OpenJDK code, not OpenJ9. If you look here

https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/2b004fdb6829f287eaa464a57a8680377886ca75/common/autoconf/toolchain.m4#L1425-L1440

you'll see that they're trying to disable that opt under GCC 6, so it should not be used when you build using GCC 4.8. Is your default host compiler GCC 6 or later? Perhaps the configure scripts are invoking that in some places instead of your powerpc-poky-linux-gnuspe-gcc cross compiler and getting confused. You can look in the various config.log files that are generated to see what's going on.

@dnakamura
Copy link
Contributor

You should also note there is a runtime check you need to disable to work on 32 bit ( see #2399 ).
Note: in the issue they also discuss issues with 32 bit power missing for certain instructions, however I dont think thats an issue for the e500 cores. However you may run into other issues where bits of our code assume we are running on a 64bit chip

@shingarov
Copy link
Contributor

Do you have maybe the build system adjustments to cross-compile the OpenJ9 on ARM? I mean the arm is also not supported (at all), so I could reuse some of its code on ppc port.

I recently followed James' instructions and successfully cross-compiled from Ubuntu/AMD64 to the RPi and the resulting VM works fine. Caveat: you may want to read the recent conversation on Slack about back-contributing directly to the master repo, not via James' fork.

I am also actively trying to cross-compile to the e500. I am approaching it differently though, I am trying to start from (pieces of) the OMR testcompiler which kind of looks more within reach. What I understood however is that its build system is quite disconnected from the other two i.e. from both TR's and J9's. And I have a feeling that it's less actively being looked at, as while the other parts cross-compile just fine, I had to dance around things to get the tc/tril/etc to cross-compile to ARM. I'll keep you posted on the progress with tc/tril on e500.

@lmajewski
Copy link
Author

Thanks Boris for your input.

I recently followed James' instructions and successfully cross-compiled from Ubuntu/AMD64 to the RPi and the resulting VM works fine.

I've looked on your Github repositories and I couldn't find the ARM port for J9. Would it be possible to upload it somewhere?

Slack about back-contributing directly to the master repo, not via James' fork.

Do you have any reference/logs to those conversations?

I had to dance around things to get the tc/tril/etc to cross-compile to ARM.

Could you share the steps (or repository), which were needed on ARM to get it working?

I'll keep you posted on the progress with tc/tril on e500.

Thanks.

@lmajewski
Copy link
Author

The scimark.fft.large took 30 minutes to "Warmup()". And similar time to run the test.

../openj9-openjdk-jdk8/build/linux-ppc-normal-zero-release/images/j2re-image/bin/java -Xjit:disableFPCodeGen -Xshareclasses -jar SPECjvm2008.jar -wt 5s -it 5s -bt 2 scimark.fft.large  -ikv -ctf false -chf false

Unfortunately, this is too long.

I would expect the penalty from lack of FP support, but on the other hand we do emulate them on this SoC anyway.

@PTamis
Copy link

PTamis commented Oct 18, 2018

During the compilation I caught some strange warnings.
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9codert_vm.a(cnathelp.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9util.a(fltconv.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9util.a(bcdump.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9util.a(fltmath.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9util.a(fltrem.o) uses soft float
Warning: vm/compiler/../libj9jit29.so uses hard float, vm/compiler/../lib/libj9utilcore.a(j9argscan.o) uses soft float

I dug a bit over this and I show that JIT files were not compiled with the correct flags.
mcpu=powerpc was used instead of -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double

I saw that this variable is set by:
common.mk so I changed that by the options above.

Compile broke only to one point after that. This was at the compilation of PPCHWProfiler.cpp because it includes file PPCHWProfilerPrivate.hpp which has 64bit assembly in it and of course this is not compatible with our HW. So the compiler stopped.

The problem is at lines 250 to 270.
I commented those lines and I redefined MTSPR64 and MFSPR64 exactly like MTSPR MFSPR inside __GNUC__.
After that compilation went all the way and no hard soft floats warnings emitted again.

I guessed that this code should not be called otherwise we would have seen and SIGILL during some of the test.
Any comments for a better fix would be appreciated.

@ymanton
Copy link
Member

ymanton commented Oct 18, 2018

Yes you are right, the code in the PPCHWProfiler files will only run on certain POWER processors. The inline assembly should be better guarded or the files should not be built at all on platforms like yours ideally, but it's a minor issue. I'll keep it in mind.

Most of the code in libj9jit is dedicated to the compiler itself, which doesn't do a lot of floating point math, but some code is runtime code that will be executed by the program. Some of those runtime routines are in object files built via VM makefiles and they get linked together with the JIT library, which explains those warnings. The files built with hardfloat will never interact with the ones built with softfloat so you were not in danger, but it's good to fix that anyway.

Is floating point performance important to you? If it is then you really need the JIT to support SPE. With -Xjit:disableFPCodeGen I see 30x slower performance on FP-intensive programs on large servers as well, so I think that is an unavoidable reality.

@shingarov
Copy link
Contributor

mcpu=powerpc was used instead of -mcpu=8548 -mabi=spe

Out of curiosity, how is your gcc configured? I explicitly invoke the OMR configure script with CC=powerpc-linux-gnuspe-gcc and that gcc just comes in a Debian package and -v says it was compiled with --with-cpu=8548 --enable-e500_double --target=powerpc-linux-gnuspe.

@PTamis
Copy link

PTamis commented Oct 18, 2018

@ymanton FP cannot be excluded.
I tested tomcat to see the load time and it took considerable time to load along with all its apps.
I believe that FP is the bottleneck and it took so much time to load. FP tests alone are getting 30x times slower indeed.

@shingarov CC=powerpc-linux-gnuspe-gcc -m32 -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double --sysroot=/ as an environmental variable

configuration comes from ELDK 5.6 which is based on yocto daisy 1.6. I took configuration from there in order to make the native build. Later I will try also the cross compile with a yocto recipe.

But if you explicitly say to the gcc mcpu=powerpc I guess the wrong one will be used and that's why the warnings being produced. This is hard coded to the common.mk file

Either way even with the -mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double I did not notice any performance difference.

@lmajewski
Copy link
Author

@ymanton

Is floating point performance important to you? If it is then you really need the JIT to support SPE.
With -Xjit:disableFPCodeGen I see 30x slower performance on FP-intensive programs on large servers as well, so I think that is an unavoidable reality.

Yes, the floating point support is necessary. The observed performance regression is not acceptable.
I'm now looking into the kernel to see what exactly fails with JIT generated code.

From my understanding -> as we emulate FPU instructions in-kernel, the J9 JIT which uses them shall work with fully emulated code. Performance shall be better than -Xjit:disableFPCodeGen , but worse than HW FP.

@lmajewski
Copy link
Author

To set off:

  1. One "strange" thing:
    run '-Xjit:disableAsyncCompilation,limit={java/lang/StrictMath.log(D)D}(count=0,breakAfterCompile)' LogTest

we do specify count=0, but at least on my J9:
The "pure SW" implementation of log is executed from:

__j__ieee754_log @ /mnt/openj9-openjdk-jdk8/jdk/src/share/native/java/lang/fdlibm/src/e_log.c:109

It provides correct result.

The above behaviour puzzles me a bit.
Corresponding ASM code snippet:

116                 if (hx<0) return (x-x)/zero;        /* log(-#) = NaN */
   0x0f923ef0 <+172>:   lwz     r9,12(r31)
   0x0f923ef4 <+176>:   cmpwi   cr7,r9,0
   0x0f923ef8 <+180>:   bge-    cr7,0xf923f18 <__j__ieee754_log+212>
   0x0f923efc <+184>:   evldd   r10,104(r31)
   0x0f923f00 <+188>:   evldd   r9,104(r31)
   0x0f923f04 <+192>:   efdsub  r10,r10,r9
   0x0f923f08 <+196>:   lwz     r9,-32764(r30)
   0x0f923f0c <+200>:   evldd   r9,0(r9)

117                 k -= 54; x *= two54; /* subnormal number, scale up x */
   0x0f923f18 <+212>:   lwz     r9,8(r31)
   0x0f923f1c <+216>:   addi    r9,r9,-54
   0x0f923f20 <+220>:   stw     r9,8(r31)
   0x0f923f24 <+224>:   evldd   r10,104(r31)
   0x0f923f28 <+228>:   lwz     r9,-32768(r30)
   0x0f923f2c <+232>:   evldd   r9,0(r9)
   0x0f923f30 <+236>:   efdmul  r9,r10,r9
   0x0f923f34 <+240>:   evstdd  r9,104(r31)

It uses "ev*" ASM SPE instructions (like evdmul -> the same performance as FPU but on GPRs), so this is the fastest possible code on this SoC.
This code is always compiled - even when we set 'count=0' in
'-Xjit:disableAsyncCompilation,limit={java/lang/StrictMath.log(D)D}(count=0,breakAfterCompile)' LogTest

Even better, the JIT code uses trampoline to jump to function, which provides the log (@ymanton is there a way to check where dcall java/lang/StrictMath.log(D)D[#618 final native static Method] is provided? The traceFull doesn't show where this function's ASM representation can be found).

When we disable the -Xjit:disableFPCodeGen we shall use the above code, which is the fastest possible.

Maybe the problem with performance regression lies somewhere else? Maybe locking (as JVM uses several threads) as we use sync instead of lwsync or msync?

@ymanton
Copy link
Member

ymanton commented Oct 22, 2018

The above behaviour puzzles me a bit.

That is an implementation detail. java/lang/StrictMath.log(D)D is special in that it is a JNI method not a Java method. This method is declared in a class as native and it's implementation is in C, not Java. The JIT behaves a little differently for these kinds of methods, even if you use count=0.

@ymanton is there a way to check where dcall java/lang/StrictMath.log(D)D[#618 final native static Method] is provided? The traceFull doesn't show where this function's ASM representation can be found.

Notice that it is a native method, this means that it's implementation will be in C. You can still find the assembly for it in gdb or by looking at the .o file, but the source code will be in https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/openj9/jdk/src/share/native/java/lang/StrictMath.c. The implementation eventually reaches the fdlibm version of log() in https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/openj9/jdk/src/share/native/java/lang/fdlibm/src/e_log.c.

When we disable the -Xjit:disableFPCodeGen we shall use the above code, which is the fastest possible.

Maybe the problem with performance regression lies somewhere else? Maybe locking (as JVM uses several threads) as we use sync instead of lwsync or msync?

You have to consider that yes you will execute a "fast" version of log() that uses the SPE hardware, however by using -Xjit:disableFPCodeGen more methods will now run in the interpreter instead of being compiled. Every method that has even a single float or double bytecode will never be JIT compiled, so if for example the main loop of the benchmark cannot be compiled and must execute in the interpreter, your overall performance will be much worse, even if log() is fast. As I said earlier, I can reproduce a 30x slowdown on some floating point benchmarks when I use -Xjit:disableFPCodeGen on a ppc64 machine, so it can be a big penalty even on server machines.

@lmajewski
Copy link
Author

Do you have maybe the build system adjustments to cross-compile the OpenJ9 on ARM? I mean the arm is also not supported (at all), so I could reuse some of its code on ppc port.

I recently followed James' instructions and successfully cross-compiled from Ubuntu/AMD64 to the RPi and the resulting VM works fine. Caveat: you may want to read the recent conversation on Slack about back-contributing directly to the master repo, not via James' fork.

I am also actively trying to cross-compile to the e500. I am approaching it differently though, I am trying to start from (pieces of) the OMR testcompiler which kind of looks more within reach. What I understood however is that its build system is quite disconnected from the other two i.e. from both TR's and J9's. And I have a feeling that it's less actively being looked at, as while the other parts cross-compile just fine, I had to dance around things to get the tc/tril/etc to cross-compile to ARM. I'll keep you posted on the progress with tc/tril on e500.

@shingarov - Have you managed to make any progress there (with e500 or RISC V)?
If yes - could you share your code (even development stage) on github?

@PTamis
Copy link

PTamis commented Oct 24, 2018

@ymanton during my tests I figure out something a bit strange.
In order to make java -version to work in my system (kernel v4) I had to enable CONFIG_MATH_EMULATION_FULL. I also enabled the traces in kernel and I can see all the FPU functions being emulated by the kernel.

Now I did the following test:
./java -Xint -version disables both JIT and AOT.

The strange thing I show in dmesg was that jvm was emitting lots of lfd and stfd instructions.
And I was wandering how is this possible since I compile all files with mcpu=8548 -mabi=spe -mspe -mfloat-gprs=double so all jvm code should be SPE specific and not have mcpu=powerpc instructions.

Who is emitting those instructions if not JIT or AOT?

@ymanton
Copy link
Member

ymanton commented Oct 24, 2018

Who is emitting those instructions if not JIT or AOT?

There are probably a few low-level routines written in assembly that are still being called. For example https://github.com/eclipse/openj9/blob/master/runtime/util/xp32/volatile.s

@lmajewski
Copy link
Author

@ymanton
If I may ask - have you tried/used on any of your systems the J9 with --with-zlib=system set during
configuration?

We do experience some slow downs when decompressing files (.war/.jar). After enabling the above switch the speedup was not present (but expected as this is a native lib).
The ldd output:

root@lala:/usr/lib/jvm# ldd `which java`
	...
	libz.so.1 => /lib/libz.so.1 (0x0f9e0000)
	...

Is there any other way to speed up decompression on J9?

@ymanton
Copy link
Member

ymanton commented Oct 29, 2018

No I've never used that option. It looks like it allows you to use your system's zlib rather than the one in https://github.com/ibmruntimes/openj9-openjdk-jdk8/tree/openj9/jdk/src/share/native/java/util/zip/zlib.

Even if you don't use that option you will be getting a native zlib implementation, the only difference is which one. It sounds like there is no performance to be gained by using your system's zlib over the OpenJDK one.

Not sure how you can speed up decompression, other than reducing the number of JAR files you access or maybe changing the build so that JAR files are built with no compression (i.e. jar -0 ...). Are you using the -Xshareclasses option that I suggested earlier? I don't know if it actually allows us to skip decompressing jar files or not, but it generally helps startup.

@lmajewski
Copy link
Author

@ymanton

Not sure how you can speed up decompression, other than reducing the number of JAR files you access or maybe changing the build so that JAR files are built with no compression (i.e. jar -0 ...)

This I do need to check if those files can be converted.

Are you using the -Xshareclasses option that I suggested earlier? I don't know if it actually allows us to skip decompressing jar files or not, but it generally helps startup.

I've tried it and the results are promissing - when put into /tmp dir I can see the startup speedup of around 15%

@lmajewski
Copy link
Author

I did some detailed tests on the target production application ( with -Xjit:disableFPCodeGen):

The JIT log interresting parts:

root@lala:~# grep -E "^! " /tmp/JIT_log.20181025.143321.6552 | grep -i Hash
! java/util/WeakHashMap.<init>(IF)V time=3741us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=815 MB
! java/util/HashMap.resize()[Ljava/util/HashMap$Node; time=3498us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=806 MB
! java/util/Hashtable.rehash()V time=2038us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=774 MB
! java/util/WeakHashMap.<init>()V time=1249us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=746 MB
! java/util/HashMap.<init>()V time=1279us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=746 MB
! java/util/HashMap.<init>(IF)V time=2877us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=735 MB
! java/util/HashMap.putMapEntries(Ljava/util/Map;Z)V time=2132us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=729 MB
! java/util/HashMap.<init>(I)V time=1588us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=728 MB
! java/util/Hashtable.<init>(IF)V time=4472us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=723 MB
! java/util/Hashtable.<init>()V time=1270us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=713 MB
! java/util/LinkedHashMap.<init>(IF)V time=1085us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=695 MB
! java/util/HashSet.<init>(IFZ)V time=1327us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=695 MB
! java/util/LinkedHashSet.<init>()V time=1038us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=695 MB

And maybe the most important:

grep -E "^! " /tmp/JIT_log.20181025.143321.6552 | grep -i Zip
! java/util/zip/ZipCoder.getBytes(Ljava/lang/String;)[B time=3326us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=815 MB
! java/util/zip/ZipCoder.toString([BI)Ljava/lang/String; time=3622us compilationRestrictedMethod memLimit=262144 KB freePhysicalMemory=746 MB

The code responsible for unzip'ing.
I do guess that compilationRestrictedMethod is caused by adding -Xjit:disableFPCodeGen

It seems like this is the main slow down factor (the 'prod' application is 3x slower and the cpu usage is very high). Hence my above question if unzipping can be replaced with on-system library.

@lmajewski
Copy link
Author

It seems like the zip decompression is the bottle neck - at least from the FPU point of view.....

I took the webapp (*.war) and re-archived it with jar:
usr/bin/fastjar -c0f web.war web/ -> the size increased a bit (23 MiB -> 27 MiB),

but the execution time (for this part) was reduced from 60 seconds to 2.5 seconds !!!

@ymanton
Copy link
Member

ymanton commented Oct 30, 2018

but the execution time (for this part) was reduced from 60 seconds to 2.5 seconds !!!

That's surprising. I don't see how disableFPCodeGen fits into this since I don't think the default zlib compression algorithm uses floating point, but maybe I'm wrong.

Is the perf tool or OProfile available to you? If so can you collect some profiles?

With perf you should use the -Xjit:perfTool option, with OProfile use -agentlib:jvmti_oprofile.

@lmajewski
Copy link
Author

Output from the perf report (for the part performing the zlib decompression)

# Overhead   Command        Shared Object
 20.24%  jsvc.ppc  libj9vm29.so         [.] bytecodeLoop
 4.45%  jsvc.ppc  [kernel.kallsyms]    [k] program_check_exception
 3.10%  jsvc.ppc  libj9thr29.so        [.] omrthread_spinlock_acquire
 2.14%  jsvc.ppc  [kernel.kallsyms]    [k] do_mathemu
 1.45%  jsvc.ppc  [kernel.kallsyms]    [k] do_resched
 1.05%  jsvc.ppc  [kernel.kallsyms]    [k] __do_softirq
  0.89%  jsvc.ppc  [kernel.kallsyms]    [k] finish_task_switch

...
  0.19%  jsvc.ppc  [kernel.kallsyms]    [k] lfd
  0.19%  jsvc.ppc  [kernel.kallsyms]    [k] stfd 
  0.15%  jsvc.ppc  [kernel.kallsyms]    [k] fsub
  And also fdiv, fmuls, etc

It is apparent that some FPU instructions have slipped in the zip "decompression code".

However, neither the images/j2sdk-image/jre/lib/ppc/libzip.so nor ./jdk/lib/ppc/default/libj9zlib29.so contain the FPU ASM instructions.

I've also grep'ed the libj9*.so libs (in the build directory) and the lfd, stfd, etc. are placed there very often.

@ymanton
Copy link
Member

ymanton commented Oct 30, 2018

OK thanks, bytecodeLoop is the Java interpreter doing a lot of work because of disableFPCodeGen and program_check_exception, do_mathemu, lfd , stfd, etc are FP emulation in the kernel. do_resched, __do_softirq, etc are possibly also emulation related. So it looks like 20% of your time is in the interpreter and 10% or more is in FP emulation.

I'll look into some of this when I have some time in the next couple of days and get back to you.

@ymanton
Copy link
Member

ymanton commented Nov 9, 2018

I took a quick look at why ZipCoder.getBytes() and ZipCoder.toString() were not being compiled with -Xjit:disableFPCodeGen and it is indeed because they use floating point operations; for example float CharsetEncoder.maxBytesPerChar() and CharsetDecoder.maxCharsPerByte(), which is unfortunate because the calculation isn't all that interesting. There are probably lots of other places where some minor FP usage is causing methods to fail compilation.

If you want you can try the following change to your JCL to see how much performance you can get back for unzipping:

diff --git a/jdk/src/share/classes/java/util/zip/ZipCoder.java b/jdk/src/share/classes/java/util/zip/ZipCoder.java
index b920b82..cc449e6 100644
--- a/jdk/src/share/classes/java/util/zip/ZipCoder.java
+++ b/jdk/src/share/classes/java/util/zip/ZipCoder.java
@@ -45,7 +45,7 @@ final class ZipCoder {
 
     String toString(byte[] ba, int length) {
         CharsetDecoder cd = decoder().reset();
-        int len = (int)(length * cd.maxCharsPerByte());
+        int len = (int)(length * maxCharsPerByte);
         char[] ca = new char[len];
         if (len == 0)
             return new String(ca);
@@ -76,7 +76,7 @@ final class ZipCoder {
     byte[] getBytes(String s) {
         CharsetEncoder ce = encoder().reset();
         char[] ca = s.toCharArray();
-        int len = (int)(ca.length * ce.maxBytesPerChar());
+        int len = (int)(ca.length * maxBytesPerChar);
         byte[] ba = new byte[len];
         if (len == 0)
             return ba;
@@ -127,6 +127,8 @@ final class ZipCoder {
     private Charset cs;
     private CharsetDecoder dec;
     private CharsetEncoder enc;
+    private int maxCharsPerByte;
+    private int maxBytesPerChar;
     private boolean isUTF8;
     private ZipCoder utf8;
 
@@ -139,11 +141,15 @@ final class ZipCoder {
         return new ZipCoder(charset);
     }
 
+    private int maxCharsPerByteRU() { return (int)(dec.maxCharsPerByte() + 0.5f); }
+    private int maxBytesPerCharRU() { return (int)(enc.maxBytesPerChar() + 0.5f); }
+
     private CharsetDecoder decoder() {
         if (dec == null) {
             dec = cs.newDecoder()
               .onMalformedInput(CodingErrorAction.REPORT)
               .onUnmappableCharacter(CodingErrorAction.REPORT);
+            maxCharsPerByte = maxCharsPerByteRU();
         }
         return dec;
     }
@@ -153,6 +159,7 @@ final class ZipCoder {
             enc = cs.newEncoder()
               .onMalformedInput(CodingErrorAction.REPORT)
               .onUnmappableCharacter(CodingErrorAction.REPORT);
+            maxBytesPerChar = maxBytesPerCharRU();
         }
         return enc;
     }

@lmajewski
Copy link
Author

Thanks @ymanton for your investigation.

As one can see above - code which on the first glance doesn't require FP support, needs one.
I think that the only feasible solution would be to:

  1. Enable full FPU support in the kernel (also some fpu instructions - like sqrt() are implemented in libc)
  2. Recompile the whole SW stack with (-mcpu=powerpc, do not use SPE at all)
  3. Only then use the OpenJ9 with FPU enabled.

Taking the above into consideration - we can get away with massive changes in OpenJ9 code and just add support for PPC32 bit to its repository.

@shingarov
Copy link
Contributor

Have you managed to make any progress there (with e500 or RISC V)? If yes - could you share your code (even development stage) on github?

@lmajewski Our immediate goals at this stage are much more modest, being currently confined to just OMR. On RISC-V, we successfully JIT some simple methods such as Fibonacci. We hope to share that initial code during this coming RISC-V summit.

On e500, I would like to understand how you were able to run so much of OpenJ9 so successfully. In my experiments so far, I have confined myself to the much simpler TestCompiler, and even for those trivial tests, the generated code is sometimes incorrect. For example, I am trying to debug problems in the area of PPCSystemLinkage::calculateActualParameterOffset() and around; sometimes the offsets are wrong, with catastrophic results, the lwz will trash the saved LR in the link area, causing the blr to segfault. I would like to understand how OpenJ9 doesn't crash in the same place. Investigating...

@ymanton
Copy link
Member

ymanton commented Nov 23, 2018

@shingarov PPCSystemLinkage implements the ppc64le ABI only (because OMR is only supported on ppc64le), it does not handle the AIX/ppc64be ABI or the ppc32 ABI. We don't use the native ABIs for Java, we use our own and you can find the implementations for that stuff in https://github.com/eclipse/openj9/blob/master/runtime/compiler/p/codegen/PPCPrivateLinkage.cpp and https://github.com/eclipse/openj9/blob/master/runtime/compiler/p/codegen/PPCJNILinkage.cpp

@ymanton
Copy link
Member

ymanton commented Feb 27, 2019

@lmajewski and @PTamis just curious if you're still pursuing this and/or still using OpenJ9 on e500?

I'm going to spend some time figuring out what we can salvage from the various patches that have been discussed in this issue that can be contributed back to OMR and OpenJ9.

@lmajewski
Copy link
Author

Dear @ymanton please find some small update from this project:

  1. As you might noticed the PPC SPE architecture in GCC9 has been removed [1], hence there is no point in providing OpenJ9 support for this particular one (especially as it is time consuming)

  2. Considering the above, the idea was to recompile the whole rootfs and userspace binaries to support "generic" PPC32 architecture and see what we can achieve with OpenJ9 + SW match emulation.
    Some initial investigation has been carried out, but the decision for switch hadn't been made.

[1] - https://www.phoronix.com/scan.php?page=news_item&px=GCC-Removes-PowerPCSPE

@wyatt8740
Copy link

wyatt8740 commented Mar 2, 2019

Further notes upon researching:

I'm not sure if there's enough interest in non-e500-using communities, though, since most of the (non-ancient) obtainable hardware for an average user is 64-bit POWER with AltiVec:

  • IBM machines
  • Other enterprise/special purpose high performance computers
  • Raptor Talos

With that said my PowerBook G4 would certainly be easier to justify keeping if I could run Java software at a reasonable (given the system's inherent performance) clip. OpenJDK/zero is absolutely miserable, and IBM's old J9 JVM only runs on it up to version 1.6 or so (where it performs quite well in Debian).

@PTamis
Copy link

PTamis commented Mar 2, 2019

@ymanton sorry for my late reply.
For the time being we stopped any more development over this issue. It is a pity since we made so much progress but the time frame we had for completion was to narrow and the risk was high.
As @lmajewski said we believe that it might be easier to use pure PPC32 and drop any SPE instructions. I made some tests over this direction and the results were very positive.
I want to believe that we will start working again over this issue by the start of the next year where we will have to reconsider our options for the java being used in our PPC targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants