Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture updater (auto-sync) - Updating ARM #1949

Merged
merged 51 commits into from
Jul 19, 2023

Conversation

Rot127
Copy link
Collaborator

@Rot127 Rot127 commented Jan 11, 2023

Architecture Updater

This is the very early version of the new architecture updater for Capstone.
It is meant to replace (#1831 and #1803).
Because this updater makes a few refactor tasks on Capstone necessary, it is opened as draft PR.

Feedback is very much welcome!

Please note that x86 will not be supported here. Due to the vast difference in the x86 LLVM TableGen backends it would be too much work.
This PR will only update ARM. Once ARM is done, PPC and afterwards other archs will follow in separated PRs.
ARM and PPC come first because they have the most corner cases.

Updater purpose

Updating an architecture module requires a lot of manual work. This is because (most) Capstone modules use the LLVM Disassembler and AsmPrinter which are written in C++.
Since Capstone is written in C the updating process requires a lot of manual work to translate the files.

The consequence is that instructions, introduced by new LLVM releases, are hard to add.

The purpose of this updater is to automate as much work as possible.
The target is that a developer only has to run the updater and fix a few build errors afterwards (which were too complex to resolve automatically).

Additionally, this will refactor a few things in Capstone.
Because of some changes in the LLVM TableGen backends (see below) we can generate easily more information than before.
A simple example for this are the instruction enums in include/capstone/<ARCH>.h.
But there is more, like information about input and output operands, operand sizes etc.

Also, because all information will be generated homogeneously we can generalize some Capstone logic between architecture modules.

How it works

Please refer to suite/auto-sync/README.md for an introduction. Everything should be explained there.

If you still have questions afterwards, please let me know here. This way I can enhance the documentation.

Please note that this PR currently does not include the new versions of the generated files. You can generate them by following the instructions in the README.

Generated files are added now.

The LLVM TableGen backends - The most important part

The new updater is only possible in this way because of some larger refactor tasks in some LLVM TableGen backends.
The newly refactored backends allow to emit the same code but in other languages then C++ (in our case we want them to emit C code).

Please see this LLVM review for details.

Testing

The updater is tested via Rizin. Checkout rizinorg/rizin#3399 and the CI test results of it.

TODO overview

Changes to core

  • Done

Translation

  • Done

Add patches for:

  • Done

Fix:

  • Done

Generated info

  • Done

Update script

  • Add functionality that previous hand made fixes can be applied to the newly translated file in an easy manner (something like "diff between tree version and translated version"?).
  • Hint which files should be checked for changes (DisassemblerExtensions, BaseInfo.h) by hand.

Other

  • Add tests (and fix if in scope) all issues regarding ARM disassembly problems.
  • Add test cases of test_arm to cstest issue file.
  • Document LLVM disassembler logic (based on single instructions) and CS logic (based on instrucitons with same Mnemonic - Matchables in LLVM).
  • Duplicate enum definitions in arm.h and translated header files (like ARMBaseInfo.h).
  • Rename files translated from LLVM (add LLVM to name).
  • Fix links in READMEs
  • The header comment of all generated and translated files must contain the LLVM commit and release the update is based on.
  • Resolve Capstone LLVM backend design problems (after more feedback, for now leave it like that)
  • Add missing CAPSTONE_DIET guards
  • Segfaults: ./cstool -d -s arm "80 b4 00 af 0a 4b d3 ed 00 7a b7 ee e7 6a 9f ed 06 5b 85 ee 06 7b f7 ee c7 7b b0 ee 67 0a bd 46 5d f8 04 7b 70 47"
  • Segfaults: ./cstool -d armbe 0xCD000B00
  • Groups in detail seem off.
  • Allow to set multiple writeback indices
  • All reglists are read. td files do not define write reglists.
  • Capitalize all arch names
  • Add step to patch files, to fix stuff like: [broken, don't merge] correct register access attributes of Arm's LDM instruction #1987 (comment)
  • Add issue template. So the bug is compared to llvm-objdump first.
  • Fix ARM_reg_access()
  • System operands new groups

TODO after merge

Open issue about

  • Removing MCInst.writeback and introducing Constraints instead (With AArch64 update). The constraints can be checked if they match a writeback condition (see: ARM writeback on post-index  #1507).
  • Check Predicate before setting UpdateFlags = True: update_flags not working as expected for some ARM instructions #1568
  • [broken, don't merge] correct register access attributes of Arm's LDM instruction #1987 (Needs fix. Is decoded to pseudo instr. tLDMIA not to t2LDMIA_UDP)
  • Adding a call to mapping for getRegName() which returns a register alias if one is present (fp etc.). LLVM only returns the raw reg name.
  • Remove generation of set_mem_access in ARMs inc files. The neon lanes can be checked without it.
  • cstool -d shows the wrong id name (but the correct numerical value).
  • Add sys register name as string to detail (see: ARM get sysreg name #1713). Can be done in ARM_set_detail_op_sysreg.
  • Generate test files from output from llvm-objdump. Extend suite/MC/update.py with https://github.com/imbillow/capstone/blob/tricore/suite/gencstest.py
  • Handle ARMs CDE (Custom-Data-Extension) Coprocessors (0xb4,0xec,0x04,0x85 = ldc p5, c8, [r4], #0x10 decodes differently if CDE coprocs are present.) and DFB
  • Add detail of instructions operands which are never decoded (e.g. the #0 in 0x40,0x1b,0xf5,0xee == vcmp.f64 d17, #0 (arm). As well as MSR instructions (only print the serach for MRS in AsmPrinter.inc).
  • ARM branches read registers #1087 - Conditioinal branch instruction have no CPSR access set. Fix in tablegen.
  • Open issue about Capstone ARCHITECTURE docs
  • Add a call to mapping which adds Capstone only groups to specific instructions (e.g. INT, RET).
  • Allow to test Patches of the translator. Add a method to each patch to get test cases and use the patcch on it.
  • Update-Arch.sh should be a Python script. Compatibility with Windows is a nightmare otherwise.
  • Move the CppTranslator python venv to capstone repo root
  • Open revesion for d64f749
  • Upstream capstone-engine/llvm-capstone@dd0b3b8 to LLVM

Issues fixed

Fixed at the current state

closes #1985
closes #1946
closes #1897
closes #1784
closes #1855
closes #1935
closes #1951
closes #1983
closes #1783
closes #1587
closes #1197
closes #1020
closes #994
closes #1072 (Hand made groups are no longer supported. Only the ones defined by LLVM.)
closes #1601
closes #1228
closes #1196
closes #1195
closes #1724
closes #1568

Possibly closed

Need to be checked after this PR is ready.

#993 (needs tblgen fix)
#1713 (leave issue open)
#984 (LLVM issue.)
#1201 (Leave issue open)
#1013 (Fix after merge)

Possibly breaking changes

  • Enum changes:
    • ARM_CC_ -> ARMCC
    • System registers are renamed to match C++ namespaces. Also group Banked and system registers into different groups.
    • Some instr. enum entries no longer exist (e.g. VPUSH, VPOP).
  • Some instruction groups which are not part of LLVM were removed (e.g. GROUP_INT)
    • Groups like RET, INT should be added via Mapper separately.
  • Feature groups like ARM_GRP_CRC are renamed to match LLVM nameing: ARM_FEATURE_HasCRC
  • Features are now checked more strictly (V8, MCLASS, ARM, THUMB) because instruction aliases are supported now. So it matters.
  • The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the [] brackets were added.
  • writeback is part of detail and no longer of detail.arm.
  • Register alias not defined in LLVM (r15 = pc etc.) are no longer printed as default. Must be enabled via CS_OPT_SYNTAX_CS_REG_ALIAS or -a for the cstool.
  • The immediate value of operands is no of type uint32_t, no longer int32_t.
  • Memory disponents are always positive. The subtracted flag indicates if they should be subtracted or not. Before disp sign and subtracted fag were inconsistent.
  • The ITState is no longer reset after each cs_disas call. Only after cs_close.

@Rot127
Copy link
Collaborator Author

Rot127 commented Jan 12, 2023

If you get build errors for Blake3

FAILED: lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_avx512_x86-64_unix.S.o 
/usr/bin/ccache -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/user/repos/llvm-capstone/build/lib/Support/BLAKE3 -I/home/user/repos/llvm-capstone/llvm/lib/Support/BLAKE3 -I/home/user/repos/llvm-capstone/build/include -I/home/user/repos/llvm-capstone/llvm/include -fPIC -mavx512vl -o lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_avx512_x86-64_unix.S.o -c /home/user/repos/llvm-capstone/llvm/lib/Support/BLAKE3/blake3_avx512_x86-64_unix.S
/usr/bin/ccache: invalid option -- 'D'

then apply this work around:

diff --git a/llvm/lib/Support/BLAKE3/CMakeLists.txt b/llvm/lib/Support/BLAKE3/CMakeLists.txt
index e5d227b0c486..f795bbabe702 100644
--- a/llvm/lib/Support/BLAKE3/CMakeLists.txt
+++ b/llvm/lib/Support/BLAKE3/CMakeLists.txt
@@ -10,6 +10,7 @@ if (LLVM_DISABLE_ASSEMBLY_FILES)
 else()
   set(CAN_USE_ASSEMBLER TRUE)
 endif()
+set(CAN_USE_ASSEMBLER FALSE)
 
 macro(disable_blake3_x86_simd)
   add_definitions(-DBLAKE3_NO_AVX512 -DBLAKE3_NO_AVX2 -DBLAKE3_NO_SSE41 -DBLAKE3_NO_SSE2)

@Rot127
Copy link
Collaborator Author

Rot127 commented Jan 22, 2023

From now on you will find an overview, of the tasks left, in the PR description.

@Rot127
Copy link
Collaborator Author

Rot127 commented Feb 22, 2023

Currently the generated .inc files are not pushed. @XVilka @kabeor Should I push them just to let the build be successful? Or do want to have a separated PR for them because they'll add an enormous line count?

Never mind. I will push them once ARMSystemRegister.inc can be generated. They can also be removed just before the merge.

@Rot127 Rot127 force-pushed the auto-sync branch 2 times, most recently from 6e082c6 to 7fa0c46 Compare February 27, 2023 23:52
@XVilka
Copy link
Contributor

XVilka commented Mar 15, 2023

Additional architectures that are supported by LLVM already but not yet by Capstone, which could be added once this auto-sync feature is finished, are:

  • LoongArch
  • CSKY
  • Lanai
  • Xtensa

@imbillow
Copy link
Contributor

image

Some instructions mnemonic contain '.' , but in the generated file <Arch>GenCSMappingInsn.inc retains these '.'. It may be safe to replace them with '_'

@imbillow
Copy link
Contributor

image

Then in the <Arch>GenDisassemblerTables.inc file, the C++ symbol :: seems to be replaced with _ everywhere else, but not here for some reason.

@Rot127
Copy link
Collaborator Author

Rot127 commented Mar 23, 2023

Some instructions mnemonic contain '.' , but in the generated file <Arch>GenCSMappingInsn.inc retains these '.'. It may be safe to replace them with '_'

Thanks for pointing it out :) Will fix this.

Then in the <Arch>GenDisassemblerTables.inc file, the C++ symbol :: seems to be replaced with _ everywhere else, but not here for some reason.

Have you rebased the tblgen_capstone_backends branch onto TriDis repo? Or did you rebase TriDis branch onto the tblgen_capstone_backends branch?

@imbillow
Copy link
Contributor

Some instructions mnemonic contain '.' , but in the generated file <Arch>GenCSMappingInsn.inc retains these '.'. It may be safe to replace them with '_'

Thanks for pointing it out :) Will fix this.

Then in the <Arch>GenDisassemblerTables.inc file, the C++ symbol :: seems to be replaced with _ everywhere else, but not here for some reason.

Have you rebased the tblgen_capstone_backends branch onto TriDis repo? Or did you rebase TriDis branch onto the tblgen_capstone_backends branch?

Neither. In fact, I just used https://github.com/Rot127/llvm-capstone/tree/tblgen_capstone_backends
Then I modified and used suite/auto-sync/Update-Arch.sh to generate some .inc files from the TableGen files.

@Rot127
Copy link
Collaborator Author

Rot127 commented Mar 23, 2023

I see. So you copied the TriCore .td files into a new TriCore directory in llvm/lib/Target/. Correct?
Or did you used the ones in the Capstone commit (in capstone/arch/TriCore)?
If so, please push your version to a separated branch so I can debug it.

@imbillow
Copy link
Contributor

https://github.com/imbillow/capstone/tree/14a3f56c2eb829b6a9d93ccfc2693e456e1d3fb1

Use this arch/TriCore/TriCore.td

I see. So you copied the TriCore .td files into a new TriCore directory in llvm/lib/Target/. Correct? Or did you used the ones in the Capstone commit (in capstone/arch/TriCore)? If so, please push your version to a separated branch so I can debug it.

@Rot127
Copy link
Collaborator Author

Rot127 commented Apr 5, 2023

@kabeor I assume it is fine to add the MC regression tests to the CI (running cstest -d MC/ARM/)?

@kabeor
Copy link
Member

kabeor commented Apr 6, 2023

@kabeor I assume it is fine to add the MC regression tests to the CI (running cstest -d MC/ARM/)?

Of course yes!

@XVilka
Copy link
Contributor

XVilka commented Apr 6, 2023

GitHub removed Ubuntu 18 images, if you want to continue supporting it, these should be converted to use Docker instead: actions/runner-images#6002

Moreover, it makes sense to add support for Ubuntu 22.x series in CI, better in a separate PR, then rebase once it is merged.

EDITED: Nevermind, it was already done, just rebase is required: #1986

@XVilka

This comment was marked as resolved.

@Rot127
Copy link
Collaborator Author

Rot127 commented Apr 14, 2023

@kabeor What would be your release strategy for the auto-sync PRs? Will you release v5 before this is merged? And then v6 after all 9 archs have been updated? Or wait to update all 9 archs and then release v5?
Better to be discussed here: #1980

@XVilka
Copy link
Contributor

XVilka commented Jul 7, 2023

Yes. Waitting for @aquynh to take a look.

A gentle ping.

@XVilka
Copy link
Contributor

XVilka commented Jul 8, 2023

@kabeor @aquynh ping

@Rot127 Rot127 changed the title Architecture updater (auto-sync) - Part 1/9 - Updating ARM Architecture updater (auto-sync) - Updating ARM Jul 8, 2023
@XVilka
Copy link
Contributor

XVilka commented Jul 9, 2023

@kabeor @aquynh ping

@XVilka
Copy link
Contributor

XVilka commented Jul 11, 2023

@kabeor @aquynh ping

@kabeor
Copy link
Member

kabeor commented Jul 11, 2023

@aquynh is still reviewing. He is really busy these days and only has time to do it at night. So sorry for the delay.😅

@XVilka
Copy link
Contributor

XVilka commented Jul 11, 2023

@aquynh is still reviewing. He is really busy these days and only has time to do it at night. So sorry for the delay.😅

Ok, thats fine and understandable. Complete silence is what worried me.

@XVilka
Copy link
Contributor

XVilka commented Jul 14, 2023

@aquynh is still reviewing. He is really busy these days and only has time to do it at night. So sorry for the delay.😅

So, any feedback yet?

@XVilka
Copy link
Contributor

XVilka commented Jul 16, 2023

PING!

@Rot127
Copy link
Collaborator Author

Rot127 commented Jul 17, 2023 via email

@kabeor
Copy link
Member

kabeor commented Jul 19, 2023

Finally we are ready to merge now.

@Rot127 Just a quick check, "Rename files translated from LLVM (add LLVM to name)." item isn't selected. Any commit needs?

@Rot127
Copy link
Collaborator Author

Rot127 commented Jul 19, 2023 via email

@kabeor
Copy link
Member

kabeor commented Jul 19, 2023

Great, so thanks again for this excellent work. Merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment