Skip to content

Profiles and new platform support

Chris Hathhorn edited this page Aug 1, 2019 · 12 revisions

Match/C-semantics profiles

If a program is strictly-conforming, then the basic C semantics with no extensions should be enough to interpret it and detect undefined behavior. But the reality of how C is actually used means it is much more common for programs to depend on implementation-specific details of particular hardware or software platforms. Profiles are meant to allow the interpreter generated from the C semantics to analyze programs in the context of such platform-specific behavior.

Platform-specific extensions

Platform-specific extensions and behavior should optimally be confined to a profile. These behaviors can broadly be understood as behaviors originating from the compiler, the operating system, or the underlying hardware. We have broken them down into more specific categories:

  • Command-line interface/flags -- the specifics of the interface a particular compiler uses need to be supported for RV-Match to be able to replace that compiler in a build system.
  • Preprocessor macros and extensions -- the specific preprocessor used by an implementation and the value of feature-test macros.
  • Implementation-specific choices -- the C standard leaves implementations with several choices, such as the size of int, that need to be made in order to interpret programs.
  • Language extensions: pragmas, attributes -- language extensions that generally don't require extra support by the parser.
  • Language extensions: built-in functions -- language extensions that generally behave like function calls, so don't require extra support by the parser.
  • Language extensions: new syntax -- language extensions involving novel syntax that requires extra support by the parser (e.g., gcc's statement expressions: https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html#Statement-Exprs).
  • Operating system interface and libraries -- e.g., the Linux system call interface or POSIX library functions.
  • Native execution/FFI support -- the components necessary for executing functions on the native platform when they can't be analyzed by Match (e.g., when they contain inline assembly).

Current profiles

The list of installed profiles can be queried with kcc -version. We currently support a few different profiles, all based on gcc and x86_64:

  • x86_64-linux-gcc-glibc -- the default profile intended to act as a drop-in replacement for gcc with -std=c11 set.
  • x86-gcc-limited-libc -- like the x86_64-linux-gcc-glibc profile, except that it has limited support for the C standard library because it does not use the standard library implementation installed on the system. This is similar to the profile the open source semantics uses: https://github.com/kframework/c-semantics/tree/master/profiles/x86-gcc-limited-libc
  • x86_64-linux-gcc-glibc-reverse-eval-order -- like the above x86_64-linux-gcc-glibc profile, except rules are evaluated in the reverse order, which can sometimes find errors due to non-determinately sequenced expressions.
  • x86_64-linux-gcc-glibc-gnuc and x86_64-linux-gcc-glibc-gnuc-reverse-eval-order -- these profiles are both analogous to the similarly-named ones above, except they are meant to behave like gcc with -std=gnu11, which is the default behavior of gcc and might be required in certain cases. Functionally, these profiles provide the same GNUC language extensions as the ones above, the only difference is the value of feature-test macros.

In the past, we have also experimented with an 78k0r profile.

All of the above profiles are based on the behavior of gcc. In the following, we outline the extensions to the basic C semantics these profiles support in each of the categories mentioned above.

Command-line interface/flags

Implemented as: extensions to the kcc script, not in the profile.

The command line interface to Match is provided by the kcc script, which is intended to be compatible with gcc. It supports many of gcc's most commonly-used flags in addition to flags specific to Match. Supporting other another compiler's interface would generally require writing a similar wrapper script.

The current state of the command line interface supported by kcc is described by kcc --help. This piece of platform-specific behavior is not kept in profiles and is the same for all of them (though some flags are not supported by the open source profile). The actual command line parsing done by kcc happens here:

https://github.com/kframework/c-semantics/blob/master/scripts/RV_Kcc/Opts.pm

Extending kcc to support more flags would require, to start with, updating this file with the new flags.

Preprocessor macros and extensions

Implemented as: preprocessor script in the profile.

Currently, kcc uses gcc as a C preprocessor. The default profile also unsets the __GNUC__ feature-test macro in order to discourage the use of GNU language extensions.

The preprocessor used by Match is set by the profile. E.g., this is the preprocessor used by the open source profile:

https://github.com/kframework/c-semantics/blob/master/profiles/x86-gcc-limited-libc/pp

Implementation-specific choices

Implemented as: semantic extensions in the profile.

Implementation-specific behavior described in the standard are set in various *-settings.k files in the profile, e.g.: https://github.com/kframework/c-semantics/blob/master/profiles/x86-gcc-limited-libc/semantics/c-settings.k

For example:

// ...
     rule cfg:sizeut => unsigned-long-int
     rule cfg:wintut => unsigned-int
     rule cfg:wcharut => int
     rule cfg:ptrsize => 4
     rule cfg:ptrdiffut => int

     rule cfg:alignofBool => 1
     rule cfg:alignofSignedChar => 1
     rule cfg:alignofShortInt => 2
     rule cfg:alignofInt => 4
     rule cfg:alignofLongInt => 4
     rule cfg:alignofLongLongInt => 8
     rule cfg:alignofOversizedInt => 16
     rule cfg:alignofFloat => 4
     rule cfg:alignofDouble => 8
     rule cfg:alignofLongDouble => 16
     rule cfg:alignofOversizedFloat => 16
     rule cfg:alignofPointer => 4
     // Max of the above.
     rule cfg:alignofMalloc => 16
// ...

Language extensions: pragmas, attributes

Implemented as: semantic extensions in the profile.

Some gcc attributes we support:

  • __attribute__((packed)), __attribute__((__packed__))

Some kcc attributes we support:

  • kcc_absolute
  • kcc_extends_tu_of

Language extensions: built-in functions

Implemented as: semantic extensions (and sometimes C source) in the profile.

List of gcc built-in functions for which we have some explicit semantic support:

__builtin_abort
__builtin_aligned_alloc
__builtin_calloc
__builtin_exit
__builtin_free
__builtin_malloc
__builtin_realloc
__builtin_asin
__builtin_atan2
__builtin_atan
__builtin_cos
__builtin_exp
__builtin_floor
__builtin_fmod
__builtin_log
__builtin_sin
__builtin_sqrt
__builtin_tan
__builtin_longjmp
__builtin_setjmp
__builtin_fprintf
__builtin_fputc
__builtin_fputs
__builtin_fscanf
__builtin_fwrite
__builtin_printf
__builtin_putc
__builtin_putchar
__builtin_puts
__builtin_scanf
__builtin_snprintf
__builtin_sprintf
__builtin_sscanf
__builtin_vfprintf
__builtin_vfscanf
__builtin_vprintf
__builtin_vscanf
__builtin_vsnprintf
__builtin_vsprintf
__builtin_vsscanf
__builtin_memset
__builtin_strcpy
__builtin_strlen
__builtin_alloca
__builtin_offsetof

Some kcc built-ins (also, generally any gcc built-in with the __builtin_ prefix replaced with __kcc_):

__kcc_breakpoint()

Additionally, we auto-generate wrappers to implement many of the gcc builtin functions we do not have explicit semantics for. An incomplete list of gcc builtin functions here:

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Language extensions: new syntax

Implemented as: semantic extensions in the profile and parser extensions (outside profiles).

Some gcc language extensions we support:

  • Case ranges.
  • Statement expressions.
  • typeof
  • Pointer arithmetic on void and function pointers.
  • Zero length arrays.
  • __int128, __int128_t, __uint128_t
  • __float128
  • auto type.

Also, types, misc kcc stuff:

  • __kcc_int{8,16,32,64}_t, __kcc_uint{8,16,32,64}_t

List of gcc language extensions here:

https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html

Operating system API and libraries

Implemented as: semantic extensions in the profile.

For Match to detect errors or misuse of APIs, it needs semantics. In addition to part of the C standard library (the semantics of which is https://github.com/kframework/c-semantics/tree/master/semantics/c/library), the proprietary extensions to the C semantics include (at least partial) semantics for the POSIX Threads API.

Native execution/FFI support

Implemented as: part of the profile; a natively-compiled program running on the target platform.

The native server is a program intended to run on the native platform. It accepts compiled, executable program fragments from Match, executes the fragment, and returns the result to Match. We have an example implementation, with documentation, here:

https://github.com/kframework/c-semantics/tree/master/native-server

Mechanisms for native platform execution

We currently have two methods for dealing with natively-compiled code: "direct FFI" and "native server."

Direct FFI

This is currently our default mechanism.

Translation-time

  1. Compile all code with the native compiler for the current platform.
  2. Mangle and export symbols without external linkage to make them visible and accessible to libffi.
  3. Rename the main function so it doesn't clash with the interpreter's main function.
  4. Translate the same code with the C translation semantics. During translation, if we notice a function definition with assembly (or e.g. some other compiler builtin we don't have semantics for), we remove the definition from the generated AST and mark calls to the function to be handled via the native call mechanism.
  5. The executable produced includes the interpreter (extracted from the execution semantics) linked with the natively-compiled object along with the translated AST combined in one binary.

Execution-time

  1. When the interpreter generated from the C execution semantics encounters a call to a function to be executed natively, it invokes a special nativeCall function, with the symbol of the function to be called and the arguments, which is defined in the OCaml backend.
  2. The OCaml definition for nativeCall (which is part of our Match-specific K plugin) handles marshaling the arguments from the abstract representation used by the semantics to native data structures and then calling the native function.
  3. When the native function returns, the Match-specific K plugin handles marshalling native data back into the representation used by the semantics.

Native server

This is an experimental mechanism more complicated than direct FFI, but allows native execution to happen on a platform (e.g., an embedded device or emulator) different than the platform performing the analysis.

Translation-time

The steps taken during translation-time are the same as for the direct FFI approach, except that the natively-compiled object is linked together with the native server into a separate executable. The native server comprises three pieces: (1) a platform-independent piece, (2) a platform-dependent piece, and (3) a program-dependent set of trampoline definitions.

Native server: platform-independent piece

The platform-independent piece (at https://github.com/kframework/c-semantics/tree/master/native-server) defines a protocol for communicating between the interpreter and the native server (which might be executing on a platform different from the interpreter). This protocol has five commands:

  • _INVOKE: takes a pointer to a function, a list of arguments, a pointer to the return value's location in memory, and a set of native memory to be marshalled, and executes the specified function, then unmarshals the native memory back tto the server.

  • _GET_SYMBOL_TABLE: returns a list of all native symbols that may be used by the application, as well as their addresses.

  • _ALLOC: takes an integer size and allocates dynamically native memory of the specified size, returning the address.

  • _SYNC_READ: takes an address and a width in bytes and returns the content of the specified byte range.

  • _EXIT: shuts down the native code server upon termination of the application.

Native server: platform-dependent piece

This platform-independent protocol is defined in terms of platform-dependent communication and allocation primitives which are intended to be implemented on each platform. These primitives:

  • void __init_messages(const char *read_path, const char *write_path): Connects to the RV-Match client process, optionally accepting a pair of file system paths pointing to named pipes. Implementations that do not have a file system are free to pass whatever values they want to this function.

  • void __send_message(void *buf, size_t len): sends a message of length len contained in the buffer buf. Guaranteed to send exactly len characters.

  • void __recv_message(void *buf, size_t len): Receives a message and copies len characters into the buffer buf. Guaranteed to receive exactly len characters.

  • void __deinit_messages(void): Performs cleanup to terminate the native call server and free resources.

  • void __error(char *msg): Handles an error that arises during execution of platform-independent code.

  • void * __alloc(size_t size): Allocates a native buffer of a particular size.

  • void __flush_message(void): Flush the write buffer, if any.

An example implementation using Linux pipes is here:

https://github.com/kframework/c-semantics/blob/master/profiles/x86-gcc-limited-libc/native/platform.c

Native server: program-dependent trampolines

During translation, we generate a set of trampoline definitions based on function callsites. Callsites are extracted using a clang-based tool:

https://github.com/kframework/c-semantics/blob/master/clang-tools/CallSites.cc

Which are then turned into trampoline definitions via scripts:

https://github.com/kframework/c-semantics/blob/master/scripts/make-trampolines

Trampolines are functions with names that encode the number and size of arguments encountered at a particular callsite and which accept two arguments: the address of a function to invoke and the address where the return value should be stored. They effectively wrap a call to the function in the first argument with the plumbing necessary to receive the arguments over the communication channel, invoke the function, and then store the result in the argument pointer. The automatically-generated definition of a trampoline handles receiving the arguments and invoking the argument function in terms of the platform-dependent primitives listed above.

For example, a trampoline definition for a call to a non-variadic (f) function accepting a single signed integer argument (si) and returning a signed integer looks like this:

void __si_si_1_f (void (*fptr)(), void *retval) {
  int (*fun)(int) = (int(*)(int))fptr;
  int *typed_retval = retval;
  int _0;
  __recv_message(&_0, sizeof(_0));
  *typed_retval = fun(_0);
}

The definition for a trampoline generated from calls to a function accepting no arguments and returning nothing (such as hello() below) looks like this:

void __v_0_f (void (*fptr)(), void *retval) {
  void (*fun)(void) = (void(*)(void))fptr;
  fun();
}

Execution-time

The steps taken during execution time are the same as for the direct FFI approach, except that the definition of the nativeCall function in the OCaml Match-specific K plugin is in terms of the protocol above. The Match-specific K plugin acts as a client to the native-server, issuing the _INVOKE, _ALLOC, _GET_SYMBOL_TABLE, etc. commands.

Consider this program:

#include <string.h>
 
char* str = "Hello World!\n";
 
void hello(void) {
        char *s = str;
        long len = strlen(str);
        int ret = 0;
 
        __asm__("movq $1, %%rax \n\t"
                "movq $1, %%rdi \n\t"
                "movq %1, %%rsi \n\t"
                "movl %2, %%edx \n\t"
                "syscall"
                : "=g"(ret)
                : "g"(s), "g" (len));
 
}
 
int main() {
  hello();
}

When execution in the interpreter reaches the point of the hello() call, it invokes the nativeCall function. If we're using the direct FFI mechanism, this gets translated into an FFI call using libffi and the OCaml ctypes library (https://github.com/ocamllabs/ocaml-ctypes). If we're using the native-server mechanism, this instead gets translated into a series of messages exchanged with the native call server:

  1. Interpreter -> server: _GET_SYMBOL_TABLE
  2. Server -> interpreter: the symbol table
  3. Interpreter looks up __v_0_f in the symbol table (tramp_addr), which is the trampoline for handling the hello() call.
  4. Interpreter looks up hello in the symbol table (hello_addr).
  5. Interpreter -> server: _INVOKE(tramp_addr, addr) (Arguments would follow if hello had been called with any.)
Clone this wiki locally