Skip to content

Latest commit

 

History

History
413 lines (318 loc) · 11 KB

IMPLEMENTATION.md

File metadata and controls

413 lines (318 loc) · 11 KB

pyasm2

© 2012, Jurriaan Bremer

Introduction

pyasm2 is an x86 assembler library. It allows an easy Intel-like assembly syntax, with support for sequences of instructions, as well as labels.

Simple Usage

Here are some examples to illustrate the simplicity of pyasm2. For each example the normal Intel-syntax is given, followed by the equivalent using pyasm2.

  • push eaxpush(eax)
  • mov eax, ebxmov(eax, ebx)
  • lea edx, [ebp+eax*4+32]lea(edx, [ebp+eax*4+32])
  • movzx ebx, byte [esp-64]movzx(ebx, byte [esp-64])
  • mov eax, dword fs:[0xc0]mov(eax, dword [fs:0xc0])

Note that pyasm2 throws an exception if the instruction doesn't support the given operands (an operand is like a parameter to an instruction.)

A few simple command-line examples.

>>> from pyasm2 import *
>>> mov(eax, dword[ebx+0x100])
mov eax, dword [ebx+0x100]
>>> push(dword[esp])
push dword [esp]
>>> mov(eax, eax, eax) # invalid encoding
... snip ...
Exception: Unknown or Invalid Encoding

Blocks

Besides normal instructions pyasm2 also supports sequences of instructions, referred to as blocks from now on.

Blocks are especially useful when chaining multiple instructions. Besides that, blocks automatically resolve relative jumps, labels, etc.

A simple example of a function that does only one thing; zero the eax register (the default return value of a function on x86) and returning to the caller, looks like the following.

Block(
    xor(eax, eax),
    retn()
)

Before we discuss further on blocks, we first need an introduction on pyasm2 labels.

Labels

pyasm2 supports two types of Labels; anonymous labels and named labels.

Anonymous Labels

Anonymous labels get an index, and can be referred to by a relative index.

For example, the following block increases the eax register infinite times. (The -1 in this example is a relative index, so -1 points to the last defined Label.)

Block(
    Label(),
    inc(eax),
    jmp(Label(-1))
)

It is, however, not possible to reference to anonymous labels outside of the current block (i.e. an IndexError is thrown.)

There are three different possible values for relative indices.

  • Negative Index → Points to an anonymous label before the current instruction.
  • Zero Index → Points to a transparant label which points to the current instruction.
  • Positive Index → Points to an anonymous label after the current instruction.

(This does indeed mean that relative index 1 points to the first label after the current instruction.)

Throughout the following sections we will refer to this snippet, by rewriting it a little bit every time.

Global Named Labels

A new named label can be created by creating a new Label instance with the name as first parameter. Referencing a named label is just like referencing an anonymous label, but instead of passing an index, you give a string as parameter.

Block(
    Label('loop'),
    inc(eax),
    jmp(Label('loop'))
)

Note that this type of named label is global, that is, other blocks can reference to this particular label as well. This is useful for example when defining a function. (Note that two or more blocks can not declare the same global named labels!)

Local Named Labels

Whereas one could make a global named label using e.g. Label('name'), it is also possible to make a local named label; a label that's only defined for the current block. Because local labels are more commonly used than global labels, their syntax is easier as well. Local named labels are simply created and referenced by using a string as name.

Block(
    'loop',
    inc(eax),
    jmp('loop')
)

Label References

-Labels are referenced by e.g. Label('name'). When looking up label references, pyasm2 will first try to find the label in the current block, and only if there is no such label in the current block, it will look it up in the parent. In other words, local named labels are more important than global named labels.-

Local Named Labels and Global Named Labels can not be mixed. E.g. the following snippet throws an error.

Block(
    Label('loop'),  # global named label
    inc(eax),
    jmp('loop')     # local named label

Further Label Tweaks

Now we've seen the types of labels supported by pyasm2, it is time to get to some awesome tweaks which will speed up development and clean up your code even further.

Label classobj instead of instance

It is possible to define an Anonymous Label by passing the Label class, instead of passing an instance.

Block(
    Label,
    inc(eax),
    jmp(Label(-1))
)

Global Named Labels as variabele

Because global named labels are able to reference to labels outside their current scope (a block), it is also possible to reference to them as a variabele (e.g. a function.)

return_zero = Label('return_zero')
f = Block(
    return_zero,
    xor(eax, eax),
    retn()
)
f2 = Block(
    call(return_zero),
    # ... do something ...
)

Alias Label to L

For those of us that think that the classname Label is too long, you could simply make an alias to L (i.e. L = Label.)

Block(
    L,
    inc(eax),
    jmp(L(-1))
)

Tweaked Anonymous Label References

Because jmp(L(-1)) looks pretty ugly (see the Alias Label to L section), we've tweaked anonymous label references even further to the point where you can add or subtract a relative index directly to/from the Label class.

Block(
    L,
    inc(eax),
    jmp(L-1)
)

Offset from a Label

Sometimes it might be necessary to add or subtract a value from the address of a label, in those cases the following technique applies.

Block(
    L,
    nop,
    mov(eax, Label(-1)+1)
)

In this example the anonymous label will be referenced, but the value one is added to it. So Label(-1)+1 points to the mov instruction, because the nop instruction is only one byte in length.

Do note that Label(-1)+1 could be rewritten as L-1+1, but please don't do that, we don't want to torture python.

Blocks part two

Now we've seen how pyasm2 handles labels, it's time for some more in-depth information about blocks.

Instruction classobj instead of instance

Any instruction that does not take any additional operands (e.g. retn, stosb, sysenter, etc.) can be used directly in a block without actually making an instance. For example, the following two snippets are equal to pyasm2.

Block(
    mov(eax, 0),
    retn()
)
Block(
    mov(eax, 0),
    retn
)

Combining Blocks

One can combine multiple blocks by adding one to the other. Combining blocks is actually just merging them, e.g. one block is appended to the other block.

a = Block(
    mov(eax, ebx),
    mov(ebx, 42)
)
b = Block(
    mov(ecx, edx)
)
print repr(a + b)
# Block(mov(eax, ebx), mov(ebx, 42), mov(ecx, edx))

Temporary Blocks as Lists

Temporary blocks, those that you only use to add to other blocks, can be written as lists (or tuples, for that matter.)

a = Block(
    mov(eax, ebx),
    mov(ebx, 42)
)
print repr(a + [xor(ecx, ecx), retn])
# Block(mov(eax, ebx), mov(ebx, 42), xor(ecx, ecx), retn)

This does, however, not work if you want to call repr or str on the block. In that particular case, you can do the following.

a = [xor(eax, eax), retn]
print repr(Block(a))
# Block(xor(eax, eax), retn)

Combining Instructions Directly

Instead of writing e.g. Block(mov(eax, ebx), mov(ebx, 42)), pyasm2 offers a shorthand.

a = mov(eax, ebx) + mov(ebx, 42)
print repr(a)
# Block(mov(eax, ebx), mov(ebx, 42))

Raw Data Sections

As any assembler, pyasm2 also supports raw data. There are a few supported data types; signed/unsigned 8/16/32/64bit integers, strings and labels (which are 32bit pointers on x86.)

Some examples should suffice as explanation.

a = Block(
    String('abc'),
    Int8(0x64),
    Uint8(0x65),
    Uint16(0x6766),
    Int32(0x6b6a6968)
)
print str(a)
# abcdefghijlk

Raw Data Aliases

Some interesting aliases include.

  • S = String
  • i8 = Int8
  • u8 = Uint8
  • etc.

Multiple Items with the same Type

It is perfectly possible to define multiple values of the same type in one simple statement.

a = Uint32(0x11223344, 0x44332211, 0x12345678, 0x87654321)

Blocks part three

Now we have seen the declaration of raw data using pyasm2, it is time to link code and data sections. For example, in normal executable binaries, it is normal to have different so-called sections for code and data. This way the code is seperated from the data.

This gives us a problem. When assembling, we do not have to combine the text and data blocks, so in order to get the correct addresses of code and data, we do the following. We assign an address to the data section, and from there give every label with address to the code section. This way the code section knows where to find the references to those labels.

a = Block(
    mov(eax, L('hello')),
    # ... snip ...
)
b = Block(
    L('hello'),
    String('Hello World!\n\x00')
)
b.base_address(0x402000)
a.references(b)

pyasm2 Internals

Although most of pyasm2 is fairly straightforward (chaining instructions is not that hard), there is one tricky part: labels.

To start off, the x86 instruction set provides two types of relative jumps. Those with an 8bit relative offset, and those with a 32bit relative offset.

Besides that, instructions can refer to other instructions or addresses within a data section, using labels. This means that pyasm2 has to keep track of these references, and magically fix them in the final step.

Relative Offset Size

So a relative jump can point to another instruction, by using a label. This raises the question; is the offset to this instruction within the size of an 8bit relative offset, or a 32bit one?

(8bit relative jumps are 2 bytes in length, 32bit ones are 5 bytes for unconditional jumps, and 6 bytes for conditional ones.)

There are two solutions to this problem, as far as I can tell.

  • Each label keeps a list of instructions pointing to it. When assembling, each of the instructions is updated with the location of the label, so the instructions can assemble the address or relative offset accordingly. From here the instruction can determine if the offset has to be 8bit or 32bit.
  • At first each relative jump is created using a 32bit relative offset. Then, after assembling each instruction, the instructions are enumerated and a check is done if the relative jumps would fit as jumps with an 8bit relative offset as well. If that is the case, the jump is updated, and all the other instructions are updated as well. This goes one until there are no relative jumps left to tweak, or a recursive limit has exceeded.

Although the first implementation might be a little better, performance wise. pyasm2 uses the latter implementation, which is much easier to implement.