Skip to content

Latest commit

 

History

History
233 lines (178 loc) · 14.9 KB

fileformat.md

File metadata and controls

233 lines (178 loc) · 14.9 KB

Introduction

This is a description of version 2.0 of the MDX and MDD file format, used by the MDict dictionary software. The software is not open-source, nor is the file format openly specified, so the following description is based on reverse-engineering, and is likely incomplete and inaccurate in its details.

Most of the information comes from https://bitbucket.org/xwang/mdict-analysis. While xwang mostly focuses on being able to read this unknown format, I have added details that are necessary to also write MDX files.

Concepts

MDX and MDD files are both designed to store an associative array of pairs (keyword, record).

For MDX files, the information stored is typically a dictionary. The keyword and record are both (Unicode) strings, with the keyword being the headword for the dictionary entry, and the record giving a description of that word. An example of an MDX entry could be:

  • keyword: "reverse engineering"
  • record: "noun: a process of analyzing and studying an object or device, in order to understand its inner workings"

MDD files are instead designed to store binary data. Typically, the keyword is a file path, and the record is the contents of that file. As an example, we may have:

  • keyword: "\image.png"
  • record: 0x89 0x50 0x4e 0x47 0x0d 0x0a 0x1a 0x0a... MDX files is designed to store a dictionary, i.e. a collection of pairs (keyword, record), which could be, for example, keyword="reverse engineering", record="noun: a process of analyzing and studying an object or device, in order to understand its inner workings".

Typically, MDD files are associated with an MDX file of the same name (but with extension .mdx instead of .mdd), and contains resources to be included in the text of MDX files. For example, and entry of the MDX file might contain the HTML code <img src="/images/image.png" />, in which case the MDict software will look for the entry "\image.png" in the MDD file.

File structure

The basic file structure is a follows:

MDX File
header_sect Header section. See "Header Section" below.
keyword_sect Keyword section. See "Keyword Section" below.
record_sect Record section. See "Record Section" below.

Header Section

header_sect Length
length 4 bytes Length of header_str, in bytes. Big-endian.
header_str varying An XML string, encoded in UTF-16LE. See below for details.
checksum 4 bytes ADLER32 checksum of header_str, stored little-endian.

The header_str consists of a single, XML tag dictionary, with various attributes. For MDX files, they look like this: (newlines added for clarity)

<Dictionary 
GeneratedByEngineVersion="2.0" 
RequiredEngineVersion="2.0" 
Encrypted="2" 
Encoding="UTF8"
Format="Html"
CreationDate="2015-01-01"
Compact="No"
Compat="No"
KeyCaseSensitive="No"
Description="This is a <i>test dictionary</i>."
Title="My dictionary"
DataSourceFormat="106"
StyleSheet=""
RegisterBy="Email"
RegCode="0102030405060708090A0B0C0D0E0F"/>

For MDD files, we have instead:

<Library_Data 
GeneratedByEngineVersion="2.0" 
RequiredEngineVersion="2.0" 
Encrypted="2" 
Format=""
CreationDate="2015-01-01"
Compact="No"
Compat="No"
KeyCaseSensitive="No"
Description="This is a <i>test dictionary</i>."
Title="My dictionary"
DataSourceFormat="106"
StyleSheet=""
RegisterBy="Email"
RegCode="0102030405060708090A0B0C0D0E0F"/>

The meaning of the attributes are explained below:

Attribute Description
GeneratedByEngineVersion The version of the file format. This document describes version 2.0. Apart from this, version 1.2 is also possible.
RequiredEngineVersion Presumably the lowest format version compatible with this version.
Encrypted An integer between 0 and 3 (inclusive). If the lower bit is set, indicates that the first part of the keyword section is encrypted, as described in the section Keyword header encryption. If the upper bit is set, indicates that the keyword index is encrypted, using the scheme described in Keyword index encryption.
Encoding Only used for MDX files. The encoding used for text in the document. Possible values are "UTF-8", "UTF-16" (uses little-endian encoding), "GBK", and "Big5". For MDD files, the encoding used for the keywords (file paths) is always UTF-16, and the records consist of binary data.
Format The format of the dictionary entry texts. Possible values include "Html" and "Text". For MDD files, this must be empty.
CreationDate The date the dictionary was created.
Compact If this is "Yes", indicates the dictionary entries is in an Mdict-specific compact format, where certain string are replaced according to the scheme specified in StyleSheet. See the documentation for the official MdxBuilder client for details.
Compat Appears to be a typo for Compact, which certain versions of the official Mdict client look for instead of Compact.
KeyCaseSensitive Indicates to the dictionary reader whether or not keys should be treated in a case-insensitive manner.
Description A description of the dictionary, which appears as the ":about" page in the official MDict client.
Title The title of the dictionary.
DataSourceFormat Unknown.
StyleSheet Used in conjunction with the Compact option. See the documentation for the official MdxBuilder client for details.
RegisterBy Either "EMail" or "DeviceID". Only used if the lower bit of Encrypted is set. Indicates which piece of user-identifying data is used to encrypt the encryption key. See the section Keyword header encryption for details.
RegCode When keyword header encryption is used (see Keyword header encryption), this is one way to deliver the encrypted key. In this case, this is a string consisting of 32 hexadecimal digits.

Keyword Section

The keyword section contains all the keywords in the dictionary, divided into blocks, as well as information about the sizes of these blocks.

keyword_sect Length
num_blocks 8 bytes Number of items in key_blocks. Big-endian. Possibly encrypted, see below.
num_entries 8 bytes Total number of keywords. Big-endian. Possibly encrypted, see below.
key_index_decomp_len 8 bytes Number of bytes in decompressed version of key_index. Big-endian. Possibly encrypted, see below.
key_index_comp_len 8 bytes Number of bytes in compressed version of key_index (including the comp_type and checksum parts). Big-endian. Possibly encrypted, see below.
key_blocks_len 8 bytes Total number of bytes taken up by key_blocks. Big-endian. Possibly encrypted, see below.
checksum 4 bytes ADLER32 checksum of the preceding 40 bytes. If those are encrypted, it is the checksum of the decrypted version. Big-endian.
key_index varying The keyword index, compressed and possibly encrypted. See below.
key_blocks[0] varying A compressed block containing keywords, compressed. See below.
... ... ...
key_blocks[num_blocks-1] varying ...

Keyword header encryption:

If the parameter Encrypted in the header has the lowest bit set (i.e. Encrypted | 1 is nonzero), then the 40-byte block from num_blocks are encrypted. The encryption used is Salsa20/8 (Salsa20 with 8 rounds instead of 20). In pseudo-Python:

def encrypt(message, key):
    salsa20_8_init(key_length = 128, #128 bits
       iv_length = 64, # 64 bits
       ivs = b"\0\0\0\0\0\0\0\0"), #64 bits of zeros)
    return salsa20_8_encrypt(message, key)

encrypted_block = encrypt(unencrypted_block, key=ripemd128(encryption_key))

Here, encryption_key is the dictionary password specified on creation of the dictionary.

This encryption_key is not distributed directly. Instead it is further encrypted, using a piece of data, user_id, that is specific to the user or the client machine, according to the following scheme:

reg_code = encrypt(ripemd128(encryption_key), ripemd128(user_id))

The string user_id can be either an email address ("[email protected]") that the user enters into his/her MDict client, or a device ID ("12345678-90AB-CDEF-0123-4567890A") which the MDict client obtains in different ways depending on the platform. The choice of which one to use depends on the attribute RegisterBy in the file header. (See Header section.) In either case, user_id is an ASCII-encoded string. On certain platforms, the official MDict client seems to default to the DeviceID being the empty string.

The 128-bit reg_code is then distributed to the user. This can be done in two ways:

  • If the MDX file is called dictionary.mdx, the dictionary reader should look for a file called dictionary.key in the same directory, which contains reg_code as a 32-digit hexadecimal string.
  • Otherwise, reg_code can be included in the header of the MDX file, as the attribute RegCode.

Keyword index

The keyword index lists some basic data about the key blocks. It is compressed (see "Compression"), and possibly encrypted (see "Keyword index encryption"). After decompression and decryption, it looks like this:

decompress(key_index) Length
num_entries[0] 8 bytes Number of keywords in the first keyword block.
first_size[0] 2 bytes Length of first_word[0], not including trailing null character. In number of "basic units" for the encoding, so e.g. bytes for UTF-8, and 2-byte units for UTF-16.
first_word[0] varying The first keyword (alphabetically) in the key_blocks[0] keyword block. Encoding given by Encoding attribute in the header.
last_size[0] 2 bytes Length of last_word[0], not including trailing null character. In number of "basic units" for the encoding, so e.g. bytes for UTF-8, and 2-byte units for UTF-16.
last_word[0] varying The last keyword (alphabetically) in the key_blocks[0] keyword block. Encoding given by Encoding attribute in the header.
comp_size[0] 8 bytes Compressed size of key_blocks[0].
decomp_size[0] 8 bytes Decompressed size of key_blocks[0].
num_entries[1] 8 bytes ...
... ... ...
decomp_size[num_blocks-1] 8 bytes ...

Keyword index encryption:

If the parameter Encrypted in the header has its second-lowest bit set (i.e. Encrypted | 2 is nonzero), then the keyword index is further encrypted. In this case, the comp_type and checksum fields will be unchanged (refer to the section Compression), the following C function will be used to encrypt the compressed_data part, after compression.

#define SWAPNIBBLE(byte) (((byte)>>4) | ((byte)<<4))
void encrypt(unsigned char* buf, size_t buflen, unsigned char* key, size_t keylen) {
	unsigned char prev=0x36;
	for(size_t i=0; i < buflen; i++) {
		buf[i] = SWAPNIBBLE(buf[i] ^ ((unsigned char)i) ^ key[i%keylen] ^ previous);
		previous = buf[i];
	}
}

The encryption key used is ripemd128(checksum + "\x95\x36\x00\x00"), where + denotes string concatenation.

Keyword blocks

Each keyword is compressed (see "Compression"). After decompressing, they look like this:

decompress(key_blocks[0]) Length
offset[0] 8 bytes Offset where the record corresponding to key[0] can be found, see below. Big-endian.
key[0] varying The first keyword in the dictionary, null-terminated and encoded using Encoding.
offset[1] 8 bytes ...
key[1] varying ...
... ... ...

The offset should be interpreted as follows: Decompress all record blocks, and concatenate them together, and let records denote the resulting array of bytes. The record corresponding to key[i] then starts at records[offset[i]].

Record section

The record section looks like this:

record_sect Length
num_blocks 8 bytes Number items in record_blocks. Does not need to equal the number of keyword blocks. Big-endian.
num_entries 8 bytes Total number of records in dictionary. Should be equal to keyword_sect.num_entries. Big-endian.
index_len 8 bytes Total size of the comp_size[i] and decomp_size[i] variables, in bytes. In other words, should equal 16 times num_blocks. Big-endian.
blocks_len 8 bytes Total size of the rec_block[i] sections, in bytes. Big-endian.
comp_size[0] 8 bytes Length of rec_block[0], in bytes. Big-endian.
decomp_size[0] 8 bytes Decompressed size of rec_block[i], in bytes. Big-endian.
comp_size[1] 8 bytes Length of rec_block[1], in bytes. Big-endian.
... ... ...
decomp_size[num_blocks-1] 8 bytes ...
rec_block[0] varying A compressed block containing records. See below.
... ... ...
rec_block[num_blocks-1] varying ...

Record block

Each record block is compressed (see "Compression"). After decompressing, they look like this:

decompress(rec_block[0]) Length
record[0] varying The first record. If in an MDX file, this is null-terminated and encoded using Encoding.
record[1] varying ...
... ... ...

Compression:

Various data blocks are compressed using the same scheme. These all look like these:

compress(data) Length
comp_type 4 bytes Compression type. See below.
checksum 4 bytes ADLER32 checksum of the uncompressed data. Big-endian.
compressed_data varying Compressed version of data.

The compression type can be indicated by comp_type. There are three options:

  • If comp_type is '\x00\x00\x00\x00', then no compression is applied at all, and compressed_data is equal to data.
  • If comp_type is '\x01\x00\x00\x00', LZO compression is used.
  • If comp_type is '\x02\x00\x00\x00', zlib compression is used. It so happens that the zlib compression format appends an ADLER32 checksum, so in this case, checksum will be equal to the last four bytes of compressed_data.