This is a description of version 2.0 of the MDX and MDD file format, used by the MDict dictionary software. The software is not open-source, nor is the file format openly specified, so the following description is based on reverse-engineering, and is likely incomplete and inaccurate in its details.
Most of the information comes from https://bitbucket.org/xwang/mdict-analysis. While xwang mostly focuses on being able to read this unknown format, I have added details that are necessary to also write MDX files.
MDX and MDD files are both designed to store an associative array of pairs (keyword, record).
For MDX files, the information stored is typically a dictionary. The keyword and record are both (Unicode) strings, with the keyword being the headword for the dictionary entry, and the record giving a description of that word. An example of an MDX entry could be:
- keyword: "reverse engineering"
- record: "noun: a process of analyzing and studying an object or device, in order to understand its inner workings"
MDD files are instead designed to store binary data. Typically, the keyword is a file path, and the record is the contents of that file. As an example, we may have:
- keyword: "\image.png"
- record: 0x89 0x50 0x4e 0x47 0x0d 0x0a 0x1a 0x0a... MDX files is designed to store a dictionary, i.e. a collection of pairs (keyword, record), which could be, for example, keyword="reverse engineering", record="noun: a process of analyzing and studying an object or device, in order to understand its inner workings".
Typically, MDD files are associated with an MDX file of the same name (but with extension .mdx instead of .mdd), and contains resources to be included in the text of MDX files. For example, and entry of the MDX file might contain the HTML code <img src="/images/image.png" />
, in which case the MDict software will look for the entry "\image.png" in the MDD file.
The basic file structure is a follows:
MDX File | |
---|---|
header_sect |
Header section. See "Header Section" below. |
keyword_sect |
Keyword section. See "Keyword Section" below. |
record_sect |
Record section. See "Record Section" below. |
header_sect |
Length | |
---|---|---|
length |
4 bytes | Length of header_str , in bytes. Big-endian. |
header_str |
varying | An XML string, encoded in UTF-16LE. See below for details. |
checksum |
4 bytes | ADLER32 checksum of header_str , stored little-endian. |
The header_str
consists of a single, XML tag dictionary
, with various attributes. For MDX files, they look like this: (newlines added for clarity)
<Dictionary
GeneratedByEngineVersion="2.0"
RequiredEngineVersion="2.0"
Encrypted="2"
Encoding="UTF8"
Format="Html"
CreationDate="2015-01-01"
Compact="No"
Compat="No"
KeyCaseSensitive="No"
Description="This is a <i>test dictionary</i>."
Title="My dictionary"
DataSourceFormat="106"
StyleSheet=""
RegisterBy="Email"
RegCode="0102030405060708090A0B0C0D0E0F"/>
For MDD files, we have instead:
<Library_Data
GeneratedByEngineVersion="2.0"
RequiredEngineVersion="2.0"
Encrypted="2"
Format=""
CreationDate="2015-01-01"
Compact="No"
Compat="No"
KeyCaseSensitive="No"
Description="This is a <i>test dictionary</i>."
Title="My dictionary"
DataSourceFormat="106"
StyleSheet=""
RegisterBy="Email"
RegCode="0102030405060708090A0B0C0D0E0F"/>
The meaning of the attributes are explained below:
Attribute | Description |
---|---|
GeneratedByEngineVersion |
The version of the file format. This document describes version 2.0. Apart from this, version 1.2 is also possible. |
RequiredEngineVersion |
Presumably the lowest format version compatible with this version. |
Encrypted |
An integer between 0 and 3 (inclusive). If the lower bit is set, indicates that the first part of the keyword section is encrypted, as described in the section Keyword header encryption. If the upper bit is set, indicates that the keyword index is encrypted, using the scheme described in Keyword index encryption. |
Encoding |
Only used for MDX files. The encoding used for text in the document. Possible values are "UTF-8", "UTF-16" (uses little-endian encoding), "GBK", and "Big5". For MDD files, the encoding used for the keywords (file paths) is always UTF-16, and the records consist of binary data. |
Format |
The format of the dictionary entry texts. Possible values include "Html" and "Text". For MDD files, this must be empty. |
CreationDate |
The date the dictionary was created. |
Compact |
If this is "Yes", indicates the dictionary entries is in an Mdict-specific compact format, where certain string are replaced according to the scheme specified in StyleSheet . See the documentation for the official MdxBuilder client for details. |
Compat |
Appears to be a typo for Compact , which certain versions of the official Mdict client look for instead of Compact . |
KeyCaseSensitive |
Indicates to the dictionary reader whether or not keys should be treated in a case-insensitive manner. |
Description |
A description of the dictionary, which appears as the ":about" page in the official MDict client. |
Title |
The title of the dictionary. |
DataSourceFormat |
Unknown. |
StyleSheet |
Used in conjunction with the Compact option. See the documentation for the official MdxBuilder client for details. |
RegisterBy |
Either "EMail" or "DeviceID". Only used if the lower bit of Encrypted is set. Indicates which piece of user-identifying data is used to encrypt the encryption key. See the section Keyword header encryption for details. |
RegCode |
When keyword header encryption is used (see Keyword header encryption), this is one way to deliver the encrypted key. In this case, this is a string consisting of 32 hexadecimal digits. |
The keyword section contains all the keywords in the dictionary, divided into blocks, as well as information about the sizes of these blocks.
keyword_sect |
Length | |
---|---|---|
num_blocks |
8 bytes | Number of items in key_blocks. Big-endian. Possibly encrypted, see below. |
num_entries |
8 bytes | Total number of keywords. Big-endian. Possibly encrypted, see below. |
key_index_decomp_len |
8 bytes | Number of bytes in decompressed version of key_index . Big-endian. Possibly encrypted, see below. |
key_index_comp_len |
8 bytes | Number of bytes in compressed version of key_index (including the comp_type and checksum parts). Big-endian. Possibly encrypted, see below. |
key_blocks_len |
8 bytes | Total number of bytes taken up by key_blocks. Big-endian. Possibly encrypted, see below. |
checksum |
4 bytes | ADLER32 checksum of the preceding 40 bytes. If those are encrypted, it is the checksum of the decrypted version. Big-endian. |
key_index |
varying | The keyword index, compressed and possibly encrypted. See below. |
key_blocks[0] |
varying | A compressed block containing keywords, compressed. See below. |
... | ... | ... |
key_blocks[num_blocks-1] |
varying | ... |
If the parameter Encrypted
in the header has the lowest bit set (i.e. Encrypted | 1
is nonzero), then the 40-byte block from num_blocks
are encrypted. The encryption used is Salsa20/8 (Salsa20 with 8 rounds instead of 20). In pseudo-Python:
def encrypt(message, key):
salsa20_8_init(key_length = 128, #128 bits
iv_length = 64, # 64 bits
ivs = b"\0\0\0\0\0\0\0\0"), #64 bits of zeros)
return salsa20_8_encrypt(message, key)
encrypted_block = encrypt(unencrypted_block, key=ripemd128(encryption_key))
Here, encryption_key
is the dictionary password specified on creation of the dictionary.
This encryption_key
is not distributed directly. Instead it is further encrypted, using a piece of data, user_id
, that is specific to the user or the client machine, according to the following scheme:
reg_code = encrypt(ripemd128(encryption_key), ripemd128(user_id))
The string user_id
can be either an email address ("[email protected]") that the user enters into his/her MDict client, or a device ID ("12345678-90AB-CDEF-0123-4567890A") which the MDict client obtains in different ways depending on the platform. The choice of which one to use depends on the attribute RegisterBy
in the file header. (See Header section.) In either case, user_id
is an ASCII-encoded string. On certain platforms, the official MDict client seems
to default to the DeviceID being the empty string.
The 128-bit reg_code
is then distributed to the user. This can be done in two ways:
- If the MDX file is called
dictionary.mdx
, the dictionary reader should look for a file calleddictionary.key
in the same directory, which containsreg_code
as a 32-digit hexadecimal string. - Otherwise,
reg_code
can be included in the header of the MDX file, as the attributeRegCode
.
The keyword index lists some basic data about the key blocks. It is compressed (see "Compression"), and possibly encrypted (see "Keyword index encryption"). After decompression and decryption, it looks like this:
decompress(key_index) |
Length | |
---|---|---|
num_entries[0] |
8 bytes | Number of keywords in the first keyword block. |
first_size[0] |
2 bytes | Length of first_word[0] , not including trailing null character. In number of "basic units" for the encoding, so e.g. bytes for UTF-8, and 2-byte units for UTF-16. |
first_word[0] |
varying | The first keyword (alphabetically) in the key_blocks[0] keyword block. Encoding given by Encoding attribute in the header. |
last_size[0] |
2 bytes | Length of last_word[0] , not including trailing null character. In number of "basic units" for the encoding, so e.g. bytes for UTF-8, and 2-byte units for UTF-16. |
last_word[0] |
varying | The last keyword (alphabetically) in the key_blocks[0] keyword block. Encoding given by Encoding attribute in the header. |
comp_size[0] |
8 bytes | Compressed size of key_blocks[0] . |
decomp_size[0] |
8 bytes | Decompressed size of key_blocks[0] . |
num_entries[1] |
8 bytes | ... |
... | ... | ... |
decomp_size[num_blocks-1] |
8 bytes | ... |
If the parameter Encrypted
in the header has its second-lowest bit set (i.e. Encrypted | 2
is nonzero), then the keyword index is further encrypted. In this case, the comp_type
and checksum
fields will be unchanged (refer to the section Compression), the following C function
will be used to encrypt the compressed_data
part, after compression.
#define SWAPNIBBLE(byte) (((byte)>>4) | ((byte)<<4))
void encrypt(unsigned char* buf, size_t buflen, unsigned char* key, size_t keylen) {
unsigned char prev=0x36;
for(size_t i=0; i < buflen; i++) {
buf[i] = SWAPNIBBLE(buf[i] ^ ((unsigned char)i) ^ key[i%keylen] ^ previous);
previous = buf[i];
}
}
The encryption key used is ripemd128(checksum + "\x95\x36\x00\x00")
, where + denotes string concatenation.
Each keyword is compressed (see "Compression"). After decompressing, they look like this:
decompress(key_blocks[0]) |
Length | |
---|---|---|
offset[0] |
8 bytes | Offset where the record corresponding to key[0] can be found, see below. Big-endian. |
key[0] |
varying | The first keyword in the dictionary, null-terminated and encoded using Encoding . |
offset[1] |
8 bytes | ... |
key[1] |
varying | ... |
... | ... | ... |
The offset should be interpreted as follows: Decompress all record blocks, and concatenate them together, and let records
denote
the resulting array of bytes. The record corresponding to key[i]
then starts at records[offset[i]]
.
The record section looks like this:
record_sect |
Length | |
---|---|---|
num_blocks |
8 bytes | Number items in record_blocks . Does not need to equal the number of keyword blocks. Big-endian. |
num_entries |
8 bytes | Total number of records in dictionary. Should be equal to keyword_sect.num_entries . Big-endian. |
index_len |
8 bytes | Total size of the comp_size[i] and decomp_size[i] variables, in bytes. In other words, should equal 16 times num_blocks . Big-endian. |
blocks_len |
8 bytes | Total size of the rec_block[i] sections, in bytes. Big-endian. |
comp_size[0] |
8 bytes | Length of rec_block[0] , in bytes. Big-endian. |
decomp_size[0] |
8 bytes | Decompressed size of rec_block[i] , in bytes. Big-endian. |
comp_size[1] |
8 bytes | Length of rec_block[1] , in bytes. Big-endian. |
... | ... | ... |
decomp_size[num_blocks-1] |
8 bytes | ... |
rec_block[0] |
varying | A compressed block containing records. See below. |
... | ... | ... |
rec_block[num_blocks-1] |
varying | ... |
Each record block is compressed (see "Compression"). After decompressing, they look like this:
decompress(rec_block[0]) |
Length | |
---|---|---|
record[0] |
varying | The first record. If in an MDX file, this is null-terminated and encoded using Encoding . |
record[1] |
varying | ... |
... | ... | ... |
Various data blocks are compressed using the same scheme. These all look like these:
compress(data) |
Length | |
---|---|---|
comp_type |
4 bytes | Compression type. See below. |
checksum |
4 bytes | ADLER32 checksum of the uncompressed data. Big-endian. |
compressed_data |
varying | Compressed version of data . |
The compression type can be indicated by comp_type
. There are three options:
- If
comp_type
is'\x00\x00\x00\x00'
, then no compression is applied at all, andcompressed_data
is equal todata
. - If
comp_type
is'\x01\x00\x00\x00'
, LZO compression is used. - If
comp_type
is'\x02\x00\x00\x00'
, zlib compression is used. It so happens that the zlib compression format appends an ADLER32 checksum, so in this case,checksum
will be equal to the last four bytes ofcompressed_data
.