Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review usage of MapDB in HashDictionary #35

Open
pstutz opened this issue Sep 16, 2015 · 3 comments
Open

Review usage of MapDB in HashDictionary #35

pstutz opened this issue Sep 16, 2015 · 3 comments

Comments

@pstutz
Copy link
Member

pstutz commented Sep 16, 2015

This is the class that's being described here: https://github.com/uzh/triplerush/blob/master/src/main/scala/com/signalcollect/triplerush/dictionary/HashDictionary.scala

The purpose of this class is to offer a mapping of RDF IRIs/literals (think strings with lots of shared prefixes, usually between 20 and 80 characters long) to integers, and from integers back to those same strings. For now we only add to this dictionary and never delete an existing mapping.

We use MapDB to store an in-memory mapping from Int->String. For the String->Int mapping we use a hash on the UTF-8 encoded byte array. We only explicitly store the String->Int mapping when there is a collision of this hash (also in a much smaller MapDB B-tree).

For this reason the Int->String mapping is the crucial part of our usage. For this mapping we have configured the MapDB B-tree like this:

val db = DBMaker
  .memoryUnsafeDB
  .closeOnJvmShutdown
  .transactionDisable
  .asyncWriteEnable
  .asyncWriteQueueSize(32768)
  .storeExecutorEnable(Executors.newScheduledThreadPool(math.min(16, Runtime.getRuntime.availableProcessors)))
  .compressionEnable
  .make

val int2String = db.treeMapCreate("int2String")
  .keySerializer(BTreeKeySerializer.INTEGER)
  .valueSerializer(Serializer.BYTE_ARRAY)
  .nodeSize(32)
  .makeOrGet[Int, Array[Byte]]()

Compression is enabled so the strings, that usually share a lot of their prefixes, are not so large. The node size 32 was chosen as a trade-off between the read/write performance and memory usage. A value of 32 also kept GC churn lower than it was with larger values.

We are considering using the G1 GC and I wonder if it is a good fit for this kind of use case. It should result in less fragmentation than other parallel collectors. I noticed that when adding entries in parallel and at full-speed it can be hard for the GC to keep up.

Is there anything else we can do to reduce the GC pressure created by the intermediate objects?

I noticed that async writes and a pretty large queue size increased the write performance by a lot. What's the trade-off and what are the optimal values?

Do we need to compact the tree to keep the direct memory usage low over time?

Are there any other potential improvements?

Existing ideas for improvements:

  • Use Huffman coding for string compression
  • New string/byte[] comparator that works from the end of the array to the front, as many of our strings have long shared prefixes
  • We might be able to speed up loading with DataPump, but it would require significant preprocessing of our data
@jankotek
Copy link

  1. enable values outside nodes

  2. disable compression globally and use just on id2string value:

val int2String = db.treeMapCreate("int2String")
  .keySerializer(BTreeKeySerializer.INTEGER)
  .valueSerializer(new Serializer.CompressionDeflateWrapper(Serializer.BYTE_ARRAY))
  .nodeSize(32)
  .makeOrGet[Int, Array[Byte]]()
  1. enable dictionary with previous example:
new Serializer.CompressionDeflateWrapper(Serializer.BYTE_ARRAY, LEVEL, DICTIONARY)

@jankotek
Copy link

decrease write queue size .asyncWriteQueueSize(32768)

@jankotek
Copy link

Try Koloboke LongLongMap for mapping. Key is your ID, value is Recid from MapDB Engine.

To insert new value into mapdb and get its Recid:

long recid = db.getEngine().put(value,Serializer.BYTE_ARRAY)

To get value from recid:
byte[] val = db.getEngine().get(recid, Serializer.BYTE_ARRAY)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants