You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The purpose of this class is to offer a mapping of RDF IRIs/literals (think strings with lots of shared prefixes, usually between 20 and 80 characters long) to integers, and from integers back to those same strings. For now we only add to this dictionary and never delete an existing mapping.
We use MapDB to store an in-memory mapping from Int->String. For the String->Int mapping we use a hash on the UTF-8 encoded byte array. We only explicitly store the String->Int mapping when there is a collision of this hash (also in a much smaller MapDB B-tree).
For this reason the Int->String mapping is the crucial part of our usage. For this mapping we have configured the MapDB B-tree like this:
Compression is enabled so the strings, that usually share a lot of their prefixes, are not so large. The node size 32 was chosen as a trade-off between the read/write performance and memory usage. A value of 32 also kept GC churn lower than it was with larger values.
We are considering using the G1 GC and I wonder if it is a good fit for this kind of use case. It should result in less fragmentation than other parallel collectors. I noticed that when adding entries in parallel and at full-speed it can be hard for the GC to keep up.
Is there anything else we can do to reduce the GC pressure created by the intermediate objects?
I noticed that async writes and a pretty large queue size increased the write performance by a lot. What's the trade-off and what are the optimal values?
Do we need to compact the tree to keep the direct memory usage low over time?
Are there any other potential improvements?
Existing ideas for improvements:
Use Huffman coding for string compression
New string/byte[] comparator that works from the end of the array to the front, as many of our strings have long shared prefixes
We might be able to speed up loading with DataPump, but it would require significant preprocessing of our data
The text was updated successfully, but these errors were encountered:
This is the class that's being described here: https://github.com/uzh/triplerush/blob/master/src/main/scala/com/signalcollect/triplerush/dictionary/HashDictionary.scala
The purpose of this class is to offer a mapping of RDF IRIs/literals (think strings with lots of shared prefixes, usually between 20 and 80 characters long) to integers, and from integers back to those same strings. For now we only add to this dictionary and never delete an existing mapping.
We use MapDB to store an in-memory mapping from Int->String. For the String->Int mapping we use a hash on the UTF-8 encoded byte array. We only explicitly store the String->Int mapping when there is a collision of this hash (also in a much smaller MapDB B-tree).
For this reason the Int->String mapping is the crucial part of our usage. For this mapping we have configured the MapDB B-tree like this:
Compression is enabled so the strings, that usually share a lot of their prefixes, are not so large. The node size 32 was chosen as a trade-off between the read/write performance and memory usage. A value of 32 also kept GC churn lower than it was with larger values.
We are considering using the G1 GC and I wonder if it is a good fit for this kind of use case. It should result in less fragmentation than other parallel collectors. I noticed that when adding entries in parallel and at full-speed it can be hard for the GC to keep up.
Is there anything else we can do to reduce the GC pressure created by the intermediate objects?
I noticed that async writes and a pretty large queue size increased the write performance by a lot. What's the trade-off and what are the optimal values?
Do we need to compact the tree to keep the direct memory usage low over time?
Are there any other potential improvements?
Existing ideas for improvements:
The text was updated successfully, but these errors were encountered: