Skip to content

Latest commit

 

History

History
492 lines (448 loc) · 7.78 KB

walkthrough.md

File metadata and controls

492 lines (448 loc) · 7.78 KB

XOR

In the beginning...

Truth Table

  x y
x 0 1
y 1 0

Missing Element "Trick"

XOR is communicative, associative, and not idempotent:

A ^ B == B ^ A
A ^ (B ^ C) == (A ^ B) ^ C
A ^ A == 0 != A 

This peculiar combination of properties can be used to find the missing element of two otherwise identical sets:

>>> import operator
>>> a = [5, 7, 9]
>>> b = [3, 5, 7, 9]
>>> A = reduce(operator.xor, a, 0)
>>> A
11
>>> B = reduce(operator.xor, b, 0)
>>> B
8
>>> A ^ B
3

This quality does not depend on the size of the sets (we could do the last step with very large sets in O(1) time).

Bloom Filter Example

>>> from pybloom import BloomFilter
>>> f = BloomFilter(capacity=10, error_rate=0.5)
>>> f.add(1)
False
>>> f.bitarray
bitarray('00000000000100000000000000000')
>>> f.add(2)
False
>>> f.bitarray
bitarray('00000000000100000000000010000')
+>>> 1 in f
True
+>>> 2 in f
True
+>>> 3 in f # False Positive
True

Counting Bloom Filter

Just like a bloom filter, but expand each bit into a cell that contains a counter.

[{"count": 0},{"count": 0},...]

The counts permit deleting elements from the filter (as long as you don't exceed max count size).

Invertible Bloom Filter

An invertible bloom filter combines the curious properties of XOR with a counting bloom filter. In addition to the count field we now also have an id. The id will contain the actual keys inserted into the filter.

[{"count": 0, "id": 0},{"count": 0, "id": 0},...]

When we add elements to the set:

  1. hash(value) -> N positions
  2. for each position, increment the count and xor id with value.

Finally there is a hash field that is used to reduce the error rate of miss identifying an id when the cell count is 1.

So the final cells look a bit like:

[{"count": 0, "id": 0, "hash": 0},{"count": 0, "id": 0, "hash": 0},...]

OK, let's do a example with the ibf tool. It produces the IBFs in JSON format so we can easily inspect what is happening.

$ ibf create demo.ibf 10
$ jq '.' demo.ibf

This first block is the serialized parameters to our hash functions that pick positions for entries.

{
  "positioners": [
    {
      "key": [
        8717895732742166000,
        2259404117704393200
      ]
    },
    {
      "key": [
        6050128673802996000,
        501233450539197800
      ]
    },
    {
      "key": [
        3390393562759376400,
        2669985732393126000
      ]
    }
  ],

The hasher section is the serialized parameters to our hash function that fills in the hash field of a cell.

  "hasher": {
    "key": [
      1774932891286980000,
      6044372234677422000
    ]
  },

The actual IBF cells, composed as discussed of an id, hash, and count.

  "size": 10,
  "cells": [
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    },
    {
      "key": {
        "data": "AAAAAAAAAAA="
      },
      "digest": 0,
      "count": 0
    }
  ],

The cardinality is the total number of elements in the IBF.

  "cardinality": 0
}

Looking more closely at the cells when we insert an element:

$ ibf insert demo.ibf 'A Value'
$ jq '.cells' demo.ibf
[
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAdBIFZhbHVl"
    },
    "digest": 11333042293537241000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAdBIFZhbHVl"
    },
    "digest": 11333042293537241000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAdBIFZhbHVl"
    },
    "digest": 11333042293537241000,
    "count": 1
  }
]

When we attempt to list entries we are looking for "pure" cells: Cells that have a count of 1 and the id hashes to the same value as hash.

$ ibf list demo.ibf
A Value

As long as we can pull out "pure" cells we can unravel the IBF. But if we run out of "pure" cells and there are non-empty ones left we know we weren't able to get them all.

$ ibf insert demo.ibf 'B Value'
$ ibf insert demo.ibf 'C Value'
$ ibf list demo.ibf
A Value
B Value
C Value
$ jq '.cells' demo.ibf 
[
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAdBIFZhbHVl"
    },
    "digest": 11333042293537241000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAdCIFZhbHVl"
    },
    "digest": 10437079027557620000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAdDIFZhbHVl"
    },
    "digest": 9477556251531188000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAABAAAAAAAA"
    },
    "digest": 1391892804557128400,
    "count": 2
  },
  {
    "key": {
      "data": "AAAAAAAAAAADAAAAAAAA"
    },
    "digest": 977555963097886000,
    "count": 2
  },
  {
    "key": {
      "data": "AAAAAAAAAAACAAAAAAAA"
    },
    "digest": 2215778548118941700,
    "count": 2
  }
]

The way ibf list works is to pop "pure" cells out of the set and then look again for more. We can simulate this by removing elements manually. Let's remove 'B Value':

$ ibf remove demo.ibf 'B Value'
$ jq '.cells' demo.ibf 
[
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAdBIFZhbHVl"
    },
    "digest": 11333042293537241000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAAAAAAAAAAA"
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAA="
    },
    "digest": 0,
    "count": 0
  },
  {
    "key": {
      "data": "AAAAAAAAAAdDIFZhbHVl"
    },
    "digest": 9477556251531188000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAdDIFZhbHVl"
    },
    "digest": 9477556251531188000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAdBIFZhbHVl"
    },
    "digest": 11333042293537241000,
    "count": 1
  },
  {
    "key": {
      "data": "AAAAAAAAAAACAAAAAAAA"
    },
    "digest": 2215778548118941700,
    "count": 2
  }
]