Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

pinning index (as ipfs objects) and cdb discussion #4

Open
jbenet opened this issue May 1, 2015 · 22 comments
Open

pinning index (as ipfs objects) and cdb discussion #4

jbenet opened this issue May 1, 2015 · 22 comments

Comments

@jbenet
Copy link
Member

jbenet commented May 1, 2015

this is a transplant from: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/ipfs-users/BI-P0H41-2E/d4Gd0akbYPoJ

@jbenet
Copy link
Member Author

jbenet commented May 1, 2015

@tv42 said:

Hi. I'm looking at switching pinning state from datastore to IPFS objects. The datastore will just point to the root of "the pinning tree". Direct and recursive pins are essentially sets, and indirect pins are current a map {key: refcount} but I think a better model for those is going to be a multi-valued key-value store, with {key1: pinner1, key1: pinner2, key2: ...}, etc.

Pin data changes rarely, but needs reasonably fast lookups.

We need to just get something done fairly quickly, so simple is better than state of the art, right now. Here's my current draft. Feedback is welcome.

Key-value store using IPFS objects

IPFS objects have schema (Links, Data) where Links is a list of
(Name, Size, Hash). Name must be UTF-8 and human friendly [pending
https://github.com/ipfs/kubo/issues/1172 ], so we choose to leave
it empty. Link Hash cannot be empty.

Each object Data is a CDB file:

There are three kinds of objects:

Empty

The well-known hash value
QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n has no Links and no
Data. We will use it to signal empty.

Small

An object with no Links is small.

Its Data is a CDB file, containing keys and values.

When adding key K with value V, nodes are kept small if size + 24 + len(K) + len(V) + 4 <= maxBytes (for explanation of the 4, see
large objects).

If removing a key results in a small object becoming empty, its hash
in the parent can be changed to empty.

Large

An object with Links is large. Its Data is a CDB file, just as with
small objects, with an extra 4 random bytes at the end. The extra
bytes are ignored by CDB tools.

A large object is kept as large as long as it has even one Link
that is not Empty. If a large object is made small, it's Links are
removed, and the last 4 bytes of Data are trimmed.

When a small object is made large, it has fanout Links added,
each set to empty and 4 random bytes appended to Data. fanout may
have different values at different versions of the software. While an
object is large, the number of Links does not change (to preserve
partitioning).

If a key is not in Data, the Link to look it up in is the one at index
h(seed || key) % len(Links) where h is a 32-bit FNV-1a hash,
seed is the last 4 bytes of Data, and || is a string concatenation
operation.

Removing keys may make a large Data be smaller than maxSize. This
is valid. Adding keys adds them to Data if there is room.

If a key exists in Data, it will not be looked up in Links. This means
removing a key must recursively remove the key from the relevant child
in Links, to avoid shadowed old values from becoming visible.

@jbenet
Copy link
Member Author

jbenet commented May 1, 2015

@wking said

On Thu, Apr 30, 2015 at 07:42:20PM -0700, Tommi Virtanen wrote:

Direct and recursive pins are essentially sets, and indirect pins
are current a map {key: refcount} but I think a better model for
those is going to be a multi-valued key-value store, with {key1:
pinner1, key1: pinner2, key2: ...}, etc.

If a key is not in Data, the Link to look it up in is the one at
index h(seed || key) % len(Links) where h is a 32-bit FNV-1a
hash, seed is the last 4 bytes of Data, and || is a string
concatenation operation.

Why hash the keys with a seed, since the keys are already multihashes
(the pinned node's ID)? I asked this on IRC 1, but I'm still not
clear on what non-multihash keys you intend to store in this
structure. Hashing keys a second time isn't going to break anything,
but it does sound like it will make lookup slower, especially with a
different random seed for each object in the pin tree.

Other than that, this looks reasonable to me. But, as you may have
gathered from our IRC discussion, I'm not really a low-level guy, so
take my opinions here with a grain of salt ;).

Cheers,
Trevor

@jbenet
Copy link
Member Author

jbenet commented May 1, 2015

I'll think about this more and reply with something more solid tomorrow, but basic feedback:

  • cdb is great! would be great to have examples of databases "ported" to run on top of ipfs. and maybe even end up with a simple way to use "cdb-on-ipfs" for other things easily.
  • that said, i think this model is a bit off-- i think the object keys in the "data" blocks (the pinned keys) should be actual links, not just stored in the data segment. that way the usual ipfs rules apply. can easily reason about the cum size of the whole thing, move it around, etc. then "pinning" becomes just:
    • maintain the pinning index (yields a root object (and hash))
    • make sure all the objects reachable from the pinning root are kept in local storage.
  • also not convinced that need to rehash with FNV-1a. why not just key % len(Links) ? the key (minus the multihash prefix) -- being a cryptographic hash -- is already uniformly distributed.

@jbenet
Copy link
Member Author

jbenet commented May 1, 2015

also tagging (for keeping in the loop + surfacing thoughts if relevant): @whyrusleeping @cryptix @krl

@tv42
Copy link

tv42 commented May 1, 2015

Without something in the x % len(Links) changing, the tree is worthless beyond the first level: more levels wouldn't splay out the keys any more.

If I put the pins in the Links.Hash, things change:

  1. quick lookups no longer exist (have to scan the slice). I could, however, build an index in Data, and get us back to this point (except without being able to reuse cdb code).
  2. multi-level becomes harder; how will I know which Links point to pinned objects, which ones just point to more pins. I guess I could put a bitmap in Data, too. Or abuse Link.Name for that, but that feels like abuse.

@tv42
Copy link

tv42 commented May 1, 2015

Thinking out lout: mph/cdb-style hash table in Data, result is 0<=i<=n, answer is Links[i] if Links[i].Hash == key. Links > n are inner nodes.

@wking
Copy link

wking commented May 2, 2015

On Fri, May 01, 2015 at 07:47:05AM -0700, Tv wrote:

Without something in the x % len(Links) changing, the tree is
worthless beyond the first level: more levels wouldn't splay out the
keys any more.

Ah, right. So you'd want to trie the cypto hash and shard
QmSomePinnedHash into sh/Ha/QmSomePinned (or use the prefixes after
stripping off the non-crypto bits, or whatever, see
multiformats/multihash#1). That means that you'd have to know your depth to
figure out which part of the multihash to use, but tracking depth
while you traverse the tree sounds cheaper than re-hashing with a
random seed for each object.

If I put the pins in the Links.Hash, things change:

  1. quick lookups no longer exist (have to scan the slice).

I don't understand “scan the slice”, with sorted link lists
(ipfs/kubo#915) you can do a binary lookup to find your target key
quickly. If you were looking for a particular pin (e.g. for insertion
or removal) and using links (instead of cdb), I'd suggest logic like:

  1. Look for a key matching your full multihash in the link lists.
    a. If you find it, that's your entry. For inserts, there's nothing
    to do. For deletes, remove the entry, and possibly collapse the
    object into it's parent if the removal shrinks it below your
    size threshold.
    b. If you don't find it, get the depth-appropriate slice from your
    crypto hash, and lookup that key in the link list.
    i. If you find it, enter that node, increment the depth, and go
    back to (1).
    ii. If you don't find it, your hash isn't listed. For deletes,
    there's nothing to do. For insert, if the addition will
    keep you under your size threshold, add the multihash as a
    link; if the addition will bump you over your size
    threshold, create a new child node for the key, and push all
    the full-multihash link keys with a matching
    depth-appropriate slice down into the new child node.
  1. multi-level becomes harder; how will I know which Links point to
    pinned objects, which ones just point to more pins. I guess I could
    put a bitmap in Data, too. Or abuse Link.Name for that, but that
    feels like abuse.

Just use an invalid multihash prefix for the depth-appropriate slice
entries. The multihash spec reserves 0x00-0x0f for
application-specific functions, so the base-whatever encoded version
of 0x00 0x00 … should be a reasonable prefix. As a bonus, this sorts
your link name space into {all the sub-object links} and {all the full
multi-hash links} sets, and you can cache that split boundary from the
(1) search to make the (b) search faster.

As an aside, I'm not sure how, given a byte string the might be a
base-whatever-encoded multihash, you should determine what base the
encoding is in. Are multihashes in link values required to be in
base-58? Are multihashes in the link values just raw binary
multihashes? If we want to use keys in the link names we should
probably either pick a standard encoding, record the chosing encoding
information somewhere (type? ipfs/ipfs#36), or allow binary link names
(ipfs/kubo#1172). Of those choices, I like storing the
multihash-to-UTF-8 encoding in a type reference best.

Anyhow, I'd be surprised if keeping everything in links is going to
perform as well as the cdb approach, but it sounds more native and
like less work, so it may be easier to implement and fast enough. I
don't mind if we use the cdb approach or not, but I think both
approaches are workable. I do think depth-based slicing is better
than random-seed rehashes though.

@jbenet
Copy link
Member Author

jbenet commented May 4, 2015

Without something in the x % len(Links) changing, the tree is worthless beyond the first level: more levels wouldn't splay out the keys any more.

Can use different substrings of the hash. it is a cryptographic hash -- which makes it appear (to a bounded adversary, etc) uniformly distributed at any bit length. So grabbing different substrings helps. -- i think @wking suggests the same above.

multi-level becomes harder; how will I know which Links point to pinned objects, which ones just point to more pins. I guess I could put a bitmap in Data, too. Or abuse Link.Name for that, but that feels like abuse.

unixfs dir's extend each of links in the data section. we could do this, or extend the link protobufs themselves (as has been discussed elsewhere). Agreed about not abusing the name.

@tv42
Copy link

tv42 commented May 6, 2015

The problem with using the height as input to the hash function is that now debugging by staring at a single object is harder; I'd be inclined to put the level in the object, and at that point, I might as well put in something else. And without a randomized hash, the data structure isn't very generic; it can't be safely & robustly used to store non-cryptohash, user-chosen, keys.

@wking
Copy link

wking commented May 6, 2015

On Wed, May 06, 2015 at 02:11:08PM -0700, Tv wrote:

The problem with using the height as input to the hash function is
that now debugging by staring at a single object is harder…

How frequently will you need to do this? It doesn't seem all that
much harder (one extra parameter to log vs. calculating a new hash for
each object you're looking at during your debug pass).

And without a randomized hash, the data structure isn't very
generic; it can't be safely & robustly used to store non-cryptohash,
user-chosen, keys.

If we're looking at a generic fanout index structure, then this is a
reasonable argument. Rehashing the key your searching for at every
object is not that steep a cost. But pushing the link payloads into
a cdb payload structure seems like too much of a stretch for this
approach to be a generic solution. If we keep the links and their
target hashes in the Links section 1, then we get interoperability
with the usual IPFS linking, but then the cdb is less useful…

I dunno. I think it's hard to get efficient fanout and name-based
lookup with the dual constraints of the base IPFS object spec and
protobuf's variable-length fields (for more thoughts on indexing
vs. protobufs, see ipfs-inactive/archive-format#6).

@tv42
Copy link

tv42 commented May 8, 2015

Next draft. I did not change the hash function yet, but now that the keys are actually stored in Links.Hash they have to be IPFS Keys anyway, so assuming uniformness is more acceptable.

Set/Multiset of Keys as IPFS object

IPFS objects have schema (Links, Data) where Links is a list of
(Name, Size, Hash). Name must be UTF-8 and human friendly, so we
choose to leave it empty. Link Hash cannot be empty.

We implement a Set and Multiset of IPFS objects, using IPFS
objects for storage.

For our purposes, we divide Links into two groups:

  • subtrees, fanout items
  • items, n items, sorted in lexicographical order by Hash

fanout changes rarely if ever, while n grows and shrinks
dynamically.

The object always has at least fanout Links (see below for where
that value is stored). Those links may point to the well-known hash
value QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n that has no
Links and no Data. We will use it to signal empty.

For Set: Keys in items are distinct. As a write optimization, a
key in items may also have a copy in the correct subtree.

For Multiset: items may contain the same key multiple times, and
the correct subtree may contain even more entries for the key. The
numbers of times a Key has been added to the Multiset is the sum of
all these reference counts.

The object Data consists of two sections:

  • an uvarint length prefixed protobuf message
  • for Multiset only: array of n refcounts

The protobuf message is described by

message Set {
    // 0 for now, library will refuse to handle entries with version greater than recognized.
    required uint32 version = 1;
    // how many of the links are subtrees
    required uint32 fanout = 2;
    // hash seed for subtree selection, a random number
    required fixed32 seed = 3;
}

Refcounts

To be useful in an array, we need fixed size refcounts. However, there
is no limit to how many times an IPFS object may be referenced. To
support this, we allow Multisets to contain multiple instances of the
item, both in the same node and in the correct subtree.

Version 0 of the format uses uint8 refcounts.

fanout changes

(This can be left unimplemented for now.)

We can avoid wasting space for fanout empty Links in leaf nodes by
making new, small, nodes have fanout 0.

fanout changes would change the hash decisions of all items in
subtrees, but we can choose to only make such changes at times
when that cost is non-existent.

Leaf to node: Changing fanout from 0 requires inserting fanout
empty items at the beginning of Links. This can be done at the time
a leaf node becomes too big, and is split into an inner node with
multiple leaves.

Node to leaf: Changing fanout to 0 can happen at a time when all
fanout Links are empty. This is to preserve subtree partitioning.

The exact fanout value may have different values at different
versions of the software.

Lookups

If a key is not in the n directly stored keys, the Link to look it
up in is the one at index h(seed || key) % n where seed is the
bytes of the uint32 in little-endian order, h is a 32-bit FNV-1a
hash, and || is a string concatenation operation.

Removing keys may decrease n, all the way down to 0. Adding keys
adds them to Links as long as n is small enough (details are
implementation specific and not to be relied upon).

Optimization: decreasing write amplification

To allow for cheap updates, Add operation can just insert keys in the
root node, without first checking whether the key is present in the
subtree. This can lead to multiple instances of the key being
present in the tree.

Thus, removing a key must recursively remove the
key from the relevant child in Links, to avoid shadowed old instances
from becoming visible.

Multiset refcount decrement can decrement the first instance it finds,
stopping there. Increment can opportunistically increment an existing
entry, adding a new entry if one is not found. Refcounts can be merged
at any time when they do not overflow.

@wking
Copy link

wking commented May 8, 2015

On Fri, May 08, 2015 at 09:58:31AM -0700, Tv wrote:

Next draft.

I'm liking this more :). Hopefully it's performant enough that we
don't end up getting pushed back to in-data cdbs, because then I'd
feel pretty silly :p.

I did not change the hash function yet…

Yeah, this is a minor issue for me too. Especially before we have
something available to benchmark.

// 0 for now, library will refuse to handle entries with version greater than recognized.
required uint32 version = 1;

I'm fine with this, and it's the traditional approach. But for the
archive format I'm hoping to use a multihash version to get the
self-describing features discussed in ipfs/ipfs#36. In that case, I'd
just store the accepted version of a spec like this in a unixfs file,
and use its multihash as the version for pinning nodes following that
spec. Pros and cons:

  • Increased spec discoverability. It's easy to go from a multihash to
    the spec, just ‘ipfs cat /ipfs/’, or ‘ipfs ls /ipfs/’ if
    it turns out to be a directory spec, or ….
  • Increased parallel evolvability. Just like Git and other Merkle-y
    things, using hash-based identifiers makes it easy to evolve the
    spec in parallel, fork, merge, etc. without needing a central agent
    to hand out the version identifiers (in this case, version numbers).
  • Difficult interoperability calculations. It's harder to say things
    like “I'm compatible with pin formats 2 through 7” if you're using
    multihash versions. You'd have hard-code the list of supported
    multihashes into the implementation or rely on some calculable
    feature of the referenced multihash. For core stuff like pin
    handling, I think it's best to hardcode the list of compatible
    multihashes.

On the balance, I think the pros outweigh the cons, especially for
specs that we expect to be fairly stable. But again, I'm fine if this
all sounds like crazy-talk and you'd rather store the version
information in an incremented integer.

If a key is not in the n directly stored keys, the Link to look it
up in is the one at index h(seed || key) % n

‘% n’ should be ‘% fanout’.

To allow for cheap updates, Add operation can just insert keys in
the root node, without first checking whether the key is present in
the subtree. This can lead to multiple instances of the key
being present in the tree.

Thus, removing a key must recursively remove the key from the
relevant child in Links, to avoid shadowed old instances from
becoming visible.

Works for me. I don't mind who pays the subtree-drilling cost. One
issue to consider is atomicity, but in either case I expect we'd only
update the root-pin-object hash after completing the operation, so
we'd keep a consistent state regardless of power-outages, etc. The
more expensive operation just runs a higher risk of failure, and
catastrophic failures like that should be rare enough that it doesn't
matter who we pick.

Multiset refcount decrement can decrement the first instance it
finds, stopping there. Increment can opportunistically increment an
existing entry, adding a new entry if one is not found. Refcounts
can be merged at any time when they do not overflow.

This seems like a useful thing to do during the garbage-collection
sweeps, and should be fairly efficient if the traversal algorithm is
depth-first through the subtrees in parallel with a walk across the
items. That's going to give you one item-walker per level, but this
tree shouldn't be so deep that that's a prohibitive cost. And doing
something that periodically refactors this tree during low-use periods
would help make subsequent operations more efficient.

@tv42
Copy link

tv42 commented May 8, 2015

@wking

Especially before we have something available to benchmark.

FNV-1a of an IPFS key takes about half the time of a single RAM cache miss.

But for the archive format I'm hoping to use a multihash version

That pretty quickly leads to a world where objects know their own type, and at that point that shouldn't live in Data, but alongside Links and Data. I'm not quite convinced it's worth storing that in every object. In any case, this is easy to convert to that, once the infrastructure exists. For more inspirational reading, see Ethos ETypes.

‘% n’ should be ‘% fanout’.

Yes. Thank you.

One issue to consider is atomicity,

It's a merkle tree, all atomicity comes from flipping the root hash, nothing else is possible.

@wking
Copy link

wking commented May 8, 2015

On Fri, May 08, 2015 at 11:56:54AM -0700, Tv wrote:

Especially before we have something available to benchmark.

FNV-1a of an IPFS key takes about half the time of a single RAM
cache miss.

In that case we might as well use it, since the person writing the
code should get to pick their favorite shed color ;).

But for the archive format I'm hoping to use a multihash version

That pretty quickly leads to a world where objects know their own
type, and at that point that shouldn't live in Data, but alongside
Links and Data.

I think the “where should this type information live” is what's
happening in ipfs/ipfs#36.

In any case, this is easy to convert to that, once the
infrastructure exists.

Well, modulo some version-dependent migration, since it won't fit into
your version scheme here. But yeah.

@tv42
Copy link

tv42 commented May 9, 2015

@wking I can still change the "current version" to 1, new versions can leave that tag unused in the protobuf, it'll get the value 0.

@wking
Copy link

wking commented May 9, 2015

On Fri, May 08, 2015 at 05:02:35PM -0700, Tv wrote:

I can still change the "current version" to 1, new versions can
leave that tag unused in the protobuf, it'll get the value 0.

Oh, right :p.

I think that covers all my concerns with the v2 spec :).

@jbenet
Copy link
Member Author

jbenet commented May 9, 2015

I like v2 very much!! 👍 👍 👍 cool construction


I've some questions which may be obviously stupid:

1, a few times it is mentioned that the key may be stored multiple times. why not store the keys once, and use an array of numbers, like:

links: [ l1, l2, l2, l3, l3 ]
data: { ... }
--- # vs
links: [ l1, l2, l3 ]
data: { counts: [ 1, 2, 2 ], ... }

2, it is possible to also use the names to signal fanout vs n ?

links: [ { hash: k1, name: "fanout", ... }, ... ]
data: { ... }

3, we could bite the bullet and figure out how the hell to make links nicely extensible. (for example, maybe we can designate proto tag numbers above 50 to be for the user. so could make the link:

message SetLink {
  // standard merkledag part
  bytes Hash = 1;
  string Name = 2;
  uint64 Tsize = 3;

  // ours
  bool fanout = 51;
  uint64 count = 52;
}

It may be a bit tricky to parse out, but we may be able to make a nice set of interfaces (a bit capnp inspired) which expose things like:

// in merkledag
type Link struct {}

func (ln *Link) Data() []byte {
  return ln.rawData
}

// in SetLink
func CastLink(ln *dag.Link) *SetLink {
  return NewSetLink(ln.Data())
}

func NewSetLink(data []byte) *SetLink {
  // validates data, check it's a conforming protobuf 
  setLinkCheckProtobuf(data) 
  return &SetLink{data}
}

(sorry, dont mean to co-opt this thread. we can discuss this complicated question elsewhere. Oh and, maybe I'm stupid but so far I find the Any type really cumbersome to use, more than "casting")


I aree with @wking here:

I'm hoping to use a multihash version to get the self-describing features discussed in ipfs/ipfs#36. In that case, I'd just store the accepted version of a spec like this in a unixfs file, and use its multihash as the version for pinning nodes following that spec.
...
On the balance, I think the pros outweigh the cons, especially for
specs that we expect to be fairly stable.

I think we

at that point that shouldn't live in Data, but alongside Links and Data

As described in ipfs/ipfs#36, Links and Data give us enough-- meaning that we can use a link and name it @type as in JSON-LD. So we'd have:

links: [ { name: "@type", hash: <hash-to-spec>, ... }, l2, l3, ... ]
data: { ... }

or @context

links: [ { name: "@context", hash: <hash-to-spec>, ... }, l2, l3, ... ]
data: { ... }

The importance of doing this in the links is that:

  • we don't doom ourselves to buy into one (sure to seem short-sighted and problematic in 5 years) type-structure.
  • all the ipfs link semantics apply (reference + backup specs, can traverse them in paths: .../myobj/@type, etc.)

For more inspirational reading, see Ethos ETypes.

I'll read on Ethos ETypes, thanks for posting. may want to check out PADS too (links in ipfs/ipfs#36) -- that's a fantastic piece of work-- even though it's more Type-Theory specific, and takes a long grokking time before the benefits are apparent. And to my knowlede there's no developer-sensible approaches yet. (i.e. no "JSON-LD" to the RDF.)


FNV-1a of an IPFS key takes about half the time of a single RAM
cache miss.

I still don't understand why, if we have uniformly distributed data, we need to rehash it? what am i missing?

Like -- depending on how big the seed is, and whether it's properly random -- why can't we just:

combine(seed, key) % fanout

where combine could be:

xor(seed, key[:len(seed)])
(seed || key)
(key[:len(seed)] ** seed)

In that case we might as well use it, since the person writing the
code should get to pick their favorite shed color ;).

Yeah, sounds good to me. go for it :) -- just, i don't understand why it's needed yet.

@jbenet
Copy link
Member Author

jbenet commented May 9, 2015

anyway all this is minor, this SGTM

@tv42
Copy link

tv42 commented May 10, 2015

On Sat, May 09, 2015 at 03:52:47AM -0700, Juan Batiz-Benet wrote:

1, a few times it is mentioned that the key may be stored multiple times. why not store the keys once, and use an array of numbers, like:

links: [ l1, l2, l2, l3, l3 ]
data: { ... }
--- # vs
links: [ l1, l2, l3 ]
data: { counts: [ 1, 2, 2 ], ... }

With the current design, the Links indexes map directly to indexes in
Data. If the items in Data are variable length, that'll no longer
hold. That complicates a lot of things, including e.g. sorting the the
item Links by Hash.

2, it is possible to also use the names to signal fanout vs n ?

links: [ { hash: k1, name: "fanout", ... }, ... ]
data: { ... }

Yes, it is. But I didn't want to repeat the word "fanout" about 256
times in every object, and couldn't come up with a way to make that
cheaper.

Also, if the fanout number is not easily accessible, we need to walk
the links and look for where the name changes.

3, we could bite the bullet and figure out how the hell to make
links nicely extensible.

Plenty of things there that could be done.

It's worth noting that protobuf as a whole is switching away from
that sort of "extensions using numbers" design, into using the Any
type with URLy identifiers; the stated reason was that they had too
much confusion from ID collisions. But then again, that's wasteful.

A top-level (sibling of Links and Data) protobuf Any would let us
include these things in a more structured manner, yet pay the string
cost only once per object. The real difficulty is if you want to
support GC holding on to links that are only in these extensions. I
can explain how to make that work, but it's a bit more complex. It'd
essentially look at the types of all fields, named Links or not, and
use that as criteria whether that's a link or not.

Venti solves a similar problem by using two streams for one kind of
data, one with well-known semantics and the other one with
type-specific semantics. The Multiset datatype could just keep another
array or such under the Any variable, for the refcount use or its edge
cases.

Now, I'd rather get this thing out there in a simple form, first,
because the above will have avalanche effects all over the system.

Let's please make a separate ticket for "extensible objects"? If and
when that stuff gets actually programmed, we'll migrate this over.
Unless you really want me to put this work on hold until that's done.

I aree with @wking here:

I'm hoping to use a multihash version to get the self-describing features discussed in ipfs/ipfs#36. In that case, I'd just store the accepted version of a spec like this in a unixfs file, and use its multihash as the version for pinning nodes following that spec.
...
On the balance, I think the pros outweigh the cons, especially for
specs that we expect to be fairly stable.

I think we

at that point that shouldn't live in Data, but alongside Links and Data

As described in ipfs/ipfs#36, Links and Data give us enough-- meaning that we can use a link and name it @type as in JSON-LD. So we'd have:

links: [ { name: "@type", hash: <hash-to-spec>, ... }, l2, l3, ... ]
data: { ... }

or @context

links: [ { name: "@context", hash: <hash-to-spec>, ... }, l2, l3, ... ]
data: { ... }

The importance of doing this in the links is that:

  • we don't doom ourselves to buy into one (sure to seem short-sighted and problematic in 5 years) type-structure.
  • all the ipfs link semantics apply (reference + backup specs, can
    traverse them in paths: .../myobj/@type, etc.)

That part that's hurting this particular use there is that if Links
can has all kinds of extra entries, we can no longer safely do things
like just compute a link index and follow that as a subtree.

That also conflicts with the idea that Links are unixfs filenames. You
really don't want to make "mkdir @context" an invalid operation.

I feel like the idea has merit, but slapping everything into Links is
going to be overly problematic. Why not use the protobuf? If you want
a type field, why not add a type field.

This conversation doesn't really belong in this ticket. Only one thing
matters here: do we put this work on hold waiting for the
extensibility discussion to result in working code, or not.

FNV-1a of an IPFS key takes about half the time of a single RAM
cache miss.

I still don't understand why, if we have uniformly distributed data, we need to rehash it? what am i missing?

Like -- depending on how big the seed is, and whether it's properly random -- why can't we just:

Because it's hard to convince me the data is always uniformly
distributed.

If I publish an object, it contains links that I have carefully
crafted. If I can make you pin it, I get to control what goes in your
indirect pins. I can very easily generate junk objects until I get out
ones with keys that e.g. all have 0x00 in the same location. I can
even hide them in a subfolder you're unlikely to look in. And that's
assuming the links even have to really point to existing objects to
trigger the attack, which I'm not even sure is needed.

(While doing this work, I'm starting to have opinions about how the
garbage collection could avoid all this.. but I really don't want to
delay this even further.)

:(){ :|:&};:

@jbenet
Copy link
Member Author

jbenet commented May 11, 2015

Now, I'd rather get this thing out there in a simple form, first, because the above will have avalanche effects all over the system.

Yeah, that sounds good to me 👍

Because it's hard to convince me the data is always uniformly distributed.

Ah! that's a good point. yep, we do need to hash 👍

@tv42
Copy link

tv42 commented May 11, 2015

Final version, needs to be archived somewhere, probably in this repo. Note that it's not specific to pinning.

Set/Multiset of Keys as IPFS object

IPFS objects have schema (Links, Data) where Links is a list of
(Name, Size, Hash). Name must be UTF-8 and human friendly, so we
choose to leave it empty. Link Hash cannot be empty.

We implement a Set and Multiset of IPFS objects, using IPFS
objects for storage.

For our purposes, we divide Links into two groups:

  • subtrees, fanout items
  • items, n items, sorted in lexicographical order by Hash

fanout changes rarely if ever, while n grows and shrinks
dynamically.

The object always has at least fanout Links (see below for where
that value is stored). Those links may point to the well-known hash
value QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n that has no
Links and no Data. We will use it to signal empty.

For Set: Keys in items are distinct. As a write optimization, a
key in items may also have a copy in the correct subtree.

For Multiset: items may contain the same key multiple times, and
the correct subtree may contain even more entries for the key. The
numbers of times a Key has been added to the Multiset is the sum of
all these reference counts.

The object Data consists of two sections:

  • an uvarint length prefixed protobuf message
  • for Multiset only: array of n refcounts, little endian

The protobuf message is described by

message Set {
    // 1 for now, library will refuse to handle entries with version different from what's recognized.
    required uint32 version = 1;
    // how many of the links are subtrees
    required uint32 fanout = 2;
    // hash seed for subtree selection, a random number
    required fixed32 seed = 3;
}

Refcounts

To be useful in an array, we need fixed size refcounts. However, there
is no limit to how many times an IPFS object may be referenced. To
support this, we allow Multisets to contain multiple instances of the
item, both in the same node and in the correct subtree.

Version 1 of the format uses uint8 refcounts.

fanout changes

(This can be left unimplemented for now.)

We can avoid wasting space for fanout empty Links in leaf nodes by
making new, small, nodes have fanout 0.

fanout changes would change the hash decisions of all items in
subtrees, but we can choose to only make such changes at times
when that cost is non-existent.

Leaf to node: Changing fanout from 0 requires inserting fanout
empty items at the beginning of Links. This can be done at the time
a leaf node becomes too big, and is split into an inner node with
multiple leaves.

Node to leaf: Changing fanout to 0 can happen at a time when all
fanout Links are empty. This is to preserve subtree partitioning.

The exact fanout value may have different values at different
versions of the software.

Lookups

If a key is not in the n directly stored keys, the Link to look it
up in is the one at index h(seed || key) % fanout where seed is
the bytes of the uint32 in little-endian order, h is a 32-bit
FNV-1a hash, and || is a string concatenation operation.

Removing keys may decrease n, all the way down to 0. Adding keys
adds them to Links as long as n is small enough (details are
implementation specific and not to be relied upon).

Optimization: decreasing write amplification

To allow for cheap updates, Add operation can just insert keys in the
root node, without first checking whether the key is present in the
subtree. This can lead to multiple instances of the key being
present in the tree.

Thus, removing a key must recursively remove the
key from the relevant child in Links, to avoid shadowed old instances
from becoming visible.

Multiset refcount decrement can decrement the first instance it finds,
stopping there. Increment can opportunistically increment an existing
entry, adding a new entry if one is not found. Refcounts can be merged
at any time when they do not overflow.

@sivachandran
Copy link

Is there anything implemented related to this? Is there a way to store key-value pair and retrieve/lookup the value by key?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants