Skip to content

lachenmayer/hyperdb-authorization-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Authorization in hyperdb: a deep dive

The goal of this guide is to uncover how hyperdb authorization works in detail. HyperDB creates a key-value store from several hypercore feeds (append-only logs), and allows multiple writers to write to a single database. Here, I'll only be focusing on the built-in authorization mechanism - if you want to find out more about how the data structure itself works, check out the hyperdb architecture document for details (it's very cool).

This guide might be useful to you if you want to know more about the inner workings of the forthcoming "multiwriter" features in the Dat ecosystem. If you don't know what hyperdb or Dat is, I recommend checking out the Dat website and exploring the various projects featured on dat.land. That said, this guide does not assume any knowledge of Dat or hyperdb, so you should be able to follow this if you have never used Dat or hyperdb before. This guide does go deep into the code, so if that's not your thing, this is probably not for you.

If anything is confusing or unclear, please open an issue! Also, if you notice any mistakes like typos or broken links, feel free to open a pull request or open an issue.

The hyperdb architecture document tells us this:

The set of hypercores are authorized in that the original author of the first hypercore in a hyperdb must explicitly denote in their append-only log that the public key of a new hypercore is permitted to edit the database. Any authorized member may authorize more members. There is no revocation or other author management elements currently.

In this guide, we'll dig a bit deeper than that, and find out how exactly authors denote these permissions, and some of the implications of this. We'll start by defining exactly what hyperdb does, play around with these concepts in an interactive example, and then thoroughly walk through the bits of code which make all this magic happen.

I will link to specific lines in the hyperdb GitHub repo throughout. These will link to the current release v3.5.0. I am not linking to master, because otherwise the line numbers will get messed up when the code gets updated. By the time you read this, a lot of this might very well be out of date - things move fast in Dat land. 🐝

hyperbasics - hypercore & hyperdb

HyperDB is made up of several hypercore feeds. Every one of these feeds is an append-only log containing the writes made by a single person/device. We'll get into what exactly goes into these feeds below.

Hypercore feeds have a very simple API (slightly simplified):

  • get(seq): Returns the data at sequence number seq.
  • append(data): Add new data to the end of the feed. Returns the seq number at which this data is stored in the feed.

All of the data that is added to the feed is cryptographically signed using a secret key. The person holding the secret key is referred to as the owner of the feed.

Feeds are identified by their corresponding public key (usually just called key), which allows the receiver to verify the data sent to them by untrusted peers.

The fact that anyone with access to the public key can verify that the data is legitimate (ie. comes from the person who owns the corresponding secret key) no matter where it came from is what makes hypercore a peer-to-peer data structure. Hypercore feeds are also very efficient to replicate across a network.

From these feeds, hyperdb builds a multi-writer key-value database (db) which allows for the following operations:

  • get(key): Returns the value stored at the given key if it exists.
  • put(key, value): Store the given value in the db at the given key.
  • delete(key): Mark the value at the given key as deleted.
  • authorize(key): Allow the writer with the given key to publish their writes to the db. (Note that key here specifically refers to a public key, unlike the key argument in the get/put operations.)

How exactly get/put/delete work is out of the scope of this guide, but if you're interested, read the hyperdb architecture document (I'll be referring to this document throughout).

Source & local feeds

There are 2 special feeds in every archive:

  • source: the feed belonging to the owner of the db. The source feed's public key is the db key, which identifies the db. (This feed is stored in /source inside the db folder.)
  • local: the writable feed which contains all of your own writes. (Stored in /local inside the db folder.)

If you own the source feed, ie. if you are the owner of the db, your local feed is just the source feed, and no separate local feed directory will be created. (source)

Feeds that belong to other writers (ie. everyone else) are read-only, since they represent other people's writes. (They are stored in /peers/<discovery key>.)

How to authorize a writer to a hyperdb

To authorize a second writer to your database, you use the db.authorize(key, [callback]) method, like so:

const hyperdb = require('hyperdb')

const db1 = hyperdb('db1-data')
db1.ready(() => {
  const db2 = hyperdb('db2-data', db1.key)
  // IMPORTANT: You need to authorize db2.local.key, NOT db2.key!
  db1.authorize(db2.local.key, err => {
    if (err) console.log(err)
    // db2 is authorized by db1 to write! :)
    // don't forget to replicate changes.
  })
})

Warning: This method unfortunately has some very subtle behavior - if you accidentally pass db2.key instead of db2.local.key to the authorize method, it will silently fail, and db2 will not be authorized. db2.local.key refers to the local feed's public key, ie. the key of the feed containing the second writer's changes.

Another very common reason why writes from a different peer might not show up is that you may have forgotten to replicate the writes between the two dbs. Check out the example below for working code that shows how to do this.

Example: authorizing a writer to a hyperdb

The example.js file in this repo contains code that demonstrates the authorization flow in hyperdb. This code:

  • creates a new hyperdb db1, and writes some data to it,
  • creates another hyperdb db2, writes some data to that,
  • authorizes db2 in db1,
  • replicates the changes from db1 to db2 and vice versa,
  • prints the contents of all feeds in the two dbs.

To try it out yourself, download this repo and run:

npm install # if you haven't already
node example.js

If you haven't got node at hand and/or are too lazy to run it yourself, just expand the Example output block below.

Example output

Note that the keys won't be the same if you run this, as they are generated every time you run the script.

You're also missing the fancy colors of the real output, just saying...

===== db 1 =====
feeds: 2

feed 0 8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28 (source) (local)

1
  key: example/first
  value: db1 was here
  deleted: false
  trie:  []
  clock: [ 2 ]
  inflate: 1
  feeds:
    - 0: 8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28

2
  key: example/second
  value: db1 was here
  deleted: false
  trie:  [ <32 empty items>,
  [ <1 empty item>,
    [ { feed:
         '8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28',
        seq: 1 } ] ] ]
  clock: [ 3 ]
  inflate: 1
  feeds:

3 'inflate entry'
  key: 
  value: null
  deleted: false
  trie:  [ [ <3 empty items>,
    [ { feed:
         '8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28',
        seq: 2 } ] ] ]
  clock: [ 4, 0 ]
  inflate: 3
  feeds:
    - 0: 8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28
    - 1: f7a57223ebf39f9dbc0afb5e505b70041d92b58ca8f85aa85fa7644677f88dff




feed 1 f7a57223ebf39f9dbc0afb5e505b70041d92b58ca8f85aa85fa7644677f88dff

1
  key: example/third
  value: db2 was here
  deleted: false
  trie:  []
  clock: [ 0, 2 ]
  inflate: 1
  feeds:
    - 0: 8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28
    - 1: f7a57223ebf39f9dbc0afb5e505b70041d92b58ca8f85aa85fa7644677f88dff

2
  key: example/second
  value: db2 was here
  deleted: false
  trie:  [ <32 empty items>,
  [ <2 empty items>,
    [ { feed:
         'f7a57223ebf39f9dbc0afb5e505b70041d92b58ca8f85aa85fa7644677f88dff',
        seq: 1 } ] ] ]
  clock: [ 0, 3 ]
  inflate: 1
  feeds:



===== db 2 =====
feeds: 2

feed 0 8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28 (source)

1
  key: example/first
  value: db1 was here
  deleted: false
  trie:  []
  clock: [ 2 ]
  inflate: 1
  feeds:
    - 0: 8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28

2
  key: example/second
  value: db1 was here
  deleted: false
  trie:  [ <32 empty items>,
  [ <1 empty item>,
    [ { feed:
         '8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28',
        seq: 1 } ] ] ]
  clock: [ 3 ]
  inflate: 1
  feeds:

3 'inflate entry'
  key: 
  value: null
  deleted: false
  trie:  [ [ <3 empty items>,
    [ { feed:
         '8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28',
        seq: 2 } ] ] ]
  clock: [ 4, 0 ]
  inflate: 3
  feeds:
    - 0: 8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28
    - 1: f7a57223ebf39f9dbc0afb5e505b70041d92b58ca8f85aa85fa7644677f88dff




feed 1 f7a57223ebf39f9dbc0afb5e505b70041d92b58ca8f85aa85fa7644677f88dff (local)

1
  key: example/third
  value: db2 was here
  deleted: false
  trie:  []
  clock: [ 0, 2 ]
  inflate: 1
  feeds:
    - 0: 8a9a8586d53c4836862b5c853a0cec7226af17c682a083ea4447d9342bce9f28
    - 1: f7a57223ebf39f9dbc0afb5e505b70041d92b58ca8f85aa85fa7644677f88dff

2
  key: example/second
  value: db2 was here
  deleted: false
  trie:  [ <32 empty items>,
  [ <2 empty items>,
    [ { feed:
         'f7a57223ebf39f9dbc0afb5e505b70041d92b58ca8f85aa85fa7644677f88dff',
        seq: 1 } ] ] ]
  clock: [ 0, 3 ]
  inflate: 1
  feeds:



The output might seem a bit overwhelming at first, but it should hopefully make sense with a bit of explanation.

For both dbs, db 1 and db 2, we iterate through their feeds. For every feed, we print the key, and whether it's a source or local feed (or neither).

For every entry in the feed, we print all of its contents.

From this data, hyperdb computes a key-value store that contains three keys:

key value(s)
example/first db1 was here
example/second db1 was here, db2 was here (a conflicting value!)
example/third db2 was here

If you ran the example app yourself, you can try out a little tool I wrote called hyperdb-explorer to convince yourself that your dbs actually contains these values. Just run:

npx hyperdb-explorer data/db1/

This tool will list all of the keys in the db, and let you explore the contents of the nodes further. (Try it with data/db2/ too!)

These are both just a nicer representation of the raw data that is written to the data/ directory when you run the script. You should have two directories, data/db1 and data/db2, containing several more directories, as described above. If you're comfortable with looking at binary files, check out what's in these directories.

Try playing around with the code in example.js. Some interesting things to look out for are:

  • What happens if you create a third db, and...
    • authorize with the second db / authorize with the first db / don't authorize at all
    • replicate with the second db / replicate with the first db / don't replicate at all
  • If you put more values into any of the dbs, what happens to the inflate value in the newly written nodes?
  • How do the clock & trie values change when you do any of these?
    • The architecture document has a nice description of this - this doesn't really have much to do with authorization, but this is probably the most interesting part of the data structure itself, so check it out!

Once you have a good intuition of what these values (might) mean and how they change when you play around with them, read on to get a deeper understanding of how this all works under the hood. (Or just read on anyway, nobody's gonna stop you...)

What actually happens when you authorize writers

This section goes into the nitty gritty details, down to the exact lines of code that make authorization happen. If you don't care about the details, skip to the TL;DR section below.

Authorizing a new writer - HyperDB#authorize

To authorize a feed with a given key, the user calls HyperDB#authorize, which does the following:

Writing the update to the db - put

The put method is quite complex (see lib/put.js), because it calculates the trie value which maps keys to positions in the feeds. See the architecture document if you're interested in the details.

For our purposes, all we care about is that a put causes some value to be appended to the local feed. This calls Writer#append, which deals with encoding the messages that will be written to the feed.

What gets written to the db - Writer, Entry & InflatedEntry

The Writer class contains the logic related to writing messages to a single feed. There are two different types of messages: Entry and InflatedEntry. These are defined in schema.proto, which is a Protocol Buffer schema file. Protocol Buffer (aka. "protobuf" or just "proto") is an efficient binary encoding protocol designed to be compact and easy to parse in any programming language.

The Entry message represents writes to the db (ie. put/delete operations), and contains all the required information, such as the key and value (see architecture doc for details).

Most interestingly for us, an Entry message also contains a field called inflate, which is a positive number (uint64). This value is a sequence number in the feed: it refers to the latest InflatedEntry that has been written to the feed. (You can verify that this is the case by running the example above.)

The InflatedEntry message contains all of the same fields as an Entry message, but additionally contains a repeated field called feeds. This is an array containing all the keys that are authorized to write to this db. It also contains a contentFeed field, which is not relevant for authorization. This refers to a separate feed containing the content for when you want to keep the content and the metadata separate. This could be useful when you are storing large values, for example when using hyperdb as a file system, eg. as a storage backend for Dat. (In the example app, the inflated entries are marked with 'inflated entry'.)

The InflatedEntry message is used to represent a change in the set of authorized writers. Any time an entry is to be appended to the local feed, the Writer#append method checks whether the entry needs to be inflated. An entry needs to be inflated when the number of known feeds in the database is different to the number of feeds in the latest inflated entry message (referred to as _feedsMessage). Once a new entry is inflated, it will contain the latest known list of feeds. (Again, you can verify this with the example app.)

This is what happens when you call HyperDB#authorize: when the empty value is put, the fact that a new feed was created will cause the entry to be inflated. The inflated entry will contain the list of feeds including the new writer.

How writers are discovered

Starting discovery - HyperDB#replicate & HyperDB#ready

When you want to read data from a remote db, you need to replicate all of the feeds in the db from a peer. This happens in HyperDB#replicate, which calls hypercore#replicate for every feed in the list of authorized writers. The stream that is returned from HyperDB#replicate implements the hypercore protocol, which you can pipe to another peer over any reliable stream, for example a TCP connection or a WebSocket.

When you first replicate a db, the only writer you know is the owner of the db. You know about this writer because the db key is the key of the owner's feed, the source feed.

In HyperDB#ready, the source and local writers are set up. For both of these writers (or just the source writer, if you are the owner of the db), HyperDB#_writer is called.

Initialising writers & finding more writers - HyperDB#_writer

Every writer is initialised in HyperDB#_writer. This method creates a hypercore feed and a Writer instance corresponding to that feed. A listener for each feed's 'sync' event is added, which calls the Writer#head method once the feed has been completely downloaded.

Writer#head retrieves the latest message from the feed using Writer#get. Once the head is retrieved, it is decoded in Writer#_decode. This calls Writer#_loadFeeds.

Writer#_loadFeeds checks the head's inflate value - remember that the inflate number refers to the sequence number in the feed which contains the latest InflatedEntry, which contains a list of feeds.

If the inflate value is the same as the head's sequence number, then the head contains the most recent list of feeds (since it's the most recent message). If not, Writer#_loadFeeds retrieves the entry in the feed at sequence number inflate.

Once the latest inflated entry is found, Writer#_addWriters is called, which calls HyperDB#_addWriter for every feed in the list of feeds.

HyperDB#_addWriter creates a new hypercore feed using HyperDB#_writer, stores it in peers/<discovery key>, and pushes this to the list of known writers. We already came across HyperDB#_writer above: the cycle is repeated for the newly found writers - for every feed, we find the latest inflated message, look at the list of feeds in it, and add the new feeds as writers.

Once all that is done, Writer#_updateFeeds is called, which just sets up a bunch of state.

TL;DR

  • A HyperDB db is made of several hypercore feeds, each containing the changes from a single writer.
  • These feeds contain entries, which represent writes to the database, and inflated entries, which represent changes to the list of writers (feeds) in the db.
  • Every entry in the feed contains a pointer to the latest inflated entry, so that all of the writers in the db can be efficiently discovered.
  • When initialising the db or replicating changes, additional writers are discovered by looking up the list of feeds in the latest inflated entry in every known feed.

Discussion

If you've made it this far, you now know exactly how authorization currently works in hyperdb. HyperDB is a fantastic data structure which will most likely form the backbone of the forthcoming collaborative features in Dat.

One thing you may have noticed is that there is currently no way to revoke authorization, or to enforce any kind of access control on the values in the database, for example marking certain keys as read-only. Various solutions for this are currently being discussed in the community, and I hope that this guide can help spread some understanding of what currently exists, and how things could be improved.

Efforts are currently also under way to port the Dat ecosystem to different languages, for example Rust. There is still a lot of foundational work to be done before a Rust port of hyperdb can be tackled, but understanding the current logic should hopefully make the porting task much simpler.

Writing this guide massively improved my understanding of how hyperdb works - I hope that reading it improves yours!

About

A deep dive into how authorization works in hyperdb.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published