Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routing Table Maintenance #45

Closed
4 tasks done
Tracked by #54 ...
iand opened this issue Jun 30, 2023 · 16 comments
Closed
4 tasks done
Tracked by #54 ...

Routing Table Maintenance #45

iand opened this issue Jun 30, 2023 · 16 comments
Assignees
Labels
kind/enhancement New feature or request scope/required Feature required to match go-libp2p-kad-dht

Comments

@iand
Copy link
Contributor

iand commented Jun 30, 2023

ETA: 2023-08-31

Design last updated: 2023-08-11
Updates to the design will be notified by a comment listing changes.

Routing Table Maintenance

Routing table maintenance is concerned with gaining and preserving a high occupancy of useful nodes in a routing table.
A useful node is one that can reliably and performantly be used for the kinds of queries made by the local node.

Maintenance consists of four operations:

  • Bootstrap - populates an empty routing table
  • Include - processes candidates for inclusion in the routing table
  • Probe - assesses usefulness of nodes in the routing table
  • Explore - improves occupancy of routing table buckets

These are discussed in more detail below.

A node value table is maintained which records metrics to assess the value or usefulness of each node.
The value metrics are used to compute a score for each node in the case where a decision is required on whether to drop a node from the routing table.
The computed score may involve other factors such as feature set or network locality in the future, but this is out of scope of this design.

Initially the following information is recorded in each value table entry:

  • the time the node was included in the routing table
  • the last time the node responded to a message (such as a connectivity check or query)
  • the number of times the node has been involved in a query
  • the number of times the node has supplied closer nodes for a query

Calculation of the score TBD.

A candidate queue is maintained which contains nodes that are candidates for inclusion in the routing table.

A number of external events are monitored and changes applied to the routing table, candidate queue and value table:

  • When a node fails to respond to a message or is timed out it is removed from the routing table and added to the candidate queue.
    The value table entry for the node is preserved.
  • When a node responds successfully to a query, its value entry is updated
  • When a node responds successfully to a query and provides closer nodes, its value entry is updated

Bootstrap

Bootstrap is an operation that is performed on startup of a node if its routing table is empty.
A Bootstrap can be followed by an explore operation to gain more occupancy in the routing table buckets.

A query is initiated for the local node's Kademlia key in the DHT.
This enables the node to discover its closest neighbours and gives the neighbours an opportunity to add the node to their own routing tables.
Discovered nodes are added to the candidate queue for inclusion in the routing table.

No other queries may be started on the local node until the bootstrap operation has completed.
This avoids overwhelming neighbouring nodes who would be contacted too frequently when the routing table is sparsely populated.

The bootstrap operation must be initiated with one or more seed nodes that can form the starting point of the query.

Include

The include operation is a continually running background job that processes a queue of candidate nodes to be included in the routing table.
Newly discovered nodes are added to the candidate queue by other operations and external events.

The include operation proceeds by taking the first node from the candidate queue.
If the node is already included in the routing table, it is skipped, otherwise a connectivity check is performed.
By default this check will ask the node to respond to a get closest nodes request for its own Kademlia key.
The candidate node is dropped from the queue if it does not respond or responds without including at least one node in its response.

An attempt is then made to add the node to the routing table.
If a node is included in the routing table an entry is made for it in the node value table if it doesn't already have one.

The routing table may reject the addition due to capacity constraints or other criteria.
In this case the node is dropped from the queue.

Any node dropped from the candidate queue also has its node value table entry removed.

Probe

The probe operation is a continually running background job that periodically assesses the usefulness of every node in the routing table.

The probe processes nodes in descending order of the time since each last responded to a message.
If the time since the last response is lower than a configurable threshold then the node and all following nodes are skipped.
A connectivity check is performed for the node, using the same mechanism as for the include operation.
If the check fails or reaches a timeout the node is removed from the routing table and added to the candidate queue.

If the bucket occupied by the node does not have high occupancy (< 90%), then the probe operation moves to the next node in sequence.

Otherwise, a score is calculated for the node based on the value metrics and compared to the score for all nodes in the bucket.
If the score is below a threshold value the node is removed from the routing table and added to the candidate queue.
This is intended to free up space for more useful nodes to be included in the routing table.
By default the threshold will be the score at the 10% centile of all scores when ordered ascending.
The node may be returned to the routing table at a later time by the include operation.
The value table entry for a node moved to the candidate queue is preserved.

Explore

The explore operation is used to discover new nodes at various distances from the local node in order to improve the occupancy of routing table buckets.
It is run on a regular schedule.

For each bucket a random key is generated that would occupy the bucket and a query initiated to find the nodes close to it.
Dicovered nodes are added to the candidate queue for inclusion in the routing table.
During the course of the query discovered nodes may result in being included in buckets other than the one being processed.

The explore operation processes buckets in order of distance from the local node and waits for query completion before proceeding to the next bucket.
A bucket with >=90% occupancy is skipped.
The operation stops when a query results in the bucket having <=10% occupancy.

The frequency of running an explore varies by bucket distance, such that closer buckets are processed more frequently.

A multi-level queue is used to manage the explore schedule.
The n closest buckets are assigned to the queue's first tier, 2n buckets are assigned to the second tier, 4n to the third tier and the final tier contains the remaining buckets.

At a regular interval, t, the explore operation is initiated for the first bucket in each tier, starting with the first tier and proceeding to the second, third and fourth as the previous is completed.
When each operation is complete the bucket is placed at the back of the queue for the relevant tier.
Thus each bucket in tier 1 will be explored at an interval of n x t.
Each bucket in tier 2 will be explored at an interval of 2n x t and those in tier 3 at an interval of 4n x t.
Buckets in tier 4 will be explored at an interval dependent on the key length of the dht, which governs the maximum number of more distant buckets.

The time interval t should be chosen such that four explore queries can complete within the interval.

For example, with n=2 and t=5m the explore operation would be run every:

  • 10m for buckets 0-1
  • 20m for buckets 2-5
  • 40m for buckets 6-13
  • up to 21.5h for all remaining buckets (if the table were fully populated)

Tasks

Preview Give feedback
  1. iand
@iand iand mentioned this issue Jun 30, 2023
3 tasks
@iand iand added kind/enhancement New feature or request scope/required Feature required to match go-libp2p-kad-dht labels Jun 30, 2023
@iand
Copy link
Contributor Author

iand commented Jun 30, 2023

For reference: libp2p/go-libp2p-kad-dht#283 (comment)

@iand
Copy link
Contributor Author

iand commented Aug 2, 2023

DRAFT DESIGN - still work in progress

Routing Table Maintenance

Routing table maintenance is concerned with gaining and preserving a high occupancy of useful nodes in a routing table.
A useful node is one that can reliably and performantly be used for the kinds of queries made by the local node.

Maintenance consists of four operations:

  • Bootstrap - populates an empty routing table
  • Include - processes candidates for inclusion in the routing table
  • Refresh - assesses usefulness of nodes in the routing table
  • Explore - improves occupancy of routing table buckets

These are discussed in more detail below.

A node value table is maintained which records metrics to assess the value or usefulness of each node.
The value metrics are used to compute a score for each node in the case where a decision is required on whether to drop a node from the routing table.
The computed score may involve other factors such as feature set or network locality in the future, but this is out of scope of this design.

Initially the following information is recorded in each value table entry:

  • the time the node was included in the routing table
  • the last time the node responded to a message (such as a connectivity check or query)
  • the number of times the node has been involved in a query
  • the number of times the node has supplied closer nodes for a query

Calculation of the score TBD.

A candidate queue is maintained which contains nodes that are candidates for inclusion in the routing table.

A number of external events are monitored and changes applied to the routing table, candidate queue and value table:

  • When a node fails to respond to a message or is timed out it is removed from the routing table and added to the candidate queue.
    The value table entry for the node is preserved.
  • When a node responds successfully to a query, its value entry is updated
  • When a node responds successfully to a query and provides closer nodes, its value entry is updated

Bootstrap

Bootstrap is an operation that is performed on startup of a node if its routing table is empty.
A Bootstrap can be followed by an explore operation to gain more occupancy in the routing table buckets.

A query is initiated for the local node's Kademlia key in the DHT.
This enables the node to discover its closest neighbours and gives the neighbours an opportunity to add the node to their own routing tables.
Dicovered nodes are added to the candidate queue for inclusion in the routing table.

No other queries may be started on the local node until the bootstrap operation has completed.
This avoids overwhelming neighbouring nodes who would be contacted too frequently when the routing table is sparsely populated.

The node must have at least one node already included in its routing table for a bootstrap to be initiated.
This node must be added by some previous step such as a manual add or by peering with a dedicated bootstraper node.

Include

The include operation is a continually running background job that processes a queue of candidate nodes to be included in the routing table.
Newly discovered nodes are added to the candidate queue by other operations and external events.

The include operation proceeds by taking the first node from the candidate queue.
If the node is already included in the routing table, it is skipped, otherwise a connectivity check is performed.
By default this check will ask the node to respond to a get closest nodes request for its own Kademlia key.
The candidate node is dropped from the queue if it does not respond or responds without including at least itself in the list of nodes.

An attempt is then made to add the node to the routing table.
If a node is included in the routing table an entry is made for it in the node value table if it doesn't already have one.

The routing table may reject the addition due to capacity constraints or other criteria.
In this case the node is dropped from the queue.

Any node dropped from the candidate queue also has its node value table entry removed.

Refresh

The refresh operation is a continually running background job that periodically assesses the usefulness of every node in the routing table.

The refresh processes nodes in descending order of the time since each last responded to a message.
If the time since the last response is lower than a configurable threshold then the node and all following nodes are skipped.
A connectivity check is performed for the node, using the same mechanism as for the include operation.
If the check fails or reaches a timeout the node is removed from the routing table and added to the candidate queue.

If the bucket occuupied by the node does not have high occupancy (< 90%), then the refresh operation moves to the next node in sequence.

Otherwise, a score is calculated for the node based on the value metrics and compared to the score for all nodes in the bucket.
If the score is below a threshold value the node is removed from the routing table and added to the candidate queue.
This is intended to free up space for more useful nodes to be included in the routing table.
By default the threshold will be the score at the 10% centile of all scores when ordered ascending.
The node may be returned to the routing table at a later time by the include operation.
The value table entry for a node moved to the candidate queue is preserved.

Explore

The explore operation is used to discover new nodes at various distances from the local node in order to improve the occupancy of routing table buckets.
It is run on a regular schedule.

For each bucket the a random key is generated that would occupy the bucket and a query initiated to find the nodes close to it.
Dicovered nodes are added to the candidate queue for inclusion in the routing table.
During the course of the query discovered nodes may result in being included in buckets other than the one being processed.

The explore operation processes buckets in order of distance from the local node and waits for query completion before proceeding to the next bucket.
A bucket with >=90% occupancy is skipped.
The operation stops when a query results in the bucket having <=10% occupancy.

The frequency of running an explore varies by bucket distance, such that closer buckets are processed more frequently.

A multi-level queue is used to manage the explore schedule.
The n closest buckets are assigned to the queue's first tier, 2n buckets are assigned to the second tier, 4n to the third tier and the final tier contains the remaining buckets.

At a regular interval, t, the explore operation is initiated for the first bucket in each tier, starting with the first tier and proceeding to the second, third and fourth as the previous is completed.
When each operation is complete the bucket is placed at the back of the queue for the relevant tier.
Thus each bucket in tier 1 will be explored at an interval of n x t.
Each bucket in tier 2 will be explored at an interval of 2n x t and those in tier 3 at an interval of 4n x t.
Buckets in tier 4 will be explored at an interval dependent on the key length of the dht, which governs the maximum number of more distant buckets.

The time interval t should be chosen such that four explore queries can complete within the interval.

For example, with n=2 and t=5m the explore operation would be run every:

  • 10m for buckets 0-1
  • 20m for buckets 2-5
  • 40m for buckets 6-13
  • up to 21.5h for all remaining buckets (if the table were fully populated)

@dennis-tra
Copy link
Contributor

dennis-tra commented Aug 8, 2023

Great write-up! I especially like the clear structure of bootstrap, include, refresh, explore! A few comments:

  1. What about calling the phases: bootstrap, include, probe, refresh?
  2. Should the value table also incorporate information from how the routing table is currently populated? I'm thinking of a lower value score if a peer wouldn't add to a higher diversity in a specific bucket.
  3. Other external events could be peer connect/disconnect events. A peer could become a candidate if e.g. it was found through another subsystem, e.g., gossip-sub.
  4. A concrete proposal for the value metric would be interesting to discuss!

The frequency of running an explore varies by bucket distance, such that closer buckets are processed more frequently.

That makes sense but is probably hard to do in practice. How would you calculate the frequency? Hard code? This is dependent on network size, right?

  1. Also, in your derivation further below buckets 0-1 are explored way more frequently than e.g. 6-13. In this case, farther buckets are processed more freuqently.
  2. What are we trying to achieve with this tiered exploration? Minimize networking overhead? I think there are good reasons for keeping high and low buckets up-to-date.

@iand
Copy link
Contributor Author

iand commented Aug 8, 2023

  1. What about calling the phases: bootstrap, include, probe, refresh?

Do you mean rename refresh as probe? I think that would be clearer. The operations could be bootstrap, include, probe and explore.

  1. Should the value table also incorporate information from how the routing table is currently populated? I'm thinking of a lower value score if a peer wouldn't add to a higher diversity in a specific bucket.

Yes. Initially the scoring will be fairly simplistic based on numbers of queries. I included the sentence "The computed score may involve other factors such as feature set or network locality in the future, but this is out of scope of this design." to cover future expansion. Should I improve that?

  1. Other external events could be peer connect/disconnect events. A peer could become a candidate if e.g. it was found through another subsystem, e.g., gossip-sub.

Yes. libp2p notifiee is another mechanism. I can add these as explicit examples.

The frequency of running an explore varies by bucket distance, such that closer buckets are processed more frequently.

That makes sense but is probably hard to do in practice. How would you calculate the frequency? Hard code? This is dependent on network size, right?
6. Also, in your derivation further below buckets 0-1 are explored way more frequently than e.g. 6-13. In this case, farther buckets are processed more freuqently.
7. What are we trying to achieve with this tiered exploration? Minimize networking overhead? I think there are good reasons for keeping high and low buckets up-to-date.

The last section was intended to describe this. I'll try and make it clearer. The multi-level queue is intended to determine the frequency and to bias toward exploring closer buckets more frequently. The explore query is likely to discover nodes further out before finding closer ones and these further nodes will be added to the candidate queue as normal, so they will likely end up expanding the further buckets. The algorithm was informed a bit by this issue: libp2p/go-libp2p-kad-dht#612

@dennis-tra
Copy link
Contributor

Do you mean rename refresh as probe? I think that would be clearer. The operations could be bootstrap, include, probe and explore.

👍

Yes. Initially the scoring will be fairly simplistic based on numbers of queries. I included the sentence "The computed score may involve other factors such as feature set or network locality in the future, but this is out of scope of this design." to cover future expansion. Should I improve that?

Ah, of course, I read that sentence but forgot about it when I returned to write the comment. So all good 👍

The multi-level queue is intended to determine the frequency and to bias toward exploring closer buckets more frequently.

Maybe I'm reading it wrong but "the explore operation would be run every 10m for buckets 0-1 [...] 40m for buckets 6-13". This means that closer buckets (6-13) would be explored less frequently.

@iand
Copy link
Contributor Author

iand commented Aug 8, 2023

Maybe I'm reading it wrong but "the explore operation would be run every 10m for buckets 0-1 [...] 40m for buckets 6-13". This means that closer buckets (6-13) would be explored less frequently.

This is just my numbering. The buckets are ordered by closeness. The n closest buckets are assigned to the queue's first tier so buckets 0-1 here mean the first 2 buckets in the order.

@iand iand changed the title Routing Table Refresh Routing Table Maintenance Aug 11, 2023
@iand
Copy link
Contributor Author

iand commented Aug 11, 2023

Updated the design based on feedback and moved to description of issue.

Changes:

  • renamed Refresh to Probe

@guillaumemichel
Copy link
Contributor

Great write-up @iand ! I especially enjoy the Probe and Explore sections!

I have a couple of questions.

A node value table is maintained which records metrics to assess the value or usefulness of each node.

What kind of data structure is the node value table? Is it distinct from the routing table, or additional information to be included in the routing table?

Bootstrap

No other queries may be started on the local node until the bootstrap operation has completed.

Bootstrap shouldn't have the priority over real operations.

Bootstrap is an operation that is performed on startup of a node if its routing table is empty.

The node must have at least one node already included in its routing table for a bootstrap to be initiated.

nit: routing table is empty, but one peer is required :p

Include

The include operation is a continually running background job that processes a queue of candidate nodes to be included in the routing table.
Newly discovered nodes are added to the candidate queue by other operations and external events.

What are the benefits of having the include operation run as a background job vs external events directly triggering routing table add?

The candidate node is dropped from the queue if it does not respond or responds without including at least itself in the list of nodes.

In this case, the remote peer should return only itself in the list of remote peers (at least in go-libp2p-kad-dht). I am not sure whether there is a spec specifying whether a node should include itself in a lookup response or not. In theory it is useless for a peer to return its own address to the requester, because the requester already knows it. Hence I would say that if the response of the peer contains at least one peer (not necessarily itself) it should be fine.

Probe

If the check fails or reaches a timeout the node is removed from the routing table and added to the candidate queue.

Why do we want to add the timed out node to the candidate queue? To give it a second chance slightly later?

If the score is below a threshold value the node is removed from the routing table and added to the candidate queue.

Sounds good as a future step, but I think that we are still very far from it.

Another way to perform the probe operation would be to schedule a probe for each node individually (e.g using the scheduler). When we receive a message from a node (initial message or refresh), we delete any planned probe for this node, and schedule a new probe after a specific duration. This would allow to smooth the Probe/refresh process and avoid having many peers probed at the same time.

Explore

For each bucket the a random key is generated that would occupy the bucket and a query initiated to find the nodes close to it.

typo? (the a)

The frequency of running an explore varies by bucket distance, such that closer buckets are processed more frequently.

YES!

Buckets in tier 4 will be explored at an interval dependent on the key length of the dht, which governs the maximum number of more distant buckets.

up to 21.5h for all remaining buckets (if the table were fully populated)

I don't understand the reasoning for the above.

Comments

  1. What are we trying to achieve with this tiered exploration? Minimize networking overhead? I think there are good reasons for keeping high and low buckets up-to-date.

I think a motivation to refresh less far buckets is that we learn many peers in the buckets that are far away when making lookups. It is critical for routing soundness to always keep track of the closest peers, it is less critical for peers that are further away (as long as we know some). We are less likely to propagate information about dead peers in the buckets that are far away, because statistically we receive more requests for close buckets. Moreover there are more candidates for far away buckets, hence the peers in these buckets will eventually be very stable, so we need to refresh less. And the main motivation is to reduce traffic I guess.

@iand
Copy link
Contributor Author

iand commented Aug 11, 2023

Great write-up @iand ! I especially enjoy the Probe and Explore sections!

I have a couple of questions.

A node value table is maintained which records metrics to assess the value or usefulness of each node.

What kind of data structure is the node value table? Is it distinct from the routing table, or additional information to be included in the routing table?

It's a separate structure, independent of the routing table.

Bootstrap

No other queries may be started on the local node until the bootstrap operation has completed.

Bootstrap shouldn't have the priority over real operations.

This is in order to mitigate libp2p/go-libp2p-kad-dht#652 though that attack is only for inbound requests resulting in the remote peer being added to the table. I can relax that constraint and impose the restriction on ad-hoc additions of nodes. Related: #23

Bootstrap is an operation that is performed on startup of a node if its routing table is empty.

The node must have at least one node already included in its routing table for a bootstrap to be initiated.

nit: routing table is empty, but one peer is required :p

Very true lol. I will reword to use a set of seed peers. A bootstrap could be requested on a non-empty table too, but assuming the node has just started from fresh it will need seed peer(s).

Include

The include operation is a continually running background job that processes a queue of candidate nodes to be included in the routing table.
Newly discovered nodes are added to the candidate queue by other operations and external events.

What are the benefits of having the include operation run as a background job vs external events directly triggering routing table add?

This allows us to bound the work being performed. We're not in control of external events so we would have to throttle them if we want to ensure that we have capacity for other work. In particular we want to be able to bound resources such as memory and network connections.

It also makes the system more deterministic by preserving sequential execution.

The candidate node is dropped from the queue if it does not respond or responds without including at least itself in the list of nodes.

In this case, the remote peer should return only itself in the list of remote peers (at least in go-libp2p-kad-dht). I am not sure whether there is a spec specifying whether a node should include itself in a lookup response or not. In theory it is useless for a peer to return its own address to the requester, because the requester already knows it. Hence I would say that if the response of the peer contains at least one peer (not necessarily itself) it should be fine.

That's what I assumed would be the case and then I looked at the implementation in lookupCheck and got mislead by the comments. I will update to require at least one peer.

Probe

If the check fails or reaches a timeout the node is removed from the routing table and added to the candidate queue.

Why do we want to add the timed out node to the candidate queue? To give it a second chance slightly later?

Yes. We can't tell the nature of the failure timeout. It could be that we lost connectivity locally or we (or the remote) are temporarily out of resources and timed out the request. This gives nodes a second chance, but removes them from the routing table in the meantime on the assumption that they are permanent failures.

If the score is below a threshold value the node is removed from the routing table and added to the candidate queue.

Sounds good as a future step, but I think that we are still very far from it.

We can do a simple version of this based on numbers of useful responses given in queries. This scoring can be extended at a later time to include other factors.

Another way to perform the probe operation would be to schedule a probe for each node individually (e.g using the scheduler). When we receive a message from a node (initial message or refresh), we delete any planned probe for this node, and schedule a new probe after a specific duration. This would allow to smooth the Probe/refresh process and avoid having many peers probed at the same time.

I would expect to bound the number of probes in flight at any one time and allow that to be configurable. The query pool implementation bounds the number of in flight queries and requests in a very simple way and a similar approach can be used for probe requests.

Explore

For each bucket the a random key is generated that would occupy the bucket and a query initiated to find the nodes close to it.

typo? (the a)

👍

The frequency of running an explore varies by bucket distance, such that closer buckets are processed more frequently.

YES!

Buckets in tier 4 will be explored at an interval dependent on the key length of the dht, which governs the maximum number of more distant buckets.

up to 21.5h for all remaining buckets (if the table were fully populated)

I don't understand the reasoning for the above.

The bottom tier will have all remaining buckets 14 and avove. If we had a table with buckets for every CPL then that could be 241 buckets. 240x5m = 20hrs which would be the interval at which those buckets are explored. So in this example every bucket in the table is explored at least once a day and the closest ones are explored every 10 or 20 minutes.

Comments

  1. What are we trying to achieve with this tiered exploration? Minimize networking overhead? I think there are good reasons for keeping high and low buckets up-to-date.

I think a motivation to refresh less far buckets is that we learn many peers in the buckets that are far away when making lookups. It is critical for routing soundness to always keep track of the closest peers, it is less critical for peers that are further away (as long as we know some). We are less likely to propagate information about dead peers in the buckets that are far away, because statistically we receive more requests for close buckets. Moreover there are more candidates for far away buckets, hence the peers in these buckets will eventually be very stable, so we need to refresh less. And the main motivation is to reduce traffic I guess.

The tiering is intended to achieve this, so the closest buckets are explored more frequently. The expectation is that this exploration will fill further buckets in the course of the query

The tiering is a simple mechanism to schedule explores on a continuous cycle. Rather than using a different schedule for each bucket we simply cycle through the tiers, exploring the first bucket in each tier. This ensures that each bucket in the smaller tiers get explored more frequently but buckets assigned to the bottom, largest tier, get explored less frequently.

However the explore process is predictable and bounded: always exploring 4 buckets every cycle.

@iand
Copy link
Contributor Author

iand commented Aug 11, 2023

@guillaumemichel

No other queries may be started on the local node until the bootstrap operation has completed.

Bootstrap shouldn't have the priority over real operations.

Looking again, the reason I gave is that when the routing table is nearly empty then every outbound query will use the same few seed nodes. This potentially could overwhelm dedicate bootstrap nodes since they will overwhelmingly be in the selected set of initial peers at the start. I recall seeing an issue or a comment for this in one of the repos but I can't find it now.

@iand
Copy link
Contributor Author

iand commented Aug 11, 2023

Design updated

  • fixed some typos
  • connectivity check requires at least one node to be returned, not necessarily itself
  • note that the bootstrap operation requires one or more seed nodes

@guillaumemichel
Copy link
Contributor

Thanks for the detailed explanation, it makes a lot of sense!

If the score is below a threshold value the node is removed from the routing table and added to the candidate queue.

Sounds good as a future step, but I think that we are still very far from it.

We can do a simple version of this based on numbers of useful responses given in queries. This scoring can be extended at a later time to include other factors.

I would say that until we have a good reason to use a scoring function, we don't need to evict reachable peers from the routing table. It adds (small) additional load and complexity on the client, and artificial churn to the network. It is good to keep this option for later, I'd love to have proven benefits to balance the side effects.

Buckets in tier 4 will be explored at an interval dependent on the key length of the dht, which governs the maximum number of more distant buckets.

up to 21.5h for all remaining buckets (if the table were fully populated)

I don't understand the reasoning for the above.

The bottom tier will have all remaining buckets 14 and avove. If we had a table with buckets for every CPL then that could be 241 buckets. 240x5m = 20hrs which would be the interval at which those buckets are explored. So in this example every bucket in the table is explored at least once a day and the closest ones are explored every 10 or 20 minutes.

Currently we have approx. 15 non-empty buckets (maybe even less in practice). When indexing them by CPL (furthest bucket is 0, and closest is 14), we would get the following tiered architecture:

  • Tier 1: buckets 13-14
  • Tier 2: buckets 9-12
  • Tier 3: buckets 1-8
  • Tier 4: bucket 0

So the refresh interval numbers are certainly different.

Comments

  1. What are we trying to achieve with this tiered exploration? Minimize networking overhead? I think there are good reasons for keeping high and low buckets up-to-date.

I think a motivation to refresh less far buckets is that we learn many peers in the buckets that are far away when making lookups. It is critical for routing soundness to always keep track of the closest peers, it is less critical for peers that are further away (as long as we know some). We are less likely to propagate information about dead peers in the buckets that are far away, because statistically we receive more requests for close buckets. Moreover there are more candidates for far away buckets, hence the peers in these buckets will eventually be very stable, so we need to refresh less. And the main motivation is to reduce traffic I guess.

The tiering is intended to achieve this, so the closest buckets are explored more frequently. The expectation is that this exploration will fill further buckets in the course of the query

The tiering is a simple mechanism to schedule explores on a continuous cycle. Rather than using a different schedule for each bucket we simply cycle through the tiers, exploring the first bucket in each tier. This ensures that each bucket in the smaller tiers get explored more frequently but buckets assigned to the bottom, largest tier, get explored less frequently.

However the explore process is predictable and bounded: always exploring 4 buckets every cycle.

If the goal is to explore 4 buckets every cycle, I would suggest a probabilistic approach rather than a discrete tiered approach. The goal is to have a refresh interval directly depending on the bucket ID, and not the Tier (which is more arbitrary, the size of each Tier is just another magic number). The probability of a bucket being refreshed at a cycle would depend on the bucket ID and the number of buckets explored per cycle.

A bucket with >=90% occupancy is skipped.

Buckets with low ID (far) will certainly always be skipped anyways, because there are always tons of candidates in the queue.

The operation stops when a query results in the bucket having <=10% occupancy.

I don't understand the meaning of this.

For each bucket a random key is generated that would occupy the bucket and a query initiated to find the nodes close to it.
Dicovered nodes are added to the candidate queue for inclusion in the routing table.

The issue with this technique is that we would have dense spots within a large space. E.g if bucket 0 (far) is empty and it is selected for exploration, 1 (random) key will be looked up and the 20 closest peers to this key will be added to the bucket/queue. But these 20 peers will share a long CPL and will not be uniformly spread in the space of bucket 0. But maybe it isn't such an issue, because in order to reach the 20 closest peers other peers will be discovered along the way, and added first to the bucket.

e.g when looking up a random key in bucket 0, say 0000'0000, we may hit at first 0100'0000, then 0001'0000, then 000'0100 and finally 0000'0000. So we have 3 peers (0001'0000, 0000'0100 and 0000'0000) in just 1/8 of the bucket space, and nothing in the rest of the bucket. Limiting the number of peers added to the bucket/queue for each lookup could help, but I don't have any better solution for now.

In the current implementation, we observed that the buckets are very balanced. It is probably because the only selection criteria for the routing table is seniority. Hence, when looking up for a random key in a bucket missing X peers, not all discovered peers are taken into account but only the first X, that are expected to be more uniformly distributed. And in practice, a random key will be looked up in bucket that is full (far, 0-9) when 1 spot will be free to take so only 1 peer will be discovered (uniform selection). And in non-full buckets (close, 10-14) looking up the 20 closest peers will certainly return (almost) all peers belonging to this bucket, so they are expected to be uniformly distributed in the bucket space.

No other queries may be started on the local node until the bootstrap operation has completed.

Bootstrap shouldn't have the priority over real operations.

Looking again, the reason I gave is that when the routing table is nearly empty then every outbound query will use the same few seed nodes. This potentially could overwhelm dedicate bootstrap nodes since they will overwhelmingly be in the selected set of initial peers at the start. I recall seeing an issue or a comment for this in one of the repos but I can't find it now.

A bootstrap is triggered when the routing table doesn't contain enough peers. Concerning the initial bootstrap, I agree that the user should wait for the node to be well connected before sending queries. However, if the node has been running for some time, and the number of peers drops below a specific threshold, the goal of bootstrap is to repopulate the routing table. I would argue that if the system has the capacity to perform multiple concurrent lookups, then it should allow user lookups because they help populate the routing table, and it is useful traffic. As long as the routing table contains more than just the bootstrap nodes, it should be able to use the newly discovered peers to discover more peers, no need to wait for the lookup for self to be complete.

@dennis-tra
Copy link
Contributor

dennis-tra commented Aug 14, 2023

Just came across another bootstrapping issue, so just leaving this here: libp2p/go-libp2p-kad-dht#387

Also more context on reasons why bootstrap peers might change over time: libp2p/go-libp2p-kad-dht#716

@iand
Copy link
Contributor Author

iand commented Aug 14, 2023

If the goal is to explore 4 buckets every cycle, I would suggest a probabilistic approach rather than a discrete tiered approach. The goal is to have a refresh interval directly depending on the bucket ID, and not the Tier (which is more arbitrary, the size of each Tier is just another magic number). The probability of a bucket being refreshed at a cycle would depend on the bucket ID and the number of buckets explored per cycle.

4 per cycle is arbitrary. The goal is to bound the work so we limit use of concurrent connections and also to ensure that all buckets are explored within a timely interval (I used 24hrs in the example) with closer buckets being explored more frequently.

We could implement this with a zipf probability distribution so there is a descending chance of being picked based on distance (low distance being given a higher probability) but that still requires some magic numbers to tune the distribution to visit the full set of buckets within a specific interval and to skew the distribution so low buckets are

@guillaumemichel
Copy link
Contributor

We could implement this with a zipf probability distribution so there is a descending chance of being picked based on distance (low distance being given a higher probability) but that still requires some magic numbers to tune the distribution to visit the full set of buckets within a specific interval and to skew the distribution so low buckets are

This sounds good! If we want to have a guarantee that all buckets are explored every so often, we could also determine a static order, following Zipf distribution in which the buckets are explored. This static order could be different for each host, and could evolve over time depending on the number of populated buckets. The buckets would always be explored in the same order, every so often.

An example of static ordering could be as follow. The closest bucket has id 14 and the furthest bucket has id 0. $interval$ is the interval between any 2 explorations, $period$ is the maximal time period in which all buckets must have been refreshed at least once.

Time Explored buckets
$t_0$ 14, 13, 12, 11
$t_0 + interval$ 14, 13, 10, 9
$t_0 + 2 \times interval$ 14, 12, 8, 7
$t_0 + 3 \times interval$ 14, 13, 11, 6
... ...
$t_0 + period$ 14, 13, 12, 11
$t_0 + period + interval$ 14, 13, 10, 9
... ...

The static order would be computed by the host depending on $interal$, $period$, the number of populated buckets, the number of buckets explored every $interval$, some weight metric (e.g bucket i must be explored x as much as bucket i+1), and possibly a random seed.

@iand iand mentioned this issue Aug 16, 2023
3 tasks
iand added a commit that referenced this issue Aug 16, 2023
This adds a state machine for running the bootstrap query described in
#45. The state machine
is very simple (it runs a query that attempts to find the self node),
but it drives the design of the coordinator in a number of ways:

- The coordinator now manages two state machines: bootstrap and user
queries. It enforces the constraint that no user queries can be
progressed while the bootstrap is running. This establishes the pattern
for managing a set of state machines.
- Priority is simple: the coordinator first attempts to advance the
bootstrap state machine and only if it is idle, indicating the bootstrap
has no further work, will it proceed to advance the query pool state
machine.
- This changes the design of the state machine. Previously the state
machine reacted to an incoming event passed to the `Advance` method.
However this causes complications in the presence of multiple state
machines. What should happen if the bootstrap is waiting for a message
response but the caller attempts to start a user query? The coordinator
needs to remember the "add query" event until the bootstrap is complete,
so that event would need to remain on the coordinator's queue. But the
coordinator needs to read from the queue to detect if an incoming
message response is available for the bootstrap, without removing the
"add query" event. Thus we need separate queues for the two state
machines. Rather than manage those in the coordinator, we give each
state machine its own event queue. External callers enqueue events and
the state machine dequeues the next one each time it attempts to advance
state.
- ~The above change leads to a simple interface for state machines: an
Enqueue method for notifying a new event and an Advance method that
returns the next state.~
- **update 2023-08-15**: The above change leads to a simple interface
for state machines:an Advance method that accepts an event and returns
the next state.
- Coordinator methods like StartQuery and StopQuery now enqueue an event
for query pool
 - A new Bootstrap method enqueues an event for bootstrap state machine
- **update 2023-08-15**: the queues for the state machines are managed
by the coordinator, which allows state machines to be more cleanly
composed into hierarchies (for example, the state machine managing the
routing table include queue will use a query pool state machine and this
change eliminates the need to manage event queues of child state
machines)

There are still some ugly parts which I may be able to fix within the
scope of this PR:
- the coordinator implements a number of unused methods to conform to
the scheduler.Scheduler interface. All that is needed is the RunOne
method.
- ~the name of the bootstrap query needs to be factored into a constant
or remembered by the coordinator~ - coordinator now uses a separate
callback to deal with bootstrap query instead of checking query id
- ~events are action.Action interfaces so they can use the standard
queue interface. The Run method is unused. The queue could simply be a
channel or we could modify the queue interface to be parameterised by
type, allowing us to have a queue of BootstrapEvents~ (**removed
2023-08-15**)
- currently the bootstrap method expects a function that generates a
FindNode request for the given node. FindNode is such a fundamental DHT
operation that I think it should be provided as a method by the Endpoint


Fixes #47
iand added a commit that referenced this issue Aug 16, 2023
Based on sm-bootstrap branch (#90)

This adds a state machine for running the include process described in
#45. The state machine
manages a queue of candidates nodes and processes them by checking
whether they respond to a find node request. Candidates that respond
with one or more closer nodes are considered live and added to the
routing table. Nodes that do not respond or do not provide any suggested
closer nodes are dropped from the queue. The number of concurrent checks
in flight is configurable.

Not done yet:

- [ ] check timeouts
- [ ] removing nodes failing checks from routing table
- [ ] notifying of unroutable nodes
iand added a commit that referenced this issue Sep 1, 2023
This adds the initial state machine for running regular connectivity
checks against nodes in the routing table. Scoring of useful nodes is
not included and is to be added in a later change.

Part of #45
@iand iand closed this as completed Sep 25, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in DHT Optimization Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request scope/required Feature required to match go-libp2p-kad-dht
Projects
Archived in project
Development

No branches or pull requests

3 participants