Implementing Data Availability Sampling (DAS)

There's a lot of history to unpack here. Vitalik posted about the "Endgame": where ethereum could be headed, assuming MEV is fundamental. Scaling data-availability is one of the fundamental endgame requirements.

DAS is not new though, there is a lot of background.

A brief summary:

KZG (Kate-Zaverucha-Goldberg), Aniket Kate, Gregory M. Zaverucha, Ian Goldberg, 2010
Reed-Solomon erasure code recovery in n*log^2(n) time with FFTs, Vitalik Buterin, Aug 2018
Fraud and Data Availability Proofs: Maximising Light Client Security and Scaling Blockchains with Dishonest Majorities, Mustafa Al-Bassam, Alberto Sonnino, Vitalik Buterin, Sep 2018
A note on data availability and erasure coding, Vitalik Buterin, Sep 2018
Data availability checks, Dankrad Feist, Dec 2019
With fraud-proof-free data availability proofs, we can have scalable data chains without committees, Vitalik Buterin, Jan 2020
Fast amortized Kate proofs, Dankrad Feist, Dmitry Khovratovich, Mar 2020
KZG polynomial commitments (explainer), Dankrad Feist, Jun 2020
2D data availability with Kate commitments, Vitalik Buterin, Oct 2020
Data availability sampling in practice, Vitalik Buterin, Oct 2020
Python code for the crypto part of sampling, Dankrad Feist, Dec 2020

So "wen data-availability sampling?" Well, we need to solve the network layer still. This is a new proposal how it could work.

Prototypes

The DAS-in-practice doc made a start, and a call with eth2 implementers was hosted in October 2020 to talk about implementation. Then Dec 1 (2020) the beacon-chain launched, followed by holidays: no time from implementers for bleeding edge work end of 2020.

There were some educated guesses for an initial design direction though:

Random sampling results in high subscription churn in GossipSub.
Sampling needs to be very fast, DHTs are slow, right?
GossipSub can handle attestation load, surely we can handle sampling if we avoid the churn?

And so the idea of a push-variant started: everyone rotates random subscriptions to sample-specific GossipSub topics slowly, to avoid churn. Then seed the subnets via the shard topic (horizontal), and propagate samples to subscribers (vertical).

I was too optimistic, and started putting together a libp2p simulation of this GossipSub approach: eth2-das.

The networking approach seemed ok-ish, but doubts started to arise around the altruism:

How do we serve historical shard-data?
How do we make sure sampling works past just the most recent things?
If you propagate on 1000+ different topics, and the network is random and peer-fan is low, don't we eat a lot of undesired bandwidth?

With peer-scoring and banning, and frequent enough tiny tasks for the scores to make sense, you may have some chance it works. Still, if the networking is too heavy, altruism with peer-scoring is not going to work. The prototype was then abandoned.

KZG libs

Enough other problems to solve for implementers though, like using actual KZG proofs in prototypes.

So in January I ported over the python KZG10/FK20 code to optimized Go code: Go-KZG. We formed a small working-group, and more ports, based on the Python and Go code, were created:

C-KZG, using BLST by Ben Edington (Teku team). FK20, with optimized go code, then written in C, and hooked to the fasted BLS library. Amazing 👏
JC-KZG, java bindings to C-KZG, developed by Anton Nashatyrev (Teku team).
Constantine (research code), KZG FFTs and proofs ported to Nim by Mamy Ratsimbazafy (Nimbus team).
KZG in rust, Rust port by students of Saulius Grigaitis / SIFRAI-tech (Grandine team).

Revisiting the DAS network problem

Theory and crypto-libs are there, and the Merge is almost here (Catalyst, Rayonism, Amphora, Kintsugi, Kiln), now back to the networking problem.

The protocol has two independent identity layers, because of privacy and redundancy: (everyone can setup their validators however, wherever and act however they like):

consensus layer (validator entities signing stuff)
network layer (nodes hosting validators)

To the consensus layer, DAS only needs to be accurate. The data that is available must not be ignored, or we get an unhealthy chain.

The network layer serves the above, requiring:

Fast sample responses (keep accuracy for best case)
Resilience (keep accuracy for worst case, e.g. DoS attacks and maintain randomness for security)
Low resources (keep accuracy for lazy case, no excuses not to run it)

Also note that data-availability only requires data to be published to anyone unconditionally, not to persist it forever (incentivized honest nodes can keep storing it, as long as they can keep getting it).

Some ideas to create these properties:

Reduce the number of hops for fast samples queries, and saving resources
No small meshes as requester, to preserve sample randomness
Distribute the samples as much as possible, the uniform random sampling will load-balance the work
Distributing helps offload privacy/DoS risk from data publishers
Reduce the overhead per sample as much as possible, to save resources
Rotate out data slow enough to serve queries, but fast enough to keep the relevant

Does this sound familiar? Discv5 solves a very similar problem!

What discv5 offers:

The node tables can be large (16 x 16 = 256 records, plus replacements queue) and the network is not that much larger (rough estimate: 10k validating nodes, = approx 40 times the size). Number of hops will be low if some type of DHT search even works only a little bit.
Only UDP communication: effective with resource to keep that many peers.
Information distribution is similar: every record has a tiny signature to authenticate the data. Samples have a tiny proof to verify.
Just like sending a record in a single packet, send a sample with a single packet.
Discovery so far is used to retrieve records at random locations, and we do not need to advertise locations, they can be deterministic based on the sample hash
Built to scale horizontally, and scale through participation

Caveat: discv5 is bigger than ethereum discovery, not all nodes may want to answer sample requests. We'll need some type of overlay of ethereum-only discv5 nodes.

Extending Discv5

But how does it fit discv5?

Sampling

Discv5 offers TALKREQ/TALKRESP to extend the protocol with application-layer messaging. Not all discv5 implementations support this yet, but it is part of the spec and we can start with it in Go.

Although the TALKREQ/TALKRESP are intended for initial handshakes / negotiation of extensions, the communication in this protocol is infrequent and simple enough that TALKREQ/TALKRESP should work fine (open for suggestions however).

The primary roles in the DHT are:

Seed the DHT with sample bundles as publisher
Fan-out the bundles as relayer
Query the DHT for samples as searcher
Respond to queries as host

And the samples all get hashed to an ID in the DHT: H(fork_digest, randao_mix, data_id, sample_index) -> node_id:

fork_digest (f) - Bytes4, fork digest to separate networks
randao_mix (r) - Bytes32, is randomness picked based on the sample time: this prevents mining of node identities close to a specific sample far in the future
data_id (x) - Bytes48, is the KZG commitment of the data: samples are unique to the data they are part of
sample_index (i) - uint64 (little endian), identifies the sample

The fork_digest and randao_mix can be determined based on time (slot t), and are not communicated or stored in the sample protocol.

Samples are stored in a key-value store. Key: t,x,i (f and r are determined based on slot t) Value: sample data points and proof

Seeding

Generally we are not worried about the cost of the publisher as much: a block builder has more resources and has incentive to avoid orphaning their block because of missed/late publishing.

Before publishing the data, nodes should have the expected KZG commitment to all of it (proposer signed KZG commitment from builder on global topic).

The builder then generates the samples 0,...,N and hashes them to their DHT location H(f,r,x,0),...,H(f,r,x,N), and distributes them in bundles (see )

Each node that receives a sample can verify the sample proof against the global KZG commitment, and limit spam by e.g. testing if the message is close enough to their identity.

Nodes that receive the samples can propagate them further to other nodes in the DHT radius of the sample: some redundancy is balances load and speeds up search.

Sybil attacks to capture the first distributed samples may be a problem, but a builder can score their peers, select long-lived peers, and distribute to extra nodes.

Querying

Each node (validating or not) can verify if data is available by making k random sample requests (xs = [rand_range(0, N) for i in range(k)]).

To request, the sample identity is hashed to a node identity H(f,r,x,i). If the overlay has nearby nodes, then these are queried for the sample. If not, then use regular discv5 search to find a node closer to H(f,r,x,i).

Peer score

Detected good/bad sample behaviors feed back to a scoring table of peers.

Think of:

very bad: Stops serving a sample unexpectedly early
very bad: Spams requests
very bad: Seeds samples to us with bad proof
bad: Requests when closer to sample than ourselves
bad: Not behaving nice on discv5
bad: Many identities behind same IP
bad: Seeds outdated samples
bad: Seeds samples that are too far away from local node ID
bad: Duplicate sample query
good: Serves sample requests
good: Seeds samples to us (better if closer to our node ID, better if early)
good: Stays consistently within sample rate limits
very good: First to seed the sample to us

Sample Pruning

Samples are dropped after some locally configured expiry time. Aggressive pruning may negatively affect the DAS peer score perceived by other nodes.

Sample Gating

Samples pushed to the node by other peers are ignored, penalized, queued or accepted based on validation:

Is the pushing peer banned? ignore.
Do we not have the commitment? queue with timeout.
Is the proof invalid? penalize.

Otherwise follow the sample distribution validation rules.

Overlay

Since discv5 is also including non-ethereum nodes (IoT, alt networks, testnets, pretty much everything), we need to filter down to just the nodes that participate: an overlay.

The discv5 overlay requires:

No changes to discv5
A das key-value pair in the node records. Used to filter between different DAS overlays. Absent for non-DAS nodes.
Maintain more node records than a regular discv5 table
Quick lookups of records close to any identity: DHT-search is the backup, not the default
Ensure records are balanced and scored: an empty or poor-behaving sub-set in the DHT should be repaired with new records

Tree

Discv5 uses a XOR log distance: the more bits between two identities match, the closer they are together.

To find nodes for retrieval of samples, the overlay collection of records can be represented as binary tree:

Each ID bit directs to the left or right sub-tree.
Pair nodes can have empty left or right trees if there are no records
Leaf nodes can be represented with extension bits (avoid many half-empty pair nodes)

Discv5 is extended, not modified: when a sub-tree is too small for reliable results, or when the nodes are misbehaving, discv5 is used to find new nodes in the sub-tree that can serve samples. This functions as a fallback during sampling, or as tree balancing during idle time.

Tree balancing

The scores of sub-trees add up:

Score(Pair(a, b)) = Score(a) + Score(b)
Score(Pair(nil, x)) = Score(Pair(x, nil)) = Score(x)

Leaf nodes below a threshold score are pruned out.
The weakest sub-tree is prioritized to grow with new nodes
Nodes leak score over time when not accessed

Sampling

Seeding samples

Samples fan-out to the DHT sub-trees with redundancy:

D = 8: fanout parameter, number of peers to distribute to (TODO: depends on network size)
P = 2: prefix bits per bundle step

After building a data-blob:

Extend to 2x size for error-recovery (only need 50% to recover)
Hash samples to sample identities
Bundle samples by ID prefix of P bits, e.g. 00, 01, 10, 11
Send bundles to D random nodes in corresponding subtrees (TALKREQ). Empty or single-sample bundles are not distributed further.

When receiving a bundle with prefix XX (a TALKREQ):

Validate the score of the sender is sufficient
If the bundle (or a superset of it) was received before, ignore it (execute no more steps) (remember min and max ID for bound checks)
- TODO: if we want to track amount of repeated different seeders, we still need to validate the below conditions (extra cost)
Reject if the sample is too old
Reject if the bundle prefix does not match the local ID
Reject if the bundle contents do not all match the prefix
Verify each sample proof, reject if invalid
Split the bundle into smaller bundles with next P bits: X00, X01, X10, X11
Send smaller bundles to D random nodes in corresponding subtrees. Empty or single-sample bundles are not distributed further.
Increase score of bundle sender (TODO: )

Serving samples

On an incoming request for sample with ID inputs t,x,i (in a TALKREQ):

Validate and update rate-limit of the requesting peer, reject if exceeded
Reject if the requester has made the same request recently
Slightly decrease requester score if the sample is unknown, and do not respond
Increase the request count of the sample
Serve the request: send the sample (through a TALKRESP)

Retrieving samples

Find the closest record in the tree to the sample ID
Request the sample by sending the ID inputs t,x,i in a TALKREQ
On response:
1. Verify proof of returned sample
2. Increase score of node
On timeout:
1. Decrease score of node
2. Continue to step 2 with next closest node, give up after trying R closest nodes

Tree Repairing

During idle time, the tree is balanced through repairs:

select the weakest sub-tree for each depth
For each sub-tree, from deep to high up, run a DHT search for close nodes
Filter the retrieved node records for das support
Add nodes to the tree (do not overwrite existing leaf scores, do re-compute pair scores)

Subfinder is a subdomain discovery tool that discovers valid subdomains for websites. Designed as a passive framework to be useful for bug bounties and safe for penetration testing.

Fast passive subdomain enumeration tool. Features • Install • Usage • API Setup • License • Join Discord Subfinder is a subdomain discovery tool that

Jan 4, 2023

Service registration and discovery, support etcd, zookeeper, consul, etc.

discox 支持类型 zookeeper etcd consul 示例 zookeeper server package main import ( "fmt" "github.com/goeasya/discox" "os" ) func main() { cfg := discox

Aug 31, 2022

A service registry and service discovery implemention for kitex based on etcd

kitex etcd Introduction kitexetcd is an implemention of service registry and service discovery for kitex based on etcd. Installation go get -u github.

Feb 18, 2022

Libp2p chat with discovery and pubsub

Dicovery - pubsub chat with libp2p How to test Run boostrap node $ go run main/main.go --port 35005 --nick boot --pk XDLjuaVJ2yKQ2zHMmsee5PGHtDHmkkvFA

Jul 3, 2022

Envoy-eds-server - Envoy EDS server is a working Envoy Discovery Service implementation

envoy-eds-server Intro Envoy EDS server is a working Envoy Discovery Service imp

Apr 2, 2022

DNS service discovery library for Go

Discovery DNS service discovery library for Go Documentation see pkg.go.dev Installation

Mar 10, 2022

CoreRAD is an extensible and observable IPv6 Neighbor Discovery Protocol router advertisement daemon. Apache 2.0 Licensed.

CoreRAD CoreRAD is an extensible and observable IPv6 Neighbor Discovery Protocol router advertisement daemon. Apache 2.0 Licensed. To get started with

Nov 14, 2022

Subfinder is a subdomain discovery tool that discovers valid subdomains for websites by using passive online sources

Jan 1, 2023

Data Connector is a Google Sheets Add-on that lets you import (and export) data to/from Google Sheets

Data Connector Data Connector is a Google Sheets Add-on that lets you import (and export) data to/from Google Sheets. Our roadmap: Connect to JSON/XML

Jul 30, 2022

Data Availability Sampling (DAS) on a Discovery-v5 DHT overlay

Implementing Data Availability Sampling (DAS)

Prototypes

KZG libs

Revisiting the DAS network problem

Extending Discv5

Sampling

Seeding

Querying

Peer score

Sample Pruning

Sample Gating

Overlay

Tree

Tree balancing

Sampling

Seeding samples

Serving samples

Retrieving samples

Tree Repairing

Owner

Diederik Loerakker

Similar Resources

Related tags

🌌 A libp2p DHT crawler that gathers information about running nodes in the network.

P2PDistributedHashTable - A golang Kademlia/Bittorrent DHT library that implements BEP5

IPFS Cluster - Automated data availability and redundancy on IPFS

Overlay networks based on WebRTC.

the pluto is a gateway new time, high performance, high stable, high availability, easy to use

This is a tool that allows you to check minecraft names availability, this tool can do around 3000~ names a minute or more!

Checks sneaker availability, currently Asos/JD/Nike + Air Force 1 '07 44 only

TLDs finder: check domain name availability across all valid top-level domains

Pure-Go library for cross-platform local peer discovery using UDP multicast :woman: :repeat: :woman:

Use Consul to do service discovery, use gRPC +kafka to do message produce and consume. Use redis to store result.