🌌 A libp2p DHT crawler that gathers information about running nodes in the network.

Nebula Crawler Logo

Nebula Crawler

standard-readme compliant

A libp2p DHT crawler that gathers information about running nodes in the network. The crawler runs every 30 minutes by connecting to the standard DHT bootstrap nodes and then recursively following all entries in the k-buckets until all peers have been visited.

Screenshot Screenshot from a Grafana dashboard

Table of Contents

Project Status

The crawler is in a working state as it's successfully visiting and following all nodes in the network. However, the project is very young and thus has its sharp edges here and there; in the codebase and documentation. Most importantly, the gathered numbers are in line with existing data like the wiberlin/ipfs-crawler. Their crawler also powers a dashboard which can be found here.

Usage

Nebula is a command line tool and provides the three sub-commands crawl, monitor and daemon. See the command line help page below for configuration options:

NAME:
   nebula - A libp2p DHT crawler and monitor that exposes timely information about DHT networks.

USAGE:
   nebula [global options] command [command options] [arguments...]

VERSION:
   vdev+5f3759df

AUTHOR:
   Dennis Trautwein <[email protected]>

COMMANDS:
   crawl    Crawls the entire network based on a set of bootstrap nodes.
   monitor  Monitors the network by periodically dialing and pinging previously crawled peers.
   daemon   Start a long running process that crawls and monitors the DHT network
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug                  Set this flag to enable debug logging (default: false) [$NEBULA_DEBUG]
   --log-level value        Set this flag to a value from 0 to 6. Overrides the --debug flag (default: 4) [$NEBULA_LOG_LEVEL]
   --config FILE            Load configuration from FILE [$NEBULA_CONFIG_FILE]
   --dial-timeout value     How long should be waited before a dial is considered unsuccessful (default: 30s) [$NEBULA_DIAL_TIMEOUT]
   --prom-port value        On which port should prometheus serve the metrics endpoint (default: 6666) [$NEBULA_PROMETHEUS_PORT]
   --prom-host value        Where should prometheus serve the metrics endpoint (default: localhost) [$NEBULA_PROMETHEUS_HOST]
   --db-host value          On which host address can nebula reach the database (default: localhost) [$NEBULA_DATABASE_HOST]
   --db-port value          On which port can nebula reach the database (default: 5432) [$NEBULA_DATABASE_PORT]
   --db-name value          The name of the database to use (default: nebula) [$NEBULA_DATABASE_NAME]
   --db-password value      The password for the database to use (default: password) [$NEBULA_DATABASE_PASSWORD]
   --db-user value          The user with which to access the database to use (default: nebula) [$NEBULA_DATABASE_USER]
   --protocols value        Comma separated list of protocols that this crawler should look for (default: "/ipfs/kad/1.0.0", "/ipfs/kad/2.0.0") [$NEBULA_PROTOCOLS]
   --bootstrap-peers value  Comma separated list of multi addresses of bootstrap peers [$NEBULA_BOOTSTRAP_PEERS]
   --help, -h               show help (default: false)
   --version, -v            print the version (default: false)

How does it work?

crawl

The crawl sub-command starts by connecting to a set of bootstrap nodes and constructing the routing tables (kademlia k-buckets) of the remote peers based on their PeerIds. Then nebula builds random PeerIds with a common prefix length (CPL) and asks each remote peer if they know any peers that are closer to the ones nebula just constructed. This will effectively yield a list of all PeerIds that a peer has in its routing table. The process repeats for all found peers until nebula does not find any new PeerIds.

This process is heavily inspired by the basic-crawler in libp2p/go-libp2p-kad-dht from @aschmahmann.

Every peer that was found is persisted together with its multi-addresses. If the peer was dialable nebula will also create a session instance that contains the following information:

type Session struct {
  // A unique id that identifies a particular session
  ID int
  // The peer ID in the form of Qm... or 12D3...
  PeerID string
  // When was the peer successfully dialed the first time
  FirstSuccessfulDial time.Time
  // When was the most recent successful dial to the peer above
  LastSuccessfulDial time.Time
  // When should we try to dial the peer again
  NextDialAttempt null.Time
  // When did we notice that this peer is not reachable.
  // This cannot be null because otherwise the unique constraint
  // uq_peer_id_first_failed_dial would not work (nulls are distinct).
  // An unset value corresponds to the timestamp 1970-01-01
  FirstFailedDial time.Time
  // The duration that this peer was online due to multiple subsequent successful dials
  MinDuration null.String
  // The duration from the first successful dial to the point were it was unreachable
  MaxDuration null.String
  // indicates whether this session is finished or not. Equivalent to check for
  // 1970-01-01 in the first_failed_dial field.
  Finished bool
  // How many subsequent successful dials could we track
  SuccessfulDials int
  // When was this session instance updated the last time
  UpdatedAt time.Time
  // When was this session instance created
  CreatedAt time.Time
}

At the end of each crawl nebula persists general statistics about the crawl like the duration, dialable peers, encountered errors, agent versions etc...

Info: You can use the crawl sub-command with the --dry-run option that skips any database operations.

monitor

The monitor sub-command polls every 10 seconds all sessions from the database (see above) that are due to be dialed in the next 10 seconds (based on the NextDialAttempt timestamp). It attempts to dial all peers using previously saved multi-addresses and updates their session instances accordingly if they're dialable or not.

The NextDialAttempt timestamp is calculated based on the uptime that nebula has observed for that given peer. If the peer is up for a long time nebula assumes that it stays up and thus decreases the dial frequency aka. sets the NextDialAttempt timestamp to a time further in the future.

daemon

Work in progress: The daemon sub-command combines the crawl and monitor tasks in a single process. It uses application level scheduling of the crawls rather than e.g. using OS-level cron configurations.

Install

Release download

There is no release yet.

From source

To compile it yourself run:

go install github.com/dennis-tra/nebula/cmd/nebula@latest # Go 1.16 or higher is required (may work with a lower version too)

Make sure the $GOPATH/bin is in your PATH variable to access the installed nebula executable.

Development

To develop this project you need Go > 1.16 and the following tools:

To install the necessary tools you can run make tools. This will use the go install command to download and install the tools into your $GOPATH/bin directory. So make sure you have it in your $PATH environment variable.

Database

You need a running postgres instance to persist and/or read the crawl results. Use the following command to start a local instance of postgres:

docker run -p 5432:5432 -e POSTGRES_PASSWORD=password -e POSTGRES_USER=nebula -e POSTGRES_DB=nebula postgres:13

Info: You can use the crawl sub-command with the --dry-run option that skips any database operations.

The default database settings are:

Name     = "nebula",
Password = "password",
User     = "nebula",
Host     = "localhost",
Port     = 5432,

To run migrations then run:

# Up migrations
migrate -database 'postgres://nebula:password@localhost:5432/nebula?sslmode=disable' -path migrations up
# OR
make migrate-up

# Down migrations
migrate -database 'postgres://nebula:password@localhost:5432/nebula?sslmode=disable' -path migrations down
# OR
make migrate-down

# Create new migration
migrate create -ext sql -dir migrations -seq some_migration_name

To generate the ORM with SQLBoiler run:

sqlboiler psql

Related Efforts

Maintainers

@dennis-tra.

Contributing

Feel free to dive in! Open an issue or submit PRs.

Support

It would really make my day if you supported this project through Buy Me A Coffee.

Other Projects

You may be interested in one of my other projects:

  • pcp - Command line peer-to-peer data transfer tool based on libp2p.
  • image-stego - A novel way to image manipulation detection. Steganography-based image integrity - Merkle tree nodes embedded into image chunks so that each chunk's integrity can be verified on its own.

License

Apache License Version 2.0 © Dennis Trautwein

Owner
Dennis Trautwein
BSc./MSc. in extraterrestrial/solid state physics | iOS/WebDev WellingtonNZ/MelbourneAU | [email protected] HamburgGER | Dev@OriginStamp CH
Dennis Trautwein
Comments
  • postgres ssl

    postgres ssl

    I was working on a helm chart to deploy this. If you want to have a gander, that can be seen here https://github.com/filecoin-project/helm-charts/pull/121

    Our postgres database is configured limit connections only from secure sslmodes, so I added another flag to parameterize it and changed the default.

  • Is it possible to continue monitoring nodes after one unsuccessful dial?

    Is it possible to continue monitoring nodes after one unsuccessful dial?

    Hello, it seems like after one unsuccessful dial attempt, the monitor will mark a node as unreachable and will not attempt any further dial. It is possible to continue monitoring the disconnected nodes for a period of time before considering them as permanently unreachable? In this way, we can possibly gather some data on the disconnected node to analyze the offline pattern for some server nodes (for example, some nodes may have regular offline time due to various reasons).

  • Would it be possible to crawl LBRY SDK DHT nodes?

    Would it be possible to crawl LBRY SDK DHT nodes?

    I have question is it possible to crawl other things using DHT like for example LBRY SDK nodes (https://github.com/lbryio/lbry-sdk) port 4444 UDP and also for example torrent DHT?

  • Use proper user agent

    Use proper user agent

    Libp2p allows to configure a user agent via libp2p.UserAgent(...).

    I'm thinking of using nebula-crawler/{version} while {version} should come from a central source.

  • Nebula v2

    Nebula v2

    Changelog

    • network command line parameter to tell Nebula which network to crawl.
    • Resolution of Multiaddresses involve
      • DNS lookup in case of dns
      • Mapping to IPv4/IPv6 address
      • Geolocation Mapping based on Maxmind GeoIP2
      • Datacenter detection based on UdgerDB
      • Relay flag
    • New Database schema incorporating the learnings of one year in operation
      • Partitioning of time series tables: visists, sessions, peer_logs, neighbors
      • New indexes based on access patterns during crawling, monitoring, and data analysis
      • Simplification of mapping tables (e.g. visits now contain an array of multi addresses)
      • New constraints to not even let bad data enter the database
  • Nebula v2

    Nebula v2

    Changelog

    • network command line parameter to tell Nebula which network to crawl.
    • Resolution of Multiaddresses involve
      • DNS lookup in case of dns
      • Mapping to IPv4/IPv6 address
      • Geolocation Mapping based on Maxmind GeoIP2
      • Datacenter detection based on UdgerDB
      • Relay flag
    • New Database schema incorporating the learnings of one year in operation
      • Partitioning of time series tables: visists, sessions, peer_logs, neighbors
      • New indexes based on access patterns during crawling, monitoring, and data analysis
      • Simplification of mapping tables (e.g. visits now contain an array of multi addresses)
      • New constraints to not even let bad data enter the database
  • Add latency measurement option to crawl command

    Add latency measurement option to crawl command

    PR

    This pull request adds the option to measure latencies to dialable peers during crawling.

    Follow up ideas

    • [ ] create a measure subcommand
    • [ ] add option to measure latencies during monitoring
    • [ ] query geoip db to enrich IP information latencies table
    • [ ] use different db for this kind of measurements
  • Check existing data when starting a crawl to not mix data from different networks

    Check existing data when starting a crawl to not mix data from different networks

    Imagine you're running the crawler for the IPFS network for some time. Then you want to start crawling the FILECOIN network as well and experiment around. This could easily lead to FILECOIN data ending up in the same database as the IPFS data. This could be avoided if prior to each crawl we check which network was actually crawled before.

  • Add CSV output option

    Add CSV output option

    If the --neighbors and, e.g., the --csv flag is given a neighbors adjacency list should be generated after the crawl. Format:

    peer_id_1,neighbor_1
    peer_id_1,neighbor_2
    peer_id_1,neighbor_3
    peer_id_1,neighbor_4
    peer_id_1,neighbor_5
    ...
    peer_id_2,neighbor_x
    peer_id_2,neighbor_y
    peer_id_2,neighbor_z
    ...
    
  • migrate sql queries

    migrate sql queries

    When I installed the latest version, graphs don't work when they query against an old relation. Looks like there was a migration in https://github.com/dennis-tra/nebula-crawler/commit/5476677 and perhaps the dashboards weren't updated since then.

    The filecoin dashboard needs to be updated also, not in this PR.

A simple toy example for running Graphsync + libp2p.
A simple toy example for running Graphsync + libp2p.

graphsync-example Here we outline a simple toy example (main.go) where two local peers transfer Interplanetary Linked Data (IPLD) graphs using Graphsy

Dec 8, 2021
Baseledger core consensus for running validator, full and seed nodes

baseledger-core Baseledger core consensus client for running a validator, full or seed node. ⚠️ WARNING: this code has not been audited and is not rea

Jan 13, 2022
P2PDistributedHashTable - A golang Kademlia/Bittorrent DHT library that implements BEP5
P2PDistributedHashTable - A golang Kademlia/Bittorrent DHT library that implements BEP5

This is a golang Kademlia/Bittorrent DHT library that implements BEP 5. It's typ

Apr 10, 2022
Jun 20, 2022
Data Availability Sampling (DAS) on a Discovery-v5 DHT overlay

Implementing Data Availability Sampling (DAS) There's a lot of history to unpack here. Vitalik posted about the "Endgame": where ethereum could be hea

Nov 12, 2022
A LoRaWAN nodes' and network simulator that works with a real LoRaWAN environment (such as Chirpstack) and equipped with a web interface for real-time interaction.
A LoRaWAN nodes' and network simulator that works with a real LoRaWAN environment (such as Chirpstack) and equipped with a web interface for real-time interaction.

LWN Simulator A LoRaWAN nodes' simulator to simulate a LoRaWAN Network. Table of Contents General Info Requirements Installation General Info LWN Simu

Nov 20, 2022
LNC is a lightning network capital management tool built for routing nodes.

LNC is a lightning network capital management tool built for routing nodes.

Dec 21, 2021
📦 Command line peer-to-peer data transfer tool based on libp2p.

pcp - Peer Copy Command line peer-to-peer data transfer tool based on libp2p. Table of Contents Motivation Project Status How does it work? Usage Inst

Jan 5, 2023
A Lightweight VPN Built on top of Libp2p for Truly Distributed Networks.
A Lightweight VPN Built on top of Libp2p for Truly Distributed Networks.

Hyprspace A Lightweight VPN Built on top of Libp2p for Truly Distributed Networks. demo.mp4 Table of Contents A Bit of Backstory Use Cases A Digital N

Dec 29, 2022
RPC over libp2p pubsub with error handling

go-libp2p-pubsub-rpc RPC over libp2p pubsub with error handling Table of Contents Background Install Usage Contributing Changelog License Background g

Dec 14, 2022
libp2p implementation in Go
libp2p implementation in Go

The Go implementation of the libp2p Networking Stack. Table of Contents Background Usage API Examples Development Using the go-libp2p Workspace About

Jan 1, 2023
Transport to allow go-libp2p applications to natively use i2p for communication

I2P Transport for go-libp2p This library can be used to build go-libp2p applications using the i2p network. Look at transport_test.go for example usag

Sep 15, 2022
P2P Forwarder - a tool for farwarding tcp/udp ports. Made using libp2p.
P2P Forwarder - a tool for farwarding tcp/udp ports. Made using libp2p.

P2P Forwarder A tool for farwarding ports. Made using libp2p. How it works A: opens desired ports ports inside P2P Forwarder A: shares it's id from P2

Nov 14, 2022
Libp2p chat with discovery and pubsub

Dicovery - pubsub chat with libp2p How to test Run boostrap node $ go run main/main.go --port 35005 --nick boot --pk XDLjuaVJ2yKQ2zHMmsee5PGHtDHmkkvFA

Jul 3, 2022
A simple port forward tools build on libp2p with holepunch support.

p2p-tun A simple port forward and tun2socks tools build on libp2p with holepunch support. Usage NAME: p2p-tun - port forward and tun2socks through

Dec 20, 2022
Steam's protocol in Go to allow automation of different actions on the Steam network without running an actual Steam client

Steam's protocol in Go to allow automation of different actions on the Steam network without running an actual Steam client. Includes APIs for friends, chatting, trading, trade offers and TF2 crafting.

Jan 4, 2023
Go library providing an abstraction to Ethereum execution nodes

go-execution-client Go library providing an abstraction to Ethereum execution nodes. Its external API follows the official Ethereum consensus APIs spe

Aug 12, 2022
CoreDNS plugin to create records for Kubernetes nodes.

kubenodes Name kubenodes - creates records for Kubernetes nodes. Description kubenodes watches the Kubernetes API and synthesizes A, AAAA, and PTR rec

Jul 7, 2022
Nuke-Net is a VERY VERY over powered and ridiculous web crawler that is well- very very noisy XD read more here
Nuke-Net is a VERY VERY over powered and ridiculous web crawler that is well- very very noisy XD read more here

Nuke-Net is a VERY VERY over powered and ridiculous web crawler that is well- very very noisy XD read more here

Dec 20, 2021