Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo 💎

A Unix-style personal search engine and web crawler for your digital footprint

apollo demo

Demo

apollodemo.mp4

Contents

Background
Thesis
Design Architecture
Data Schema
Workflows
Document Storage
Shut up, how can I use it?
Notes
Future
Inspirations

Background

Apollo is a different type of search engine. Traditional search engines (like Google) are great for discovery when you're trying to find the answer to a question, but you don't know what you're looking for.

However, they're very poor at recall and synthesis when you've seen something before on the internet somewhere but can't remember where. Trying to find it becomes a nightmare - how can you synthezize the great material on the internet when you forgot where it even was? I've wasted many an hour combing through Google and my search history to look up a good article, blog post, or just something I've seen before.

Even with built in systems to store some of my favorite articles, podcasts, and other stuff, I forget things all the time.

Thesis

Screw finding a needle in the haystack. Let's create a new type of search to choose which gem you're looking for

Apollo is a search engine and web crawler to digest your digital footprint. What this means is that you choose what to put in it. When you come across something that looks interesting, be it an article, blog post, website, whatever, you manually add it (with built in systems to make doing so easy). If you always want to pull in data from a certain data source, like your notes or something else, you can do that too. This tackles one of the biggest problems of recall in search engines returning a lot of irrelevant information because with Apollo, the signal to noise ratio is very high. You've chosen exactly what to put in it.

Apollo is not necessarly built for raw discovery (although it certainly matches rediscovery), it's built for knowledge compression and transformation - that is looking up things that you've previously deemed to be cool

Design

The first thing you might notice is that the design is reminiscent of the old digital computer age, back in the Unix days. This is intentional for many reasons. In addition to paying homage to the greats of the past, this design makes me feel like I'm searching through something that is authentically my own. When I search for stuff, I genuinely feel like I'm travelling through the past.

Architecture

architecture Apollo's client side is written in Poseidon. The client side interacts with the backend via a REST-like API which provides endpoints for searching data and adding a new entry.

The backend is written in Go and is composed of a couple of important components

  1. The web server which serves the endpoints
  2. A tokenizer and stemmer used during search queries and when building the inverted index on the data
  3. A simple web crawler for scraping links to articles/blog posts/YouTube video
  4. The actual search engine which takes a query, tokenizes and stems it, finds the relevant results from the inverted index using those stemmed tokens then ranks results with TF-IDF
  5. A package which pulls in data from a couple of different sources - if you want to pull data from a custom data source, this is where you should add it.

Data Schema

Two schemas we use, one to first parse the data into some encoded format. This does not get stored, it's purely an intermediate before we transform it into a record for our inverted index. Why is this important?

  • Because since any data gets parsed into this standarized format, you can link any data source you want, if you build your own tool, if you store a lot of data in some existing one, you don't have to manually add everything. You can pull in data from any data source provided you give the API data in this format.
type Data struct {
    title string //a title of the record, self-explanatory
    link string //links to the source of a record, e.g. a blog post, website, podcast etc.
    content string //actual content of the record, must be text data
    tags []string //list of potential high-level document tags you want to add that will be
                  //indexed in addition to the raw data contained 
}
//smallest unit of data that we store in the database
//this will store each "item" in our search engine with all of the necessary information
//for the inverted index
type Record struct {
	//unique identifier
	ID string `json:"id"`
	//title
	Title string `json:"title"`
	//potential link to the source if applicable
	Link string `json:"link"`
	//text content to display on results page
	Content string `json:"content"`
	//map of tokens to their frequency
	TokenFrequency map[string]int `json:"tokenFrequency"`
}

Workflows

Data comes in many forms and the more varied those forms are, the harder it's to write reliable software to deal with it. If everything I wanted to index was just stuff I wrote, life would be easy. All of my notes would probably live in one place, so I would just have to grab the data from that data source and chill

The problem is I don't take a lot of notes and not everything I want to index is something I'd take notes of.

So what to do?

Apollo can't handle all types of data, it's not designed to. However in building a search engine to index stuff, there are a couple of things I focused on:

  1. Any data that comes from a specific platform can be integrated. If you want to index all your Twitter data for example, this is possible since all of the data can be absorbed in a constant format, converted into the compatible apollo format, and sent off. So data sources can be easily integrated, this is by design in case I want to pull in data from personal tools.
  2. The harder thing is what about just, what I wil call, "writing on the internet." I read a lot of stuff on the Internet, much of which I'd like to be able to index, without necessarily having to takes notes on everything I read because I'm lazy. The dream would be to just be able to drop a link and have Apollo intelligently try to fetch the content, then I can index it without having to go to the post and copying the content, which would be painful and too slow. This was a large motivation for the web crawler component of the project
  • If it's writing on the Internet, should be able to post link and autofill pwd
  • If it's a podcast episode or any YouTube video, download text transcription e.g. this
  • If you want to pull data from a custom data source, add it as a file in the pkg/apollo/sources folder, following the same rules as some of the examples and make sure to add it in the GetData() method of the source.go file in this package

Document storage

Local records and data from data sources are stored in separate JSON files. This is for convenience.

I also personally store my Kindle highlights as a JSON file - I use read.amazon.com and a readwise extension to download the exported highlights for a book. I put any new book JSON files in a kindle folder in the outer directory and every time the inverted index is recomputed, the kindle file takes any new book highlights, integrate them into the main kindle.json file stored in the data folder, then delete the old file.

Shut up, how can I use it?

Although I built Apollo first and foremost for myself, I also wanted other people to be able to use if they found it valuable. To use Apollo locally

  1. Clone the repo: git clone ....
  2. Make sure you have Go installed and youtube-dl which is how we download the subtitles of a video. You can use this to install it.
  3. Navigate to the root directory of the project: cd apollo . Note since Apollo syncs from some personal data sources, you'll want to remove them, add your own, or build stuff on top of them. Otherwise the terminal wil complain if you attempt to run it, so:
  4. Navigate to the pkg/apollo/sources in your preferred editor and replace the body of the GetData function with return make(map[string]schema.Data)
  5. Create a folder data in the outer directory
  6. Create a .env file in the outermost directory (i.e. in the same directory as the README.md) and add PASSWORD= where is whatever password you want. This is necessary for adding or scraping the data, you'll want to "prove you're Amir" i.e. authenticate yourself and then you won't need to do this in the future. If this is not making sense, try adding some data on apollo.amirbolous.com/add and see what happens.
  7. Go back to the outer directory (meanging you should see the files the way GitHub is displaying them right now) and run go run cmd/apollo.go in the terminal.
  8. Navigate to 127.0.0.1:8993 on your browser
  9. It should be working! You can add data and index data from the database If you run into problems, open an issue or DM me on Twitter

A little more information on the Add Data section

  • In order to add data, you'll first need to authenticate yourself - enter your password once in the "Please prove you'r Amir" and if you see a Hooray! popup then that means you were authenticated successfully. You only need to do this once since we use localStorage to save whether you've been authenticated once or not.
  • In order to scrape a website, you'll want to paste a link in the link textbox, then click on the button scrape. Note this does not add the website/content - you still need to click the add button if you want to save it. The web crawler works reliably most of the time if you're dealing with written content on a web page or a YouTube video. We use a Go ported version of readability to scrape the main contents from a page if it's written content and youtube-dl to get the transcript of a video. In the future, I'd like to make this web crawler more robust, but it works well enough most of the time for now.

As a side note, although I want others to be able to use Apollo, this is not a "commercial product" so feel free to open a feature request if you'd like one but it's unlikely I will get to it unless it becomes something I personally want to use.

Notes

  • The inverted index is re-generated once every n number of days (currently for n = 3)
  • Since this is not a commercial product, I will not be running your version of this (if you find it useful) on my server. However, although I designed this, first and foremost for myself, I want other people to be able to use if this is something that's useful, refer to How can I use this
  • I had the choice between using Go's gob package for the database/inverted index and JSON. The gob package is definitely faster however it's only native in Go so I decided to go with JSON to make the data available in the future for potentially any non-Go integrations and be able to switch the infrastructure completely if I want to etc.
  • I use a ported version of the Go snowball algorithm for my stemmer. Although I would have like to build my own stemmer, implementing a robust one (which is what I wanted) was not the focus of the project. Since the algorithm for a stemmer does not need to be maintined like other types of software, I decided to use one out of the box. If I write my own in the future, I'll swap it out.

Future

  • Improve the search algorithm, more like Elasticsearch when data grows a lot?
  • Improve the web crawler - make more robust like mercury parser, maybe write my own
  • Speed up search

Inspirations

Owner
Amir Bolous
Hacker, maker, and professional noob
Amir Bolous
Comments
  • error in the instructions or not clear

    error in the instructions or not clear

    I don't know go but followed the instructions and replaced the content of the GetData function with return []schema.Data{} after deleting all other content. the function has just that one line and when I run I get:

    cant use []schema.Data{} (type []schema.Data) as type map[string]schema.Data in return argument

    Clearly i shouldn't have deleted everything, but not sure what to do no knowing go.

  • Error when running server

    Error when running server

    When I run the server as described in the readme, I get the following errors:

    $ go run cmd/apollo.go
    
    # github.com/amirgamil/apollo/pkg/apollo/sources
    pkg/apollo/sources/kindle.go:50:2: undefined: ensureFileExists
    pkg/apollo/sources/kindle.go:66:3: undefined: deleteFiles
    pkg/apollo/sources/kindle.go:121:11: undefined: getFilesInFolder
    
  • [suggestion/enhancement] Use pandoc

    [suggestion/enhancement] Use pandoc

    You use youtube-dl to get subtitles for videos. For local files and things like PDF's and other form's that might get downloaded, pandoc will convert just about anything to txt. Perhaps if you check in the crawler for file:// and then use pandoc to get the file as text which you can index a lot more options open up. I suspect a lot of people have local files/content they want to add to apollo and this would give huge flexibility.

  • Apollo from command line & Chrome extension

    Apollo from command line & Chrome extension

    First of all, amazing project! Really clever way of sifting through search results like that. I was wondering if you thought about cmdline client for apollo? I spend quite a lot of time in my terminal and have to context switch to browser if i need something. Tmuxing into apollo would be really useful in my case, especially for code documentation or papers I've read.

    Might be material for another feature request / idea but having a Chrome / Firefox extension to quickly pull webpages into personal store for apollo as you read them would be very cool. I often read something, leave it in the 200 Chrome tabs I have open and need to find it 2 days later and find myself digging through my history. So far I've tried things like Workona to categorise tabs into projects, but its tedious and I can see Apollo being much less attention hungry than that.

  • Error when running server: could not create database for path

    Error when running server: could not create database for path

    I got the following error when running the server:

    $ go run cmd/apollo.go
    
    2021/07/29 16:14:58 Error, could not create database for path: ./data/sources.json with: open ./data/sources.json: no such file or directory
    exit status 1
    

    This is my file structure:

    .
    ├── LICENSE
    ├── README.md
    ├── cmd
    │   └── apollo.go
    ├── docs
    │   ├── Screen\ Shot\ 2021-07-25\ at\ 4.36.15\ PM.png
    │   ├── apollo.png
    │   └── architecture.png
    ├── go.mod
    ├── go.sum
    ├── pkg
    │   └── apollo
    │       ├── backend
    │       │   ├── api.go
    │       │   ├── searcher.go
    │       │   └── tokenizer.go
    │       ├── data
    │       ├── schema
    │       │   ├── crawler.go
    │       │   └── schema.go
    │       ├── server.go
    │       └── sources
    │           ├── athena.go
    │           ├── kindle.go
    │           ├── source.go
    │           ├── utils.go
    │           └── zeus.go
    ├── static
    │   ├── css
    │   │   └── stylesheet.css
    │   ├── img
    │   │   ├── about.png
    │   │   ├── add.png
    │   │   └── home.png
    │   ├── index.html
    │   └── js
    │       ├── main.js
    │       └── poseidon.min.js
    └── tests
        └── main_test.go
    
    13 directories, 27 files
    
  • Missing license

    Missing license

    Since this repo contains no LICENSE file and no license or copyright notice in any file, we must assume that it is copyright (c) 2021 Amir Bolous, All Rights Reserved. I would not recommend that anyone download, install, distribute, or contribute here until the licensing is resolved.

    This issue appears to affect all other repos owned by Bolous that I have checked.

  • "panic: runtime error: invalid memory address or nil pointer dereference" on macOS

    Following the instructions, I got to the point where all of the folders were created, the command was ready to run, etc... and this is what returned.

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x2 addr=0x8 pc=0x10005ade0]
    
    goroutine 1 [running]:
    reflect.mapiternext(0x10041eb60?)
    	/opt/homebrew/Cellar/go/1.18/libexec/src/runtime/map.go:1378 +0x20
    github.com/modern-go/reflect2.(*UnsafeMapIterator).UnsafeNext(0x0?)
    	/Users/kenneth/go/pkg/mod/github.com/modern-go/[email protected]/unsafe_map.go:136 +0x3c
    github.com/json-iterator/go.(*mapEncoder).Encode(0x140002692f0, 0x1400012e968, 0x14000130780)
    	/Users/kenneth/go/pkg/mod/github.com/json-iterator/[email protected]/reflect_map.go:262 +0x3bc
    github.com/json-iterator/go.(*onePtrEncoder).Encode(0x1400026b270, 0x14000269110, 0x14000278e80?)
    	/Users/kenneth/go/pkg/mod/github.com/json-iterator/[email protected]/reflect.go:219 +0x8c
    github.com/json-iterator/go.(*Stream).WriteVal(0x14000130780, {0x10041fb80, 0x14000269110})
    	/Users/kenneth/go/pkg/mod/github.com/json-iterator/[email protected]/reflect.go:98 +0x174
    github.com/json-iterator/go.(*Encoder).Encode(0x1400012e918, {0x10041fb80?, 0x14000269110?})
    	/Users/kenneth/go/pkg/mod/github.com/json-iterator/[email protected]/adapter.go:127 +0x34
    github.com/amirgamil/apollo/pkg/apollo/backend.writeIndexToDisk()
    	/Users/kenneth/git/apollo/pkg/apollo/backend/api.go:117 +0xec
    github.com/amirgamil/apollo/pkg/apollo/backend.RefreshInvertedIndex()
    	/Users/kenneth/git/apollo/pkg/apollo/backend/api.go:153 +0x1dc
    main.main()
    	/Users/kenneth/git/apollo/cmd/apollo.go:16 +0x24
    exit status 2
    

    Running on Golang version 1.18 built for ARM64. Thoughts?

  • feat(docker): made Apollo docker ready

    feat(docker): made Apollo docker ready

    • Fix little bugs regarding the password read and used viper instead
    • createFile will automatically create parents folder too now
    • The server won't kill itself when fails to scrape something
    • Expose the right port, set a default working dir and moved in all the necessary files in order to make a minimal working image
    • Docker container will be built without kindle and podcast integrations

    You can try to build & play with the image with the following commands: docker build -t test/test . && docker run --rm -p8993:8993 -ePASSWORD=supersecurepassword test/test I've used multi-stage docker build so that the final image is still slim!

    I used viper instead of your library since this way you can either set the password via environment variable or via .env file. viper will perform transparently the variable override so it will give more flexibility to the server while keeping the codebase simple!

    I hope this helps the project somehow :muscle:

  • Trying to get in touch regarding a security issue

    Trying to get in touch regarding a security issue

    Hey there!

    I'd like to report a security issue but cannot find contact instructions on your repository.

    If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

    Thank you for your consideration, and I look forward to hearing from you!

    (cc @huntr-helper)

  • Add Dockerfile

    Add Dockerfile

    I tried to create a quick Dockerfile that includes youtube-dl. I also had to ignore errors from calling godotenv.Load() to be able to set the env variables from the docker-compose.yml

  • Added ApolloNIA - Chrome extension for Apollo

    Added ApolloNIA - Chrome extension for Apollo

    PR: Chrome extension for Apollo - more in the submodule and at: ApolloNIA

    I've modified the server.go slightly to ensure CORS behaves with the extension (it should not affect normal operation of Apollo)

  • [ehnancement request] take input from an rss/atom feed

    [ehnancement request] take input from an rss/atom feed

    cAn you support taking input from an RSS/Atom feed? Lots of sites have feeds and some have data you want to remember. Plus, with rss-bridge, you can get an rss feed from lots of sites and apps that don't have easy api access and in a uniform way. For example, I listen to NPR Fresh Air. The page has an rss feed so I can add that to apollo and automatically have an index of the podcasts, at least from the summary/desciption provided.

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Sep 24, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022
Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022
A recursive, mirroring web crawler that retrieves child links.

A recursive, mirroring web crawler that retrieves child links.

Jan 29, 2022
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Nov 9, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
Pholcus is a distributed high-concurrency crawler software written in pure golang
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

Dec 30, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Dec 4, 2022
Go IMDb Crawler
 Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Aug 1, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

Dec 29, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

Sep 7, 2022
Simple content crawler for joyreactor.cc
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

May 5, 2022
A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

Nov 9, 2021
Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Nov 3, 2021