Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Amir Bolous

Last update: Dec 27, 2022

Comments: 12

Apollo 💎

A Unix-style personal search engine and web crawler for your digital footprint

Demo

apollodemo.mp4

Background
Thesis
Design Architecture
Data Schema
Workflows
Document Storage
Shut up, how can I use it?
Notes
Future
Inspirations

Background

Apollo is a different type of search engine. Traditional search engines (like Google) are great for discovery when you're trying to find the answer to a question, but you don't know what you're looking for.

However, they're very poor at recall and synthesis when you've seen something before on the internet somewhere but can't remember where. Trying to find it becomes a nightmare - how can you synthezize the great material on the internet when you forgot where it even was? I've wasted many an hour combing through Google and my search history to look up a good article, blog post, or just something I've seen before.

Even with built in systems to store some of my favorite articles, podcasts, and other stuff, I forget things all the time.

Thesis

Screw finding a needle in the haystack. Let's create a new type of search to choose which gem you're looking for

Apollo is a search engine and web crawler to digest your digital footprint. What this means is that you choose what to put in it. When you come across something that looks interesting, be it an article, blog post, website, whatever, you manually add it (with built in systems to make doing so easy). If you always want to pull in data from a certain data source, like your notes or something else, you can do that too. This tackles one of the biggest problems of recall in search engines returning a lot of irrelevant information because with Apollo, the signal to noise ratio is very high. You've chosen exactly what to put in it.

Apollo is not necessarly built for raw discovery (although it certainly matches rediscovery), it's built for knowledge compression and transformation - that is looking up things that you've previously deemed to be cool

Design

The first thing you might notice is that the design is reminiscent of the old digital computer age, back in the Unix days. This is intentional for many reasons. In addition to paying homage to the greats of the past, this design makes me feel like I'm searching through something that is authentically my own. When I search for stuff, I genuinely feel like I'm travelling through the past.

Architecture

Apollo's client side is written in Poseidon. The client side interacts with the backend via a REST-like API which provides endpoints for searching data and adding a new entry.

The backend is written in Go and is composed of a couple of important components

The web server which serves the endpoints
A tokenizer and stemmer used during search queries and when building the inverted index on the data
A simple web crawler for scraping links to articles/blog posts/YouTube video
The actual search engine which takes a query, tokenizes and stems it, finds the relevant results from the inverted index using those stemmed tokens then ranks results with TF-IDF
A package which pulls in data from a couple of different sources - if you want to pull data from a custom data source, this is where you should add it.

Data Schema

Two schemas we use, one to first parse the data into some encoded format. This does not get stored, it's purely an intermediate before we transform it into a record for our inverted index. Why is this important?

Because since any data gets parsed into this standarized format, you can link any data source you want, if you build your own tool, if you store a lot of data in some existing one, you don't have to manually add everything. You can pull in data from any data source provided you give the API data in this format.

type Data struct {
    title string //a title of the record, self-explanatory
    link string //links to the source of a record, e.g. a blog post, website, podcast etc.
    content string //actual content of the record, must be text data
    tags []string //list of potential high-level document tags you want to add that will be
                  //indexed in addition to the raw data contained 
}

//smallest unit of data that we store in the database
//this will store each "item" in our search engine with all of the necessary information
//for the inverted index
type Record struct {
	//unique identifier
	ID string `json:"id"`
	//title
	Title string `json:"title"`
	//potential link to the source if applicable
	Link string `json:"link"`
	//text content to display on results page
	Content string `json:"content"`
	//map of tokens to their frequency
	TokenFrequency map[string]int `json:"tokenFrequency"`
}

Workflows

Data comes in many forms and the more varied those forms are, the harder it's to write reliable software to deal with it. If everything I wanted to index was just stuff I wrote, life would be easy. All of my notes would probably live in one place, so I would just have to grab the data from that data source and chill

The problem is I don't take a lot of notes and not everything I want to index is something I'd take notes of.

So what to do?

Apollo can't handle all types of data, it's not designed to. However in building a search engine to index stuff, there are a couple of things I focused on:

Any data that comes from a specific platform can be integrated. If you want to index all your Twitter data for example, this is possible since all of the data can be absorbed in a constant format, converted into the compatible apollo format, and sent off. So data sources can be easily integrated, this is by design in case I want to pull in data from personal tools.
The harder thing is what about just, what I wil call, "writing on the internet." I read a lot of stuff on the Internet, much of which I'd like to be able to index, without necessarily having to takes notes on everything I read because I'm lazy. The dream would be to just be able to drop a link and have Apollo intelligently try to fetch the content, then I can index it without having to go to the post and copying the content, which would be painful and too slow. This was a large motivation for the web crawler component of the project

If it's writing on the Internet, should be able to post link and autofill pwd
If it's a podcast episode or any YouTube video, download text transcription e.g. this
If you want to pull data from a custom data source, add it as a file in the pkg/apollo/sources folder, following the same rules as some of the examples and make sure to add it in the GetData() method of the source.go file in this package

Document storage

Local records and data from data sources are stored in separate JSON files. This is for convenience.

I also personally store my Kindle highlights as a JSON file - I use read.amazon.com and a readwise extension to download the exported highlights for a book. I put any new book JSON files in a kindle folder in the outer directory and every time the inverted index is recomputed, the kindle file takes any new book highlights, integrate them into the main kindle.json file stored in the data folder, then delete the old file.

Shut up, how can I use it?

Although I built Apollo first and foremost for myself, I also wanted other people to be able to use if they found it valuable. To use Apollo locally

Clone the repo: git clone ....
Make sure you have Go installed and youtube-dl which is how we download the subtitles of a video. You can use this to install it.
Navigate to the root directory of the project: cd apollo . Note since Apollo syncs from some personal data sources, you'll want to remove them, add your own, or build stuff on top of them. Otherwise the terminal wil complain if you attempt to run it, so:
Navigate to the pkg/apollo/sources in your preferred editor and replace the body of the GetData function with return make(map[string]schema.Data)
Create a folder data in the outer directory
Create a .env file in the outermost directory (i.e. in the same directory as the README.md) and add PASSWORD= where is whatever password you want. This is necessary for adding or scraping the data, you'll want to "prove you're Amir" i.e. authenticate yourself and then you won't need to do this in the future. If this is not making sense, try adding some data on apollo.amirbolous.com/add and see what happens.
Go back to the outer directory (meanging you should see the files the way GitHub is displaying them right now) and run go run cmd/apollo.go in the terminal.
Navigate to 127.0.0.1:8993 on your browser
It should be working! You can add data and index data from the database If you run into problems, open an issue or DM me on Twitter

A little more information on the `Add Data` section

In order to add data, you'll first need to authenticate yourself - enter your password once in the "Please prove you'r Amir" and if you see a Hooray! popup then that means you were authenticated successfully. You only need to do this once since we use localStorage to save whether you've been authenticated once or not.
In order to scrape a website, you'll want to paste a link in the link textbox, then click on the button scrape. Note this does not add the website/content - you still need to click the add button if you want to save it. The web crawler works reliably most of the time if you're dealing with written content on a web page or a YouTube video. We use a Go ported version of readability to scrape the main contents from a page if it's written content and youtube-dl to get the transcript of a video. In the future, I'd like to make this web crawler more robust, but it works well enough most of the time for now.

As a side note, although I want others to be able to use Apollo, this is not a "commercial product" so feel free to open a feature request if you'd like one but it's unlikely I will get to it unless it becomes something I personally want to use.

Notes

The inverted index is re-generated once every n number of days (currently for n = 3)
Since this is not a commercial product, I will not be running your version of this (if you find it useful) on my server. However, although I designed this, first and foremost for myself, I want other people to be able to use if this is something that's useful, refer to How can I use this
I had the choice between using Go's gob package for the database/inverted index and JSON. The gob package is definitely faster however it's only native in Go so I decided to go with JSON to make the data available in the future for potentially any non-Go integrations and be able to switch the infrastructure completely if I want to etc.
I use a ported version of the Go snowball algorithm for my stemmer. Although I would have like to build my own stemmer, implementing a robust one (which is what I wanted) was not the focus of the project. Since the algorithm for a stemmer does not need to be maintined like other types of software, I decided to use one out of the box. If I write my own in the future, I'll swap it out.

Future

Improve the search algorithm, more like Elasticsearch when data grows a lot?
Improve the web crawler - make more robust like mercury parser, maybe write my own
Speed up search

Inspirations

Monocle for the idea
Serenity OS for the design

Owner

Amir Bolous

Hacker, maker, and professional noob

https://github.com/amirgamil/apollo https://apollo.amirbolous.com/

Comments

error in the instructions or not clear

I don't know go but followed the instructions and replaced the content of the GetData function with return []schema.Data{} after deleting all other content. the function has just that one line and when I run I get:

cant use []schema.Data{} (type []schema.Data) as type map[string]schema.Data in return argument

Clearly i shouldn't have deleted everything, but not sure what to do no knowing go.

Error when running server

When I run the server as described in the readme, I get the following errors:

$ go run cmd/apollo.go

# github.com/amirgamil/apollo/pkg/apollo/sources
pkg/apollo/sources/kindle.go:50:2: undefined: ensureFileExists
pkg/apollo/sources/kindle.go:66:3: undefined: deleteFiles
pkg/apollo/sources/kindle.go:121:11: undefined: getFilesInFolder

[suggestion/enhancement] Use pandoc

You use youtube-dl to get subtitles for videos. For local files and things like PDF's and other form's that might get downloaded, pandoc will convert just about anything to txt. Perhaps if you check in the crawler for file:// and then use pandoc to get the file as text which you can index a lot more options open up. I suspect a lot of people have local files/content they want to add to apollo and this would give huge flexibility.
Apollo from command line & Chrome extension

First of all, amazing project! Really clever way of sifting through search results like that. I was wondering if you thought about cmdline client for apollo? I spend quite a lot of time in my terminal and have to context switch to browser if i need something. Tmuxing into apollo would be really useful in my case, especially for code documentation or papers I've read.

Might be material for another feature request / idea but having a Chrome / Firefox extension to quickly pull webpages into personal store for apollo as you read them would be very cool. I often read something, leave it in the 200 Chrome tabs I have open and need to find it 2 days later and find myself digging through my history. So far I've tried things like Workona to categorise tabs into projects, but its tedious and I can see Apollo being much less attention hungry than that.

Error when running server: could not create database for path

I got the following error when running the server:

$ go run cmd/apollo.go

2021/07/29 16:14:58 Error, could not create database for path: ./data/sources.json with: open ./data/sources.json: no such file or directory
exit status 1

This is my file structure:

.
├── LICENSE
├── README.md
├── cmd
│   └── apollo.go
├── docs
│   ├── Screen\ Shot\ 2021-07-25\ at\ 4.36.15\ PM.png
│   ├── apollo.png
│   └── architecture.png
├── go.mod
├── go.sum
├── pkg
│   └── apollo
│       ├── backend
│       │   ├── api.go
│       │   ├── searcher.go
│       │   └── tokenizer.go
│       ├── data
│       ├── schema
│       │   ├── crawler.go
│       │   └── schema.go
│       ├── server.go
│       └── sources
│           ├── athena.go
│           ├── kindle.go
│           ├── source.go
│           ├── utils.go
│           └── zeus.go
├── static
│   ├── css
│   │   └── stylesheet.css
│   ├── img
│   │   ├── about.png
│   │   ├── add.png
│   │   └── home.png
│   ├── index.html
│   └── js
│       ├── main.js
│       └── poseidon.min.js
└── tests
    └── main_test.go

13 directories, 27 files

Missing license

Since this repo contains no LICENSE file and no license or copyright notice in any file, we must assume that it is copyright (c) 2021 Amir Bolous, All Rights Reserved. I would not recommend that anyone download, install, distribute, or contribute here until the licensing is resolved.

This issue appears to affect all other repos owned by Bolous that I have checked.

"panic: runtime error: invalid memory address or nil pointer dereference" on macOS

Following the instructions, I got to the point where all of the folders were created, the command was ready to run, etc... and this is what returned.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x8 pc=0x10005ade0]

goroutine 1 [running]:
reflect.mapiternext(0x10041eb60?)
	/opt/homebrew/Cellar/go/1.18/libexec/src/runtime/map.go:1378 +0x20
github.com/modern-go/reflect2.(*UnsafeMapIterator).UnsafeNext(0x0?)
	/Users/kenneth/go/pkg/mod/github.com/modern-go/[email protected]/unsafe_map.go:136 +0x3c
github.com/json-iterator/go.(*mapEncoder).Encode(0x140002692f0, 0x1400012e968, 0x14000130780)
	/Users/kenneth/go/pkg/mod/github.com/json-iterator/[email protected]/reflect_map.go:262 +0x3bc
github.com/json-iterator/go.(*onePtrEncoder).Encode(0x1400026b270, 0x14000269110, 0x14000278e80?)
	/Users/kenneth/go/pkg/mod/github.com/json-iterator/[email protected]/reflect.go:219 +0x8c
github.com/json-iterator/go.(*Stream).WriteVal(0x14000130780, {0x10041fb80, 0x14000269110})
	/Users/kenneth/go/pkg/mod/github.com/json-iterator/[email protected]/reflect.go:98 +0x174
github.com/json-iterator/go.(*Encoder).Encode(0x1400012e918, {0x10041fb80?, 0x14000269110?})
	/Users/kenneth/go/pkg/mod/github.com/json-iterator/[email protected]/adapter.go:127 +0x34
github.com/amirgamil/apollo/pkg/apollo/backend.writeIndexToDisk()
	/Users/kenneth/git/apollo/pkg/apollo/backend/api.go:117 +0xec
github.com/amirgamil/apollo/pkg/apollo/backend.RefreshInvertedIndex()
	/Users/kenneth/git/apollo/pkg/apollo/backend/api.go:153 +0x1dc
main.main()
	/Users/kenneth/git/apollo/cmd/apollo.go:16 +0x24
exit status 2

Running on Golang version 1.18 built for ARM64. Thoughts?

feat(docker): made Apollo docker ready
Fix little bugs regarding the password read and used viper instead

createFile will automatically create parents folder too now

The server won't kill itself when fails to scrape something

Expose the right port, set a default working dir and moved in all the necessary files in order to make a minimal working image

Docker container will be built without kindle and podcast integrations

You can try to build & play with the image with the following commands: docker build -t test/test . && docker run --rm -p8993:8993 -ePASSWORD=supersecurepassword test/test I've used multi-stage docker build so that the final image is still slim!

I used viper instead of your library since this way you can either set the password via environment variable or via .env file. viper will perform transparently the variable override so it will give more flexibility to the server while keeping the codebase simple!

I hope this helps the project somehow :muscle:
Trying to get in touch regarding a security issue

Hey there!

I'd like to report a security issue but cannot find contact instructions on your repository.

If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

Thank you for your consideration, and I look forward to hearing from you!

(cc @huntr-helper)
Add Dockerfile

I tried to create a quick Dockerfile that includes youtube-dl. I also had to ignore errors from calling godotenv.Load() to be able to set the env variables from the docker-compose.yml
Added ApolloNIA - Chrome extension for Apollo

PR: Chrome extension for Apollo - more in the submodule and at: ApolloNIA

I've modified the server.go slightly to ensure CORS behaves with the extension (it should not affect normal operation of Apollo)
[ehnancement request] take input from an rss/atom feed

cAn you support taking input from an RSS/Atom feed? Lots of sites have feeds and some have data you want to remember. Plus, with rss-bridge, you can get an rss feed from lots of sites and apps that don't have easy api access and in a uniform way. For example, I listen to NPR Fresh Air. The page has an rss feed so I can add that to apollo and automatically have an index of the podcasts, at least from the summary/desciption provided.

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Jan 7, 2023

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo 💎

A Unix-style personal search engine and web crawler for your digital footprint

Demo

Contents

Background

Thesis

Design

Architecture

Data Schema

Workflows

Document storage

Shut up, how can I use it?

A little more information on the Add Data section

Notes

Future

Inspirations

Owner

Amir Bolous

Comments

error in the instructions or not clear

Error when running server

[suggestion/enhancement] Use pandoc

Apollo from command line & Chrome extension

Error when running server: could not create database for path

Missing license

"panic: runtime error: invalid memory address or nil pointer dereference" on macOS

feat(docker): made Apollo docker ready

Trying to get in touch regarding a security issue

Add Dockerfile

Added ApolloNIA - Chrome extension for Apollo

[ehnancement request] take input from an rss/atom feed

Related tags

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Fast golang web crawler for gathering URLs and JavaSript file locations.

ant (alpha) is a web crawler for Go.

Fast, highly configurable, cloud native dark web crawler.

Just a web crawler

A recursive, mirroring web crawler that retrieves child links.

Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Elegant Scraper and Crawler Framework for Golang

Pholcus is a distributed high-concurrency crawler software written in pure golang

:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

Go IMDb Crawler

High-performance crawler framework based on fasthttp

A crawler/scraper based on golang + colly, configurable via JSON

crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A crawler/scraper based on golang + colly, configurable via JSON

New World Auction House Crawler In Golang

Simple content crawler for joyreactor.cc

A PCPartPicker crawler for Golang.

Multiplexer: HTTP-Server & URL Crawler

A little more information on the `Add Data` section