community search engine

Last update: Dec 24, 2022

Comments: 17

Lieu

an alternative search engine

Created in response to the environs of apathy concerning the use of hypertext search and discovery. In Lieu, the internet is not what is made searchable, but instead one's own neighbourhood. Put differently, Lieu is a neighbourhood search engine, a way for personal webrings to increase serendipitous connexions.

Goals

Enable serendipitous discovery
Support personal communities
Be reusable, easily

Usage

$ lieu help
Lieu: neighbourhood search engine

Commands
- precrawl  (scrapes config's general.url for a list of links:  elements containing an anchor  tag)
- crawl     (start crawler, crawls all urls in config's crawler.webring file)
- ingest    (ingest crawled data, generates database)
- search    (interactive cli for searching the database)
- host      (hosts search engine over http)

Example:
    lieu precrawl > data/webring.txt
    lieu ingest
    lieu host

Lieu's crawl & precrawl commands output to standard output, for easy inspection of the data. You typically want to redirect their output to the files Lieu reads from, as defined in the config file. See below for a typical workflow.

Workflow

Edit the config
Add domains to crawl in config.crawler.webring
- If you have a webpage with links you want to crawl:
- Set the config's url field to that page
- Populate the list of domains to crawl with precrawl: lieu precrawl > data/webring.txt
Crawl: lieu crawl > data/source.txt
Create database: lieu ingest
Host engine: lieu host

After ingesting the data with lieu ingest, you can also use lieu to search the corpus in the terminal with lieu search.

Config

The config file is written in TOML.

[general]
name = "Merveilles Webring"
# used by the precrawl command and linked to in /about route
url = "https://webring.xxiivv.com"
port = 10001

[data]
# the source file should contain the crawl command's output 
source = "data/crawled.txt"
# location & name of the sqlite database
database = "data/searchengine.db"
# contains words and phrases disqualifying scraped paragraphs from being presented in search results
heuristics = "data/heuristics.txt"
# aka stopwords, in the search engine biz: https://en.wikipedia.org/wiki/Stop_word
wordlist = "data/wordlist.txt"

[crawler]
# manually curated list of domains, or the output of the precrawl command
webring = "data/webring.txt"
# domains that are banned from being crawled but might originally be part of the webring
bannedDomains = "data/banned-domains.txt"
# file suffixes that are banned from being crawled
bannedSuffixes = "data/banned-suffixes.txt"
# phrases and words which won't be scraped (e.g. if a contained in a link)
boringWords = "data/boring-words.txt"
# domains that won't be output as outgoing links
boringDomains = "data/boring-domains.txt"

For your own use, the following config fields should be customized:

name
url
port
source
webring
bannedDomains

The following config-defined files can stay as-is unless you have specific requirements:

database
heuristics
wordlist
bannedSuffixes

For a full rundown of the files and their various jobs, see the files description.

License

Source code AGPL-3.0-or-later, Inter is available under SIL OPEN FONT LICENSE Version 1.1, Noto Serif is licensed as Apache License, Version 2.0.

Owner

Alexander Cobleigh

https://github.com/cblgh/lieu https://lieu.cblgh.org

Comments

HTML + CSS + overhaul performance
TLDR:

Design is the same except when accessibility mattered most

HTML was heavily tweaked in some spaces for accessibility and semantic reasons, but the design staus the same

CSS was remade from scratch but old files are in the "css_old" folder

Loading performances were increased by using woff2 instead of ttf

Notable changes:

A reset was added to limit the number of basic CSS fixes or addons to add. It's placed inside the

For accessibility reasons, an input MUST have a label. A label was added to the search input. Not very esthetic I know, but well...

For accessibility reasons, each page MUST start with at least an h1. It was added in some pages. In other places, h2 were converted to simple text.

The CSS is loosely following the CUBE CSS methodology, could maybe use some cleaning.

Entries are now lists for accessibility reasons. Screen readers announce the number of elements inside a list when they enter them, and shortcuts help jumping from one to the next

The commits can globally tell my though process when I coded this. If you need tweaks or corrections please ask.
Allows the configuration of a proxy for the HTTP client and the Colly HTTP client, Allows extracting precrawl links without matching the pattern

This PR adds a configuration option which allows a user to configure an http:// or socks:// proxy in the lieu.toml, and an additional option for extracting links from web sites that don't have a them in the form of <li><a></a></li>. My intent with this is to build a plugin to the I2P network which enhances the user's ability to search for sites by sharing the task of crawling sites and making it easy to run a small search engine. Simply setting the http_proxy environment variable does not appear to be sufficient due to DNS leakage, it must be done by replacing the default transport.
Light theme wanted
Hello, light theme is needed. I tried to fork and modify the project but had some problems (on my side, probably) related to sqlite. I don't want to deal with them currently, so I'm leaving the the task to you :-)

You probably need to add this to base.css:

@media screen and (prefers-color-scheme: light) { :root { --primary: #000; --secondary: #fefefe; } }

It will basically swap colors everywhere, except for the search button. You have hard-coded colors in the logo.svg, there are two ways to swap the colors:

Provide two versions of the file and serve them depending on current theme.

Change svg color with svg. Look here or somewhere else maybe.

Thanks
Documented the theming part a bit better

Added some comments that explain a bit better what is happening and added checks for the other fields to prevent the foreground without background issue.
Optimized favicons a bit
This added a favicon that was slightly modified to make it easier to recognize at small scales as svg

Fixed the problem that some browsers don't like non-square favicons (by making the favicon a square)

regenerated an optimized version in the ico and png formats

This version of the favicon is already used by my test instance of lieu over at: https://fediring-lieu.slatecave.net/
Feature: advanced search with rank selector
Adds support some selectors needed for more advanced searches like

-site: to exclude sites

lang: to only search for pages that have a given language tag prefix

rank:count and rank:score to choose a ranking algorithm (also exposed as url parameter)

Note: #13 is the same PR without the rank selector
Feature: advanced search
Adds support some selectors needed for more advanced searches like

-site: to exclude sites

lang: to only search for pages that have a given language tag prefix

Note: #14 is a version of this PR that includes a rank: tag that allows one to choose the ranking algorithm
Megamerge: Improving previews and finding of paragraphs
Made the paragraph scraping a lot more flexible

Scraper now collects opengraph descriptions

Ingest now considers opengraph descriptions as preview text

Fixed: Fixed the ingest not being able to handle lots of words per batch

Fixed: URIs are case sensitive, don't lowercase them
Crawler indexes all pages on a particular domain rather pages under a path

When running Lieu over all the sites in the fediring, we've found that it's only bound by domain rather than domain+path. This causes quirks with static site hosts like cronut.cafe; the only cronut.cafe user who's also a member of the ring is ~sfr, but multiple other users who aren't members have been indexed as well: https://search.fediring.net/?q=cronut

I think a good solution might be keeping track of not only the domain that's being crawled but also the original URL and ignoring links to parent directories.
Add opensearch metadata for browser integration

This allows Lieu to be discovered as a search engine by browsers, which then can be set as the default search engine for example.

Looks like this in Firefox:
Path-enhanced crawling

Webring sites passed to Lieu which end with a path, e.g. https://example.com/site/lupin will now only have their children pages crawled (as opposed to allowing all pages of example.com being crawled).

This falls more in line with expectations for webring sites which might exist on shared hosting, or just sites which have separate areas that should not be crawled.

Thanks @amolith for the issue!
Improve web UI
Includes the following patches:

.View variable in templates (to make some deduplication possible)

Deduplicated the navigation and search-form

The tagline is now a paragraph instead of an h2

NOT included are:

a 404 page, the patch inserts code in places that were changed by #13 and #14, I'll submit it once one of those has either been merged or rejected

the patch introducing error messages for the case when no results are available, it also depends on one of either #13 or #14
Missing 404 page

Currently Lieu treats every page that is not a special route like its home page, which is not ideal since all those other pages should throw a proper 404 code and have their own template (which doesn't stop them from having a search form though …)
crawl-delay

Lieu crawls fast; it seems to crawl at a rate of 5req/sec by default.

Normally, this wouldn't bee too worrying; however, the types of sites Lieu crawls are small hobbyist sites. Some peoples' sites may be self-hosted on limited hardware.

Respecting a crawl-delay robots.txt directive could help avoid overwhelming smaller sites.
Handle query argument type of link style like"?post=xxx"

Hi there administrators! My blog in the webring uses a link format of index.php?post=20220612001238, like this, however it looks to me that the crawler doesn't like query arguments in the url: https://github.com/cblgh/lieu/blob/b0ad7dce102d35123bb0092527b7ceea6df8ad86/crawler/crawler.go#L51

Can there be a way for sites to hint that they may want to use ? or # separated URLs? From what I know, MDWiki is quite popular and it uses #! to specify page links, so that way we could index more pages for these sites as well.

I can see where # could post some problems with title links... I'd suggest that allow a <meta> or some sort of tag in the page head to hint the crawler that some formats of the link can be allowed, and if href regex matches the "allowed link format", the link will be preserved?

Thanks!
Potential sadness if you find your domain in "boringDomains" config

:wave: r a d project :partying_face:

As an admin, it feels a bit uncomfortable putting domains into that config with such a name. The functionality is great & relevant but in a pubnix / shared server environment, other users might get the wrong idea seeing the naming? If I'm trying to instead focus my search space on relevant links and not saying I think their stuff is "boring".

Proposal: skipDomains.
pizza finds .pizza domain

I noticed that a search for pizza just shows stuff that is on a site with a .pizza domain. I don't know if this is preventing actual pizza content from turning up or if there isn't any, but either way, it could be more useful if it was aware of it just being part of the domain.

Weaviate is a cloud-native, modular, real-time vector search engine

Weaviate is a cloud-native, real-time vector search engine (aka neural search engine or deep search engine). There are modules for specific use cases such as semantic search, plugins to integrate Weaviate in any application of your choice, and a console to visualize your data.

Jan 5, 2023

Self hosted search engine for data leaks and password dumps

Self hosted search engine for data leaks and password dumps. Upload and parse multiple files, then quickly search through all stored items with the power of Elasticsearch.

Aug 2, 2021

A search engine for XKCD

xkcd_searchtool a search engine for XKCD What is it? This tool can crawling the comic transcripts from XKCD.com Users can search a comic using key wor

Sep 29, 2021

Zinc Search engine. A lightweight alternative to elasticsearch that requires minimal resources, written in Go.

Zinc Search Engine Zinc is a search engine that does full text indexing. It is a lightweight alternative to Elasticsearch and runs using a fraction of

Jan 1, 2023

The gofinder program is an acme user interface to search through Go projects.

Jun 14, 2021

Universal code search (self-hosted)

Sourcegraph OSS edition is a fast, open-source, fully-featured code search and navigation engine. Enterprise editions are available. Features Fast glo

Jan 9, 2023

using go search the Marvel universe characters via marvel api

go-marvel-api using go search the Marvel universe characters via marvel api Build and run tests on the local environemnt Build the project $ go build

Oct 5, 2021

Alfred 4 workflow to easily search and launch bookmarks from the Brave Browser

Alfred Brave Browser Bookmarks A simple and fast workflow for searching and launching Brave Browser bookmarks. Why this workflow? No python dependency

Nov 28, 2022

Quick search and short links for NYC Council Legislation

Quick Search and Short Links for NYC Council Legislation Quick Search Link to searches with /?q=${query}. In-browser searching is implemented with fle

Oct 12, 2022

Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and traditional RESTful API.

Phalanx Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and tr

Dec 25, 2022

community search engine

Lieu

Goals

Usage

Workflow

Config

License

Owner

Alexander Cobleigh

Comments

HTML + CSS + overhaul performance

Allows the configuration of a proxy for the HTTP client and the Colly HTTP client, Allows extracting precrawl links without matching the pattern

Light theme wanted

Documented the theming part a bit better

Optimized favicons a bit

Feature: advanced search with rank selector

Feature: advanced search

Megamerge: Improving previews and finding of paragraphs

Crawler indexes all pages on a particular domain rather pages under a path

Add opensearch metadata for browser integration

Path-enhanced crawling

Improve web UI

Missing 404 page

crawl-delay

Handle query argument type of link style like"?post=xxx"

Potential sadness if you find your domain in "boringDomains" config

pizza finds .pizza domain

Related tags

Weaviate is a cloud-native, modular, real-time vector search engine

Self hosted search engine for data leaks and password dumps

A search engine for XKCD

Zinc Search engine. A lightweight alternative to elasticsearch that requires minimal resources, written in Go.

The gofinder program is an acme user interface to search through Go projects.

Universal code search (self-hosted)

using go search the Marvel universe characters via marvel api

Alfred 4 workflow to easily search and launch bookmarks from the Brave Browser

Quick search and short links for NYC Council Legislation

Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and traditional RESTful API.

Search running process for a given dll/function. Exposes a bufio.Scanner-like interface for walking a process' PEB

Target Case Study - Document Search

IBus Engine for GoVarnam. An easy way to type Indian languages on GNU/Linux systems.

A BPMN engine, meant to be embedded in Go applications with minim hurdles, and a pleasant developer experience using it.

Program to generate ruins using the Numenera Ruin Mapping Engine

An experimental vulkan 3d engine for linux (raspberry 4)

Rule engine implementation in Golang

Nune - High-performance numerical engine based on generic tensors

Nune-go - High-performance numerical engine based on generic tensors