community search engine

Lieu

an alternative search engine

Created in response to the environs of apathy concerning the use of hypertext search and discovery. In Lieu, the internet is not what is made searchable, but instead one's own neighbourhood. Put differently, Lieu is a neighbourhood search engine, a way for personal webrings to increase serendipitous connexions.

Goals

  • Enable serendipitous discovery
  • Support personal communities
  • Be reusable, easily

Usage

$ lieu help
Lieu: neighbourhood search engine

Commands
- precrawl  (scrapes config's general.url for a list of links: 
  • elements containing an anchor tag) - crawl (start crawler, crawls all urls in config's crawler.webring file) - ingest (ingest crawled data, generates database) - search (interactive cli for searching the database) - host (hosts search engine over http) Example: lieu precrawl > data/webring.txt lieu ingest lieu host
  • Lieu's crawl & precrawl commands output to standard output, for easy inspection of the data. You typically want to redirect their output to the files Lieu reads from, as defined in the config file. See below for a typical workflow.

    Workflow

    • Edit the config
    • Add domains to crawl in config.crawler.webring
      • If you have a webpage with links you want to crawl:
      • Set the config's url field to that page
      • Populate the list of domains to crawl with precrawl: lieu precrawl > data/webring.txt
    • Crawl: lieu crawl > data/source.txt
    • Create database: lieu ingest
    • Host engine: lieu host

    After ingesting the data with lieu ingest, you can also use lieu to search the corpus in the terminal with lieu search.

    Config

    The config file is written in TOML.

    [general]
    name = "Merveilles Webring"
    # used by the precrawl command and linked to in /about route
    url = "https://webring.xxiivv.com"
    port = 10001
    
    [data]
    # the source file should contain the crawl command's output 
    source = "data/crawled.txt"
    # location & name of the sqlite database
    database = "data/searchengine.db"
    # contains words and phrases disqualifying scraped paragraphs from being presented in search results
    heuristics = "data/heuristics.txt"
    # aka stopwords, in the search engine biz: https://en.wikipedia.org/wiki/Stop_word
    wordlist = "data/wordlist.txt"
    
    [crawler]
    # manually curated list of domains, or the output of the precrawl command
    webring = "data/webring.txt"
    # domains that are banned from being crawled but might originally be part of the webring
    bannedDomains = "data/banned-domains.txt"
    # file suffixes that are banned from being crawled
    bannedSuffixes = "data/banned-suffixes.txt"
    # phrases and words which won't be scraped (e.g. if a contained in a link)
    boringWords = "data/boring-words.txt"
    # domains that won't be output as outgoing links
    boringDomains = "data/boring-domains.txt"

    For your own use, the following config fields should be customized:

    • name
    • url
    • port
    • source
    • webring
    • bannedDomains

    The following config-defined files can stay as-is unless you have specific requirements:

    • database
    • heuristics
    • wordlist
    • bannedSuffixes

    For a full rundown of the files and their various jobs, see the files description.

    License

    Source code AGPL-3.0-or-later, Inter is available under SIL OPEN FONT LICENSE Version 1.1, Noto Serif is licensed as Apache License, Version 2.0.

    Comments
    • HTML + CSS + overhaul performance

      HTML + CSS + overhaul performance

      TLDR:

      • Design is the same except when accessibility mattered most
      • HTML was heavily tweaked in some spaces for accessibility and semantic reasons, but the design staus the same
      • CSS was remade from scratch but old files are in the "css_old" folder
      • Loading performances were increased by using woff2 instead of ttf

      Notable changes:

      • A reset was added to limit the number of basic CSS fixes or addons to add. It's placed inside the
      • For accessibility reasons, an input MUST have a label. A label was added to the search input. Not very esthetic I know, but well...
      • For accessibility reasons, each page MUST start with at least an h1. It was added in some pages. In other places, h2 were converted to simple text.
      • The CSS is loosely following the CUBE CSS methodology, could maybe use some cleaning.
      • Entries are now lists for accessibility reasons. Screen readers announce the number of elements inside a list when they enter them, and shortcuts help jumping from one to the next

      The commits can globally tell my though process when I coded this. If you need tweaks or corrections please ask.

    • Allows the configuration of a proxy for the HTTP client and the Colly HTTP client, Allows extracting precrawl links without matching the pattern

      Allows the configuration of a proxy for the HTTP client and the Colly HTTP client, Allows extracting precrawl links without matching the pattern

      This PR adds a configuration option which allows a user to configure an http:// or socks:// proxy in the lieu.toml, and an additional option for extracting links from web sites that don't have a them in the form of <li><a></a></li>. My intent with this is to build a plugin to the I2P network which enhances the user's ability to search for sites by sharing the task of crawling sites and making it easy to run a small search engine. Simply setting the http_proxy environment variable does not appear to be sufficient due to DNS leakage, it must be done by replacing the default transport.

    • Light theme wanted

      Light theme wanted

      Hello, light theme is needed. I tried to fork and modify the project but had some problems (on my side, probably) related to sqlite. I don't want to deal with them currently, so I'm leaving the the task to you :-)

      You probably need to add this to base.css:

      @media screen and (prefers-color-scheme: light) {
          :root {
              --primary: #000;
              --secondary: #fefefe;
          }
      }
      

      It will basically swap colors everywhere, except for the search button. You have hard-coded colors in the logo.svg, there are two ways to swap the colors:

      1. Provide two versions of the file and serve them depending on current theme.
      2. Change svg color with svg. Look here or somewhere else maybe.

      Thanks

    • Documented the theming part a bit better

      Documented the theming part a bit better

      Added some comments that explain a bit better what is happening and added checks for the other fields to prevent the foreground without background issue.

    • Optimized favicons a bit

      Optimized favicons a bit

      • This added a favicon that was slightly modified to make it easier to recognize at small scales as svg
      • Fixed the problem that some browsers don't like non-square favicons (by making the favicon a square)
      • regenerated an optimized version in the ico and png formats

      This version of the favicon is already used by my test instance of lieu over at: https://fediring-lieu.slatecave.net/

    • Feature: advanced search with rank selector

      Feature: advanced search with rank selector

      Adds support some selectors needed for more advanced searches like

      • -site: to exclude sites
      • lang: to only search for pages that have a given language tag prefix
      • rank:count and rank:score to choose a ranking algorithm (also exposed as url parameter)

      Note: #13 is the same PR without the rank selector

    • Feature: advanced search

      Feature: advanced search

      Adds support some selectors needed for more advanced searches like

      • -site: to exclude sites
      • lang: to only search for pages that have a given language tag prefix

      Note: #14 is a version of this PR that includes a rank: tag that allows one to choose the ranking algorithm

    • Megamerge: Improving previews and finding of paragraphs

      Megamerge: Improving previews and finding of paragraphs

      • Made the paragraph scraping a lot more flexible
      • Scraper now collects opengraph descriptions
      • Ingest now considers opengraph descriptions as preview text
      • Fixed: Fixed the ingest not being able to handle lots of words per batch
      • Fixed: URIs are case sensitive, don't lowercase them
    • Crawler indexes all pages on a particular domain rather pages under a path

      Crawler indexes all pages on a particular domain rather pages under a path

      When running Lieu over all the sites in the fediring, we've found that it's only bound by domain rather than domain+path. This causes quirks with static site hosts like cronut.cafe; the only cronut.cafe user who's also a member of the ring is ~sfr, but multiple other users who aren't members have been indexed as well: https://search.fediring.net/?q=cronut

      I think a good solution might be keeping track of not only the domain that's being crawled but also the original URL and ignoring links to parent directories.

    • Add opensearch metadata for browser integration

      Add opensearch metadata for browser integration

      This allows Lieu to be discovered as a search engine by browsers, which then can be set as the default search engine for example.

      Looks like this in Firefox: leu

    • Path-enhanced crawling

      Path-enhanced crawling

      Webring sites passed to Lieu which end with a path, e.g. https://example.com/site/lupin will now only have their children pages crawled (as opposed to allowing all pages of example.com being crawled).

      This falls more in line with expectations for webring sites which might exist on shared hosting, or just sites which have separate areas that should not be crawled.

      Thanks @amolith for the issue!

    • Improve web UI

      Improve web UI

      Includes the following patches:

      • .View variable in templates (to make some deduplication possible)
      • Deduplicated the navigation and search-form
      • The tagline is now a paragraph instead of an h2

      NOT included are:

      • a 404 page, the patch inserts code in places that were changed by #13 and #14, I'll submit it once one of those has either been merged or rejected
      • the patch introducing error messages for the case when no results are available, it also depends on one of either #13 or #14
    • Missing 404 page

      Missing 404 page

      Currently Lieu treats every page that is not a special route like its home page, which is not ideal since all those other pages should throw a proper 404 code and have their own template (which doesn't stop them from having a search form though …)

    • crawl-delay

      crawl-delay

      Lieu crawls fast; it seems to crawl at a rate of 5req/sec by default.

      Normally, this wouldn't bee too worrying; however, the types of sites Lieu crawls are small hobbyist sites. Some peoples' sites may be self-hosted on limited hardware.

      Respecting a crawl-delay robots.txt directive could help avoid overwhelming smaller sites.

    • Handle query argument type of link style like

      Handle query argument type of link style like"?post=xxx"

      Hi there administrators! My blog in the webring uses a link format of index.php?post=20220612001238, like this, however it looks to me that the crawler doesn't like query arguments in the url: https://github.com/cblgh/lieu/blob/b0ad7dce102d35123bb0092527b7ceea6df8ad86/crawler/crawler.go#L51

      Can there be a way for sites to hint that they may want to use ? or # separated URLs? From what I know, MDWiki is quite popular and it uses #! to specify page links, so that way we could index more pages for these sites as well.

      I can see where # could post some problems with title links... I'd suggest that allow a <meta> or some sort of tag in the page head to hint the crawler that some formats of the link can be allowed, and if href regex matches the "allowed link format", the link will be preserved?

      Thanks!

    • Potential sadness if you find your domain in

      Potential sadness if you find your domain in "boringDomains" config

      :wave: r a d project :partying_face:

      As an admin, it feels a bit uncomfortable putting domains into that config with such a name. The functionality is great & relevant but in a pubnix / shared server environment, other users might get the wrong idea seeing the naming? If I'm trying to instead focus my search space on relevant links and not saying I think their stuff is "boring".

      Proposal: skipDomains.

    • pizza finds .pizza domain

      pizza finds .pizza domain

      I noticed that a search for pizza just shows stuff that is on a site with a .pizza domain. I don't know if this is preventing actual pizza content from turning up or if there isn't any, but either way, it could be more useful if it was aware of it just being part of the domain.

    Weaviate is a cloud-native, modular, real-time vector search engine
    Weaviate is a cloud-native, modular, real-time vector search engine

    Weaviate is a cloud-native, real-time vector search engine (aka neural search engine or deep search engine). There are modules for specific use cases such as semantic search, plugins to integrate Weaviate in any application of your choice, and a console to visualize your data.

    Jan 5, 2023
    Self hosted search engine for data leaks and password dumps
    Self hosted search engine for data leaks and password dumps

    Self hosted search engine for data leaks and password dumps. Upload and parse multiple files, then quickly search through all stored items with the power of Elasticsearch.

    Aug 2, 2021
    A search engine for XKCD

    xkcd_searchtool a search engine for XKCD What is it? This tool can crawling the comic transcripts from XKCD.com Users can search a comic using key wor

    Sep 29, 2021
    Zinc Search engine. A lightweight alternative to elasticsearch that requires minimal resources, written in Go.
    Zinc Search engine. A lightweight alternative to elasticsearch that requires minimal resources, written in Go.

    Zinc Search Engine Zinc is a search engine that does full text indexing. It is a lightweight alternative to Elasticsearch and runs using a fraction of

    Jan 1, 2023
    The gofinder program is an acme user interface to search through Go projects.

    The gofinder program is an acme user interface to search through Go projects.

    Jun 14, 2021
    Universal code search (self-hosted)
    Universal code search (self-hosted)

    Sourcegraph OSS edition is a fast, open-source, fully-featured code search and navigation engine. Enterprise editions are available. Features Fast glo

    Jan 9, 2023
    using go search the Marvel universe characters via marvel api
    using go search the Marvel universe characters via marvel api

    go-marvel-api using go search the Marvel universe characters via marvel api Build and run tests on the local environemnt Build the project $ go build

    Oct 5, 2021
    Alfred 4 workflow to easily search and launch bookmarks from the Brave Browser

    Alfred Brave Browser Bookmarks A simple and fast workflow for searching and launching Brave Browser bookmarks. Why this workflow? No python dependency

    Nov 28, 2022
    Quick search and short links for NYC Council Legislation

    Quick Search and Short Links for NYC Council Legislation Quick Search Link to searches with /?q=${query}. In-browser searching is implemented with fle

    Oct 12, 2022
    Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and traditional RESTful API.

    Phalanx Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and tr

    Dec 25, 2022
    Search running process for a given dll/function. Exposes a bufio.Scanner-like interface for walking a process' PEB

    Search running process for a given dll/function. Exposes a bufio.Scanner-like interface for walking a process' PEB

    Apr 21, 2022
    Target Case Study - Document Search
     Target Case Study - Document Search

    Target Case Study - Document Search Goal The goal of this exercise is to create

    Feb 7, 2022
    IBus Engine for GoVarnam. An easy way to type Indian languages on GNU/Linux systems.

    IBus Engine For GoVarnam An easy way to type Indian languages on GNU/Linux systems. goibus - golang implementation of libibus Thanks to sarim and haun

    Feb 10, 2022
    A BPMN engine, meant to be embedded in Go applications with minim hurdles, and a pleasant developer experience using it.

    A BPMN engine, meant to be embedded in Go applications with minim hurdles, and a pleasant developer experience using it. This approach can increase transparency for non-developers.

    Dec 29, 2022
    Program to generate ruins using the Numenera Ruin Mapping Engine

    Ruin Generator This is my attempt to build a program to generate ruins for Numenera using the rules from the Jade Colossus splatbook. The output only

    Nov 7, 2021
    An experimental vulkan 3d engine for linux (raspberry 4)

    protomatter an experimental vulkan 3d engine for linux (raspberry 4).

    Nov 14, 2021
    Rule engine implementation in Golang
    Rule engine implementation in Golang

    Rule engine implementation in Golang

    Dec 30, 2022
    Nune - High-performance numerical engine based on generic tensors

    Nune (v0.1) Numerical engine is a library for performing numerical computation i

    Nov 9, 2022
    Nune-go - High-performance numerical engine based on generic tensors

    Nune (v0.1) Numerical engine is a library for performing numerical computation i

    Nov 9, 2022