GoVarnam is a cross-platform transliteration library.

Varnam

Varnam is an Indian language transliteration library. GoVarnam is a Go port of libvarnam with some core architectural changes. Not every part of libvarnam is ported.

It is stable to use daily as an input method. See it in action here: https://varnam.subinsb.com/

An Input Method Engine for Linux operating systems via IBus is available here: https://github.com/varnamproject/govarnam-ibus

Installation

You will need to install GoVarnam library in your system for any app to use Varnam.

  • Download a recent GoVarnam version.
  • Extract the zip file
  • Open a terminal and go to the extracted folder by using this command :
cd Downloads/govarnam
  • Now run this command to install GoVarnam :
sudo ./install.sh install

It will ask for your password, enter it.

  • Installation is finished

To check if installation is successful, try this command :

varnamcli -s ml enthaanu

It should give malayalam output if installation is successful.

  • To make Varnam give better suggestions, you will need to import some words. Download a .vlf (Varnam Learnings File) file from here [TODO LINK].
  • Import it:
varnamcli -s ml -import file.vlf

Now, you may install the IBus engine to use Varnam system wide: https://github.com/varnamproject/govarnam-ibus

Usage

Test it out:

varnamcli -s ml namaskaaram

Learn a word:

varnamcli -s ml -learn കുന്നംകുളം

Train a word with a particular pattern:

varnamcli -s ml -train college കോളേജ്

Learning Words From A File

You can import all language words from any text file. Varnam will separate english words and non-english words and learn accordingly.

varnamcli -s ml -learn-from-file file.html

You can download news articles or Wikipedia pages in HTML format to learn words from them.

Development

Build

This repository have 3 things :

  1. GoVarnam library
  2. GoVarnam Command Line Utility (CLI)
  3. Go bindings for GoVarnam

GoVarnam is written in Go, but to be a standard library that can be used with any other programming languages, we compile it to a C library. This is done by :

go build -buildmode "c-shared" -o libgovarnam.so

(Shortcut to doing above is make library)

The output libgovarnam.so is a shared library that can be dynamically linked in any other programming languages. Some examples :

  • Go bindings for GoVarnam: See govarnamgo folder in this repo
  • Java bindings for GoVarnam: IN PROGRESS

Wait, it means we need to write another Go file to interface with GoVarnam library ! This is because we're interfacing with a shared library and not the Go library.

Files & Folders

  • govarnam - The library files
  • main.go, c-shared* - Files that help in making the govarnam a C shared library
  • govarnamgo - Go bindings for the library. For use with other Go projects
  • cli - A CLI tool for varnam. Uses govarnamgo to interface with the library.
  • symbol-frequency-calculator - For populating the weight column in VST files

CLI (Command Line Utility)

The command line utility (CLI) is written in Go, uses govarnamgo to interface with the library.

You need to separately build the CLI:

cd cli

# Show the path to libgovarnam.so
export LD_LIBRARY_PATH=$(realpath ../):$LD_LIBRARY_PATH

go build -o varnamcli .

Hacking

This section is straight on getting your hands in. Explanation of how GoVarnam works is at the bottom.

  • Clone of course
  • Do go get
  • You will need a .vst file. Get it from schemes folder in a release. Paste it in schemes folder
  • Do make library to compile

When you make changes to govarnam source code, you will need to do make library for the changes to build on and then test with CLI.

You can run tests (to make sure nothing broke) with :

make test

GoVarnam BTS

Read GoVarnam Spec: https://docs.google.com/document/d/1l5cZAkly_-kl7UkfeGmObSam-niWCJo4wq-OvAEaDvQ/edit?usp=sharing

Changes from libvarnam

  • ml.vst has been changed to add a new weight column in symbols table. Get the new ml.vst here. The symbol with the least weight has more significance. This is calculated according to popularity from corpus. You can populate a ml.vst with weight values by a Python script. See that in the subfolder. The previous ruby script is used for making the VST. That is the same. ml.vst from libvarnam is incompatible with govarnam.

  • patterns_content is renamed to patterns in GoVarnam

  • patterns table in learnings DB won't store malayalam patterns. Instead, for each input, all possible malayalam words are calculated (from symbols VARNAM_MATCH_ALL) and searched in words. These are returned as suggestions. Previously, pattern would store every pattern to a word. english => malayalam.

  • patterns in govarnam is used solely for English words. Computer => കമ്പ്യൂട്ടർ. These English words won't work out with our VST tokenizer cause the words are not really transliterable in our language. It would be kambyoottar => Computer

Miscellaneous

To build without SQLite :

go build -tags libsqlite3 -buildmode=c-shared -o libgovarnam.so

Release Process

  • git tag
  • make build release

Pack ibus engine:

  • make build-ubuntu18 release
Owner
Varnamproject
“Varnam” is an open source, cross platform transliterator for Indian languages
Varnamproject
Comments
  • Give prirority to greedy tokenized if there are no exact matches

    Give prirority to greedy tokenized if there are no exact matches

    For this dictionary results will have to be separated :

    • ExactMatches
    • PatternDictionaryMatches
    • DictionaryMatches
    • ...

    This order should be :

    • ExactMatches
    • PatternDictionaryMatches show first result of PatternDictionaryMatches if ExactMatches is NULL
    • DictionaryMatches show first result of DictionaryMatches if ExactMatches is NULL
    • GreedyTokenized if ExactMatches is NULL
    • PatternDictionaryMatches show rest of PatternDictionaryMatches if ExactMatches is NULL
    • DictionaryMatches show rest of DictionaryMatches if ExactMatches is NULL
    • PatternDictionaryMoreMatches
    • DictionaryMoreMatches
    • ...

    It's also good to separate DictionaryMatches into DictionaryMatches & DictionaryMoreMatches. Similarly separate PatternDictionaryMatches.

    Usecase: Dictionary will have the word പാവയ്ക്ക. If I type "pavanaayi", the first suggestion will be "പാവനായി" which is wrong. The importance should be to the scheme pattern eh ? or at least show them at the beginning itself and not wayyy down.

  • - character in input string is causing an FTS5 error

    - character in input string is causing an FTS5 error

    Error:

    2022/08/19 06:08:10 fts5: syntax error near "*"
    2022/08/19 06:08:10 fts5: syntax error near "*"
    2022/08/19 06:08:11 fts5: syntax error near "*"
    

    From varnamd logs :

    [2022-08-19T06:08:10.771881527Z] status: 200, latency_human: 2.561737ms, error: <nil>, user_agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36, uri: /tl/ml/-4i
    [2022-08-19T06:08:10.828157178Z] status: 200, latency_human: 1.249368msm error: <nil>, user_agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36, uri: /tl/ml/-4it
    
  • Improve search symbol table

    Improve search symbol table

    • Adds varnam_new_search_symbol :

    Symbol table search had a fatal flaw :

    var searchCriteria Symbol
    searchCriteria.Pattern = "a"
    searchCriteria.AcceptCondition = 0
    

    Varnam omitted a search criteria by seeing if the value is a default struct value of Go. Go's default struct value is 0 which means one can't apply a search criteria of value 0. The solution applied in this PR to this is to set default value -1 by calling func NewSymbol(): Symbol. So now:

    searchCriteria := NewSymbol()
    searchCriteria.Pattern = "a"
    searchCriteria.AcceptCondition = 0
    
    • Added all varnam configuration vars to varnam_config(). Previously used functions like varnam_set_dictionary_suggestions_limit has been DEPRECATED.
  • LearnFromFile didn't process all words from file

    LearnFromFile didn't process all words from file

    There's a bug in learning from frequency report file. If the word place in the file has hidden characters like <0xa0> then GoVarnam mistakenly takes the next number to it as a word. Because of this rest of the words will fail. Here's a sample :

    പേരിൽ 254
    ചെയ്യുന്നു 254
    നിരവധി 254
    പുതിയ 254
      254
    വിവിധ 254
    കേരളത്തിലെ 254
    കേരള 254
    ചെറിയ 254
    

    Words from വിവിധ wouldn't get learned because of a hidden <0xa0> character in the previous line.

  • One Click Install Script

    One Click Install Script

    Currently the install process is a 3-step path :

    • Install GoVarnam (this repo)
    • Install language support needed by user: https://github.com/varnamproject/schemes
    • Install IBus Engine for GoVarnam: https://github.com/varnamproject/govarnam-ibus

    Make this 3-step process into a single one click installer script. End result :

    curl http://raw.githubuser....../install.sh | bash
    
    Welcome to Varnam Installer. This installation is a 3-step process.
    Step 1: Install GoVarnam
    Start step 1. ? (yes/NO): yes
    
    Downloading GoVarnam version 1.5.0...
    <Download progres>
    Installing GoVarnam...
    Installed
    
    Step 2: Install your language support
    Languages:
    ----|-------------|
    | as | Assameese |
    ....
    | ml | Malayalam |
    
    Which language would you like to install ? (Separate by comma): ml,as
    
    Downloading...
    <Download progres>
    Installing...
    Installed ml
    
    Same but installed as.
    
    Looks like `ml` has words to import. Import words for "ml" ? (yes/no): yes
    <Importing progress>
    
    Step 3: Install Varnam IBus Engine
    Proceed ? (yes/NO): yes
    Downloading...
    <download progress>
    Installing...
    
    Varnam installation finished.
    
    Telegram Group: 
    Matrix Group: 
    Website: 
    
  • Dictionary DB goes corrupt after unlearning a word

    Dictionary DB goes corrupt after unlearning a word

    Steps to reproduce:

    1. Import many words
    2. Unlearn a word
    3. Try transliterating a word varnamcli -s ml ennum
    4. "Database image is malformed" error is in output

    On investigation, this bug is because of syncing problem between table & FTS table. The DELETE trigger has a problem.

    Bug discovered thanks to this meme

    image

  • Exact words

    Exact words

    Fixes #21

    • Adds a new higher prioirty ExactWords result along with ExactMatches. How it differs:
      • ExactWords - Exactly found words in dictionary if there is any.
      • ExactMatches - Exactly starting word matches in dictionary if there is any. Not applicable for patterns dictionary.
    • Avoid item variable in range loops to save memory copying wherever possible. for i, item := range to for i := range
  • Weird exact matches result for a non-existing word

    Weird exact matches result for a non-existing word

    Bug obtained from mwordle. Type "param". API gives {exact_matches: [word: "പരമ്", weight: 253]} but the DB doesn't have the single word പരമ്. There are words like പരമ്പര in DB which is basically പരമ് + പര. Varnam is confusing with it ?

    The ideal output is പരം. It's in the exact_matches list but second with weight 252.

  • Add varnam_get_suggestions()

    Add varnam_get_suggestions()

    Add a new functionality varnam_get_suggestions(string word) to get all suggestions from dictionary starting with a particular word. This word won't be English but a language word itself.

    Example:

    varnam_get_suggestions("മല")
    // Gives മലയാളം മലയാളചലച്ചിത്രം മലപ്പുറം മലയാളത്തിൽ
    

    How is it useful ?

    One useful case is with Inscript engines. An Inscript engine is letter-by-letter, there's no Manglish, so once you make a word it will help in giving suggestions. Currently govarnam-ibus gives Inscript output English first and then user has to pick suggestion to complete the word. This is bad practice as the Inscript English key output will be hard to understand.

  • Marathi reverse transliterated sequence not giving correct output

    Marathi reverse transliterated sequence not giving correct output

    Namaskar! 😃

    The reverse transliteration of "प्रयत्न" is shown as p~ryt~n but if the same sequence is typed into the web editor then "प्रयत्न" doesn't get shown in any of the options. In fact all 3 options are the same.

    Why is this happening and how can this be fixed?

    image

  • Allow symbol removal from VST in VST Maker

    Allow symbol removal from VST in VST Maker

    VST Maker should allow to remove a symbol from VST using a matching condition. In Malayalam scheme:

    anusvara [["m"]] => ["ം","ം","മ"]
    anusvara "m_" =>  ["ം","ം","മ"]
    anusvara({:accept_if => :ends_with}, "m" => ["ം","ം","മ"])
    anusvara({:accept_if => :in_between}, "m" => ["ം","ം","മ"])
    
    consonants ["ma"] => "മ"
    
    generate_cv
    

    The CV generation makes m => മ് but there is no use of മ് at the end of a string, anusvara will be used instead. So, need to remove the generated m => മ് and then custom add it:

    anusvara({:accept_if => :starts_with}, "m" => ["മ്"])
    anusvara({:accept_if => :in_between}, "m" => ["മ്])
    
  • Porting from libvarnam Status

    Porting from libvarnam Status

    • [x] Transliteration

    • [x] Reverse Transliteration

    • [x] Learning, Training from CLI

    • [x] Learning, Training from file

    • [x] VST Creation

    • [ ] Stem rules

      This may not be needed cause there's better stemming tools from SMC. Besides, stemrules are only set for Malayalam in varnam.

    • [ ] Using flags column in symbol table (This may not be needed cause govarnam works just fine without using flags)

  • Tamil Letter Suggestion

    Tamil Letter Suggestion

    Ok the word is >> சொல்லாமல் , which is (sollamal)

    But in varnam while typing it gives the other il (ள)

    Snap_Shot_02599

    If we type >> il (it should give (இல்) ) which is correct, but if you type ill it gives (இள்), but instead of (இள்) it should give us இல்

    The CAPS ILL does the same.

A clean, Markdown-based publishing platform made for writers. Write together, and build a community.
A clean, Markdown-based publishing platform made for writers. Write together, and build a community.

WriteFreely is a clean, minimalist publishing platform made for writers. Start a blog, share knowledge within your organization, or build a community

Jan 4, 2023
A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

Sep 27, 2022
A NMEA parser library in pure Go

go-nmea This is a NMEA library for the Go programming language (Golang). Features Parse individual NMEA 0183 sentences Support for sentences with NMEA

Dec 20, 2022
Go library for the TOML language

go-toml Go library for the TOML format. This library supports TOML version v1.0.0-rc.3 Features Go-toml provides the following features for using data

Dec 27, 2022
A Go library to parse and format vCard

go-vcard A Go library to parse and format vCard. Usage f, err := os.Open("cards.vcf") if err != nil { log.Fatal(err) } defer f.Close() dec := vcard.

Dec 26, 2022
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Dec 12, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
The Go library for working with delimited separated value (DSV).

Package dsv is a Go library for working with delimited separated value (DSV). NOTE: This package has been deprecated. See https://github.com/shuLhan/s

Sep 15, 2021
Upskirt markdown library bindings for Go

Goskirt Package goskirt provides Go-bindings for the excellent Sundown Markdown parser. (F/K/A Upskirt). To use goskirt, create a new Goskirt-value wi

Oct 23, 2022
Golang HTML to plaintext conversion library

html2text Converts HTML into text of the markdown-flavored variety Introduction Ensure your emails are readable by all! Turns HTML into raw text, usef

Dec 28, 2022
Go Library [DEPRECATED]

Tideland Go Library Description The Tideland Go Library contains a larger set of useful Google Go packages for different purposes. ATTENTION: The cell

Nov 15, 2022
Go library to parse and render Remarkable lines files
Go library to parse and render Remarkable lines files

go-remarkable2pdf Go library to parse and render Remarkable lines files as PDF.

Nov 7, 2022
A modern text indexing library for go
A modern text indexing library for go

bleve modern text indexing in go - blevesearch.com Features Index any go data structure (including JSON) Intelligent defaults backed up by powerful co

Jan 4, 2023
Faker is a Go library that generates fake data for you.
Faker is a Go library that generates fake data for you.

Faker is a Go library that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your p

Jan 7, 2023
character-set conversion library implemented in Go

mahonia character-set conversion library implemented in Go. Mahonia is a character-set conversion library implemented in Go. All data is compiled into

Dec 22, 2022
:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech

Jan 4, 2023
golang rss/atom generator library

gorilla/feeds feeds is a web feed generator library for generating RSS, Atom and JSON feeds from Go applications. Goals Provide a simple interface to

Dec 26, 2022
An (almost) compliant XPath 1.0 library.

xsel xsel is a library that (almost) implements the XPath 1.0 specification. The non-compliant bits are: xsel does not implement the id function. The

Dec 21, 2022
pdf document generation library
pdf document generation library

gopdf 项目介绍 gopdf 是一个生成 PDF 文档的 Golang 库. 主要有以下的特点: 支持 Unicode 字符 (包括中文, 日语, 朝鲜语, 等等.) 文档内容的自动定位与分页, 减少用户的工作量. 支持图片插入, 支持多种图片格式, PNG, BMP, JPEG, WEBP,

Dec 8, 2022