Varnam
Varnam is an Indian language transliteration library. GoVarnam is a Go port of libvarnam with some core architectural changes. Not every part of libvarnam is ported.
It is stable to use daily as an input method. See it in action here: https://varnam.subinsb.com/
An Input Method Engine for Linux operating systems via IBus is available here: https://github.com/varnamproject/govarnam-ibus
Installation
You will need to install GoVarnam library in your system for any app to use Varnam.
- Download a recent GoVarnam version.
- Extract the zip file
- Open a terminal and go to the extracted folder by using this command :
cd Downloads/govarnam
- Now run this command to install GoVarnam :
sudo ./install.sh install
It will ask for your password, enter it.
- Installation is finished
To check if installation is successful, try this command :
varnamcli -s ml enthaanu
It should give malayalam output if installation is successful.
- To make Varnam give better suggestions, you will need to import some words. Download a
.vlf
(Varnam Learnings File) file from here [TODO LINK]. - Import it:
varnamcli -s ml -import file.vlf
Now, you may install the IBus engine to use Varnam system wide: https://github.com/varnamproject/govarnam-ibus
Usage
Test it out:
varnamcli -s ml namaskaaram
Learn a word:
varnamcli -s ml -learn കുന്നംകുളം
Train a word with a particular pattern:
varnamcli -s ml -train college കോളേജ്
Learning Words From A File
You can import all language words from any text file. Varnam will separate english words and non-english words and learn accordingly.
varnamcli -s ml -learn-from-file file.html
You can download news articles or Wikipedia pages in HTML format to learn words from them.
Development
Build
This repository have 3 things :
- GoVarnam library
- GoVarnam Command Line Utility (CLI)
- Go bindings for GoVarnam
GoVarnam is written in Go, but to be a standard library that can be used with any other programming languages, we compile it to a C library. This is done by :
go build -buildmode "c-shared" -o libgovarnam.so
(Shortcut to doing above is make library
)
The output libgovarnam.so
is a shared library that can be dynamically linked in any other programming languages. Some examples :
- Go bindings for GoVarnam: See govarnamgo folder in this repo
- Java bindings for GoVarnam: IN PROGRESS
Wait, it means we need to write another Go file to interface with GoVarnam library ! This is because we're interfacing with a shared library and not the Go library.
Files & Folders
govarnam
- The library filesmain.go, c-shared*
- Files that help in making the govarnam a C shared librarygovarnamgo
- Go bindings for the library. For use with other Go projectscli
- A CLI tool for varnam. Usesgovarnamgo
to interface with the library.symbol-frequency-calculator
- For populating theweight
column in VST files
CLI (Command Line Utility)
The command line utility (CLI) is written in Go, uses govarnamgo to interface with the library.
You need to separately build the CLI:
cd cli
# Show the path to libgovarnam.so
export LD_LIBRARY_PATH=$(realpath ../):$LD_LIBRARY_PATH
go build -o varnamcli .
Hacking
This section is straight on getting your hands in. Explanation of how GoVarnam works is at the bottom.
- Clone of course
- Do
go get
- You will need a
.vst
file. Get it fromschemes
folder in a release. Paste it inschemes
folder - Do
make library
to compile
When you make changes to govarnam source code, you will need to do make library
for the changes to build on and then test with CLI.
You can run tests (to make sure nothing broke) with :
make test
GoVarnam BTS
Read GoVarnam Spec: https://docs.google.com/document/d/1l5cZAkly_-kl7UkfeGmObSam-niWCJo4wq-OvAEaDvQ/edit?usp=sharing
Changes from libvarnam
-
ml.vst
has been changed to add a newweight
column insymbols
table. Get the newml.vst
here. The symbol with the least weight has more significance. This is calculated according to popularity from corpus. You can populate aml.vst
with weight values by a Python script. See that in the subfolder. The previous ruby script is used for making the VST. That is the same.ml.vst
from libvarnam is incompatible with govarnam. -
patterns_content
is renamed topatterns
in GoVarnam -
patterns
table in learnings DB won't store malayalam patterns. Instead, for each input, all possible malayalam words are calculated (fromsymbols
VARNAM_MATCH_ALL) and searched inwords
. These are returned as suggestions. Previously,pattern
would store every pattern to a word. english => malayalam. -
patterns
in govarnam is used solely for English words.Computer => കമ്പ്യൂട്ടർ
. These English words won't work out with our VST tokenizer cause the words are not really transliterable in our language. It would bekambyoottar => Computer
Miscellaneous
To build without SQLite :
go build -tags libsqlite3 -buildmode=c-shared -o libgovarnam.so
Release Process
- git tag
- make build release
Pack ibus engine:
- make build-ubuntu18 release