misspelled word linter for Go comments, string literals and embedded files

gospel Build status

The gospel program lints Go source files for misspellings in comments, strings and embedded files.

It uses hunspell to identify misspellings and makes use of source code information to reduce the rate of false positive spelling errors where words refer to labels within the source code.

Installation

Beyond the standard Go installation process, you must also have libhunspell and its header files on your system. For a debian-based system this is done with sudo apt install libhunspell-dev.

Work Flow

gospel makes use of module information, and so must be run within a module and will only consider source files that are referred to by the module.

gospel can be used without configuration in many cases, particularly for small source trees where the number of potential errors is likely to be small. However, its behaviour can be tuned with configuration files. The initial setup workflow makes use of the .words file to build a dictionary of candidate words to ignore.

$ gospel -misspellings=.words ./...

This will collate all the found misspellings in the source tree and write them to .words. If gospel is run again, it will accept the all words in the source tree as correctly spelled as they now appear in the projects dictionary. Since hunspell dictionaries are just text files, you can go through the list of words to remove properly identified misspellings and leave in words that are domain-specific and incorrectly marked.

For example the .words file at the root of the gospel repo includes words that would otherwise be flagged with the default "en_US" spelling dictionary and was autogenerated using the command above.

8
behaviour
colour
coloured
emph
gendoc
hosters
initialisms
pluralisation

This file can also be edited to make use of more advanced hunspell features. See Hunspell Dictionaries below.

Command Line Options

gospel provides a number of command line options. The full list can be obtained by running gospel -h, but the majority correspond directly to .gospel.conf options.

The remaining options are not intended to be persistently stored:

  • -config — whether to use config file (default true, intended for debugging use).
  • -dict-paths — a colon-separated directory list containing hunspell dictionaries (defaults to a system-specific value).
  • -entropy-filter — filter strings and embedded files by entropy.
  • -misspellings — a file path to write a dictionary of misspellings to (see Work Flow above).
  • -since — a git ref specifying that only changes since then should be considered for misspelling (requires git).
  • -update-dict — whether the -misspellings flag is being used to update a dictionary that already exists.
  • -write-config — emit a config file based on flags and existing config to stdout and exit.

Configuration Files

gospel uses two configuration file types, .words files at the module roots of packages that are being checked, and a .gospel.conf file at the module root of the module in which gospel was invoked.

If gospel is being used in CI, these files should be committed to the repository.

.words

The .words file is a hunspell .dic formatted file. At its simplest, this is a plain text file with a list of words, one word per line and an initial word count on the first line.

gospel will read all .words files found at module roots for the packages that are being checked and build a dictionary from them to give to hunspell for the spelling checks.

Hunspell dictionaries are able to express more than just word matches though, and are able to indicate some grammatically related sets of words based on rules. This is covered lightly below.

.gospel.conf

Runtime behaviour of gospel can be modified in a persistent way through the TOML format .gospel.conf file. A number of options are provided:

  • ignore_idents — whether to include syntax information from the source code in the dictionary of acceptable words.
  • lang — the language tag to specify language locale.
  • show — whether to show context for identified misspellings.
  • check_strings — whether to check string literals.
  • check_embedded — whether to check spelling in files embedded using //go:embed.
  • ignore_upper — whether to ignore words that are all uppercase or their plurals.
  • ignore_single — whether to ignore single rune words.
  • ignore_numbers — whether to ignore number literals.
  • read_licenses — whether to ignore words found in license files.
  • read_git_log — whether to ignore author names and emails found in the output of git log (requires git to be installed, and gospel to be invoked from within a git repository to have any effect).
  • mask_flags — whether words that could be command-line flags should be removed prior to checking.
  • mask_urls — whether URLs should be removed prior to checking.
  • check_urls — whether the HTTP/HTTPS reachability of URLs should be checked.
  • camel — whether to split camelCase words into the components if the complete word is not accepted, otherwise split only on underscore.
  • max_word_len — the maximum length of words that should be checked.
  • min_naked_hex — minimum length for exclusion of words that are composed of only hex digits 0-9 and a-f (case insensitive).
  • suggest — when suggestions should be presented for misspellings: "never", "once", once for "each" comment block, or "always".
  • diff_context — how many lines around a change should be checked when the -since flag is used.
  • entropy_filter — controls the entropy filter used to exclude non-natural language from checking.
    • min_len_filtered — the minimum length of text chunks to be considered by the entropy filter; the string literal length for strings, the file length for embedded files and the line or block length for comments.
    • entropy_filter.accept — the range of complexity to allow as natural language for checking and roughly corresponds to the effective alphabet size for the language.

The .gospel.conf file is intended to set base behaviour that can be modified with the command line flags during tuning.

The current default .gospel.conf file looks like this:

ignore_idents = true
lang = "en_US"
show = true
check_strings = false
check_embedded = false
ignore_upper = true
ignore_single = true
ignore_numbers = true
read_licenses = true
read_git_log = true
mask_flags = false
mask_urls = true
check_urls = false
camel = true
max_word_len = 40
min_naked_hex = 8
suggest = "never"
diff_context = 0

[entropy_filter]
  filter = false
  min_len_filtered = 16
  [entropy_filter.accept]
    low = 14
    high = 20

Hunspell Dictionaries

Hunspell dictionaries are composed of two parts, a word list and an affix definition file. These are briefly described below. More information can be found in man 5 hunspell.

.dic Files

The .dic file comprises what you would normally think of as a dictionary. It is a list of words, one word per line, with an initial hint to hunspell indicating how many words it should expect to work with. The value of the hint is not particularly important except for performance, but must be greater than zero.

In addition to the words, hunspell allows the dictionary to encode "affix rules". These describe how word roots can be extended to allow related words to be matched as correct, for example "thing" and "things", or "lint" and "linting". The affix rules are indicated by a set of characters following the word and separated by a slash. The two examples here would be represented (in the "en_US" case) by

lint/G
thing/S

where the /S indicats that "thing" can be pluralised and the "lint" can be extended to its gerund. It is possible to specify more than one affix rule, and affixes can be prefix or suffix modifiers. Prefix and suffix rules can interact as the cross product (see below). So from the "en_US" dictionary,

advise/LDRSZGB
advised/UY

will match "advise" (root), "advisement" (L), "advised" (D), "adviser" (R), "advises" (S), "advisers" (Z), "advising" (G) and "advisable" (B) from the first rule, and "advised" (root), "unadvised" (U), "advisedly" (Y) and "unadvisedly" (UY) from the second.

.aff Files

Different natural languages use different inflection constructions for encoding grammatical information and this is specified for each language in hunspell's .aff files.

Again using the en_US language locale, an example of an affix rule definition (the Z from "advise" above)

SFX Z Y 4
SFX Z   0     rs         e
SFX Z   y     iers       [^aeiou]y
SFX Z   0     ers        [aeiou]y
SFX Z   0     ers        [^ey]

The first line indicated that the rule is the application of a suffix (SFX), the rule name is "Z", that the rule can be combined with prefixes via cross product (Y) and that there are 4 ways to apply the rule.

The remaining lines indicate how each application of the rule should be applied, and when. The first last field specifies when each rule can be applied matching to the target word. The second last indicates the suffix to add, and the third last indicates any characters to remove before adding the suffix ("0" indicated no removal). So with "advise", the first rule matches, resulting in "advise"+"r" being accepted as a correctly spelling word. The words "buy" and "fly" illustrate how the second and third rules would be applied; "buy" will match the third rule and so would allow "buy"+"ers", but "fly" would match the second and allow "fl"+"iers".

If you are adding specific rules to your .words file, the .aff files for your system can be found in the hunspell dictionary path for reference. Invoking hunspell -D will print the search path and show which dictionaries are available. On many linux system the files are found in /usr/share/hunspell/ and on macos they are usually expected to be in ~/Library/Spelling/ or /Library/Spelling/.

gospel will not add affix rules to words that have been identified as misspelled, but will retain rules that have been added during dictionary updates.

.aff files include other information as well including common misspellings and how to handle things like ordinal numbers.

Owner
Dan Kortschak
Actually Doing Science
Dan Kortschak
Comments
  • Rules for words added from symbols

    Rules for words added from symbols

    With the new .words support, I have just been working on the nats-io/nats.go client library and quite a few typos and stale comments are being fixed in the PR I'm working on, so thank you. But, this is exposing some nice-to-haves:

    • [x] symbols which are types, not of an array, should be added with an /S affix rule, so comments can talk about their plurals
    • [x] if the .words file were loaded before the symbol tables, arbitrary sane rules could be written, without being masked by the symbols being added without rules
    • [x] Comments can talk about functions from an imported module which aren't being used, explaining why, so it might be useful to add the exported symbols of imported libraries, if that can be done sanely with performance. Eg, explaining why strconv.AppendInt is not being used.
    • [x] If a struct field's tag starts [a-z]+:" then up until the next comma or double-quotes is probably a variant spelling for wire transfer formats, and it makes sense for comments to use that term. Eg, a NoWait field can be json:"no_wait,omitempty"
    • [x] omitempty should probably be in the built-in dictionary. :)

    The other head-scratcher from this work is hostnames, or other fields which look like hostnames. In this case, NATS subject examples, such as time.us.east.atlanta and time.eu.east leading to complaints about EU and Atlanta being wrong. I'm not sure what could sanely be done here.

  • improve docs for hunspell

    improve docs for hunspell

    • [x] Document gospel -misspellings .words ./... as a way to kickstart a project dictionary, removing the words which are actual misspellings
    • [x] Document where to learn more about hunspell, e.g. man 5 hunspell and where to find defined affixes for en_US, such as /usr/share/hunspell/en_US.aff
    • [x] Give some examples of lines one can add to a .words file, such as builtin/S for a plural, an example for a verb, an adjective, etc
    • [x] Document that the starting number in a dict file is a hint, and may be replaced with a dummy value like 1 (you said this somewhere and I forget where)
  • teach the tool about plurals

    teach the tool about plurals

    For example, see https://github.com/mvdan/sh/blob/e0682cacf774fe92a2b87c3d0d98cd0e9ee8db14/syntax/parser.go#L1936:

    I use IfClauses to reference the plural version of type IfClause, in https://github.com/mvdan/sh/blob/e0682cacf774fe92a2b87c3d0d98cd0e9ee8db14/syntax/nodes.go#L352.

    Perhaps there's a better way to reference plurals of identifiers?

  • add more common Go-isms to the known words set

    add more common Go-isms to the known words set

    Adding a few here as I've come across them. Some are rather generic to programming, others are somewhat specific to Go.

    • builtins
    • errored
    • recurse
    • config
    • filesystem
    • unexport
    • codepoints
    • tokenize
    • lexing
    • lexed
    • backquote
    • ascii
    • unescaped
    • unsetting
    • html
    • mutex
    • vendored
    • env (as in go env)
    • funcs (as in plural of the func keyword)
    • hacky
    • uncomment
    • v4 (as in major version, could be any vN)
    • charset
    • lossy
    • hasher
    • KiB (and MiB, etc?)

    I realise I can add these to my local dictionary per-project, but I also assume you want to capture the common ones in the tool.

    It's very likely that I'm not spelling some of these right, for example by omitting a dash like un-setting. Happy to be told to spell better :) But in those cases, it would be nice to be told what the right spelling is, because I don't.

  • use hunspell's add dictionary api to load known words

    use hunspell's add dictionary api to load known words

    @mvdan Please try this out, and suggest any words not in the list at #14 that you think would be good to have. The words from there that I have not included are misspellings.

    Closes #14.

  • another bag of domain-specific words to teach gospel about

    another bag of domain-specific words to teach gospel about

    • metacharacter
    • automata
    • IPv6
    • subnet
    • backtick
    • utf-8
    • testdata (from the standard directory name supported by Go)
    • ldflags (from Go's flag; same with gcflags and others probably)
  • make gospel usable in ci

    make gospel usable in ci

    Currently gospel is likely to generate too much noise to be usable in a CI context but this should be fixable with relatively few additions.

    • [x] provide a config file to run from
    • [x] have a non-zero exit status when misspellings are found ~(optionally?)~
    • [x] allow common apparent misspelling classes to be ignored — this is already done with all upper-case words. for example:
      • [x] floating point expressions
      • [x] hexadecimal numbers (with or without 0x prefix)
      • [x] non-decimal number prefixed numbers
      • [x] x, u and U prefixed rune literals
      • [x] check whether underscore_separated_words that have been identified as misspellings are constructed from correctly spelled words
      • [x] regular-expression based ignoring of words, for example for /^rfc[0-9]+$/i
      • [x] very long words to help ignore base64 and other troublesome encodings
      • [x] entropy-based filtering of strings
    • [x] ignore words found in valid, vet-passing struct tags
    • [x] ~~a linter directive to ignore files in the form //lint:file-ignore spelling reason paralleling the staticcheck approach. A finer-grained approach could also be used like statichceck's line-based approach, but based on blocks rather than lines.~~
  • feature: support per-repo or per-module (or per-workspace) dictionaries

    feature: support per-repo or per-module (or per-workspace) dictionaries

    Rather than each developer working on a code-base having to teach their local install about project-specific terminology, it would be great if gospel could support project-specific dictionaries without needing to set DICTIONARIES env-vars dynamically.

    This could be per-repository, walking up until it finds the VCS root; I've found golang.org/x/tools/go/vcs to be useful and it has FromDir() which should help.

    This could be per-module.

    This could support "sibling to workspace".

    Given a search path, all of the above would be neat.

  • feature: support per-user default hunspell dictionary

    feature: support per-user default hunspell dictionary

    This is for ~/.hunspell_en_US (documented in the manual page as ~/.hunspell_default). This isn't quite documented right, but the goal for gospel would be to just share the dictionary in practice.

    Inside hunspell's git repo, src/tools/hunspell.cxx inside main() has, starting at line 2120 as of hunspell/hunspell@31e6d6323 :

      if (HOME) {
        buf.assign(HOME);
    #ifndef WIN32
        buf.append("/");
    #endif
        buf.append(DICBASENAME);
        buf.append(basename(dicname, DIRSEPCH));
        load_privdic(buf.c_str(), pMS[0]);
    

    where DICBASENAME on not-WIN32 is ".hunspell_" and dicname is derived in preceding main() logic to be the locale path, so ends en_US for American English speakers.

  • add a

    add a "version" flag

    I was trying to report a bug and realised that fetching the gospel version I'm running is not particularly easy. These days, with VCS build stamping, it should be easy to report the version.

    If you want, you can copy this code of mine from another project (minus the test mocking env vars):

    https://github.com/burrowers/garble/blob/master/main.go#L327-L368

    For a development build, it ends up like:

    $ garble version
    mvdan.cc/garble (devel) 
    
    Build settings:
           -compiler gc
         CGO_ENABLED 1
              GOARCH amd64
                GOOS linux
             GOAMD64 v3
                 vcs git
        vcs.revision 6a39ad2d8157bfa0a93d121cbf6bdf437815ca43
            vcs.time 2022-03-26T09:34:51Z
        vcs.modified false
    

    A release build would look pretty similar - the build settings are always printed - but the version would be a useful tag rather than "devel".

  • add common file formats and extensions to the default dictionary

    add common file formats and extensions to the default dictionary

    gospel just flagged "jpeg" in a comment like:

    /*
    resp, err := http.Post("http://example.com/upload", "image/jpeg", &buf)
    

    This is in Go's net/http/doc.go:5:240 as of tip today, FYI.

    I think gospel should include some of these words by default.

    For example, images:

    • jpeg
    • gif
    • png
    • svg

    Documents:

    • html
    • css
    • txt
    • pdf

    I'm sure there are dozens more if we wanted to cover many of the common extensions or short format names that are reasonable to use in Go code. For example, see https://www.computerhope.com/issues/ch001789.htm.

    To avoid false positives, these could only be allowed when uppercase, such as // We expect a JPEG file., or if followed by a special character, like // Must be "image/jpeg". or // Read from foobar.jpg..

  • skip misspellings on single words surrounded by quotes

    skip misspellings on single words surrounded by quotes

    For example:

    tools/flow/tasks.go:179:54: "cuerun" is misspelled in comment
    	// finalized. However, cue cmd uses some legacy instance stitching code
    	// where some of the backlink Environments are not properly initialized.
    	// Finalizing should patch those up at the expense of doing some duplicate
    	// work. The plan is to replace `cue cmd` with a much more clean
    	// implementation (probably a separate tool called `cuerun`) where this
    	// issue is fixed. For now we leave this patch.
    
Related tags
The Golang linter that checks that there is no simultaneous return of `nil` error and an invalid value.

nilnil Checks that there is no simultaneous return of nil error and an invalid value. Installation & usage $ go install github.com/Antonboom/nilnil@la

Dec 14, 2022
Go linter that checks types that are json encoded - reports unsupported types and unnecessary error checks

Checks types passed to the json encoding functions. Reports unsupported types and reports occations, where the check for the returned error can be omited.

Oct 7, 2022
The most opinionated Go source code linter for code audit.
The most opinionated Go source code linter for code audit.

go-critic Highly extensible Go source code linter providing checks currently missing from other linters. There is never too much static code analysis.

Jan 6, 2023
[mirror] This is a linter for Go source code.

Golint is a linter for Go source code. Installation Golint requires a supported release of Go. go get -u golang.org/x/lint/golint To find out where g

Dec 23, 2022
Staticcheck - The advanced Go linter

The advanced Go linter Staticcheck is a state of the art linter for the Go programming language. Using static analysis, it finds bugs and performance

Jan 1, 2023
A Go linter to check that errors from external packages are wrapped

Wrapcheck A simple Go linter to check that errors from external packages are wrapped during return to help identify the error source during debugging.

Dec 27, 2022
A linter that handles struct tags.

Tagliatelle A linter that handles struct tags. Supported string casing: camel pascal kebab snake goCamel Respects Go's common initialisms (e.g. HttpRe

Dec 15, 2022
a simple golang SSA viewer tool use for code analysis or make a linter
a simple golang SSA viewer tool use for code analysis or make a linter

ssaviewer A simple golang SSA viewer tool use for code analysis or make a linter ssa.html generate code modify from src/cmd/compile/internal/ssa/html.

May 17, 2022
Go linter which checks for dangerous unicode character sequences

bidichk - checks for dangerous unicode character sequences bidichk finds dangerous unicode character sequences in Go source files. Considered dangerou

Oct 5, 2022
Linter for PostgreSQL

Использование Проверить миграции: oh-my-pg-linter check ./migrations/*.sql Добавить директории с дополнительными проверками (переопределение - кто пос

Nov 25, 2021
containedctx detects is a linter that detects struct contained context.Context field

containedctx containedctx detects is a linter that detects struct contained context.Context field Instruction go install github.com/sivchari/contained

Oct 22, 2022
World's spookiest linter

nosleep The world's spookiest linter nosleep is a golang-ci compatible linter which checks for and fails if it detects usages of time.Sleep. Why did y

Oct 15, 2022
Go linter to analyze expression groups: require 'import' declaration groups

grouper — a Go linter to analyze expression groups Installation

Jun 19, 2022
funcresult — a Go linter to analyze function result parameters

Go linter to analyze function result parameters: require named / unnamed function result parameters

Jan 27, 2022
nostdglobals is a simple Go linter that checks for usages of global variables defined in the go standard library

nostdglobals is a simple Go linter that checks for usages of global variables defined in the go standard library

Feb 17, 2022
Goalinter-v1: Goa framework (version1) linter

goavl: Goa framework (ver1) linter goavlは、goa version1(フォーク版)のlinterです。開発目的は、goa

Jul 28, 2022
Linter for Go's fmt.Errorf message

wrapmsg wrapmsg is Go code linter. this enforces fmt.Errorf's message when you wrap error. Example // OK ???? if err := pkg.Cause(); err != nil { re

Dec 27, 2022
a Go code to detect leaks in JS files via regex patterns

a Go code to detect leaks in JS files via regex patterns

Nov 13, 2022