A parser library for Go

A dead simple parser package for Go

PkgGoDev CircleCI Go Report Card Slack chat

V2

This is version 2 of Participle. See the Change Log for details.

Note: semantic versioning API guarantees do not apply to the experimental packages - the API may break between minor point releases.

It can be installed with:

$ go get github.com/alecthomas/participle/v2@latest

The latest version from v0 can be installed via:

$ go get github.com/alecthomas/participle@latest

Introduction

The goal of this package is to provide a simple, idiomatic and elegant way of defining parsers in Go.

Participle's method of defining grammars should be familiar to any Go programmer who has used the encoding/json package: struct field tags define what and how input is mapped to those same fields. This is not unusual for Go encoders, but is unusual for a parser.

Tutorial

A tutorial is available, walking through the creation of an .ini parser.

Tag syntax

Participle supports two forms of struct tag grammar syntax.

The easiest to read is when the grammar uses the entire struct tag content, eg.

Field string `@Ident @("," Ident)*`

However, this does not coexist well with other tags such as JSON, etc. and may cause issues with linters. If this is an issue then you can use the parser:"" tag format. In this case single quotes can be used to quote literals making the tags somewhat easier to write, eg.

Field string `parser:"@ident (',' Ident)*" json:"field"`

Overview

A grammar is an annotated Go structure used to both define the parser grammar, and be the AST output by the parser. As an example, following is the final INI parser from the tutorial.

type INI struct {
  Properties []*Property `@@*`
  Sections   []*Section  `@@*`
}

type Section struct {
  Identifier string      `"[" @Ident "]"`
  Properties []*Property `@@*`
}

type Property struct {
  Key   string `@Ident "="`
  Value *Value `@@`
}

type Value struct {
  String *string  `  @String`
  Number *float64 `| @Float`
}

Note: Participle also supports named struct tags (eg. Hello string `parser:"@Ident"`).

A parser is constructed from a grammar and a lexer:

parser, err := participle.Build(&INI{})

Once constructed, the parser is applied to input to produce an AST:

ast := &INI{}
err := parser.ParseString("", "size = 10", ast)
// ast == &INI{
//   Properties: []*Property{
//     {Key: "size", Value: &Value{Number: &10}},
//   },
// }

Grammar syntax

Participle grammars are defined as tagged Go structures. Participle will first look for tags in the form parser:"...". It will then fall back to using the entire tag body.

The grammar format is:

  • @<expr> Capture expression into the field.
  • @@ Recursively capture using the fields own type.
  • <identifier> Match named lexer token.
  • ( ... ) Group.
  • "..." or '...' Match the literal (note that the lexer must emit tokens matching this literal exactly).
  • "...":<identifier> Match the literal, specifying the exact lexer token type to match.
  • <expr> <expr> ... Match expressions.
  • <expr> | <expr> | ... Match one of the alternatives. Each alternative is tried in order, with backtracking.
  • !<expr> Match any token that is not the start of the expression (eg: @!";" matches anything but the ; character into the field).

The following modifiers can be used after any expression:

  • * Expression can match zero or more times.
  • + Expression must match one or more times.
  • ? Expression can match zero or once.
  • ! Require a non-empty match (this is useful with a sequence of optional matches eg. ("a"? "b"? "c"?)!).

Notes:

  • Each struct is a single production, with each field applied in sequence.
  • @<expr> is the mechanism for capturing matches into the field.
  • if a struct field is not keyed with "parser", the entire struct tag will be used as the grammar fragment. This allows the grammar syntax to remain clear and simple to maintain.

Capturing

Prefixing any expression in the grammar with @ will capture matching values for that expression into the corresponding field.

For example:

// The grammar definition.
type Grammar struct {
  Hello string `@Ident`
}

// The source text to parse.
source := "world"

// After parsing, the resulting AST.
result == &Grammar{
  Hello: "world",
}

For slice and string fields, each instance of @ will accumulate into the field (including repeated patterns). Accumulation into other types is not supported.

For integer and floating point types, a successful capture will be parsed with strconv.ParseInt() and strconv.ParseFloat() respectively.

A successful capture match into a bool field will set the field to true.

Tokens can also be captured directly into fields of type lexer.Token and []lexer.Token.

Custom control of how values are captured into fields can be achieved by a field type implementing the Capture interface (Capture(values []string) error).

Additionally, any field implementing the encoding.TextUnmarshaler interface will be capturable too. One caveat is that UnmarshalText() will be called once for each captured token, so eg. @(Ident Ident Ident) will be called three times.

Capturing boolean value

By default a boolean field is used to indicate that a match occurred, which turns out to be much more useful and common in Participle than parsing true or false literals. For example, parsing a variable declaration with a trailing optional syntax:

type Var struct {
  Name string `"var" @Ident`
  Type string `":" @Ident`
  Optional bool `@"?"?`
}

In practice this gives more useful AST's. If bool were to be parsed literally then you'd need to have some alternate type for Optional such as string or a custom type.

To capture literal boolean values such as true or false, implement the Capture interface like so:

type Boolean bool

func (b *Boolean) Capture(values []string) error {
	*b = values[0] == "true"
	return nil
}

type Value struct {
	Float  *float64 `  @Float`
	Int    *int     `| @Int`
	String *string  `| @String`
	Bool   *Boolean `| @("true" | "false")`
}

Streaming

Participle supports streaming parsing. Simply pass a channel of your grammar into Parse*(). The grammar will be repeatedly parsed and sent to the channel. Note that the Parse*() call will not return until parsing completes, so it should generally be started in a goroutine.

type token struct {
  Str string `  @Ident`
  Num int    `| @Int`
}

parser, err := participle.Build(&token{})

tokens := make(chan *token, 128)
err := parser.ParseString("", `hello 10 11 12 world`, tokens)
for token := range tokens {
  fmt.Printf("%#v\n", token)
}

Lexing

Participle relies on distinct lexing and parsing phases. The lexer takes raw bytes and produces tokens which the parser consumes. The parser transforms these tokens into Go values.

The default lexer, if one is not explicitly configured, is based on the Go text/scanner package and thus produces tokens for C/Go-like source code. This is surprisingly useful, but if you do require more control over lexing the builtin participle/lexer/stateful lexer should cover most other cases. If that in turn is not flexible enough, you can implement your own lexer.

Configure your parser with a lexer using the participle.Lexer() option.

To use your own Lexer you will need to implement two interfaces: Definition (and optionally StringsDefinition and BytesDefinition) and Lexer.

Stateful lexer

Participle's included stateful/modal lexer provides powerful yet convenient construction of most lexers (notably, indentation based lexers cannot be expressed).

It is sometimes the case that a simple lexer cannot fully express the tokens required by a parser. The canonical example of this is interpolated strings within a larger language. eg.

let a = "hello ${name + ", ${last + "!"}"}"

This is impossible to tokenise with a normal lexer due to the arbitrarily deep nesting of expressions.

To support this case Participle's lexer is now stateful by default.

The lexer is a state machine defined by a map of rules keyed by the state name. Each rule within the state includes the name of the produced token, the regex to match, and an optional operation to apply when the rule matches.

As a convenience, any Rule starting with a lowercase letter will be elided from output.

Lexing starts in the Root group. Each rule is matched in order, with the first successful match producing a lexeme. If the matching rule has an associated Action it will be executed. The name of each non-root rule is prefixed with the name of its group to yield the token identifier used during matching.

A state change can be introduced with the Action Push(state). Pop() will return to the previous state.

To reuse rules from another state, use Include(state).

A special named rule Return() can also be used as the final rule in a state to always return to the previous state.

As a special case, regexes containing backrefs in the form \N (where N is a digit) will match the corresponding capture group from the immediate parent group. This can be used to parse, among other things, heredocs. See the tests for an example of this, among others.

Example stateful lexer

Here's a cut down example of the string interpolation described above. Refer to the stateful example for the corresponding parser.

var lexer = stateful.Must(Rules{
	"Root": {
		{`String`, `"`, Push("String")},
	},
	"String": {
		{"Escaped", `\\.`, nil},
		{"StringEnd", `"`, Pop()},
		{"Expr", `\${`, Push("Expr")},
		{"Char", `[^$"\\]+`, nil},
	},
	"Expr": {
		Include("Root"),
		{`whitespace`, `\s+`, nil},
		{`Oper`, `[-+/*%]`, nil},
		{"Ident", `\w+`, nil},
		{"ExprEnd", `}`, Pop()},
	},
})

Example simple/non-stateful lexer

The Stateful lexer is now the only custom lexer supported by Participle, but most parsers won't need this level of flexibility. To support this common case, which replaces the old Regex and EBNF lexers, you can use stateful.MustSimple() and stateful.NewSimple().

eg. The lexer for a form of BASIC:

var basicLexer = stateful.MustSimple([]stateful.Rule{
    {"Comment", `(?i)rem[^\n]*`, nil},
    {"String", `"(\\"|[^"])*"`, nil},
    {"Number", `[-+]?(\d*\.)?\d+`, nil},
    {"Ident", `[a-zA-Z_]\w*`, nil},
    {"Punct", `[-[!@#$%^&*()+_={}\|:;"'<,>.?/]|]`, nil},
    {"EOL", `[\n\r]+`, nil},
    {"whitespace", `[ \t]+`, nil},
})

Experimental - code generation

Participle v2 now has experimental support for generating code to perform lexing. Use participle/experimental/codegen.GenerateLexer() to compile a stateful lexer to Go code.

This will generally provide around a 10x improvement in lexing performance while producing O(1) garbage.

Options

The Parser's behaviour can be configured via Options.

Examples

There are several examples included:

Example Description
BASIC A lexer, parser and interpreter for a rudimentary dialect of BASIC.
EBNF Parser for the form of EBNF used by Go.
Expr A basic mathematical expression parser and evaluator.
GraphQL Lexer+parser for GraphQL schemas
HCL A parser for the HashiCorp Configuration Language.
INI An INI file parser.
Protobuf A full Protobuf version 2 and 3 parser.
SQL A very rudimentary SQL SELECT parser.
Stateful A basic example of a stateful lexer and corresponding parser.
Thrift A full Thrift parser.
TOML A TOML parser.

Included below is a full GraphQL lexer and parser:

package main

import (
	"fmt"
	"os"

	"github.com/alecthomas/kong"
	"github.com/alecthomas/repr"

	"github.com/alecthomas/participle/v2"
	"github.com/alecthomas/participle/v2/lexer"
	"github.com/alecthomas/participle/v2/lexer/stateful"
)

type File struct {
	Entries []*Entry `@@*`
}

type Entry struct {
	Type   *Type   `  @@`
	Schema *Schema `| @@`
	Enum   *Enum   `| @@`
	Scalar string  `| "scalar" @Ident`
}

type Enum struct {
	Name  string   `"enum" @Ident`
	Cases []string `"{" @Ident* "}"`
}

type Schema struct {
	Fields []*Field `"schema" "{" @@* "}"`
}

type Type struct {
	Name       string   `"type" @Ident`
	Implements string   `( "implements" @Ident )?`
	Fields     []*Field `"{" @@* "}"`
}

type Field struct {
	Name       string      `@Ident`
	Arguments  []*Argument `( "(" ( @@ ( "," @@ )* )? ")" )?`
	Type       *TypeRef    `":" @@`
	Annotation string      `( "@" @Ident )?`
}

type Argument struct {
	Name    string   `@Ident`
	Type    *TypeRef `":" @@`
	Default *Value   `( "=" @@ )`
}

type TypeRef struct {
	Array       *TypeRef `(   "[" @@ "]"`
	Type        string   `  | @Ident )`
	NonNullable bool     `( @"!" )?`
}

type Value struct {
	Symbol string `@Ident`
}

var (
	graphQLLexer = stateful.MustSimple([]stateful.Rule{
		{"Comment", `(?:#|//)[^\n]*\n?`, nil},
		{"Ident", `[a-zA-Z]\w*`, nil},
		{"Number", `(?:\d*\.)?\d+`, nil},
		{"Punct", `[-[!@#$%^&*()+_={}\|:;"'<,>.?/]|]`, nil},
		{"Whitespace", `[ \t\n\r]+`, nil},
	})
	parser = participle.MustBuild(&File{},
		participle.Lexer(graphQLLexer),
		participle.Elide("Comment", "Whitespace"),
		participle.UseLookahead(2),
	)
)

var cli struct {
	EBNF  bool     `help"Dump EBNF."`
	Files []string `arg:"" optional:"" type:"existingfile" help:"GraphQL schema files to parse."`
}

func main() {
	ctx := kong.Parse(&cli)
	if cli.EBNF {
		fmt.Println(parser.String())
		ctx.Exit(0)
	}
	for _, file := range cli.Files {
		ast := &File{}
		r, err := os.Open(file)
		ctx.FatalIfErrorf(err)
		err = parser.Parse(file, r, ast)
		r.Close()
		repr.Println(ast)
		ctx.FatalIfErrorf(err)
	}
}

Performance

One of the included examples is a complete Thrift parser (shell-style comments are not supported). This gives a convenient baseline for comparing to the PEG based pigeon, which is the parser used by go-thrift. Additionally, the pigeon parser is utilising a generated parser, while the participle parser is built at run time.

You can run the benchmarks yourself, but here's the output on my machine:

BenchmarkParticipleThrift-12    	   5941	   201242 ns/op	 178088 B/op	   2390 allocs/op
BenchmarkGoThriftParser-12      	   3196	   379226 ns/op	 157560 B/op	   2644 allocs/op

On a real life codebase of 47K lines of Thrift, Participle takes 200ms and go- thrift takes 630ms, which aligns quite closely with the benchmarks.

Concurrency

A compiled Parser instance can be used concurrently. A LexerDefinition can be used concurrently. A Lexer instance cannot be used concurrently.

Error reporting

There are a few areas where Participle can provide useful feedback to users of your parser.

  1. Errors returned by Parser.Parse*() will be of type Error. This will contain positional information where available.
  2. Participle will make a best effort to return as much of the AST up to the error location as possible.
  3. Any node in the AST containing a field Pos lexer.Position will be automatically populated from the nearest matching token.
  4. Any node in the AST containing a field EndPos lexer.Position will be automatically populated from the token at the end of the node.
  5. Any node in the AST containing a field Tokens []lexer.Token will be automatically populated with all tokens captured by the node, including elided tokens.

These related pieces of information can be combined to provide fairly comprehensive error reporting.

Limitations

Internally, Participle is a recursive descent parser with backtracking (see UseLookahead(K)).

Among other things, this means that they do not support left recursion. Left recursion must be eliminated by restructuring your grammar.

EBNF

Participle supports outputting an EBNF grammar from a Participle parser. Once the parser is constructed simply call String().

Participle also includes a parser for this form of EBNF (naturally).

eg. The GraphQL example gives in the following EBNF:

File = Entry* .
Entry = Type | Schema | Enum | "scalar" ident .
Type = "type" ident ("implements" ident)? "{" Field* "}" .
Field = ident ("(" (Argument ("," Argument)*)? ")")? ":" TypeRef ("@" ident)? .
Argument = ident ":" TypeRef ("=" Value)? .
TypeRef = "[" TypeRef "]" | ident "!"? .
Value = ident .
Schema = "schema" "{" Field* "}" .
Enum = "enum" ident "{" ident* "}" .

Syntax/Railroad Diagrams

Participle includes a command-line utility to take an EBNF representation of a Participle grammar (as returned by Parser.String()) and produce a Railroad Diagram using tabatkins/railroad-diagrams.

Here's what the GraphQL grammar looks like:

EBNF Railroad Diagram

Comments
  • Calculate follow-set automatically from multiple production alternatives

    Calculate follow-set automatically from multiple production alternatives

    First of all, thanks a lot for sharing participle with the world! I felt very happy to discover what feels like a very novel approach to parser generation, parser libraries, etc.

    I took a stab at trying to rewrite a grammar for LLVM IR to use participle, but reached the following stumbling block and thought I'd reach out and ask if it is intended by design (to keep the parser simple), or if I've done something wrong, or otherwise, if we could seek to resolve so that the follow set of a token is calculated from each present production alternatives. In this case, the follow set of "target" should be {"datalayout", "triple"}.

    I wish to write the grammar as example 2, but have only gotten example 1 to work so far. Any ideas?

    Cheers :) /u

    Input

    LLVM IR input source:

    source_filename = "foo.c"
    target datalayout = "bar"
    target triple = "baz"
    

    Example 1

    Grammar:

    type Module struct {
    	Decls []*Decl `{ @@ }`
    }
    
    type Decl struct {
    	SourceFilename string      `  "source_filename" "=" @String`
    	TargetSpec     *TargetSpec `| "target" @@`
    }
    
    type TargetSpec struct {
    	DataLayout   string `  "datalayout" "=" @String`
    	TargetTriple string `| "triple" "=" @String`
    }
    

    Example run:

    u@x220 ~/D/g/s/g/m/low> low a.ll 
    &main.Module{
        Decls: {
            &main.Decl{
                SourceFilename: "foo.c",
                TargetSpec:     (*main.TargetSpec)(nil),
            },
            &main.Decl{
                SourceFilename: "",
                TargetSpec:     &main.TargetSpec{DataLayout:"bar", TargetTriple:""},
            },
            &main.Decl{
                SourceFilename: "",
                TargetSpec:     &main.TargetSpec{DataLayout:"", TargetTriple:"baz"},
            },
        },
    }
    

    Example 2

    Grammar:

    type Module struct {
    	Decls []*Decl `{ @@ }`
    }
    
    type Decl struct {
    	SourceFilename string `  "source_filename" "=" @String`
    	DataLayout     string `| "target" "datalayout" "=" @String`
    	TargetTriple   string `| "target" "triple" "=" @String`
    }
    

    Example run:

    u@x220 ~/D/g/s/g/m/low> low a.ll
    2017/08/27 21:15:38 a.ll:3:7: expected ( "datalayout" ) not "triple"
    
  • Proposal: interface type productions

    Proposal: interface type productions

    In the past, when I've written recursive descent parsers by hand, I have used interfaces to represent "classes" of productions, such as Stmt, Expr, etc - this simplifies the consuming code considerably.

    I would like to be able to do something similar using participle - for example, I would like to define an Expr interface, and then parse that manually while allowing the rest of the grammar to be managed by participle.

    Perhaps it could look something like this:

    // Expr is the interface implemented by expression productions
    type Expr interface { expr() }
    
    // These types all implement Expr, and are parsed manually
    type Atom struct { /* ... */ }
    type Unary struct { /* ... */ }
    type Binary struct { /* ... */ }
    
    // This is the parser definition, with a new `UseInterface` option to define a custom parsing func for the Expr interface
    var theParser = participle.MustBuild(&Grammar{}, participle.UseInterface(
        func (lex *lexer.PeekingLexer) (Expr, error) {
            /* Provide an implementation of the expression parser here */
        },
    ))
    
    // And then I can use `Expr` like this:
    type Stmt struct {
        Expr   Expr    `@@` // <- participle has registered `Expr` as an interface that can be parsed
        Assign *Assign `@@`
    }
    

    As an additional nice-to-have, it would be cool if there was a way to give a *lexer.PeekingLexer and a production to the parser, and have it fill in the data for me. Then I could compose participle-managed productions within the manual parsing code for Expr. This would allow me to break out of participle in a limited way, just for the things that I need (such as managing precedence climbing)

  • How to write all optional but at least one necessary

    How to write all optional but at least one necessary

    I'm implementing CSS Selector Grammer(https://www.w3.org/TR/selectors-4/#grammar) by participle. But I'm stucked with to implement compound-selector.

    <compound-selector> = [ <type-selector>? <subclass-selector>*
                            [ <pseudo-element-selector> <pseudo-class-selector>* ]* ]!
    

    compound-selector expects at least one optional value by !. If this is not implemented, parser will go in infinite loop. Is it possible to implement this by participle?

  • Peeking lexer optimizations

    Peeking lexer optimizations

    Hi @alecthomas,

    when rebasing #213 to master, I decided it would be best to commit my stuff in smaller chunks. The first chunk is some optimizations I did to PeekingLexer as it was becoming the bottleneck to the generated parser implementation.

    1. Optimizing PeekingLexer.Clone - Checkpoint It does an allocation that could be avoided. One way is to just have parseContext and my generated code to store it by value (keeping pointer receiver for its methods though). This already helps quite a lot, but I went a bit further in my generated code - wrapping the internal cursors in a public lexer.Checkpoint type. When a branch needs a backup of the PeekingLexer state, it will just need to copy this small struct and restore it only if the branch was not accepted and it's even a little bit faster than copying & replacing the entire PeekingLexer struct.

    2. Optimizing PeekingLexer.Clone - removing it Using the Checkpoint pattern in the reflection parser is tough as it needs to explicitly apply an accepted branch instead of reverting to a backup of a failed branch (the way that was nicer to use in the generated code). So my proposal would be to keep the Checkpoint pattern only for generated code. But I added a commit (independent of my other changes) that I can include if you like that stores PeekingLexer in parseContext by value and avoids the Clone allocation altogether.

    3. Optimizing PeekingLexer.Peek Currently Peek (which is used quite often by the reflective parser and even more often by my generated parser) needs to skip elided tokens every time it's called. It also checks for the length of PeekingLexer.tokens and returns p.eof in a lot of places - I think it's more elegant to just have the EOF token as the last token and prevent Next from advancing beyond it. This optimization does this work once in Upgrade and Next and makes Peek super simple, it also removes the eof field. The performance improvement for the Thrift benchmark is barely noticeable, but a separate benchmark I added shows a >3x improvement for Peek calls; it was also more more noticeable in my generated parser. I mentioned this whole thing in point 2 in https://github.com/alecthomas/participle/issues/213#issuecomment-1055395492 as the initial implementation changed the behavior of RawCursor and thus also Range. This new implementation adds a separate cursor for knowing when the next non-elided token is, making it a bit slower, but prevents a change of behavior.

    4. Working with lexer.Token as pointer Token is not a small struct and in most cases it's more efficient to work with it as a pointer instead of copying it - assuming of course no allocations are needed as the tokens are already stored in PeekingLexer.tokens. Changing Token methods to a pointer receiver alone improved CPU perf in the Thrift benchmark by ~2.5%, trying to make Next, Peek and other methods return it as a pointer made the improvement ~4%. Such a Considering this would be a breaking change to a public (somewhat internal, but still public) API, I decided it wasn't worth it. But if you think it's OK, I still added a new method Current, which is a version of Peek that returns a pointer. BenchmarkPeekingLexer_Current still shows that almost 3x faster than Peek. If you don't like it, I don't need to include it. I also used pointers at least inside the PeekingLexer methods and got a tiny CPU improvement in the Thrift benchmark out of it.

    5. BatchLexer interface This is actually unrelated to the code generation, but when playing with a manually written lexer for Jinja, I found it both more convenient and more efficient (thanks to a much lower number of function calls and more efficient allocations) to have the ability to yield multiple (in my case up to 50) tokens at a time. The interface and the related implementation is just a soft suggestion, if you don't like it, I can remove it. But it's optional - if a Lexer chooses to implement it, it can gain extra performance, if it doesn't, no problem.

    After all but the last optimization (BatchLexer isn't used by any built-in lexer - yet), the final comparison looks like:

    BenchmarkParticipleThriftGenerated-Before-8   	   12105	    292277 ns/op	        27.46 MiB/s	  178816 B/op	    2167 allocs/op
    BenchmarkParticipleThriftGenerated-After-8   	   12597	    269844 ns/op	        28.58 MiB/s	  161920 B/op	    1903 allocs/op
    

    Perhaps even more important is that according to a profiler, now it spends almost 7x less time in the PeekingLexer methods which I intended to optimize.

  • Support Matching EOF

    Support Matching EOF

    The parser actually supports matching EOF but then panics: panic: m:2:41: branch <eof> was accepted but did not progress the lexer at m:2:41 ("") [recovered]

    I think it would be beneficial to special-case this check to allow matching the end of a file.

  • Starting work on negation, wip

    Starting work on negation, wip

    Here is a naive interpretation on how the negation could function.

    If you want to do it entirely yourself, go ahead, but I'm willing to see this through as I really would like to see it come to fruition and this helps me hone my go funk.

    I'm looking for some feedback on that preliminary implementation.

    Am I missing something glaring ? Is the clone stuff potentially a performance killer ? Should Parse() be a little more involved ?

  • Support for sub-lexers

    Support for sub-lexers

    To support more complex languages, it should be possible to elegantly define stateful lexers.

    Ideally this would support:

    • "here docs", eg. cat << EOF\nEOF (where EOF is a user-defined marker).
    • Runtime selected sub-lexers, ala markdown's ```<language> blocks - this would defer to some external function in order to set the lexer based on <language>.
    • Recursive lexers for eg. string interpolation, "${"${var}"}" - in this situation a new lexer is pushed onto the state when ${ is encountered, and popped when } is encountered.

    My hunch is that some kind of "stateful EBNF" could work, but it would need to be extensible programatically, and it's not clear exactly how this would be expressed.

  • Support for negative lookahead groups and recursive capturing.

    Support for negative lookahead groups and recursive capturing.

    I have a situation where my grammar needs to differential between capturing an individual value and capturing values as the same type within a mathematical equation. My struct looks like this:

    type value struct {
    	Invocation     *invocation     `parser:"( @@"`
    	Path           *Path           `parser:"| @@"`
    	Float          *float64        `parser:"| @Float"`
    	Int            *int64          `parser:"| @Int"`
    	MathExpression *mathExpression `parser:"| @@"`
    	IsNil          *isNil          `parser:"| @'nil'"`
    	Bytes          *byteSlice      `parser:"| @Bytes"`
    	String         *string         `parser:"| @String"`
    	Bool           *boolean        `parser:"| @Boolean"`
    	Enum           *EnumSymbol     `parser:"| @Uppercase"`
    	List           *list           `parser:"| @@ )"`
    }
    

    Deep down inside the terms and factors of MathExpression it can capture a Invocation, Path, Float, and Int. At the moment if I try to parse something like set(x, 1 + 1), the grammar sees the 1, matches it with @Int, and then doesn't know how to handle the unexpected symbol +.

    I can solve this problem for Float and Int by using negative lookahead:

    type value struct {
    	Invocation     *invocation     `parser:"( @@"`
    	Path           *Path           `parser:"| @@"`
    	Float          *float64        `parser:"| @Float (?! @OpAddSub | @OpMultDiv)"`
    	Int            *int64          `parser:"| @Int (?! @OpAddSub | @OpMultDiv)"`
    	MathExpression *mathExpression `parser:"| @@"`
    	IsNil          *isNil          `parser:"| @'nil'"`
    	Bytes          *byteSlice      `parser:"| @Bytes"`
    	String         *string         `parser:"| @String"`
    	Bool           *boolean        `parser:"| @Boolean"`
    	Enum           *EnumSymbol     `parser:"| @Uppercase"`
    	List           *list           `parser:"| @@ )"`
    }
    

    Now set(x, 1 + 1) get parsed correctly. But the solution breaks down when I try the same strategy with Path and Invocation.

    If I try to parse set(x, y + increment()) without negative lookahead then the Path, y, is matched to Path and the parser doesn't know how to handle the +. But if I add negative lookahead like this:

    type value struct {
    	Invocation     *invocation     `parser:"( @@ (?! @OpAddSub | @OpMultDiv)"`
    	Path           *Path           `parser:"| @@ (?! @OpAddSub | @OpMultDiv)"`
    	Float          *float64        `parser:"| @Float (?! @OpAddSub | @OpMultDiv)"`
    	Int            *int64          `parser:"| @Int (?! @OpAddSub | @OpMultDiv)"`
    	MathExpression *mathExpression `parser:"| @@"`
    	IsNil          *isNil          `parser:"| @'nil'"`
    	Bytes          *byteSlice      `parser:"| @Bytes"`
    	String         *string         `parser:"| @String"`
    	Bool           *boolean        `parser:"| @Boolean"`
    	Enum           *EnumSymbol     `parser:"| @Uppercase"`
    	List           *list           `parser:"| @@ )"`
    }
    

    Then I get the following error when building the lexer: Invocation: Arguments: Invocation: structs can only be parsed with @@ or by implementing the Capture or encoding.TextUnmarshaler interfaces

    Is there a way to support recursively matching a struct but only when the negative lookahead group is satisfied?

    Full grammar can be found here: https://github.com/TylerHelmuth/opentelemetry-collector-contrib/blob/9f074d641c6c6b17b1d4774194ccbfa271f8a0da/pkg/ottl/grammar.go

  • Optional comments capturing while normally eliding them

    Optional comments capturing while normally eliding them

    For https://github.com/tawasprache/kompilierer, I'd like to be able to obtain the text of a comment before an AST node, for documentation generation purposes, while otherwise ignoring them in the parser.

    I envision the API looking similar to the Pos/EndPos API:

    type Item struct {
        Name `"function" @Ident`
        
        PrecedingComment lexer.Comment
    }
    

    can parse

    // hello!
    function hi
    

    with the comment and

    function hi
    

    without one

  • antlr2participle

    antlr2participle

    This adds the following:

    • A parser for .g4 ANTLR files.
    • A generator that uses an ANTLR AST to create a Participle lexer & parser.

    This is an initial draft. Notes:

    • Documentation is on the way.
    • Lexer modes are not yet implemented.
    • Recursive lexing is not yet implemented.
    • The skip lexer command is supported. The channel lexer command acts like skip. No other lexer commands are supported yet.
    • Actions and predicates are not supported.
    • Rule element labels are partially supported.
    • Alternative labels are parsed but not supported in the generator.
    • Rule arguments are not supported.

    Feedback is appreciated.

  • Matching Tokens

    Matching Tokens

    In line with the « Anything But » that I suggested in #104, I am looking for a way to get the tokens around a match. The idea would be that we could match a token without necessarily consume it.

    type Example struct {
      StartToken *lexer.Token `@?`
      SomeRuleThatMatches *[]string ` (@!";")+ ";" `
      EndToken *lexer.Token `@?`
    }
    

    Note that the @? is not a proposition, as I wonder what would make sense here.

    Now as to why I would need it ; usually, the lexers ignore whitespace. This makes of course for a simpler and cleaner grammar, as there is no need to mention optional whitespace everywhere.

    However, I sometimes need to access the "exact string" that was matched by a rule, discarded text included. This is because I'm writing parsers for incomplete languages where I do not care about the meaning of some constructs - I just want them "as is".

    Is there already a way to do such things ? I tried to match @Token into a *lexer.Token, but that didn't work. Also, I think that it would be more useful for token extraction that they don't advance the parser.

  • Pooled version of lexer.Upgrade

    Pooled version of lexer.Upgrade

    As observed in https://github.com/alecthomas/participle/issues/213#issuecomment-1344945666, can result in 10% - 30% overall lexing/parsing speedup in cases where needing to lex/parse thousands of documents of a similar size all in a row.

    Added a microbenchmark as well showing >4x speedup of Upgrade operation itself if used repeatedly for docs of a similar size.

  • Ideas for gen lexer supporting

    Ideas for gen lexer supporting "give back" for quest

    After #275 my lexer test suite was still failing. It looks like it's because the Quest regex operator (and related) doesn't support "giving back". For example, imagine you have the pattern for matching a 3 or 4 capital letter time zone: [A-Z][A-Z][A-Z]?T -- the generated lexer currently will not match EST, because the current quest operator implementation is fully greedy (never gives back). The generated code will match EST for the first three operations, but that advances the position past the ending T and thus the last literal T does not match. It doesn't backtrack to try the case where the [A-Z]? term was never matched, which would have allowed it to match.

    The current generated code looks something like:

    // [A-Z][A-Z][A-Z]?T (Concat)
    l3 := func(s string, p int) int {
    if p = l0(s, p); p == -1 { return -1 }
    if p = l0(s, p); p == -1 { return -1 }
    if p = l1(s, p); p == -1 { return -1 }
    if p = l2(s, p); p == -1 { return -1 }
    return p
    }
    

    It seems the implementation to support "give back" could be pretty complicated, so I wanted to discuss about possible good directions.

    Ideas:

    1. The least efficient but simplest would be to have it rewrite the regexp to remove the quest operator and instead replace the entire expression it was concatenated within with the two (or more variants). e.g., [A-Z][A-Z][A-Z]?T would be rewritten internally to [A-Z][A-Z][A-Z]T|[A-Z][A-Z]T before going through codegen. This could quickly lead to exponential explosion of the number of alternates for any reasonably complex regex, but the code itself would never have to "backtrack". It's simple to implement, but could be horribly slow, doesn't seem like a good option.
    2. Perhaps change the generated functions representing expression (e.g., l3 func above) to return a list of positions, rather than a single position (or an empty list instead of -1). Then the caller would explore all of the possibilities starting from that point (or if there was no match because it got an empty list, would stop and return an empty list right there). It feels like this direction might work, but would require some memory allocation with the returned lists (maybe there is a way to optimize by passing a shared list or using an allocation pool). A sketch of this might look like:
    // [A-Z][A-Z][A-Z]?T (Concat)
    l3 := func(s string, p int) []int {
    	matchedPositions := make([]int, 0)
    	for _, nextP := range l0(s, nextP) {
    		for _, nextP := range l0(s, nextP) {
    			for _, nextP := range l1(s, nextP) {
    				for _, nextP := range l2(s, nextP) {
    					matchedPositions = append(matchedPositions, nextP)
    				}
    			}
    		}
    	}
    	return matchedPositions
    }
    

    Probably this method could be optimized where it could recognize if a given function can only ever return a single position, and if so avoid the overhead of allocation. e.g., it could look more like:

    l3 := func(s string, p int) []int {
            matchedPositions := make([]int, 0)
    	if p = l0(s, p); p != -1 {
    		if p = l0(s, p); p != -1 {
    			for _, p := range l1(s, p) {
    				if p = l2(s, p); p != -1 {
    					matchedPositions = append(matchedPositions, p)
    				}
    			}
    		}
    	}
    	return matchedPositions
    }
    

    This variant is nice because it can still use p regardless of whether it's in a loop (previous function had multiple possibilities) or just an if statement (previous function had just one possibility). With an allocation pool for the matchedPositions array, maybe it wouldn't actually be that bad (memory complexity would be related to the number of possible open "branches" at any given time)

    1. Perhaps some technique where the caller recognizes the function it's calling represents a quest, and generates the different "paths" in the code (duplicating code for the entire "branch" for the remainder of that concatenation) -- this seems both hard to implement and code size could explode too for complex pressions.

    What do you think? Any ideas on better directions?

  • Include token type in unexpected token error message

    Include token type in unexpected token error message

    Sometimes I get an unexpected parsing error like

    unexpected token \"group\"
    

    and I get confused because group should be a valid next token in my grammar. However, it turns out that group is a Keyword token in my lexer, not an Identifier token like I wanted. Upon realizing that the token was of the wrong type, the error makes sense now. However, realizing that the token type was the problem can take a while, especially if the lexer is complicated or contains overlapping rules, because the error message doesn't give any indication of the types of the tokens involved (found OR expected). I think that including the token types in the error message would help people facing this issue solve it much more quickly.

  • Streamline long term maintenance of lexer/parser code generators

    Streamline long term maintenance of lexer/parser code generators

    I have some concerns with the long term maintenance of both the lexer and parser code generators which we should solve (or at least, not make worse) before getting these changes into master.

    Problem 1: divergence

    My first concern is that as the runtime code evolves, it diverges from the code generators. This has already occurred with the lexer, where the generated code does not handle lookahead. There will need to be some tests to detect this somehow. For the parser, enumerating the nodes and ensuring they're all handled by the generator (somehow) might work.

    • [ ] Conformance tests for lexer
    • [ ] Conformance tests for parser

    Problem 2: ergonomics

    There are also some ergonomic issues with generating the code. Specifically, having to have an implementation of the runtime code lying around as the "source of truth".

    Proposal: serialisable form for both lexers and parse trees

    • [x] Serialisable lexer
    • [x] Create participle gen lexer
    • [ ] Serialisable parser
    • [ ] Create participle gen parser

    One solution to this is for the code generators to be decoupled completely from the code. The lexer and parser would be extended to be JSON marshallable and the code generators would become standalone binaries that could ingest this serialised form and output code. This might be non-trivial for the parser because it is tightly coupled to reflection - TBD.

    Another option would be to make standalone binaries that parse the Go code directly, making the Go AST and compile-time type system the intermediate form. This would be much more complicated though.

  • Improve error message for custom capture

    Improve error message for custom capture

    If custom capture generate an error, then user will receive an error message with full struct name.

     diff --git a/_examples/ini/main.go b/_examples/ini/main.go
     index be6ec9e..e483923 100644
     --- a/_examples/ini/main.go
     +++ b/_examples/ini/main.go
     @@ -1,6 +1,7 @@
      package main
      
      import (
     +       "errors"
             "os"
      
             "github.com/alecthomas/repr"
     @@ -51,11 +52,17 @@ type String struct {
      func (String) value() {}
      
      type Number struct {
     -       Number float64 `@Float`
     +       Number N `@Float`
      }
      
      func (Number) value() {}
      
     +type N struct{}
     +
     +func (*N) Capture(values []string) error {
     +       return errors.New("problem")
     +}
     +
    
    

    panic: Number.Number: problem

    An error message with type name or token name will be much better: <float>: problem

  • possible bug in tag syntax

    possible bug in tag syntax

    Hello,

    it seems there is a difference between the "raw" and "parse" tag syntax? The following repro shows what I stumbled upon.

    package tagsyntax
    
    import (
    	"testing"
    
    	"github.com/alecthomas/assert/v2"
    	"github.com/alecthomas/participle/v2"
    )
    
    type GoodAST struct {
    	Key        string   `parser:"@Ident '='"`
    	BlankLines []string `"\n"`
    }
    
    type BadAST struct {
    	Key string `parser:"@Ident '='"`
    	// Same as field in GoodAST, as explained in https://github.com/alecthomas/participle#tag-syntax
    	BlankLines []string `parser:"'\n'"`
    }
    
    func TestLiteralNotTerminatedGood(t *testing.T) {
    	_, err := participle.Build(&GoodAST{})
    
    	assert.NoError(t, err)
    }
    
    func TestLiteralNotTerminatedBad(t *testing.T) {
    	_, err := participle.Build(&BadAST{})
    
    	// The error is:
    	//
    	//     Key: <input>:1:2: literal not terminated
    	//
    	// which is confusing because it refers to the previous field in the struct (Key)
    	// and unclear?
    	assert.NoError(t, err)
    }
    
Library for interacting with LLVM IR in pure Go.

llvm Library for interacting with LLVM IR in pure Go. Introduction Introductory blog post "LLVM IR and Go" Our Document Installation go get -u github.

Dec 24, 2022
Mathematical expression parsing and calculation engine library. 数学表达式解析计算引擎库

Math-Engine 使用 Go 实现的数学表达式解析计算引擎库,它小巧,无任何依赖,具有扩展性(比如可以注册自己的函数到引擎中),比较完整的完成了数学表达式解析执行,包括词法分析、语法分析、构建AST、运行。 go get -u github.com/dengsgo/math-engine 能够

Jan 3, 2023
Simple, fast, safe, compiled language for developing maintainable software. Compiles itself in <1s with zero library dependencies. https://vlang.io
Simple, fast, safe, compiled language for developing maintainable software. Compiles itself in <1s with zero library dependencies. https://vlang.io

The V Programming Language vlang.io | Docs | Changelog | Speed | Contributing & compiler design Key Features of V Simplicity: the language can be lear

Jan 4, 2023
A NMEA parser library in pure Go

go-nmea This is a NMEA library for the Go programming language (Golang). Features Parse individual NMEA 0183 sentences Support for sentences with NMEA

Dec 20, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
Parser and generator of M3U8-playlists for Apple HLS. Library for Go language. :cinema:

M3U8 This is the most complete opensource library for parsing and generating of M3U8 playlists used in HTTP Live Streaming (Apple HLS) for internet vi

Dec 20, 2022
grobotstxt is a native Go port of Google's robots.txt parser and matcher library.

grobotstxt grobotstxt is a native Go port of Google's robots.txt parser and matcher C++ library. Direct function-for-function conversion/port Preserve

Dec 27, 2022
A parser library for Go
A parser library for Go

A dead simple parser package for Go V2 Introduction Tutorial Tag syntax Overview Grammar syntax Capturing Capturing boolean value Streaming Lexing Sta

Dec 30, 2022
TOML parser and encoder library for Golang

TOML parser and encoder library for Golang TOML parser and encoder library for Golang. This library is compatible with TOML version v0.4.0. Installati

Oct 11, 2022
A parser combinator library for Go.

Takenoco A parser combinator library for Go. Examples CSV parser Dust - toy scripting language Usage Define the parser: package csv import ( "err

Oct 30, 2022
go command line option parser

go-flags: a go library for parsing command line arguments This library provides similar functionality to the builtin flag library of go, but provides

Jan 4, 2023
Fully featured Go (golang) command line option parser with built-in auto-completion support.

go-getoptions Go option parser inspired on the flexibility of Perl’s GetOpt::Long. Table of Contents Quick overview Examples Simple script Program wit

Dec 14, 2022
CONTRIBUTIONS ONLY: A Go (golang) command line and flag parser

CONTRIBUTIONS ONLY What does this mean? I do not have time to fix issues myself. The only way fixes or new features will be added is by people submitt

Dec 29, 2022
A command line parser written in Go

go-cmdline Introduction cmdline is a Go library to parse command line options (with optional default values), arguments and subcommands. Usage The fol

Oct 9, 2022
A command-line arguments parser that will make you smile.

docopt-go An implementation of docopt in the Go programming language. docopt helps you create beautiful command-line interfaces easily: package main

Jan 7, 2023
go command line option parser

go-flags: a go library for parsing command line arguments This library provides similar functionality to the builtin flag library of go, but provides

Dec 22, 2022
Fully featured Go (golang) command line option parser with built-in auto-completion support.

go-getoptions Go option parser inspired on the flexibility of Perl’s GetOpt::Long. Table of Contents Quick overview Examples Simple script Program wit

Dec 14, 2022
A flexible command and option parser for Go

Writ Overview Writ is a flexible option parser with thorough test coverage. It's meant to be simple and "just work". Applications using writ look and

Dec 27, 2022
A fast ISO8601 date parser for Go

A fast ISO8601 date parser for Go go get github.com/relvacode/iso8601 The built-in RFC3333 time layout in Go is too restrictive to support any ISO860

Jan 7, 2023
A simple CSS parser and inliner in Go

douceur A simple CSS parser and inliner in Golang. Parser is vaguely inspired by CSS Syntax Module Level 3 and corresponding JS parser. Inliner only p

Dec 12, 2022