Peg, Parsing Expression Grammar, is an implementation of a Packrat parser generator.

Last update: Dec 31, 2022

Comments: 15

PEG, an Implementation of a Packrat Parsing Expression Grammar in Go

A Parsing Expression Grammar ( hence peg) is a way to create grammars similar in principle to regular expressions but which allow better code integration. Specifically, peg is an implementation of the Packrat parser generator originally implemented as peg/leg by Ian Piumarta in C. A Packrat parser is a "descent recursive parser" capable of backtracking and negative look-ahead assertions which are problematic for regular expression engines .

Installing

go get -u github.com/pointlander/peg

Building

Using Pre-Generated Files

go install

Generating Files Yourself

You should only need to do this if you are contributing to the library, or if something gets messed up.

go run build.go or go generate

With tests:

go run build.go test

Usage

peg [<option>]... <file>

Usage of peg:
  -inline
      parse rule inlining
  -noast
      disable AST
  -output string
      specify name of output file
  -print
      directly dump the syntax tree
  -strict
      treat compiler warnings as errors
  -switch
      replace if-else if-else like blocks with switch blocks
  -syntax
      print out the syntax tree
  -version
      print the version and exit

Sample Makefile

This sample Makefile will convert any file ending with .peg into a .go file with the same name. Adjust as needed.

.SUFFIXES: .peg .go

.peg.go:
	peg -noast -switch -inline -strict -output $@ $<

all: grammar.go

Use caution when picking your names to avoid overwriting existing .go files. Since only one PEG grammar is allowed per Go package (currently) the use of the name grammar.peg is suggested as a convention:

grammar.peg
grammar.go

PEG File Syntax

First declare the package name and any import(s) required:

package <package name>

import <import name>

Then declare the parser:

type <parser name> Peg {
	<parser state variables>
}

Next declare the rules. Note that the main rules are described below but are based on the peg/leg rules which provide additional documentation.

The first rule is the entry point into the parser:

<rule name> <- <rule body>

The first rule should probably end with !. to indicate no more input follows.

first <- . !.

This is often set to END to make PEG rules more readable:

END <- !.

. means any character matches. For zero or more character matches, use:

repetition <- .*

For one or more character matches, use:

oneOrMore <- .+

For an optional character match, use:

optional <- .?

If specific characters are to be matched, use single quotes:

specific <- 'a'* 'bc'+ 'de'?

This will match the string "aaabcbcde".

For choosing between different inputs, use alternates:

prioritized <- 'a' 'a'* / 'bc'+ / 'de'?

This will match "aaaa" or "bcbc" or "de" or "". The matches are attempted in order.

If the characters are case insensitive, use double quotes:

insensitive <- "abc"

This will match "abc" or "Abc" or "ABc" and so on.

For matching a set of characters, use a character class:

class <- [a-z]

This will match "a" or "b" or all the way to "z".

For an inverse character class, start with a caret:

inverse <- [^a-z]

This will match anything but "a" or "b" or all the way to "z".

If the character class is case insensitive, use double brackets:

insensitive <- [[A-Z]]

(Note that this is not available in regular expression syntax.)

Use parentheses for grouping:

grouping <- (rule1 / rule2) rule3

For looking ahead a match (predicate), use:

lookAhead <- &rule1 rule2

For inverse look ahead, use:

inverse <- !rule1 rule2

Use curly braces for Go code:

gocode <- { fmt.Println("hello world") }

For string captures, use less than and greater than:

capture <- <'capture'> { fmt.Println(text) }

Will print out "capture". The captured string is stored in buffer[begin:end].

Testing Complex Grammars

Testing a grammar usually requires more than the average unit testing with multiple inputs and outputs. Grammars are also usually not for just one language implementation. Consider maintaining a list of inputs with expected outputs in a structured file format such as JSON or YAML and parsing it for testing or using one of the available options for Go such as Rob Muhlestein's tinout package.

Files

bootstrap/main.go - bootstrap syntax tree of peg
tree/peg.go - syntax tree and code generator
peg.peg - peg in its own language

Author

Andrew Snodgrass

Projects That Use `peg`

Here are some projects that use peg to provide further examples of PEG grammars:

https://github.com/tj/go-naturaldate - natural date/time parsing
https://github.com/robmuh/dtime - easy date/time formats with duration spans
https://github.com/gnames/gnparser - scientific names parsing

Owner

Andrew Snodgrass

https://github.com/pointlander/peg

Comments

C grammar fails with trivial but legal C snippet

int main() { (a)||1; }

In general, the two-place operator symbols fail to parse in this configuration (||, &&, ->, >>, <<) as well as the two-place postfix operators ("(a)--", "(a)++"). However I did notice that ">" and "<" also fail.

The key here is the LPAR and RPAR wrapping the expression. This seems to trigger it.

I'm rather losing my mind over the bug. Any help very much appreciated!

Parser very slow

The parser allocates an enormous amount of memory

var tree tokenTree = &tokens32{tree: make([]token32, math.MaxInt16)}

And then uses own vector doubling scheme

func (t *tokens32) Expand(index int) tokenTree {
        tree := t.tree
        if index >= len(tree) {
                expanded := make([]token32, 2*len(tree))
                copy(expanded, tree) 
                t.tree = expanded
        }
        return nil
}

Both of these causes the parser to be very slow because it generates an big amount of garbage. Should probably be optimized.

Parser completely breaks without warning if you have more than 65536 tokens

I am parsing a medium sized file (60 kb) and the parser breaks if you have more than 16 bits worth of tokens parsed. The AST tree will be completely wrong and it will be missing tokens because it can't have more than 16 bits worth.

I made it work correctly by manually editing the generated file and changed int16 to int32. However it also looks like you preallocated slices like this make([]int32, 1, Math.MaxInt16) which cannot simply be changed to 0 and cannot be changed to MaxInt32 because no one has that much memory. So I changed it to 18 bits, but this obviously will not work for files with more tokens.

I don't feel comfortable submitting a patch for this myself because it looks like this is used in quite a few places and will probably require a significant change to remove these static limits.

Bug: an error occurs when importing a repository that contains numeric character

The following peg file causes an error:

parse error near PegText (line 3 symbol 9 - line 3 symbol 25): "github.com/hachi"

package grammar

import "github.com/hachi8833/sample/token"

type Parser Peg {
  token.Program
}

Program <-
    expression EOF
    / expression <.+> {p.Err(begin, buffer)} EOF
    / <.+> {p.Err(begin, buffer)} EOF

expression <-
    additive

additive <-
    multitive (
        '+' multitive {p.PushOpe("+")}
      / '-' multitive {p.PushOpe("-")}
    )*

multitive <-
    value (
        '*' value {p.PushOpe("*")}
      / '/' value {p.PushOpe("/")}
    )*

value <-
    <[0-9]+> {p.PushDigit(text)}
    / '(' expression ')'

EOF                 <- !.

Just changing the repository name to like import "github.com/hachi/sample/token" works fine. (My GitHub user name contains some numeric characters😅)

using peg with io.RuneReader

hi, could you explain how to use peg generated code against text coming from a stream? if not supported, could you outline how you would want this implemented?

Parser hangs forever

grammar.peg

package main

import "os/exec"

type Prog Peg {
     Cmd *exec.Cmd
     In io.Writer
}

Command <- <(!nl)*> nl eof  

eof <- !. {logln("end of file");p.In.Write([]byte{'\x03'})}

nl <- "\n"

main.go:

// patmatch is a tool for pattern matching in bash scripts
package main

import (
	"os"
	"os/exec"
	"log"
)

const Shell = "/bin/bash"

//go:generate peg grammar.peg

func main() {
	src := `echo test
`
	p := Prog{Cmd:&exec.Cmd{
		Path:Shell,
		Stdout:os.Stdout,
	},Buffer:src}
	var err error
	p.In, err = p.Cmd.StdinPipe()
	fatal(err)
	logln("starting shell")
	err = p.Cmd.Start()	
	fatal(err)
	logln("initalizing parser")
	p.Init()
	fatal(err)
	logln("parsing")
	err = p.Parse()
	fatal(err)
	logln("executing")
	p.Execute()
	logln("done")
}

func fatal(err error) {
	if err != nil {panic(err)}
} 

func logln(args ...interface{}) {
	log.Println(args...)
}

peg -version: version: unknown-5cdb3adc061370cdd20392ffe2740cc8db104126

bootstrap with smaller grammar

The language needed to define peg in peg is a subset of the language defined by peg.peg. To minimize the hardcoded grammar in the bootstrap phase, bootstrap with a smaller grammar, then build the full peg language.
Parse and Reset as struct methods instead of struct fields
Parse and Reset are now defined as fields in the generated parser struct, not methods on the struct. One consequence of this choice is that we cannot define an interface on the generated struct because interface cannot contain fields. This comes up when one wants to abstracts over multiple parsers.

My current workaround is to define wrappers for each generated struct, say:

func (p *Listing) ParseIntf() error { return p.Parse() }

and define the interface as (simplified example):

type Listing interface { Init() ParseIntf() error Execute() }

I wish you would consider lifting Parse and Reset to methods. Thanks!
Parse Tree

Hi We are trying to write a parser for a grammar that needs more lookahead than what LR(1) in http://code.google.com/p/gocc/ provides. I am pretty sure we will be able to Pegify the grammar, but I was wondering if we are able to access the parse tree in some way or is it easy to build an AST in a bottom up way in this PEG implementation?

We have really easy to use SDT rules in gocc to build up an AST in a bottom up way.

I have not used PEG before, but I have read an article and I am really amped :)

Please help, Thank you Walter Schulze

[bug] undefined: RulePegText

./m2.peg.go:630: undefined: RulePegText what is RulePegText ? how to define it?

./m2.peg

package main

type JsonParser Peg{
  Json
}
json <- may_space (json_object / json_array / json_string / json_number / json_true / json_false / json_null) may_space
json_object <- '{' may_space '}' / '{' (json_object_pair ',')* json_object_pair  '}'
json_object_pair <- may_space json_string may_space ':' json
json_array <- '[' may_space ']' / '[' (json ',')* json ']'
json_true <- 'true' { p.addJson(buffer[begin:end]) }
json_false <- 'false'
json_null <- 'null'
json_string <- '"' json_double_char* '"'
json_double_char <- [^"\\] / '\\' ["\\/bfnrt] / '\\u' json_hex_char json_hex_char json_hex_char json_hex_char
json_hex_char <- [0-9a-fA-F]
json_number <- '-'? ('0' / [1-9][0-9]*) ('.' [0-9]+)? ([eE][+-]?[0-9]+)? may_space

space_char <- [ \n\r\t]
#space <- space_char+
may_space <- space_char*

./main.go

package main

import (
  "fmt"
  "io/ioutil"
  "launchpad.net/goyaml"
)
type Json struct{
}
func (j *Json) addJson(json string){
  fmt.Println(json)
}
type JsonTest map[string]string
func main(){
  test_yaml_string,err:= ioutil.ReadFile("json_test.yml")
  if err!=nil{
    fmt.Println(err)
    return
  }
  json_test_data:=make(JsonTest)
  err = goyaml.Unmarshal(test_yaml_string,json_test_data)
  if err!=nil{
    fmt.Println(err)
    return
  }
  for test_name,test_data:=range json_test_data{
    parser:= &JsonParser{Buffer:test_data}
    parser.Init()
    err := parser.Parse()
    if err!=nil{
      fmt.Println("FAIL "+test_name+" ",err)
    }else{
      fmt.Println("PASS "+test_name)
    }
  }
  fmt.Println("success")
}

case insensitive grammars

Hi,

I'm quite interested in using this, but I've found a fairly major sticking point. There doesn't appear to be any easy way to parse case insensitive grammars, at least that I can see. Given how prevalent case insensitive language grammars are, it'd be nice if peg supported an easier way to parse them.

I've done some searching, and it appears that this is a common problem with things based peg. I see some discussion on the pegjs project to use a "characters"i syntax to denote case insensitive character chunks:

https://github.com/dmajda/pegjs/issues/34

I'm not sure if you like that syntax or not, but something similar to ease case insensitive grammars would be super useful.
Unpredictable "Code generated by" comment when invoking peg with "go run"
Since go run has been made module aware, it is convenient to use go run with //go:generate directives, so that your project is able to trivially use a fixed version of its external dependencies.

I want to write my go:generate directive like this:

$ git grep go:generate peg.go://go:generate go run github.com/pointlander/peg -inline -switch query.peg

But go run builds the target binary in a temporary directory, and main.go passes the entirety of os.Args to the template, such that os.Args[0] contains the full path to the built peg binary in a random temporary directory: https://github.com/pointlander/peg/blob/e7588a89197f28bc2191a42a0562d77b257e20fe/main.go#L87

This results in a diff every time go generate has been run:

$ go generate && git grep 'Code generated by' query.peg.go:// Code generated by /var/folders/_m/h25_32y958gbgk67m97141400000gq/T/go-build1253021897/b001/exe/peg -inline -switch query.peg DO NOT EDIT. $ go generate && git grep 'Code generated by' query.peg.go:// Code generated by /var/folders/_m/h25_32y958gbgk67m97141400000gq/T/go-build3041684327/b001/exe/peg -inline -switch query.peg DO NOT EDIT.

(The go-build portion of the directory is different on each invocation above.)

It would be nice if os.Args[0] was just set to peg by default, but if it is important to maintain backwards compatibility, you could add a new flag to peg. I would lean towards something like -fixedname to mean "just set os.Args[0] to peg regardless of its actual value". Another option would be something like -arg0name=peg, but I doubt anyone would need it customized to anything than some arbitrarily fixed name, hence my preference to a simple boolean flag.

In the meantime, I can work around this by changing my //go:generate to just build the binary into a fixed directory:

$ git grep go:generate peg.go://go:generate go build -o ./.bin/peg github.com/pointlander/peg peg.go://go:generate ./.bin/peg -inline -switch query.peg

Then, the comment does not change on subsequent go generate calls.

$ go generate && git grep 'Code generated by' query.peg.go:// Code generated by ./.bin/peg -inline -switch query.peg DO NOT EDIT. $ go generate && git grep 'Code generated by' query.peg.go:// Code generated by ./.bin/peg -inline -switch query.peg DO NOT EDIT.
tree: avoid using strings.Builder

strings.Builder was introduced in Go 1.10. Since all other code generated by peg is compatible with Go versions older than that, it would be nice not to require Go 1.10 just for writing the AST to a string.

Rule redeclaration causes segmentation fault

Expected behavior

In case of rule redeclaration the generator should return an error:

package main

type parser Peg {
}

main <- (a)+
a <- 'a'
a <- 'a'

Actual behavior

The invalid grammar crashes the generator due to a segmentation fault:

romanscharkov@RomMac pegbug % peg grammar.peg
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1164e55]

goroutine 1 [running]:
github.com/pointlander/peg/tree.(*Tree).Compile(0xc000074000, 0xc00001a0e0, 0xe, 0xc00000c060, 0x2, 0x2, 0x1200e60, 0xc00000e030, 0x0, 0x0)
        /Users/romanscharkov/go/src/github.com/pointlander/peg/tree/peg.go:1506 +0x1475
main.main()
        /Users/romanscharkov/go/src/github.com/pointlander/peg/main.go:87 +0x575

README.md needs updating
[ ] update usage to include new cli flags

[ ] acknowledge the fact that converted files are now .peg.go instead of .go

[ ] acknowledge go generate support, although people should be able to just run go install

Related tags

Network peg

🔎Sniffing and parsing mysql,redis,http,mongodb etc protocol. 抓包截取项目中的数据库请求并解析成相应的语句。

go-sniffer Capture mysql,redis,http,mongodb etc protocol... 抓包截取项目中的数据库请求并解析成相应的语句，如mysql协议会解析为sql语句,便于调试。不要修改代码，直接嗅探项目中的数据请求。中文使用说明 Support List: m

Dec 27, 2022

IPIP.net officially supported IP database ipdb format parsing library

Dec 27, 2022

Kick dropper is a very simple and leightweight demonstration of SQL querying, and injection by parsing URl's

__ __ __ __ _____ ______ | |/ |__|.----.| |--.______| \.----.| |.-----.-----.-----.----.

Feb 6, 2022

grobotstxt is a native Go port of Google's robots.txt parser and matcher library.

grobotstxt grobotstxt is a native Go port of Google's robots.txt parser and matcher C++ library. Direct function-for-function conversion/port Preserve

Dec 27, 2022

Go http real ip header parser

remoteaddr Go http real ip header parser module A forwarders such as a reverse proxy or Cloudflare find the real IP address from the requests made to

Nov 18, 2022

Torrent-metainfo-parser - Generates a .torrent meta info from a file

torrent-metainfo-parser generates a .torrent meta info from a file required argu

Aug 23, 2022

A protoc-gen-go wrapper including an RPC stub generator

Nov 17, 2022

protoc-gen-grpc-gateway-ts is a Typescript client generator for the grpc-gateway project. It generates idiomatic Typescript clients that connect the web frontend and golang backend fronted by grpc-gateway.

protoc-gen-grpc-gateway-ts protoc-gen-grpc-gateway-ts is a Typescript client generator for the grpc-gateway project. It generates idiomatic Typescript

Dec 19, 2022

Peg, Parsing Expression Grammar, is an implementation of a Packrat parser generator.

PEG, an Implementation of a Packrat Parsing Expression Grammar in Go

See Also

Installing

Building

Using Pre-Generated Files

Generating Files Yourself

Usage

Sample Makefile

PEG File Syntax

Testing Complex Grammars

Files

Author

Projects That Use peg

Owner

Andrew Snodgrass

Comments

C grammar fails with trivial but legal C snippet

Parser very slow

Parser completely breaks without warning if you have more than 65536 tokens

Bug: an error occurs when importing a repository that contains numeric character

using peg with io.RuneReader

Parser hangs forever

bootstrap with smaller grammar

Parse and Reset as struct methods instead of struct fields

Parse Tree

[bug] undefined: RulePegText

case insensitive grammars

Unpredictable "Code generated by" comment when invoking peg with "go run"

tree: avoid using strings.Builder

Rule redeclaration causes segmentation fault

Expected behavior

Actual behavior

README.md needs updating

Related tags

🔎Sniffing and parsing mysql,redis,http,mongodb etc protocol. 抓包截取项目中的数据库请求并解析成相应的语句。

IPIP.net officially supported IP database ipdb format parsing library

Kick dropper is a very simple and leightweight demonstration of SQL querying, and injection by parsing URl's

grobotstxt is a native Go port of Google's robots.txt parser and matcher library.

Go http real ip header parser

Torrent-metainfo-parser - Generates a .torrent meta info from a file

A protoc-gen-go wrapper including an RPC stub generator

protoc-gen-grpc-gateway-ts is a Typescript client generator for the grpc-gateway project. It generates idiomatic Typescript clients that connect the web frontend and golang backend fronted by grpc-gateway.

Simple Web based configuration generator for WireGuard. Demo:

An SSH key pair generator

Highly experimental generator for Dragonfly

Shrek is a vanity .onion address generator written in Go.

Temporal Activity Protobuf Generator Proof of Concept

Kubernetes Custom Resource API Reference Docs generator

Vanitytorgen - Vanity Tor keys/onion addresses generator

A Twirp RPC OpenAPI generator implemented as `protoc` plugin

Protoc-gen-apidocs: A simple and customizable protoc generator that translates

A go implementation of the STUN client (RFC 3489 and RFC 5389)

A QUIC implementation in pure go

Projects That Use `peg`