A full-featured regex engine in pure Go based on the .NET engine

Last update: Jan 9, 2023

Comments: 17

regexp2 - full featured regular expressions for Go

Regexp2 is a feature-rich RegExp engine for Go. It doesn't have constant time guarantees like the built-in regexp package, but it allows backtracking and is compatible with Perl5 and .NET. You'll likely be better off with the RE2 engine from the regexp package and should only use this if you need to write very complex patterns or require compatibility with .NET.

Basis of the engine

The engine is ported from the .NET framework's System.Text.RegularExpressions.Regex engine. That engine was open sourced in 2015 under the MIT license. There are some fundamental differences between .NET strings and Go strings that required a bit of borrowing from the Go framework regex engine as well. I cleaned up a couple of the dirtier bits during the port (regexcharclass.cs was terrible), but the parse tree, code emmitted, and therefore patterns matched should be identical.

Installing

This is a go-gettable library, so install is easy:

go get github.com/dlclark/regexp2/...

Usage

Usage is similar to the Go regexp package. Just like in regexp, you start by converting a regex into a state machine via the Compile or MustCompile methods. They ultimately do the same thing, but MustCompile will panic if the regex is invalid. You can then use the provided Regexp struct to find matches repeatedly. A Regexp struct is safe to use across goroutines.

re := regexp2.MustCompile(`Your pattern`, 0)
if isMatch, _ := re.MatchString(`Something to match`); isMatch {
    //do something
}

The only error that the *Match* methods should return is a Timeout if you set the re.MatchTimeout field. Any other error is a bug in the regexp2 package. If you need more details about capture groups in a match then use the FindStringMatch method, like so:

if m, _ := re.FindStringMatch(`Something to match`); m != nil {
    // the whole match is always group 0
    fmt.Printf("Group 0: %v\n", m.String())

    // you can get all the groups too
    gps := m.Groups()

    // a group can be captured multiple times, so each cap is separately addressable
    fmt.Printf("Group 1, first capture", gps[1].Captures[0].String())
    fmt.Printf("Group 1, second capture", gps[1].Captures[1].String())
}

Group 0 is embedded in the Match. Group 0 is an automatically-assigned group that encompasses the whole pattern. This means that m.String() is the same as m.Group.String() and m.Groups()[0].String()

The last capture is embedded in each group, so g.String() will return the same thing as g.Capture.String() and g.Captures[len(g.Captures)-1].String().

If you want to find multiple matches from a single input string you should use the FindNextMatch method. For example, to implement a function similar to regexp.FindAllString:

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	m, _ := re.FindStringMatch(s)
	for m != nil {
		matches = append(matches, m.String())
		m, _ = re.FindNextMatch(m)
	}
	return matches
}

FindNextMatch is optmized so that it re-uses the underlying string/rune slice.

The internals of regexp2 always operate on []rune so Index and Length data in a Match always reference a position in runes rather than bytes (even if the input was given as a string). This is a dramatic difference between regexp and regexp2. It's advisable to use the provided String() methods to avoid having to work with indices.

Compare `regexp` and `regexp2`

Category	regexp	regexp2
Catastrophic backtracking possible	no, constant execution time guarantees	yes, if your pattern is at risk you can use the `re.MatchTimeout` field
Python-style capture groups `(?P<name>re)`	yes	no (yes in RE2 compat mode)
.NET-style capture groups `(?<name>re)` or `(?'name're)`	no	yes
comments `(?#comment)`	no	yes
branch numbering reset `(?\|a\|b)`	no	no
possessive match `(?>re)`	no	yes
positive lookahead `(?=re)`	no	yes
negative lookahead `(?!re)`	no	yes
positive lookbehind `(?<=re)`	no	yes
negative lookbehind `(?<!re)`	no	yes
back reference `\1`	no	yes
named back reference `\k'name'`	no	yes
named ascii character class `[[:foo:]]`	yes	no (yes in RE2 compat mode)
conditionals `(?(expr)yes\|no)`	no	yes

RE2 compatibility mode

The default behavior of regexp2 is to match the .NET regexp engine, however the RE2 option is provided to change the parsing to increase compatibility with RE2. Using the RE2 option when compiling a regexp will not take away any features, but will change the following behaviors:

add support for named ascii character classes (e.g. [[:foo:]])
add support for python-style capture groups (e.g. (P<name>re))
change singleline behavior for $ to only match end of string (like RE2) (see #24)

re := regexp2.MustCompile(`Your RE2-compatible pattern`, regexp2.RE2)
if isMatch, _ := re.MatchString(`Something to match`); isMatch {
    //do something
}

This feature is a work in progress and I'm open to ideas for more things to put here (maybe more relaxed character escaping rules?).

Library features that I'm still working on

Regex split

Potential bugs

I've run a battery of tests against regexp2 from various sources and found the debug output matches the .NET engine, but .NET and Go handle strings very differently. I've attempted to handle these differences, but most of my testing deals with basic ASCII with a little bit of multi-byte Unicode. There's a chance that there are bugs in the string handling related to character sets with supplementary Unicode chars. Right-to-Left support is coded, but not well tested either.

Find a bug?

I'm open to new issues and pull requests with tests if you find something odd!

Owner

Doug Clark

https://github.com/dlclark/regexp2

Comments

Performance issue matching against beginning of very large string

I am tokenizing some text by matching a set of regexes against the beginning of a string holding the contents of a file. I noticed that regexp2 was extremely slow for this use-case, and after running the profiler found that the time was dominated by getRunes().

This is occurring because, before every match, regexp2 converts the entire 22kb string to a slice of runes. I've worked around the issue be pre-converting the string to a slice of runes myself, then using FindRulesMatch(), but it was quite surprising and non-obvious.

A solution would be to convert runes on the fly (as most matches are under 10 characters, converting the whole string each time is redundant). Looking at the code, it doesn't seem like it would super painful to achieve. The runner would need to be modified to use DecodeRuneInString to advance the index into the string, rather than a direct index into a slice of runes.
Seems to fail a positive lookahead

Hello, I was checking it out and it seems to fail a regular expression. For a given text like this one, the expression ((Art\.\s\d+)[\S\s]*?(?=Art\.\s\d+)) fails to match every Art. block in the text. I've tested the expression on this website and there it gives me the correct count of 12 matches.

Am I missing something? Maybe a multiline flag?

Bulk replace

Hello,

I'd just like to ask you if you have any plans to implement bulk replace functions to your regexp2 as the Go standard regex? https://golang.org/pkg/regexp/#Regexp.ReplaceAll

func (re *Regexp) ReplaceAll(src, repl []byte) []byte

func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte

func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte

func (re *Regexp) ReplaceAllLiteralString(src, repl string) string

func (re *Regexp) ReplaceAllString(src, repl string) string

func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

Thank you,

Regex Multiline

a regex= ^(ac|bb)$\n, but this i dont use option Multiline,I think it will error when MustCompile,but it not ,and can match string "ac\n",so how can i do ,it will throw an error
Improve ECMAScript compatibility.

Hi,

This PR includes a couple of fixes to improve ECMAScript compatibility. The added test cases illustrate the issues fixed. Please consider merging.

Error while trying to match a string with a specific unicode against a RegExp that contains a space and a group

When trying to match (phrase.MatchString(X)) messages like gg 󠀀 󠀀 (notice that these are not the regular spaces) against a phrase like regexp2.MustCompile("\\bcool (house)\\b", 0), the following error will be thrown:

panic: runtime error: index out of range [917504] with length 128

goroutine 1 [running]:
github.com/dlclark/regexp2/syntax.(*BmPrefix).Scan(0xc000180540, {0xc000b70948, 0x6, 0x0?}, 0x0?, 0x0, 0x6)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/syntax/prefix.go:716 +0x3bb
github.com/dlclark/regexp2.(*runner).findFirstChar(0xc000623a00)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:1305 +0x366
github.com/dlclark/regexp2.(*runner).scan(0xc000623a00, {0xc000b70948?, 0x6, 0xc000b70948?}, 0x6?, 0x1, 0xc00008f8e8?)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:130 +0x1e5
github.com/dlclark/regexp2.(*Regexp).run(0xc0000f6200, 0xf4?, 0xffffffffffffffff, {0xc000b70948, 0x6, 0x6})
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:91 +0xfa
github.com/dlclark/regexp2.(*Regexp).MatchString(0x10f9c40?, {0x108f0f4?, 0xc00008fb48?})
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/regexp.go:213 +0x45
main.main()
        C:/Users/X/Desktop/GoRegExTests/test.go:127 +0xbdc

The error is only being thrown when: a. The message contains those unicode characters b. The RegExp contains a space and a group like (house)

The RegExp above is just a very basic example to demonstrate this problem.

Improved the handling of named group references in ECMAScript mode.

I have made a few changes to support named group references according to the modern ECMAScript specification. The changes only affect ECMAScript mode except one: the invalid references now cause errors whereas previously they were ignored. I've checked and the new behavior seems to match perl and .NET online regex tester (http://regexstorm.net/tester).

Please consider merging.
Licensing and specific ATTRIB details

As part as an effort that includes packaging your library for Debian, I'm wondering if it would be possible to have more details or information about which particular files are covered by each original license?

In particular, could you provide some more details regarding these comments on ATTRIB:

Some of this code is ported from dotnet/corefx, which was released under this license: ...

Small pieces of code are copied from the Go framework under this license: ...

I am aware it might be a bit difficult to retrieve that history, but any insight would be much appreciated in the hopes of making sure licenses and copyright are attributed as faithfully as possible. Thanks in advance!
Problems with Negative Lookahead
re := regexp2.MustCompile(`(?m)^.*(?!/bin/bash)$`,0) match,_ := re.FindStringMatch(string(passwd))

I'm trying to take all the string execpt the ones containing /bin/bash but actually the result is just the first line of /etc/passwd that contains /bin/bash

Continuous 4byte emoji would crash when ReplaceFunc()

Hello, it's been a long time.

Today I found an issue regarding some special "4byte" emojis on ReplaceFunc().

sample 4byte emojis: 📍😏️📣🍣🍺
sample 3byte emoji: ✔️⚾️

You can inspect the above with http://r12a.github.io/apps/conversion/ like the following:

Sample1: causes panic

Please take a look at the following: You can reproduce the issue by uncommenting the str assignment lines one by one.

As far as I checked, ReplaceFunc()'d get panic under the following condition:

target contains some continuous 4byte emojis, and
regex contains 3bytes UTF-8 characters and contains NO 4byte emojis

package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
	str := "高" // panic: Japanese Kanji
	// str := "は" // panic: Japanese Hiragana
	// str := "パ" // panic: Japanese Katakana
	// str := "[a-zA-Z0-9]{,2}" // works fine: Japanese Hiragana
	// str := "峰起|烽起" // works fine: longer Japanese Hiragana (I wonder why)
	// str := "フトレス" // panic: longer Japanese Katakana
	// str := "ALLWAYS|Allways|allways|AllWays" // works fine: Alphabet
	// str := "📍" // works fine: 4byte emoji
	// str := "📍📍" // works fine: continuous 4byte emoji
	// str := "✔️" // panic: 3byte emoji
	// str := "✔️✔️" // panic: coutinuous 3byte emoji
	// str := "📍️✔️" // works fine: 4 and 3byte emoji
	// str := "️✔📍️" // works fine: 3 and 4byte emoji
	// str := "📍️は️" // works fine: 4byte emoji and Hiragana
	// str := "️は📍️" // works fine: Hiragana and 4byte emoji

	re := regexp2.MustCompile(str, 0)
	result, _ := re.ReplaceFunc("📍✔️😏⚾️📣🍣🍺🍺 <- continuous 4byte emoji 寿司ビール文字あり", func(m regexp2.Match) string {
		return "࿗" + "࿘" + string(m.Capture.Runes()) + "࿌"
	}, -1, -1)

	pp.Println(result)
}

Sample2: all works fine

The following is a kind of control group that works fine. The key is that the target contains no "continuous 4byte emojis".

package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
        // All of the following patterns work fine perhaps because ""✔✔⚾⚾️ <- 3byte emoji 寿司ビール文字なし" contains no continuous 4byte emojis. You can check them by uncommenting them one by one.
	str := "高"
	// str := "は"
	// str := "パ"
	// str := "[a-zA-Z0-9]{,2}"
	// str := "峰起|烽起"
	// str := "フトレス"
	// str := "ALLWAYS|Allways|allways|AllWays"
	// str := "📍" 
	// str := "📍📍" 
	// str := "✔️" 
	// str := "✔️✔️" 
	// str := "📍️✔️" 
	// str := "️✔📍️" 
	// str := "📍️は️" 
	// str := "️は📍️" 

	re := regexp2.MustCompile(str, 0)
       // The following target works fine: there's no continuous 4byte emojis
	result, _ := re.ReplaceFunc("✔✔⚾⚾️ <- 3byte emoji 寿司ビール文字なし", func(m regexp2.Match) string {
		return "࿗" + "࿘" + string(m.Capture.Runes()) + "࿌"
	}, -1, -1)

	pp.Println(result)
}

FYI

The issue looks a little bit similar to "sushi-beer" issue: https://gist.github.com/kamipo/37576ce436c564d8cc28

I hope you'd check and fix it.

Best regards, 🙇

bugs in scenarios of Chinese characters or incorrect using of match.Index

the following codes fails

package main

import (
	"fmt"
	"github.com/dlclark/regexp2"
)

func main()  {
	regex := regexp2.MustCompile("<style", regexp2.IgnoreCase|regexp2.Singleline)
	match, err := regex.FindStringMatch(sample)
	if err != nil {
		panic(err)
	}
	if match != nil {
		t, err := regex.Replace(sample, "xxx", match.Index, -1)
		if err != nil {
			panic(err)
		}
		fmt.Printf("%s", t)
	}
}

var sample = "<title>错<style"

if i search some words/regex successfully, and then replace something from match.Index instead of -1, the codes fails.

however, if removed the Chinese character 错, the codes succeeds.

so, in such scenario, what should beginning index be if I want to replace all and don't want to replace from -1(begining)

error parsing regexp: unrecognized grouping construct: (?-1

package parse

import (
	"fmt"
	"github.com/dlclark/regexp2"
	"testing"
)

func TestJsonRe2(t *testing.T) {
	text := `{
  "code" : "0",
  "message" : "success",
  "responseTime" : 2,
  "traceId" : "a469b12c7d7aaca5",
  "returnCode" : null,
  "result" : {
    "total" : 0,
    "list" : [ ]
}
}`
	reg := `/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/`
	r, err := regexp2.Compile(reg, regexp2.RE2|regexp2.Multiline|regexp2.ECMAScript)
	if err != nil {
		fmt.Println(err)
		return
	}

	matchedStrings, err := r.FindStringMatch(text)
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Println(matchedStrings)
}

output:

error parsing regexp: unrecognized grouping construct: (?-1 in `/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/`

but in https://regex101.com/,it is ok

fix: ecma ranges with set terminator
Fix ECMAScript un-escaped literal '-' when followed by predefined character sets.

Also:

Fixed missing error check on parseProperty() call.

Use addChar(ch) helper instead of addRange(ch, ch).

Fixes #54
$ecmascript: cannot include class \s in character range$
ecmascript: cannot include class \s in character range
When compiling using regexp2.ECMAScript the regexp [a-\s] fails with the following but it should pass:

error parsing regexp: cannot include class \115 in character range in `[a-\s]`

regexp101 shows how it should be interpreted.
Is it possible to get the name of the currently matched group?
Say I have a regex to tokenize some language..

# in python. regex = re.compile( "(?P<comment>#.*?$)|" "(?P<newline>\n)|" # has to go ahead of the whitespace "(?P<comma>,)|" "(?P<double_quote_string>\".*?\")|" "(?P<single_quote_string>'.*?')|" "(?P<whitespace>[ \t\r\f\v]+)|" ... etc

Here you expect to get multiple matches for each group name when tokenizing a file and you want to keep the ordering of the tokens.

If I use the same approach using regexp2 can I go from match to group name? E.g. how do I get the last matched group name for a match? Is that possible?
Add support for Perl(PCRE) named and unnamed group capturing order

In other words maintain the order of capture groups. With the MaintainCaptureOrder regexp option.

I also added inline option o. It's useful if you only have access to the pattern, but not the regex. But it only is useful if used at the start of the pattern, I couldn't find a way to prevent it from being used elsewhere.

I've never liked nor have been good with bitwise operations. So I don't know if I should've picked another number.
Can regexp2 provide the same APIs adapt to std.regexp?

I wander if regexp2 can provide the same APIs adapt to std.regexp. So that I can change my rely between regexp2 & std.regexp easily by just change the expr text only.

Super Fast Regex in Go

Rubex : Super Fast Regexp for Go by Zhigang Chen ([email protected] or [email protected]) ONLY USE go1 BRANCH A simple regular expression libr

Sep 9, 2022

A simple action that looks for multiple regex matches, in a input text, and returns the key of the first found match.

Key Match Action A simple action that looks for multiple regex matches, in a input text, and returns the key of the first found match. TO RUN Add the

Aug 4, 2022

In-memory, full-text search engine built in Go. For no particular reason.

Motivation I just wanted to learn how to write a search engine from scratch without any prior experience. Features Index content Search content Index

Sep 1, 2022

In-memory, full-text search engine built in Go. For no particular reason.

Motivation I just wanted to learn how to write a search engine from scratch without any prior experience. Features Index content Search content Index

Sep 1, 2022

Takes a full name and splits it into individual name parts

gonameparts gonameparts splits a human name into individual parts. This is useful when dealing with external data sources that provide names as a sing

Sep 27, 2022

Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022

This package provides Go (golang) types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most famously used by app.diagrams.net, the new name of draw.io.

Go Draw - Golang MX This package provides types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most fa

Aug 30, 2022

A NMEA parser library in pure Go

go-nmea This is a NMEA library for the Go programming language (Golang). Features Parse individual NMEA 0183 sentences Support for sentences with NMEA

Dec 20, 2022

A general purpose syntax highlighter in pure Go

Chroma — A general purpose syntax highlighter in pure Go NOTE: As Chroma has just been released, its API is still in flux. That said, the high-level i

Dec 27, 2022

HTML, CSS and SVG static renderer in pure Go

Web render This module implements a static renderer for the HTML, CSS and SVG formats. It consists for the main part of a Golang port of the awesome W

Apr 19, 2022

A complete Liquid template engine in Go

Liquid Template Parser liquid is a pure Go implementation of Shopify Liquid templates. It was developed for use in the Gojekyll port of the Jekyll sta

Dec 15, 2022

🌭 The hotdog web browser and browser engine 🌭

This is the hotdog web browser project. It's a web browser with its own layout and rendering engine, parsers, and UI toolkit! It's made from scratch e

Dec 30, 2022

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Dec 12, 2022

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

Dec 13, 2022

A full-featured regex engine in pure Go based on the .NET engine

regexp2 - full featured regular expressions for Go

Basis of the engine

Installing

Usage

Compare regexp and regexp2

RE2 compatibility mode

Library features that I'm still working on

Potential bugs

Find a bug?

Owner

Doug Clark

Comments

Performance issue matching against beginning of very large string

Seems to fail a positive lookahead

Bulk replace

Regex Multiline

Improve ECMAScript compatibility.

Error while trying to match a string with a specific unicode against a RegExp that contains a space and a group

Improved the handling of named group references in ECMAScript mode.

Licensing and specific ATTRIB details

Problems with Negative Lookahead

Continuous 4byte emoji would crash when ReplaceFunc()

Sample1: causes panic

Sample2: all works fine

FYI

bugs in scenarios of Chinese characters or incorrect using of match.Index

error parsing regexp: unrecognized grouping construct: (?-1

fix: ecma ranges with set terminator

ecmascript: cannot include class \s in character range

Is it possible to get the name of the currently matched group?

Add support for Perl(PCRE) named and unnamed group capturing order

Can regexp2 provide the same APIs adapt to std.regexp?

Related tags

Super Fast Regex in Go

A simple action that looks for multiple regex matches, in a input text, and returns the key of the first found match.

In-memory, full-text search engine built in Go. For no particular reason.

In-memory, full-text search engine built in Go. For no particular reason.

Takes a full name and splits it into individual name parts

Small and fast FTS (full text search)

This package provides Go (golang) types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most famously used by app.diagrams.net, the new name of draw.io.

A NMEA parser library in pure Go

A general purpose syntax highlighter in pure Go

HTML, CSS and SVG static renderer in pure Go

A complete Liquid template engine in Go

🌭 The hotdog web browser and browser engine 🌭

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

A sanitization-based swear filter for Go.

Stylesheet-based markdown rendering for your CLI apps 💇🏻‍♀️

Glow is a terminal based markdown reader designed from the ground up to bring out the beauty—and power—of the CLI.💅🏻

The Markdown-based note-taking app that doesn't suck.

Generate markdown formatted sprint updates based on the Jira tickets were involved in the given sprint.

Compare `regexp` and `regexp2`