Machine-readable regular expressions for identifying accession numbers for cultural heritage organizations in text.

accession-numbers

Machine-readable regular expressions for identifying accession numbers for cultural heritage organizations in text.

Important

This is work in progress. Things may still change.

Motivation

The goal of this package is to have a collection of machine-readable regular expression patterns that can be used by applications to isolate accession numbers in arbitrary bodies of text. For example these data might be used by the sfomuseum/ios-label-whisperer application.

Data

Data for individual organizations are defined in data/{organization}.json files. These files lack a well-defined schema at this time.

The simplest version of a data file consists of name and url properties identifying an organization and a patterns properties which contains one or more dictionaries containing regular expression patterns that can be used to isolate accession numbers in a body of text. For example:

{
    "name": "National Museum of African American History and Culture",
    "url": "https://nmaahc.si.edu/",
    "patterns": [
	{
	    "name": "common",
	    "pattern": "((?:\\d{4})\\.(?:\\d+)(?:\\.\\d+){0,2})",
	    "tests": {
		"2013.68.19": 1,
		"2012.110": 1,
		"2016.5.2.11": 1,
		"2014.270.2": 1
	    }
	}
    ]
}

Patterns

Regular expression patterns should match the entire accession number and an interior matches should be non-greedy.

Tests

Tests for any given pattern are defined as a dictionary whose values are strings to match against (the current pattern) and whose values are the number of expected matches for a corresponding string (key).

Tests are run using the cmd/test-runner tool which is written in Go and uses the regexp.FindStringSubmatch method to find matches.

Tests

This packages comes with a command-line tool for running tests against some or all the files in the data directory. The tool is called test-runner and its source code can be found in the cmd/test-runner folder. It has also been pre-compiled to run on Windows, Linux and Mac OS computers. These binary versions are kept in the bin/(YOUR-OS-HERE) folders. Possible values for (YOUR-OS-HERE) are:

Operating System Value
Linux linux
Mac OS darwin
Windows windows

To run the test-runner tool type the following from a terminal window:

$> bin/(YOUR-OS-HERE)/test-runner data/*.json

And you should see something like this:

$> ./bin/darwin/test-runner data/*.json
2021/11/14 15:38:21 All tests pass for Cooper Hewitt Smithsonian National Design Museum
2021/11/14 15:38:21 All tests pass for SFO Museum

If you have the make application installed on your computer you can also simply run the tests Makefile target. For example:

$> make tests
bin/darwin/test-runner data/*.json
2021/11/14 16:19:59 All tests pass for Art Institute of Chicago
2021/11/14 16:19:59 All tests pass for Cooper Hewitt Smithsonian National Design Museum
2021/11/14 16:19:59 All tests pass for Denver Museum of Nature & Science
2021/11/14 16:19:59 All tests pass for Getty Center
2021/11/14 16:19:59 All tests pass for Metropolitan Museum of Art
2021/11/14 16:19:59 All tests pass for Museum of Modern Art
2021/11/14 16:19:59 All tests pass for National Air and Space Museum
2021/11/14 16:19:59 All tests pass for National Gallery of Art
2021/11/14 16:19:59 All tests pass for National Museum of Anthropology
2021/11/14 16:19:59 All tests pass for National Museum of African American History and Culture
2021/11/14 16:19:59 All tests pass for National Museum of American History
2021/11/14 16:19:59 All tests pass for Smithsonian National Museum of Natural History
2021/11/14 16:19:59 All tests pass for SFO Museum

Help wanted

Contributions for missing organizations and corrections for existing patterns are welcome.

Contributors

See also

Owner
San Francisco International Airport Museum
San Francisco International Airport Museum
Comments
  • Update AIC pattern

    Update AIC pattern

    Tests were recently updated to use a "tail-buffer"-based lookup to account for text with multiple accession numbers and arbitrary text (aka a "wall label"):

    • https://github.com/sfomuseum/accession-numbers/blob/main/cmd/test-runner/main.go#L98-L170

    For example:

    "This is an object\\nGift of Important Donor\\n302.2021.x1-x2\\n\\nThis is another object\\nAnonymouts Gift\\nPG731.2019 146.2020": 3
    

    This breaks the AIC test for Obj: 96681 (which has been given an expected count of -1 so as not to stop other tests):

    > make tests
    env GOOS=darwin GOARCH=amd64 go build -o bin/darwin/test-runner cmd/test-runner/main.go
    env GOOS=linux GOARCH=amd64 go build -o bin/linux/test-runner cmd/test-runner/main.go
    env GOOS=windows GOARCH=amd64 go build -o bin/windows/test-runner cmd/test-runner/main.go
    bin/darwin/test-runner data/*.json
    
    2021/11/16 11:15:54 Failed to run tests for Art Institute of Chicago, Failed to find matches for 'Obj: 96681' using '((?:[RXTNObjf0-9: ]+)\/?(?:[0-9\.\-,]*)|(?:\d{4})\.(?:\d+[a-z\-]*))' (Art Institute of Chicago), expected 1 matches but got 2 ([Obj: 96681])
    make: *** [tests] Error 1
    

    @nikhiltri Any thoughts on the best approach here?

  • Added V&A accession numbers patterns

    Added V&A accession numbers patterns

    We'd like to contribute the V&A accession number patterns to the project. We've executed the test-runner on the json and it has passed all the tests.

  • Add americanart.si.edu

    Add americanart.si.edu

    Hi, I'm trying to add americanart.si.edu, but I'm having some issues getting the regex to pass the Linux test-runner. Confusingly, the test-runner says it fails on a different accession number each time it's run against the same pattern.

  • Add Museum of New Zealand Te Papa Tongarewa

    Add Museum of New Zealand Te Papa Tongarewa

    Museum of New Zealand Te Papa Tongarewa https://www.tepapa.govt.nz (([A-Z]+[0-9]+[/A-Z])|((?:\d+)-(?:\d+)-(?:\d++)(/[A-z\ -]+))) ME001533 2014-0031-1 GH004839 2008-0006-1/A-L to L-L OL000491 SP082549/A

  • Add Musée National d'Art Moderne (Centre Pompidou)

    Add Musée National d'Art Moderne (Centre Pompidou)

    Musée National d'Art Moderne (Centre Pompidou) https://www.mam.paris.fr (([A-Z]+)(\ |-|-)(?:\d+)((-[0-9]+)|(\ ([0-9]+))|([A-Z]+))*) AMPH 204 AMS 403 AMVP-2015-277 AMVP-2015-975 AME 1104 (2) AMPH 871TER

  • Add

    Add "index" file for data/*.json

    Write a line-separated text file containing names (URLs) of *.json files with patterns in the data folder.

    The idea is to provide a simple list of files to be downloaded for applications that want a local cache of data definitions.

  • Add National Gallery Singapore

    Add National Gallery Singapore

    National Gallery Singapore https://www.nationalgallery.sg (((?:\d+)|([A-Z]+))-([A-Z0-9]+)(-[A-Z0-9.]+)*) 2003-03691 ASB-0036 2001-03415 RC-S2-CCS2.6 2010-04133

  • Add Musée d'Orsay

    Add Musée d'Orsay

    Musée d'Orsay https://www.musee-orsay.fr (([A-Z]+\ )(?:\d+)+((\ [0-9]+)|(,\ [A-z0-9\ ,]*)|(-[0-9]+))) RF 1949 17 RF 676, LUX 91 RF 1980-195 RF 37305, Recto, RF 37305

  • Add British Museum

    Add British Museum

    British Museum https://www.britishmuseum.org (([A-Z]+)([0-9]+)|([A-z]+.)(?:\d+)|(([A-z]?)(?:\d+),(?:\d+(.[0-9])([-0-9])*))) EA55614 1939,1010.1 2006,0503.1.1-20 EA24 Am.7184 Am1940,11.2

  • Add a tool to hash the contents of data/*.json

    Add a tool to hash the contents of data/*.json

    Create a tool that hashes the contents of all *.json files in the data directory and writes the value to a file.

    The idea is to have a signal to applications can use to determine whether there are any changes to a local cache of data files that need to be downloaded.

    Maybe install as a Git commit hook?

  • Account for language in

    Account for language in "object_url" property

    For example, the definition file for the Rijksmuseum assumes that the object_url property will point to the English-language version of their website:

        "object_url": "https://www.rijksmuseum.nl/en/collection/{accession_number}",    
    

    Those kinds of assumptions shouldn't be... well, assumed. This might mean redefining the object_url property to be a dictionary along the lines of:

    "object_url": {
       "en": "https://www.rijksmuseum.nl/en/collection/{accession_number}",
       "fr": "https://www.rijksmuseum.nl/fr/collection/{accession_number}"
    }
    

    Or maybe just changing the software packages that expand URI templates to take a dictionary of values (to expand) rather than a single (accession_number) value.

    The latter approach feels more-better to me, as I write this. TBD...

  • Placeholder: Fake accession number APIs

    Placeholder: Fake accession number APIs

    This is a placeholder issue for definitions and tools to create "fake accession number" API endpoints for cultural heritage organizations that do not allow individual object pages to be accessed (directly) using an accession number. It's not clear whether that work should be part of this repo.

    For example:

    • https://www.nga.gov/collection/art-object-page.20483.html has accession number 1943.8.8285

    And the data/objects.csv file in the NGA opendata release has the following header:

    objectid,accessioned,accessionnum...
    

    Which means a "fake accession number" API could map the accession number to the object ID and then either issue a redirect ot be used to explicitly assign a value to a object_uri URI template. For example:

    • https://github.com/sfomuseum/accession-numbers/blob/main/schema/definition.schema.json#L28-L30
  • Add/determine Who's On First IDs

    Add/determine Who's On First IDs

    > git grep whosonfirst_id | grep -e '-1'
    data/airandspace.si.edu.json:    "whosonfirst_id": -1,
    data/americanhistory.si.edu.json:    "whosonfirst_id": -1,
    data/artsmia.org.json:    "whosonfirst_id": -1,
    data/chnmuseum.cn.json:    "whosonfirst_id": -1,
    data/chrysler.org.json:    "whosonfirst_id": -1,
    data/kanazawa21.jp.json:    "whosonfirst_id": -1,
    data/mam.paris.fr.json:    "whosonfirst_id": -1,
    data/musee-orsay.fr.json:    "whosonfirst_id": -1,
    data/museivaticani.va.json:    "whosonfirst_id": -1,
    data/museodelprado.es.json:    "whosonfirst_id": -1,
    data/museoreinasofia.es.json:    "whosonfirst_id": -1,
    data/museu.ms.json:    "whosonfirst_id": -1,
    data/museum.go.kr.json:    "whosonfirst_id": -1,
    data/nationalgallery.sg.json:    "whosonfirst_id": -1,
    data/naturalhistory.si.edu.json:    "whosonfirst_id": -1,
    data/nga.gov.json:    "whosonfirst_id": -1,
    data/nmaahc.si.edu.json:    "whosonfirst_id": -1,
    data/okeeffemuseum.org.json:    "whosonfirst_id": -1,
    data/rusmuseum.ru.json:    "whosonfirst_id": -1,
    

    Related: https://github.com/whosonfirst-data/whosonfirst-data/labels/glam

A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

Sep 27, 2022
Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Nov 24, 2022
Guess the natural language of a text in Go

guesslanguage This is a Go version of python guess-language. guesslanguage provides a simple way to detect the natural language of unicode string and

Dec 26, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
Extract urls from text

xurls Extract urls from text using regular expressions. Requires Go 1.13 or later. import "mvdan.cc/xurls/v2" func main() { rxRelaxed := xurls.Relax

Jan 7, 2023
Easy AWK-style text processing in Go

awk Description awk is a package for the Go programming language that provides an AWK-style text processing capability. The package facilitates splitt

Jul 25, 2022
Change the color of console text.

go-colortext package This is a package to change the color of the text and background in the console, working both under Windows and other systems. Un

Oct 26, 2022
Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022
Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Dec 5, 2022
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022
text to speech bot for discord
text to speech bot for discord

text to speech bot for discord

Oct 1, 2022
A diff3 text merge implementation in Go

Diff3 A diff3 text merge implementation in Go based on the awesome paper below. "A Formal Investigation of Diff3" by Sanjeev Khanna, Keshav Kunal, and

Nov 5, 2022
gomtch - find text even if it doesn't want to be found

gomtch - find text even if it doesn't want to be found Do your users have clever ways to hide some terms from you? Sometimes it is hard to find forbid

Sep 28, 2022
Unified text diffing in Go (copy of the internal diffing packages the officlal Go language server uses)

gotextdiff - unified text diffing in Go This is a copy of the Go text diffing packages that the official Go language server gopls uses internally to g

Dec 26, 2022
Convert scanned image PDF file to text annotated PDF file
Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Dec 11, 2022
A modern text indexing library for go
A modern text indexing library for go

bleve modern text indexing in go - blevesearch.com Features Index any go data structure (including JSON) Intelligent defaults backed up by powerful co

Jan 4, 2023
Paranoid text spacing in Go (Golang)

pangu.go Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Oct 15, 2022
Diff, match and patch text in Go

go-diff go-diff offers algorithms to perform operations required for synchronizing plain text: Compare two texts and return their differences. Perform

Dec 25, 2022