Machine-readable regular expressions for identifying accession numbers for cultural heritage organizations in text.

San Francisco International Airport Museum

Last update: Jun 14, 2022

Comments: 13

accession-numbers

Important

This is work in progress. Things may still change.

Motivation

The goal of this package is to have a collection of machine-readable regular expression patterns that can be used by applications to isolate accession numbers in arbitrary bodies of text. For example these data might be used by the sfomuseum/ios-label-whisperer application.

Data

Data for individual organizations are defined in data/{organization}.json files. These files lack a well-defined schema at this time.

The simplest version of a data file consists of name and url properties identifying an organization and a patterns properties which contains one or more dictionaries containing regular expression patterns that can be used to isolate accession numbers in a body of text. For example:

{
    "name": "National Museum of African American History and Culture",
    "url": "https://nmaahc.si.edu/",
    "patterns": [
	{
	    "name": "common",
	    "pattern": "((?:\\d{4})\\.(?:\\d+)(?:\\.\\d+){0,2})",
	    "tests": {
		"2013.68.19": 1,
		"2012.110": 1,
		"2016.5.2.11": 1,
		"2014.270.2": 1
	    }
	}
    ]
}

Patterns

Regular expression patterns should match the entire accession number and an interior matches should be non-greedy.

Tests

Tests for any given pattern are defined as a dictionary whose values are strings to match against (the current pattern) and whose values are the number of expected matches for a corresponding string (key).

Tests are run using the cmd/test-runner tool which is written in Go and uses the regexp.FindStringSubmatch method to find matches.

Tests

This packages comes with a command-line tool for running tests against some or all the files in the data directory. The tool is called test-runner and its source code can be found in the cmd/test-runner folder. It has also been pre-compiled to run on Windows, Linux and Mac OS computers. These binary versions are kept in the bin/(YOUR-OS-HERE) folders. Possible values for (YOUR-OS-HERE) are:

Operating System	Value
Linux	linux
Mac OS	darwin
Windows	windows

To run the test-runner tool type the following from a terminal window:

$> bin/(YOUR-OS-HERE)/test-runner data/*.json

And you should see something like this:

$> ./bin/darwin/test-runner data/*.json
2021/11/14 15:38:21 All tests pass for Cooper Hewitt Smithsonian National Design Museum
2021/11/14 15:38:21 All tests pass for SFO Museum

If you have the make application installed on your computer you can also simply run the tests Makefile target. For example:

$> make tests
bin/darwin/test-runner data/*.json
2021/11/14 16:19:59 All tests pass for Art Institute of Chicago
2021/11/14 16:19:59 All tests pass for Cooper Hewitt Smithsonian National Design Museum
2021/11/14 16:19:59 All tests pass for Denver Museum of Nature & Science
2021/11/14 16:19:59 All tests pass for Getty Center
2021/11/14 16:19:59 All tests pass for Metropolitan Museum of Art
2021/11/14 16:19:59 All tests pass for Museum of Modern Art
2021/11/14 16:19:59 All tests pass for National Air and Space Museum
2021/11/14 16:19:59 All tests pass for National Gallery of Art
2021/11/14 16:19:59 All tests pass for National Museum of Anthropology
2021/11/14 16:19:59 All tests pass for National Museum of African American History and Culture
2021/11/14 16:19:59 All tests pass for National Museum of American History
2021/11/14 16:19:59 All tests pass for Smithsonian National Museum of Natural History
2021/11/14 16:19:59 All tests pass for SFO Museum

Help wanted

Contributions for missing organizations and corrections for existing patterns are welcome.

Contributors

Bruce Wyman

> make tests
env GOOS=darwin GOARCH=amd64 go build -o bin/darwin/test-runner cmd/test-runner/main.go
env GOOS=linux GOARCH=amd64 go build -o bin/linux/test-runner cmd/test-runner/main.go
env GOOS=windows GOARCH=amd64 go build -o bin/windows/test-runner cmd/test-runner/main.go
bin/darwin/test-runner data/*.json

2021/11/16 11:15:54 Failed to run tests for Art Institute of Chicago, Failed to find matches for 'Obj: 96681' using '((?:[RXTNObjf0-9: ]+)\/?(?:[0-9\.\-,]*)|(?:\d{4})\.(?:\d+[a-z\-]*))' (Art Institute of Chicago), expected 1 matches but got 2 ([Obj: 96681])
make: *** [tests] Error 1

@nikhiltri Any thoughts on the best approach here?

Added V&A accession numbers patterns

We'd like to contribute the V&A accession number patterns to the project. We've executed the test-runner on the json and it has passed all the tests.
Add americanart.si.edu

Hi, I'm trying to add americanart.si.edu, but I'm having some issues getting the regex to pass the Linux test-runner. Confusingly, the test-runner says it fails on a different accession number each time it's run against the same pattern.
Add Museum of New Zealand Te Papa Tongarewa

Museum of New Zealand Te Papa Tongarewa https://www.tepapa.govt.nz (([A-Z]+[0-9]+[/A-Z])|((?:\d+)-(?:\d+)-(?:\d++)(/[A-z\ -]+))) ME001533 2014-0031-1 GH004839 2008-0006-1/A-L to L-L OL000491 SP082549/A
Add Musée National d'Art Moderne (Centre Pompidou)

Musée National d'Art Moderne (Centre Pompidou) https://www.mam.paris.fr (([A-Z]+)(\ |-|-)(?:\d+)((-[0-9]+)|(\ ([0-9]+))|([A-Z]+))*) AMPH 204 AMS 403 AMVP-2015-277 AMVP-2015-975 AME 1104 (2) AMPH 871TER
Add "index" file for data/*.json

Write a line-separated text file containing names (URLs) of *.json files with patterns in the data folder.

The idea is to provide a simple list of files to be downloaded for applications that want a local cache of data definitions.
Add National Gallery Singapore

National Gallery Singapore https://www.nationalgallery.sg (((?:\d+)|([A-Z]+))-([A-Z0-9]+)(-[A-Z0-9.]+)*) 2003-03691 ASB-0036 2001-03415 RC-S2-CCS2.6 2010-04133
Add Musée d'Orsay

Musée d'Orsay https://www.musee-orsay.fr (([A-Z]+\ )(?:\d+)+((\ [0-9]+)|(,\ [A-z0-9\ ,]*)|(-[0-9]+))) RF 1949 17 RF 676, LUX 91 RF 1980-195 RF 37305, Recto, RF 37305
Add British Museum

British Museum https://www.britishmuseum.org (([A-Z]+)([0-9]+)|([A-z]+.)(?:\d+)|(([A-z]?)(?:\d+),(?:\d+(.[0-9])([-0-9])*))) EA55614 1939,1010.1 2006,0503.1.1-20 EA24 Am.7184 Am1940,11.2
Add a tool to hash the contents of data/*.json

Create a tool that hashes the contents of all *.json files in the data directory and writes the value to a file.

The idea is to have a signal to applications can use to determine whether there are any changes to a local cache of data files that need to be downloaded.

Maybe install as a Git commit hook?
Account for language in "object_url" property
For example, the definition file for the Rijksmuseum assumes that the object_url property will point to the English-language version of their website:

"object_url": "https://www.rijksmuseum.nl/en/collection/{accession_number}",

Those kinds of assumptions shouldn't be... well, assumed. This might mean redefining the object_url property to be a dictionary along the lines of:

"object_url": { "en": "https://www.rijksmuseum.nl/en/collection/{accession_number}", "fr": "https://www.rijksmuseum.nl/fr/collection/{accession_number}" }

Or maybe just changing the software packages that expand URI templates to take a dictionary of values (to expand) rather than a single (accession_number) value.

The latter approach feels more-better to me, as I write this. TBD...
Placeholder: Fake accession number APIs
This is a placeholder issue for definitions and tools to create "fake accession number" API endpoints for cultural heritage organizations that do not allow individual object pages to be accessed (directly) using an accession number. It's not clear whether that work should be part of this repo.

For example:

https://www.nga.gov/collection/art-object-page.20483.html has accession number 1943.8.8285

And the data/objects.csv file in the NGA opendata release has the following header:

objectid,accessioned,accessionnum...

Which means a "fake accession number" API could map the accession number to the object ID and then either issue a redirect ot be used to explicitly assign a value to a object_uri URI template. For example:

https://github.com/sfomuseum/accession-numbers/blob/main/schema/definition.schema.json#L28-L30

Add/determine Who's On First IDs

> git grep whosonfirst_id | grep -e '-1'
data/airandspace.si.edu.json:    "whosonfirst_id": -1,
data/americanhistory.si.edu.json:    "whosonfirst_id": -1,
data/artsmia.org.json:    "whosonfirst_id": -1,
data/chnmuseum.cn.json:    "whosonfirst_id": -1,
data/chrysler.org.json:    "whosonfirst_id": -1,
data/kanazawa21.jp.json:    "whosonfirst_id": -1,
data/mam.paris.fr.json:    "whosonfirst_id": -1,
data/musee-orsay.fr.json:    "whosonfirst_id": -1,
data/museivaticani.va.json:    "whosonfirst_id": -1,
data/museodelprado.es.json:    "whosonfirst_id": -1,
data/museoreinasofia.es.json:    "whosonfirst_id": -1,
data/museu.ms.json:    "whosonfirst_id": -1,
data/museum.go.kr.json:    "whosonfirst_id": -1,
data/nationalgallery.sg.json:    "whosonfirst_id": -1,
data/naturalhistory.si.edu.json:    "whosonfirst_id": -1,
data/nga.gov.json:    "whosonfirst_id": -1,
data/nmaahc.si.edu.json:    "whosonfirst_id": -1,
data/okeeffemuseum.org.json:    "whosonfirst_id": -1,
data/rusmuseum.ru.json:    "whosonfirst_id": -1,

Related: https://github.com/whosonfirst-data/whosonfirst-data/labels/glam

Related tags

Text Processing accession-numbers

A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

Sep 27, 2022

Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Nov 24, 2022

Guess the natural language of a text in Go

guesslanguage This is a Go version of python guess-language. guesslanguage provides a simple way to detect the natural language of unicode string and

Dec 26, 2022

omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022

Extract urls from text

xurls Extract urls from text using regular expressions. Requires Go 1.13 or later. import "mvdan.cc/xurls/v2" func main() { rxRelaxed := xurls.Relax

Jan 7, 2023

Easy AWK-style text processing in Go

awk Description awk is a package for the Go programming language that provides an AWK-style text processing capability. The package facilitates splitt

Jul 25, 2022

Change the color of console text.

go-colortext package This is a package to change the color of the text and background in the console, working both under Windows and other systems. Un

Oct 26, 2022

Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022

Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Dec 5, 2022

Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022

text to speech bot for discord

Oct 1, 2022

A diff3 text merge implementation in Go

Diff3 A diff3 text merge implementation in Go based on the awesome paper below. "A Formal Investigation of Diff3" by Sanjeev Khanna, Keshav Kunal, and

Nov 5, 2022

gomtch - find text even if it doesn't want to be found

gomtch - find text even if it doesn't want to be found Do your users have clever ways to hide some terms from you? Sometimes it is hard to find forbid

Sep 28, 2022

Unified text diffing in Go (copy of the internal diffing packages the officlal Go language server uses)

gotextdiff - unified text diffing in Go This is a copy of the Go text diffing packages that the official Go language server gopls uses internally to g

Dec 26, 2022

Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Dec 11, 2022

A modern text indexing library for go

bleve modern text indexing in go - blevesearch.com Features Index any go data structure (including JSON) Intelligent defaults backed up by powerful co

Jan 4, 2023

Paranoid text spacing in Go (Golang)

pangu.go Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Oct 15, 2022

Diff, match and patch text in Go

go-diff go-diff offers algorithms to perform operations required for synchronizing plain text: Compare two texts and return their differences. Perform

Dec 25, 2022

Machine-readable regular expressions for identifying accession numbers for cultural heritage organizations in text.

accession-numbers

Important

Motivation

Data

Patterns

Tests

Tests

Help wanted

Contributors

See also

Owner

San Francisco International Airport Museum

Comments

Update AIC pattern

Added V&A accession numbers patterns

Add americanart.si.edu

Add Museum of New Zealand Te Papa Tongarewa

Add Musée National d'Art Moderne (Centre Pompidou)

Add "index" file for data/*.json

Add National Gallery Singapore

Add Musée d'Orsay

Add British Museum

Add a tool to hash the contents of data/*.json

Account for language in "object_url" property

Placeholder: Fake accession number APIs

Add/determine Who's On First IDs

Related tags

A general purpose application and library for aligning text.

Parse placeholder and wildcard text commands

Guess the natural language of a text in Go

omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Extract urls from text

Easy AWK-style text processing in Go

Change the color of console text.

Templating system for HTML and other text documents - go implementation

Package sanitize provides functions for sanitizing text in golang strings.

Small and fast FTS (full text search)

text to speech bot for discord

A diff3 text merge implementation in Go

gomtch - find text even if it doesn't want to be found

Unified text diffing in Go (copy of the internal diffing packages the officlal Go language server uses)

Convert scanned image PDF file to text annotated PDF file

A modern text indexing library for go

Paranoid text spacing in Go (Golang)

Diff, match and patch text in Go