Trigrams

Generates random text based on trigrams generated from input text

Building
Running
Using
Implementation notes
Testing

Building

go build

Running

./trigrams

Using

Train the server with some text by doing:

curl -X POST -d "this is some text" http://localhost:8080/learn

Text files can be sent as follows (make sure to use --data-binary to account for new lines):

curl -X POST --data-binary @pride-prejudice.txt http://localhost:8080/learn

Generate a random string of text by running:

curl -X GET http://localhost:8080/generate

Output:

claimed towards Mr. Darcy had never seen a collection of people in this manner; and as a rector, made him altogether a mixture of pride and impertinence; she had as good a chance of happiness as if the second, I can admire you much better finish his letter. When that business was over, he applied to Miss Grantley’s.” “Will you give me leave to apologise for it, as well as her mother should be in danger of hating each other for the other. The master of the impertinent. She mentioned this to her notice. Mrs. Phillips was quite disconcerted. She

Implementation notes

The application is controlled by a series of consts, though these could easily be replaced with command line flags.

NGram size

Although the challenge focuses on trigrams, there are actually a number of different ngram sizes that can be processed; unigrams. bigrams, trigrams etc

The application can be developed with a more general use case; specifically, making the gram size variable. Hence, we'll develop an ngram parsing application, configured to parse trigrams.

Maximum word count

The possibility of generating unigrams highlights the need for a maximum word count; we don't want to fall into an infinite loop when building random strings. Additionally, it cannot be ruled out that we won't fall into an infinite loop when parsing larger grams, so maximum word count will be controlled by a variable.

Punctuation stripping

Input text will typically be composed of alphanumeric characters, and any number of punctuation characters. I'm unsure if this task wants to include or omit such punctuation from the ngrams; hence, stripping punctuation from strings will be controlled by a variable also.

Weighted random selection

Because we want the random word selection to be reflective of the frequency with which that word occurs in the source text, we need an efficient way of weighting the random selection. A naive approach would be to create a new array of strings, duplicating each entry a number of times equal to that word's frequency, but this would be horribly inefficient. Instead, if we calculate the total frequency count (F) of all words in the source text, then generate a random number (R) between 1 and (F), and subtract each word's frequency from (R) at random, then test for the current value of (R) being less than or equal to 0, we'll have selected a random word based on that word's frequency. For example, given:

"dog" : 5
"cat" : 1
"bird" : 3
"snake": 1
"horse": 2

The total frequency is 5 + 1 + 3 + 1 + 2 = 12. Generating a random number between 1 and 12 might give us R = 7. Randomly deleting each frequency from (R) until (R) is <= 0 might give us:

7 - (cat : 1) = 6 <= 0 : NO 6 - (bird : 3) = 3 <= 0 : NO 3 - (dog : 5) = -2 <= 0 : YES dog is selected

Or, another example where R = 5:

5 - (dog : 5) = 0 <= 0 : YES dog is selected

Endpoint considerations

A naive approach with endpoints is to call a time-consuming function asynchronously and respond immediately with an OK status code.

The problem with this approach is that flooding this endpoint with a large number of calls will cause the application to crash.

Instead of unbounded async go function calls which will quickly exhaust resources, we need to use channels to limit access to resources

We have two queues:

learn queue
generation queue

Both queues accept tasks; the learn queue accepts learn tasks, and the generation queue accepts generation tasks

Next, we launch dispatchers

The learn dispatcher creates a pool of learn workers, and starts them. Each worker runs, and registers itself with the dispatch pool, then waits to receive a learning task.

The dispatcher then listens to the learn queue, waiting for new learn tasks. When a new task is received, a worker is retrieved from the worker pool, and the task is passed to that worker to be processed.

Once a task is received by the worker, the task is processed and the input text is parsed into tokens, then n-grams are gathered and added to the gram collection.

The same happens for the generation queue for generate tasks.

ioutil.ReadAll() vs streaming requests

Initially, the /learn endpoint read the entire request body into memory for processing. While this worked well for small requests, copying very large requests into memory at once might cause the application to crash. An alternative solution is to stream the request body, rather than copy the entire request body into memory at once. To test the new implementation, five concurrent requests were made to the /learn endpoint, with enwiki8 (approx. 95MB) as the request body. In the case of the ioutil.ReadAll() implementation, memory usage of the application almost immediately spiked to over 3GB. In the streaming implementation, under the same use case, memory usage climbed only to ~30MB in seconds, and ~60MB in minutes:

ioutil.ReadAll()

Streaming

Testing

Running unit tests

Run unit tests with:

go test ./...

Race detection

Build the program with the -race flag:

go build -race

Running the application:

./trigrams

And then passing some data to the /learn endpoint:

curl -X POST --data-binary @pride-prejudice.txt http://localhost:8080/learn

Followed by a few hundred /generate calls:

hey -n 200 -c 10 http://localhost:8080/generate

Does not report any race conditions.

Buffer read size vs learn request time

In order to identify the optimum size (in bytes) of a buffer for reading the request body, pride-prejudice.txt was submitted to the /learn endpoint, each time with a different ReadSize specified as the length of the buffer for reading the request body. Over the following series of tests, a buffer size of 64 bytes was identified as achieving the best request read performance:

read size (bytes)	request time
1024	1m2.536s
1	1m16.315s
2048	1m7.439s
512	0m57.830s
256	0m55.934s
128	0m54.770s
64	0m54.367s
32	0m56.060s
48	0m55.345s
56	0m54.730s
60	0m54.509s
62	0m54.741s
63	0m54.724s

Generate some random data

fake-data-generator-api generate some random data installing and using

Dec 2, 2022

GoApiRandom - Api to get random numbers

Jan 18, 2022

Print random bytes from a secure source to stdout.

Feb 11, 2022

rxscan provides functionality to scan text to variables using regular expression capture group.

rxscan rxscan provides functionality to scan text to variables using regular expression capture group. This library is still experimental, use at your

Dec 21, 2020

Go library that provides fuzzy string matching optimized for filenames and code symbols in the style of Sublime Text, VSCode, IntelliJ IDEA et al.

Go library that provides fuzzy string matching optimized for filenames and code symbols in the style of Sublime Text, VSCode, IntelliJ IDEA et al. This library is external dependency-free. It only depends on the Go standard library.

Dec 27, 2022

Generates random text based on trigrams generated from input text

Trigrams

Contents

Building

Running

Using

Implementation notes

NGram size

Maximum word count

Punctuation stripping

Weighted random selection

Endpoint considerations

ioutil.ReadAll() vs streaming requests

ioutil.ReadAll()

Streaming

Testing

Running unit tests

Race detection

Buffer read size vs learn request time

Owner

Similar Resources

Related tags

Generates a random alphanumeric string of a given length.

Fast, scalable pseudo random number generator based on xxh3

Daypaper sets your GNOME wallpaper based on the time of day from a random and relevant Unsplash image.

Dynamically generated Last.fm stats for your profile readme

Code generator that generates boilerplate code for a go http server

A protoc plugin that generates fieldmask paths as static type properties for proto messages

randstr is a module that contains functions for generating random strings.

✔️ Get random data for your app from a third-party source.

Just some random matchers

generate random data like name, email, uuid, address, images and etc.