Trigrams
Generates random text based on trigrams generated from input text
Contents
Building
go build
Running
./trigrams
Using
Train the server with some text by doing:
curl -X POST -d "this is some text" http://localhost:8080/learn
Text files can be sent as follows (make sure to use --data-binary to account for new lines):
curl -X POST --data-binary @pride-prejudice.txt http://localhost:8080/learn
Generate a random string of text by running:
curl -X GET http://localhost:8080/generate
Output:
claimed towards Mr. Darcy had never seen a collection of people in this manner; and as a rector, made him altogether a mixture of pride and impertinence; she had as good a chance of happiness as if the second, I can admire you much better finish his letter. When that business was over, he applied to Miss Grantley’s.” “Will you give me leave to apologise for it, as well as her mother should be in danger of hating each other for the other. The master of the impertinent. She mentioned this to her notice. Mrs. Phillips was quite disconcerted. She
Implementation notes
The application is controlled by a series of consts, though these could easily be replaced with command line flags.
NGram size
Although the challenge focuses on trigrams, there are actually a number of different ngram sizes that can be processed; unigrams. bigrams, trigrams etc
The application can be developed with a more general use case; specifically, making the gram size variable. Hence, we'll develop an ngram parsing application, configured to parse trigrams.
Maximum word count
The possibility of generating unigrams highlights the need for a maximum word count; we don't want to fall into an infinite loop when building random strings. Additionally, it cannot be ruled out that we won't fall into an infinite loop when parsing larger grams, so maximum word count will be controlled by a variable.
Punctuation stripping
Input text will typically be composed of alphanumeric characters, and any number of punctuation characters. I'm unsure if this task wants to include or omit such punctuation from the ngrams; hence, stripping punctuation from strings will be controlled by a variable also.
Weighted random selection
Because we want the random word selection to be reflective of the frequency with which that word occurs in the source text, we need an efficient way of weighting the random selection. A naive approach would be to create a new array of strings, duplicating each entry a number of times equal to that word's frequency, but this would be horribly inefficient. Instead, if we calculate the total frequency count (F) of all words in the source text, then generate a random number (R) between 1 and (F), and subtract each word's frequency from (R) at random, then test for the current value of (R) being less than or equal to 0, we'll have selected a random word based on that word's frequency. For example, given:
- "dog" : 5
- "cat" : 1
- "bird" : 3
- "snake": 1
- "horse": 2
The total frequency is 5 + 1 + 3 + 1 + 2 = 12. Generating a random number between 1 and 12 might give us R = 7. Randomly deleting each frequency from (R) until (R) is <= 0 might give us:
7 - (cat : 1) = 6 <= 0 : NO 6 - (bird : 3) = 3 <= 0 : NO 3 - (dog : 5) = -2 <= 0 : YES dog is selected
Or, another example where R = 5:
5 - (dog : 5) = 0 <= 0 : YES dog is selected
Endpoint considerations
A naive approach with endpoints is to call a time-consuming function asynchronously and respond immediately with an OK status code.
The problem with this approach is that flooding this endpoint with a large number of calls will cause the application to crash.
Instead of unbounded async go function calls which will quickly exhaust resources, we need to use channels to limit access to resources
We have two queues:
- learn queue
- generation queue
Both queues accept tasks; the learn queue accepts learn tasks, and the generation queue accepts generation tasks
Next, we launch dispatchers
The learn dispatcher creates a pool of learn workers, and starts them. Each worker runs, and registers itself with the dispatch pool, then waits to receive a learning task.
The dispatcher then listens to the learn queue, waiting for new learn tasks. When a new task is received, a worker is retrieved from the worker pool, and the task is passed to that worker to be processed.
Once a task is received by the worker, the task is processed and the input text is parsed into tokens, then n-grams are gathered and added to the gram collection.
The same happens for the generation queue for generate tasks.
ioutil.ReadAll() vs streaming requests
Initially, the /learn
endpoint read the entire request body into memory for processing. While this worked well for small requests, copying very large requests into memory at once might cause the application to crash. An alternative solution is to stream the request body, rather than copy the entire request body into memory at once. To test the new implementation, five concurrent requests were made to the /learn
endpoint, with enwiki8 (approx. 95MB) as the request body. In the case of the ioutil.ReadAll()
implementation, memory usage of the application almost immediately spiked to over 3GB. In the streaming implementation, under the same use case, memory usage climbed only to ~30MB in seconds, and ~60MB in minutes:
ioutil.ReadAll()
Streaming
Testing
Running unit tests
Run unit tests with:
go test ./...
Race detection
Build the program with the -race
flag:
go build -race
Running the application:
./trigrams
And then passing some data to the /learn
endpoint:
curl -X POST --data-binary @pride-prejudice.txt http://localhost:8080/learn
Followed by a few hundred /generate
calls:
hey -n 200 -c 10 http://localhost:8080/generate
Does not report any race conditions.
Buffer read size vs learn request time
In order to identify the optimum size (in bytes) of a buffer for reading the request body, pride-prejudice.txt
was submitted to the /learn
endpoint, each time with a different ReadSize
specified as the length of the buffer for reading the request body. Over the following series of tests, a buffer size of 64 bytes was identified as achieving the best request read performance:
read size (bytes) | request time |
---|---|
1024 | 1m2.536s |
1 | 1m16.315s |
2048 | 1m7.439s |
512 | 0m57.830s |
256 | 0m55.934s |
128 | 0m54.770s |
64 | 0m54.367s |
32 | 0m56.060s |
48 | 0m55.345s |
56 | 0m54.730s |
60 | 0m54.509s |
62 | 0m54.741s |
63 | 0m54.724s |