We saw a case in a performance run, where FireFly core skipped a batch of 3 events from EVMConnect.
EVMConnect believed they had been acknowledged and received, but FF Core never actually processed them.
What we found from the logs, is that FireFly Core had ended up in a situation where it was sending acknowledgements that were being interpreted by FFTM/EVMConnect as being for the next batch after the one that had just been processed.
This combined with a disconnect/reconnect of the websocket, where a batch was dropped. After the reconnect - EVMConnect didn't re-send the missed batch, because it believed it already to be acknowledged.
The problem is the "acknowledge last thing you sent" protocol, combined with the fact that the WS code doesn't close the go channels through reconnects (deliberately to support multiple WS connections/reconnects). It leaves a window where an ack can be misinterpretted from before a reconnect. There are attempts in the code to make this window very small - with the code by clearing out the go channel that passes ack
payloads to the batch dispatcher, before delivering the batch. However, that does not eliminate the window completely.
So this PR make a breaking change to the protocol to be specific about the batch being ack'd:
FF Core and the Tokens connectors will need to be updated:
- To handle either the old EthConnect flat array payloads, or the new style FFTM/EVMConnect payloads with an object containing
batchNumber
and events
- Include the
batchNumber
in the corresponding ack
Note that the reply payloads that do not require acknowledgement come over the same pipe, and this is an area that could be overall made more consistent in the future.