In Mimir, we're occasionally seeing "empty ring" ring right after a process startup (e.g. querier). It's an issue that has started after the migration to memberlist.
Possible root cause
I think the issue is caused by the ring client implementation not guaranteeing to wait to get the initial ring state before switching to Running
state. In the following I share some thoughts about the code.
The ring client service is expected to switch to Running
state only after it initialized its internal state with the ring data structure. This is why it calls r.KVClient.Get()
in the Ring.starting()
:
https://github.com/grafana/dskit/blob/e441b77be7780e03f2c37659839bfe90dfde7dd3/ring/ring.go#L252-L256
When using Consul or etcd as backend, the r.KVClient.Get()
guarantees to return the state of the ring, but I think this guarantee has been lost in the memberlist implementation and it could return a zero data structure.
The memberlist client Get()
is implemented here:
https://github.com/grafana/dskit/blob/e441b77be7780e03f2c37659839bfe90dfde7dd3/kv/memberlist/memberlist_client.go#L63-L70
It waits until the backend KV client is running. But does waiting for it to be running guarantee the ring data structure to be populated before that? I don't think so.
The memberlist KV.starting()
just initialise memberlist but doesn't join the cluster:
https://github.com/grafana/dskit/blob/e441b77be7780e03f2c37659839bfe90dfde7dd3/kv/memberlist/memberlist_client.go#L426-L453
The memberlist cluster is joined only in the KV.running()
, but that's too late, because at that point our code assume the ring data structure to be already populated:
https://github.com/grafana/dskit/blob/e441b77be7780e03f2c37659839bfe90dfde7dd3/kv/memberlist/memberlist_client.go#L457-L472