Is this a bug report or feature request?
Deviation from expected behavior:
After the host reboot, the configmap rook-ceph-csi-config disappeared. All the ceph-csi pods were in "ContainerCreating" state.
Expected behavior:
The configmap rook-ceph-csi-config should not be deleted in this condition or should be re-created if not found.
How to reproduce it (minimal and precise):
Run "reboot" on the one of the host in the cluster. The issue didn't surface each time but had occurred several times in our testing. Below is the failure condition.
# knc get pods | egrep -v "Run|Com"
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-6cb75 0/3 ContainerCreating 0 15h
csi-cephfsplugin-9whpq 0/3 ContainerCreating 0 15h
csi-cephfsplugin-bpn88 0/3 ContainerCreating 0 15h
csi-cephfsplugin-gd6kk 0/3 ContainerCreating 0 15h
csi-cephfsplugin-hbjkj 0/3 ContainerCreating 0 15h
csi-cephfsplugin-jt48j 0/3 ContainerCreating 0 15h
csi-cephfsplugin-mlj6w 0/3 ContainerCreating 0 15h
csi-cephfsplugin-provisioner-67cdf965c6-764bx 0/5 ContainerCreating 0 15h
csi-cephfsplugin-provisioner-67cdf965c6-pq4wm 0/5 ContainerCreating 0 15h
csi-cephfsplugin-rx599 0/3 ContainerCreating 0 15h
csi-rbdplugin-9v8kb 0/3 ContainerCreating 0 15h
csi-rbdplugin-bccpt 0/3 ContainerCreating 0 15h
csi-rbdplugin-bqlpc 0/3 ContainerCreating 0 15h
csi-rbdplugin-f2fb9 0/3 ContainerCreating 0 15h
csi-rbdplugin-h8hbc 0/3 ContainerCreating 0 15h
csi-rbdplugin-l2wbz 0/3 ContainerCreating 0 15h
csi-rbdplugin-njtt7 0/3 ContainerCreating 0 15h
csi-rbdplugin-provisioner-78d6f54775-dq47m 0/6 ContainerCreating 0 15h
csi-rbdplugin-provisioner-78d6f54775-hc9nb 0/6 ContainerCreating 0 15h
csi-rbdplugin-tfn52 0/3 ContainerCreating 0 15h
rook-ceph-detect-version-kpc2p 0/1 Init:0/1 0 1s
# knc describe pod csi-rbdplugin-provisioner-78d6f54775-hc9nb
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 40m (x57 over 14h) kubelet, tesla-cb0434-csd1-csd1-control-03 Unable to attach or mount volumes: unmounted volumes=[ceph-csi-config], unattached volumes=[rook-csi-rbd-provisioner-sa-token-q4kbn host-dev host-sys lib-modules ceph-csi-config keys-tmp-dir socket-dir]: timed out waiting for the condition
Warning FailedMount 36m (x58 over 14h) kubelet, tesla-cb0434-csd1-csd1-control-03 Unable to attach or mount volumes: unmounted volumes=[ceph-csi-config], unattached volumes=[socket-dir rook-csi-rbd-provisioner-sa-token-q4kbn host-dev host-sys lib-modules ceph-csi-config keys-tmp-dir]: timed out waiting for the condition
Warning FailedMount 15m (x62 over 15h) kubelet, tesla-cb0434-csd1-csd1-control-03 Unable to attach or mount volumes: unmounted volumes=[ceph-csi-config], unattached volumes=[keys-tmp-dir socket-dir rook-csi-rbd-provisioner-sa-token-q4kbn host-dev host-sys lib-modules ceph-csi-config]: timed out waiting for the condition
Warning FailedMount 67s (x453 over 15h) kubelet, tesla-cb0434-csd1-csd1-control-03 MountVolume.SetUp failed for volume "ceph-csi-config" : configmap "rook-ceph-csi-config" not found
# cephstatus
cluster:
id: 79580ff1-adf9-4d6a-a4c6-9dc44fe784c5
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,c,d (age 15h)
mgr: a(active, since 15h)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 6 osds: 6 up (since 15h), 6 in (since 15h)
rgw: 1 daemon active (rook.ceph.store.a)
task status:
scrub status:
mds.myfs-a: idle
mds.myfs-b: idle
data:
pools: 10 pools, 208 pgs
objects: 295 objects, 22 MiB
usage: 6.9 GiB used, 53 GiB / 60 GiB avail
pgs: 208 active+clean
io:
client: 852 B/s rd, 1 op/s rd, 0 op/s wr
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml
, if necessary
- Operator's logs, if necessary
2020-08-25 04:05:57.736755 I | rookcmd: starting Rook v1.3.9 with arguments '/usr/local/bin/rook ceph operator'
2020-08-25 04:05:57.737076 I | rookcmd: flag values: --add_dir_header=false, --alsologtostderr=false, --csi-cephfs-plugin-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin.yaml, --csi-cephfs-provisioner-dep-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-dep.yaml, --csi-cephfs-provisioner-sts-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-sts.yaml, --csi-rbd-plugin-template-path=/etc/ceph-csi/rbd/csi-rbdplugin.yaml, --csi-rbd-provisioner-dep-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-dep.yaml, --csi-rbd-provisioner-sts-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-sts.yaml, --enable-discovery-daemon=true, --enable-flex-driver=false, --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-flush-frequency=5s, --log-level=INFO, --log_backtrace_at=:0, --log_dir=, --log_file=, --log_file_max_size=1800, --logtostderr=true, --master=, --mon-healthcheck-interval=45s, --mon-out-timeout=5m0s, --operator-image=, --service-account=, --skip_headers=false, --skip_log_headers=false, --stderrthreshold=2, --v=0, --vmodule=
2020-08-25 04:05:57.737087 I | cephcmd: starting operator
2020-08-25 04:05:57.801061 I | op-discover: rook-discover daemonset already exists, updating ...
2020-08-25 04:05:57.828608 I | operator: rook-provisioner ceph.rook.io/block started using ceph.rook.io flex vendor dir
I0825 04:05:57.828776 10 leaderelection.go:242] attempting to acquire leader lease rook-ceph/ceph.rook.io-block...
2020-08-25 04:05:57.828838 I | operator: rook-provisioner rook.io/block started using rook.io flex vendor dir
...
2020-08-25 04:05:59.546300 I | op-k8sutil: ROOK_CSI_KUBELET_DIR_PATH="/var/lib/kubelet" (env var)
2020-08-25 04:05:59.571085 E | ceph-block-pool-controller: failed to reconcile invalid pool CR "csireplpool" spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:05:59.945701 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:00.229308 W | cephclient: failed to get ceph daemons versions, this likely means there is no cluster yet. failed to run 'ceph versions: exit status 1
2020-08-25 04:06:00.345543 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:00.537510 E | ceph-file-controller: failed to reconcile invalid object filesystem "myfs" arguments: invalid metadata pool: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:00.631651 I | ceph-csi: successfully created csi config map "rook-ceph-csi-config"
2020-08-25 04:06:00.632192 I | ceph-csi: detecting the ceph csi image version for image "bcmt-registry:5000/csi/cephcsi:v2.1.2"
2020-08-25 04:06:00.821972 I | op-k8sutil: CSI_PROVISIONER_TOLERATIONS="- effect: NoExecute\n key: is_control\n operator: Equal\n value: \"true\"\n- effect: NoExecute\n key: is_edge\n operator: Equal\n value: \"true\"\n- effect: NoExecute\n key: is_storage\n operator: Equal\n value: \"true\"\n- effect: NoSchedule\n key: node.cloudprovider.kubernetes.io/uninitialized\n operator: Equal\n value: \"true\"\n" (env var)
2020-08-25 04:06:01.028714 W | cephclient: failed to get ceph daemons versions, this likely means there is no cluster yet. failed to run 'ceph versions: exit status 1
2020-08-25 04:06:01.127681 E | ceph-block-pool-controller: failed to reconcile invalid pool CR "csireplpool" spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:01.326682 E | ceph-object-controller: failed to reconcile invalid object store "rook-ceph-store" arguments: invalid metadata pool spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:01.545112 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:01.545202 I | op-cluster: cluster info loaded for monitoring: &{FSID:79580ff1-adf9-4d6a-a4c6-9dc44fe784c5 MonitorSecret:AQDEckRfaObYIhAAfu5txBedGHfueBAZddUAzg== AdminSecret:AQDEckRfnro2LhAABI76dGcXtkM1BBtXpfHCDA== ExternalCred:{Username: Secret:} Name:rook-ceph Monitors:map[b:0xc00000c580 c:0xc00000c780 d:0xc00000c2c0] CephVersion:{Major:0 Minor:0 Extra:0 Build:0}}
2020-08-25 04:06:01.545210 I | op-cluster: enabling cluster monitoring goroutines
2020-08-25 04:06:01.545216 I | op-client: start watching client resources in namespace "rook-ceph"
2020-08-25 04:06:02.146208 I | op-k8sutil: ROOK_OBC_WATCH_OPERATOR_NAMESPACE="true" (env var)
2020-08-25 04:06:02.146238 I | op-bucket-prov: ceph bucket provisioner launched watching for provisioner "rook-ceph.ceph.rook.io/bucket"
2020-08-25 04:06:02.147645 I | op-cluster: ceph status check interval is 60s
I0825 04:06:02.147724 10 manager.go:118] objectbucket.io/provisioner-manager "msg"="starting provisioner" "name"="rook-ceph.ceph.rook.io/bucket"
2020-08-25 04:06:02.356831 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:02.737032 E | ceph-block-pool-controller: failed to reconcile invalid pool CR "csireplpool" spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:02.821955 E | op-cluster: failed to get ceph status. failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1
2020-08-25 04:06:02.852749 I | op-config: CephCluster "rook-ceph" status: "Failure". "Failed to configure ceph cluster"
2020-08-25 04:06:02.939369 W | cephclient: failed to get ceph daemons versions, this likely means there is no cluster yet. failed to run 'ceph versions: exit status 1
2020-08-25 04:06:02.945847 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:03.463716 W | cephclient: failed to get ceph daemons versions, this likely means there is no cluster yet. failed to run 'ceph versions: exit status 1
2020-08-25 04:06:03.523955 E | ceph-file-controller: failed to reconcile invalid object filesystem "myfs" arguments: invalid metadata pool: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:03.737853 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2020-08-25T04:06:02Z" "2020-08-25T04:06:02Z" "HEALTH_OK"}
2020-08-25 04:06:03.755237 E | ceph-object-controller: failed to reconcile invalid object store "rook-ceph-store" arguments: invalid metadata pool spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
- Crashing pod(s) logs, if necessary
To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code
button from the Github UI.
Read Github documentation if you need help.
Environment:
- OS (e.g. from /etc/os-release): Rhel 7.8
- Kernel (e.g.
uname -a
): Linux tesla-cb0434-csd1-csd1-control-01 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Wed Feb 26 03:08:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Cloud provider or hardware configuration: openstack
- Rook version (use
rook version
inside of a Rook Pod): v1.3.9
- Storage backend version (e.g. for ceph do
ceph -v
): v14.2.10
- Kubernetes version (use
kubectl version
): v1.18.8
- Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Tectonic
- Storage backend status (e.g. for Ceph use
ceph health
in the Rook Ceph toolbox): HEALTH_OK