Storage Orchestration for Kubernetes

Last update: Dec 29, 2022

Comments: 16

What is Rook?

Rook is an open source cloud-native storage orchestrator for Kubernetes, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with cloud-native environments.

Rook turns storage software into self-managing, self-scaling, and self-healing storage services. It does this by automating deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management. Rook uses the facilities provided by the underlying cloud-native container management, scheduling and orchestration platform to perform its duties.

Rook integrates deeply into cloud native environments leveraging extension points and providing a seamless experience for scheduling, lifecycle management, resource management, security, monitoring, and user experience.

For more details about the storage solutions currently supported by Rook, please refer to the project status section below. We plan to continue adding support for other storage systems and environments based on community demand and engagement in future releases. See our roadmap for more details.

Rook is hosted by the Cloud Native Computing Foundation (CNCF) as a graduated level project. If you are a company that wants to help shape the evolution of technologies that are container-packaged, dynamically-scheduled and microservices-oriented, consider joining the CNCF. For details about who's involved and how Rook plays a role, read the CNCF announcement.

Getting Started and Documentation

For installation, deployment, and administration, see our Documentation.

Contributing

We welcome contributions. See Contributing to get started.

Report a Bug

For filing bugs, suggesting improvements, or requesting new features, please open an issue.

Reporting Security Vulnerabilities

If you find a vulnerability or a potential vulnerability in Rook please let us know immediately at [email protected]. We'll send a confirmation email to acknowledge your report, and we'll send an additional email when we've identified the issues positively or negatively.

For further details, please see the complete security release process.

Contact

Please use the following to reach members of the community:

Slack: Join our slack channel
Forums: rook-dev
Twitter: @rook_io
Email (general topics): [email protected]
Email (security topics): [email protected]

Community Meeting

A regular community meeting takes place every other Tuesday at 9:00 AM PT (Pacific Time). Convert to your local timezone.

Any changes to the meeting schedule will be added to the agenda doc and posted to Slack #announcements and the rook-dev mailing list.

Anyone who wants to discuss the direction of the project, design and implementation reviews, or general questions with the broader community is welcome and encouraged to join.

Project Status

The status of each storage provider supported by Rook can be found in the table below. Each API group is assigned its own individual status to reflect their varying maturity and stability. More details about API versioning and status in Kubernetes can be found on the Kubernetes API versioning page, but the key difference between the statuses are summarized below:

Alpha: The API may change in incompatible ways in a later software release without notice, recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
Beta: Support for the overall features will not be dropped, though details may change. Support for upgrading or migrating between versions will be provided, either through automation or manual steps.
Stable: Features will appear in released software for many subsequent versions and support for upgrading between versions will be provided with software automation in the vast majority of scenarios.

Name	Details	API Group	Status
Ceph	Ceph is a distributed storage system that provides file, block and object storage and is deployed in large scale production clusters.	ceph.rook.io/v1	Stable

This repo is for the Ceph storage provider. The Cassandra and NFS storage providers moved to a separate repo to allow for each storage provider to have an independent development and release schedule.

Official Releases

Official releases of Rook can be found on the releases page. Please note that it is strongly recommended that you use official releases of Rook, as unreleased versions from the master branch are subject to changes and incompatibilities that will not be supported in the official releases. Builds from the master branch can have functionality changed and even removed at any time without compatibility support and without prior notice.

Licensing

Rook is under the Apache 2.0 license.

Owner

Rook

Open Cloud-Native Storage for Kubernetes

https://github.com/rook/rook https://rook.io

Comments

Very high CPU usage on Ceph OSDs (v1.0, v1.1)

I am not sure where the problem is but I am seeing very high CPU usage since I started using v1.0.0. With three small clusters load average skyrockets to the 10s quite quickly making the nodes unusable. This happens while copying quite a bit of data to a volume mapped on the host bypassing k8s (to restore data from an existing non-k8s server). Nothing else is happening with the clusters at all. I am using low specs servers (2 cores, 8 GB of RAM) but I didn't see any of these high load issues with 0.9.3 on same-specs servers. Has something changed about Ceph or else that might explain this? I've also tried with two providers, Hetzner Cloud and UpCloud. Same issue when actually using a volume.

Is it just me or is it happening to others as well? Thanks!

The rook-ceph-csi-config cm disappeared after host reboot

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: After the host reboot, the configmap rook-ceph-csi-config disappeared. All the ceph-csi pods were in "ContainerCreating" state.

Expected behavior: The configmap rook-ceph-csi-config should not be deleted in this condition or should be re-created if not found.

How to reproduce it (minimal and precise):

Run "reboot" on the one of the host in the cluster. The issue didn't surface each time but had occurred several times in our testing. Below is the failure condition.

# knc get pods | egrep -v "Run|Com"
NAME                                                            READY   STATUS              RESTARTS   AGE
csi-cephfsplugin-6cb75                                          0/3     ContainerCreating   0          15h
csi-cephfsplugin-9whpq                                          0/3     ContainerCreating   0          15h
csi-cephfsplugin-bpn88                                          0/3     ContainerCreating   0          15h
csi-cephfsplugin-gd6kk                                          0/3     ContainerCreating   0          15h
csi-cephfsplugin-hbjkj                                          0/3     ContainerCreating   0          15h
csi-cephfsplugin-jt48j                                          0/3     ContainerCreating   0          15h
csi-cephfsplugin-mlj6w                                          0/3     ContainerCreating   0          15h
csi-cephfsplugin-provisioner-67cdf965c6-764bx                   0/5     ContainerCreating   0          15h
csi-cephfsplugin-provisioner-67cdf965c6-pq4wm                   0/5     ContainerCreating   0          15h
csi-cephfsplugin-rx599                                          0/3     ContainerCreating   0          15h
csi-rbdplugin-9v8kb                                             0/3     ContainerCreating   0          15h
csi-rbdplugin-bccpt                                             0/3     ContainerCreating   0          15h
csi-rbdplugin-bqlpc                                             0/3     ContainerCreating   0          15h
csi-rbdplugin-f2fb9                                             0/3     ContainerCreating   0          15h
csi-rbdplugin-h8hbc                                             0/3     ContainerCreating   0          15h
csi-rbdplugin-l2wbz                                             0/3     ContainerCreating   0          15h
csi-rbdplugin-njtt7                                             0/3     ContainerCreating   0          15h
csi-rbdplugin-provisioner-78d6f54775-dq47m                      0/6     ContainerCreating   0          15h
csi-rbdplugin-provisioner-78d6f54775-hc9nb                      0/6     ContainerCreating   0          15h
csi-rbdplugin-tfn52                                             0/3     ContainerCreating   0          15h
rook-ceph-detect-version-kpc2p                                  0/1     Init:0/1            0          1s

# knc describe pod csi-rbdplugin-provisioner-78d6f54775-hc9nb
...
Events:
  Type     Reason       Age                  From                                        Message
  ----     ------       ----                 ----                                        -------
  Warning  FailedMount  40m (x57 over 14h)   kubelet, tesla-cb0434-csd1-csd1-control-03  Unable to attach or mount volumes: unmounted volumes=[ceph-csi-config], unattached volumes=[rook-csi-rbd-provisioner-sa-token-q4kbn host-dev host-sys lib-modules ceph-csi-config keys-tmp-dir socket-dir]: timed out waiting for the condition
  Warning  FailedMount  36m (x58 over 14h)   kubelet, tesla-cb0434-csd1-csd1-control-03  Unable to attach or mount volumes: unmounted volumes=[ceph-csi-config], unattached volumes=[socket-dir rook-csi-rbd-provisioner-sa-token-q4kbn host-dev host-sys lib-modules ceph-csi-config keys-tmp-dir]: timed out waiting for the condition
  Warning  FailedMount  15m (x62 over 15h)   kubelet, tesla-cb0434-csd1-csd1-control-03  Unable to attach or mount volumes: unmounted volumes=[ceph-csi-config], unattached volumes=[keys-tmp-dir socket-dir rook-csi-rbd-provisioner-sa-token-q4kbn host-dev host-sys lib-modules ceph-csi-config]: timed out waiting for the condition
  Warning  FailedMount  67s (x453 over 15h)  kubelet, tesla-cb0434-csd1-csd1-control-03  MountVolume.SetUp failed for volume "ceph-csi-config" : configmap "rook-ceph-csi-config" not found


# cephstatus
  cluster:
    id:     79580ff1-adf9-4d6a-a4c6-9dc44fe784c5
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum b,c,d (age 15h)
    mgr: a(active, since 15h)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 15h), 6 in (since 15h)
    rgw: 1 daemon active (rook.ceph.store.a)
 
  task status:
    scrub status:
        mds.myfs-a: idle
        mds.myfs-b: idle
 
  data:
    pools:   10 pools, 208 pgs
    objects: 295 objects, 22 MiB
    usage:   6.9 GiB used, 53 GiB / 60 GiB avail
    pgs:     208 active+clean
 
  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary
Operator's logs, if necessary

2020-08-25 04:05:57.736755 I | rookcmd: starting Rook v1.3.9 with arguments '/usr/local/bin/rook ceph operator'
2020-08-25 04:05:57.737076 I | rookcmd: flag values: --add_dir_header=false, --alsologtostderr=false, --csi-cephfs-plugin-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin.yaml, --csi-cephfs-provisioner-dep-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-dep.yaml, --csi-cephfs-provisioner-sts-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-sts.yaml, --csi-rbd-plugin-template-path=/etc/ceph-csi/rbd/csi-rbdplugin.yaml, --csi-rbd-provisioner-dep-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-dep.yaml, --csi-rbd-provisioner-sts-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-sts.yaml, --enable-discovery-daemon=true, --enable-flex-driver=false, --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-flush-frequency=5s, --log-level=INFO, --log_backtrace_at=:0, --log_dir=, --log_file=, --log_file_max_size=1800, --logtostderr=true, --master=, --mon-healthcheck-interval=45s, --mon-out-timeout=5m0s, --operator-image=, --service-account=, --skip_headers=false, --skip_log_headers=false, --stderrthreshold=2, --v=0, --vmodule=
2020-08-25 04:05:57.737087 I | cephcmd: starting operator
2020-08-25 04:05:57.801061 I | op-discover: rook-discover daemonset already exists, updating ...
2020-08-25 04:05:57.828608 I | operator: rook-provisioner ceph.rook.io/block started using ceph.rook.io flex vendor dir
I0825 04:05:57.828776      10 leaderelection.go:242] attempting to acquire leader lease  rook-ceph/ceph.rook.io-block...
2020-08-25 04:05:57.828838 I | operator: rook-provisioner rook.io/block started using rook.io flex vendor dir
...
2020-08-25 04:05:59.546300 I | op-k8sutil: ROOK_CSI_KUBELET_DIR_PATH="/var/lib/kubelet" (env var)
2020-08-25 04:05:59.571085 E | ceph-block-pool-controller: failed to reconcile invalid pool CR "csireplpool" spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:05:59.945701 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:00.229308 W | cephclient: failed to get ceph daemons versions, this likely means there is no cluster yet. failed to run 'ceph versions: exit status 1
2020-08-25 04:06:00.345543 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:00.537510 E | ceph-file-controller: failed to reconcile invalid object filesystem "myfs" arguments: invalid metadata pool: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:00.631651 I | ceph-csi: successfully created csi config map "rook-ceph-csi-config"
2020-08-25 04:06:00.632192 I | ceph-csi: detecting the ceph csi image version for image "bcmt-registry:5000/csi/cephcsi:v2.1.2"
2020-08-25 04:06:00.821972 I | op-k8sutil: CSI_PROVISIONER_TOLERATIONS="- effect: NoExecute\n  key: is_control\n  operator: Equal\n  value: \"true\"\n- effect: NoExecute\n  key: is_edge\n  operator: Equal\n  value: \"true\"\n- effect: NoExecute\n  key: is_storage\n  operator: Equal\n  value: \"true\"\n- effect: NoSchedule\n  key: node.cloudprovider.kubernetes.io/uninitialized\n  operator: Equal\n  value: \"true\"\n" (env var)
2020-08-25 04:06:01.028714 W | cephclient: failed to get ceph daemons versions, this likely means there is no cluster yet. failed to run 'ceph versions: exit status 1
2020-08-25 04:06:01.127681 E | ceph-block-pool-controller: failed to reconcile invalid pool CR "csireplpool" spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:01.326682 E | ceph-object-controller: failed to reconcile invalid object store "rook-ceph-store" arguments: invalid metadata pool spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:01.545112 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:01.545202 I | op-cluster: cluster info loaded for monitoring: &{FSID:79580ff1-adf9-4d6a-a4c6-9dc44fe784c5 MonitorSecret:AQDEckRfaObYIhAAfu5txBedGHfueBAZddUAzg== AdminSecret:AQDEckRfnro2LhAABI76dGcXtkM1BBtXpfHCDA== ExternalCred:{Username: Secret:} Name:rook-ceph Monitors:map[b:0xc00000c580 c:0xc00000c780 d:0xc00000c2c0] CephVersion:{Major:0 Minor:0 Extra:0 Build:0}}
2020-08-25 04:06:01.545210 I | op-cluster: enabling cluster monitoring goroutines
2020-08-25 04:06:01.545216 I | op-client: start watching client resources in namespace "rook-ceph"
2020-08-25 04:06:02.146208 I | op-k8sutil: ROOK_OBC_WATCH_OPERATOR_NAMESPACE="true" (env var)
2020-08-25 04:06:02.146238 I | op-bucket-prov: ceph bucket provisioner launched watching for provisioner "rook-ceph.ceph.rook.io/bucket"
2020-08-25 04:06:02.147645 I | op-cluster: ceph status check interval is 60s
I0825 04:06:02.147724      10 manager.go:118] objectbucket.io/provisioner-manager "msg"="starting provisioner"  "name"="rook-ceph.ceph.rook.io/bucket"
2020-08-25 04:06:02.356831 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:02.737032 E | ceph-block-pool-controller: failed to reconcile invalid pool CR "csireplpool" spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:02.821955 E | op-cluster: failed to get ceph status. failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1
2020-08-25 04:06:02.852749 I | op-config: CephCluster "rook-ceph" status: "Failure". "Failed to configure ceph cluster"
2020-08-25 04:06:02.939369 W | cephclient: failed to get ceph daemons versions, this likely means there is no cluster yet. failed to run 'ceph versions: exit status 1
2020-08-25 04:06:02.945847 I | op-mon: parsing mon endpoints: d=10.254.140.99:6789,b=10.254.27.205:6789,c=10.254.144.51:6789
2020-08-25 04:06:03.463716 W | cephclient: failed to get ceph daemons versions, this likely means there is no cluster yet. failed to run 'ceph versions: exit status 1
2020-08-25 04:06:03.523955 E | ceph-file-controller: failed to reconcile invalid object filesystem "myfs" arguments: invalid metadata pool: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
: exit status 1
2020-08-25 04:06:03.737853 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2020-08-25T04:06:02Z" "2020-08-25T04:06:02Z" "HEALTH_OK"}
2020-08-25 04:06:03.755237 E | ceph-object-controller: failed to reconcile invalid object store "rook-ceph-store" arguments: invalid metadata pool spec: failed to get crush map: failed to get crush map. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)

Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read Github documentation if you need help.

Environment:

OS (e.g. from /etc/os-release): Rhel 7.8
Kernel (e.g. uname -a): Linux tesla-cb0434-csd1-csd1-control-01 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Wed Feb 26 03:08:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: openstack
Rook version (use rook version inside of a Rook Pod): v1.3.9
Storage backend version (e.g. for ceph do ceph -v): v14.2.10
Kubernetes version (use kubectl version): v1.18.8
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Tectonic
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

Operator not happy after upgrade

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: When restarted, operator checks all OSDs are running, but at some point fails with:

2018-08-17 01:22:22.635203 I | op-k8sutil: updating deployment rook-ceph-osd-id-130
2018-08-17 01:22:24.666490 I | op-k8sutil: finished waiting for updated deployment rook-ceph-osd-id-130
2018-08-17 01:22:24.666509 I | op-osd: started deployment for osd 130 (dir=false, type=bluestore)
2018-08-17 01:22:24.669040 I | op-osd: osd orchestration status for node ps-100g.sdsu.edu is starting
2018-08-17 01:22:24.669056 I | op-osd: osd orchestration status for node siderea.ucsc.edu is starting
2018-08-17 01:22:24.669061 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:24.670869 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:24.770976 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:24.771848 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:24.811933 I | op-provisioner: Deleting volume pvc-5b9e9df4-8ea0-11e8-92ef-0cc47a6be994
2018-08-17 01:22:24.811952 I | exec: Running command: rbd rm rbd/pvc-5b9e9df4-8ea0-11e8-92ef-0cc47a6be994 --cluster= --conf=/var/lib/rook/.config --keyring=/v
ar/lib/rook/client.admin.keyring
2018-08-17 01:22:24.812100 I | op-provisioner: Deleting volume pvc-0b1bb4e5-9f46-11e8-92ef-0cc47a6be994
2018-08-17 01:22:24.812118 I | exec: Running command: rbd rm rbd/pvc-0b1bb4e5-9f46-11e8-92ef-0cc47a6be994 --cluster= --conf=/var/lib/rook/.config --keyring=/v
ar/lib/rook/client.admin.keyring
2018-08-17 01:22:24.812127 I | op-provisioner: Deleting volume pvc-584f3e18-a0d0-11e8-8e30-0cc47a6be994
2018-08-17 01:22:24.812144 I | exec: Running command: rbd rm rbd/pvc-584f3e18-a0d0-11e8-8e30-0cc47a6be994 --cluster= --conf=/var/lib/rook/.config --keyring=/v
ar/lib/rook/client.admin.keyring
E0817 01:22:24.825364       7 controller.go:1044] Deletion of volume "pvc-5b9e9df4-8ea0-11e8-92ef-0cc47a6be994" failed: Failed to delete rook block image rbd/
pvc-5b9e9df4-8ea0-11e8-92ef-0cc47a6be994: failed to delete image pvc-5b9e9df4-8ea0-11e8-92ef-0cc47a6be994 in pool rbd: Failed to complete '': exit status 1. g
lobal_init: unable to open config file from search list /var/lib/rook/.config
. output:
E0817 01:22:24.825422       7 goroutinemap.go:165] Operation for "delete-pvc-5b9e9df4-8ea0-11e8-92ef-0cc47a6be994[6aec4ddb-8ea0-11e8-92ef-0cc47a6be994]" faile
d. No retries permitted until 2018-08-17 01:24:26.825404385 +0000 UTC m=+497.173393014 (durationBeforeRetry 2m2s). Error: Failed to delete rook block image rb
d/pvc-5b9e9df4-8ea0-11e8-92ef-0cc47a6be994: failed to delete image pvc-5b9e9df4-8ea0-11e8-92ef-0cc47a6be994 in pool rbd: Failed to complete '': exit status 1.
 global_init: unable to open config file from search list /var/lib/rook/.config
. output:
E0817 01:22:24.825753       7 controller.go:1044] Deletion of volume "pvc-584f3e18-a0d0-11e8-8e30-0cc47a6be994" failed: Failed to delete rook block image rbd/pvc-584f3e18-a0d0-11e8-8e30-0cc47a6be994: failed to delete image pvc-584f3e18-a0d0-11e8-8e30-0cc47a6be994 in pool rbd: Failed to complete '': exit status 1. global_init: unable to open config file from search list /var/lib/rook/.config
. output:
E0817 01:22:24.825791       7 goroutinemap.go:165] Operation for "delete-pvc-584f3e18-a0d0-11e8-8e30-0cc47a6be994[5ffa47ae-a0d0-11e8-8e30-0cc47a6be994]" failed. No retries permitted until 2018-08-17 01:24:26.825780946 +0000 UTC m=+497.173769575 (durationBeforeRetry 2m2s). Error: Failed to delete rook block image rbd/pvc-584f3e18-a0d0-11e8-8e30-0cc47a6be994: failed to delete image pvc-584f3e18-a0d0-11e8-8e30-0cc47a6be994 in pool rbd: Failed to complete '': exit status 1. global_init: unable to open config file from search list /var/lib/rook/.config
. output:
E0817 01:22:24.825958       7 controller.go:1044] Deletion of volume "pvc-0b1bb4e5-9f46-11e8-92ef-0cc47a6be994" failed: Failed to delete rook block image rbd/pvc-0b1bb4e5-9f46-11e8-92ef-0cc47a6be994: failed to delete image pvc-0b1bb4e5-9f46-11e8-92ef-0cc47a6be994 in pool rbd: Failed to complete '': exit status 1. global_init: unable to open config file from search list /var/lib/rook/.config
. output:
E0817 01:22:24.825996       7 goroutinemap.go:165] Operation for "delete-pvc-0b1bb4e5-9f46-11e8-92ef-0cc47a6be994[119e6a37-9f46-11e8-92ef-0cc47a6be994]" failed. No retries permitted until 2018-08-17 01:24:26.825984549 +0000 UTC m=+497.173973182 (durationBeforeRetry 2m2s). Error: Failed to delete rook block image rbd/pvc-0b1bb4e5-9f46-11e8-92ef-0cc47a6be994: failed to delete image pvc-0b1bb4e5-9f46-11e8-92ef-0cc47a6be994 in pool rbd: Failed to complete '': exit status 1. global_init: unable to open config file from search list /var/lib/rook/.config
. output:
2018-08-17 01:22:24.871944 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:24.872769 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:24.972851 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:24.973845 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.073929 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.074814 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.174891 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.175758 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.275835 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.276647 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.376740 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.377564 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.477646 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.478442 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.578526 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.579283 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.679364 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.680127 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.780216 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.781114 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.881202 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.882004 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:25.982083 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:25.982835 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:26.082912 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:26.083701 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:26.183800 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:26.184550 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:26.284634 I | op-osd: 12/14 node(s) completed osd provisioning
2018-08-17 01:22:26.285400 I | op-osd: orchestration status config map result channel closed, will restart watch.
2018-08-17 01:22:26.385449 I | op-osd: 12/14 node(s) completed osd provisioning

After this it keeps printing this message about channel closed.

Expected behavior: Operator function normally after the upgrade 0.7.1->0.8.1

Environment:

OS (e.g. from /etc/os-release): CentOS 7.5
Kernel (e.g. uname -a): 4.14.14-1.el7.elrepo.x86_64
Cloud provider or hardware configuration: BAremetal
Rook version (use rook version inside of a Rook Pod): 0.8.1
Kubernetes version (use kubectl version): 1.11.2
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_WARN noscrub,nodeep-scrub flag(s) set

cephfs storageclass does not work

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: wp-pv-claim and mysql-pv-claim is Bound, but the cephfs-pvc is Pending

Expected behavior: The cephfs-pvc works

How to reproduce it (minimal and precise):

# microk8s  v1.14.1 on Ubuntu 16.04.5 LTS (Xenial Xerus)
kubectl apply -f ceph/common.yaml
kubectl apply -f ceph/operator.yaml
kubectl apply -f ceph/cluster-test.yaml
kubectl apply -f ceph/toolbox.yaml
kubectl apply -f ceph/csi/rbd/storageclass-test.yaml
kubectl apply -f . # install mysql.yaml wordpress.yaml
kubectl apply -f ceph/object-test.yaml
kubectl apply -f ceph/object-user.yaml
kubectl apply -f ceph/filesystem-test.yaml
kubectl apply -f ceph/csi/cephfs/storageclass.yaml
kubectl apply -f ceph/csi/cephfs/kube-registry.yaml

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary
Operator's logs, if necessary
Crashing pod(s) logs, if necessary

# kubectl -n kube-system describe pvc cephfs-pvc
Name:          cephfs-pvc
Namespace:     kube-system
StorageClass:  csi-cephfs
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   kubectl.kubernetes.io/last-applied-configuration:
                 {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"name":"cephfs-pvc","namespace":"kube-system"},"spec":{"acc...
               volume.beta.kubernetes.io/storage-provisioner: rook-ceph.cephfs.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Mounted By:    kube-registry-5b9c9854c5-psdsv
               kube-registry-5b9c9854c5-r2g4l
               kube-registry-5b9c9854c5-rmqms
Events:
  Type     Reason                Age                   From                                                                                                             Message
  ----     ------                ----                  ----                                                                                                             -------
  Warning  ProvisioningFailed    5m10s (x11 over 38m)  rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-f64c4574b-mvb7p_cc13f063-e0f5-11e9-813c-56d3e713ca9f  failed to provision volume with StorageClass "csi-cephfs": rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   ExternalProvisioning  72s (x162 over 41m)   persistentvolume-controller                                                                                      waiting for a volume to be created, either by external provisioner "rook-ceph.cephfs.csi.ceph.com" or manually created by system administrator
  Normal   Provisioning          10s (x12 over 41m)    rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-f64c4574b-mvb7p_cc13f063-e0f5-11e9-813c-56d3e713ca9f  External provisioner is provisioning volume for claim "kube-system/cephfs-pvc"


# kubectl -n rook-ceph logs csi-cephfsplugin-provisioner-f64c4574b-mvb7p -c csi-provisioner
I0927 08:03:15.230702       1 connection.go:183] GRPC response: {}
I0927 08:03:15.231643       1 connection.go:184] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0927 08:03:15.231740       1 controller.go:979] Final error received, removing PVC 264abaa4-e0f7-11e9-bead-ac1f6b84bde2 from claims in progress
W0927 08:03:15.231762       1 controller.go:886] Retrying syncing claim "264abaa4-e0f7-11e9-bead-ac1f6b84bde2", failure 11
E0927 08:03:15.231801       1 controller.go:908] error syncing claim "264abaa4-e0f7-11e9-bead-ac1f6b84bde2": failed to provision volume with StorageClass "csi-cephfs": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0927 08:03:15.231842       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kube-system", Name:"cephfs-pvc", UID:"264abaa4-e0f7-11e9-bead-ac1f6b84bde2", APIVersion:"v1", ResourceVersion:"17010356", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "csi-cephfs": rpc error: code = DeadlineExceeded desc = context deadline exceeded

# ceph -s    
  cluster:
    id:     9acd086e-493b-4ebd-a39f-2be2cce80080
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum a (age 62m)
    mgr: a(active, since 61m)
    mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
    osd: 1 osds: 1 up (since 61m), 1 in (since 61m)
    rgw: 1 daemon active (my.store.a)
 
  data:
    pools:   9 pools, 72 pgs
    objects: 407 objects, 457 MiB
    usage:   1.1 TiB used, 611 GiB / 1.7 TiB avail
    pgs:     72 active+clean
 
  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

Environment:

OS (e.g. from /etc/os-release): microk8s v1.14.1 on Ubuntu 16.04.5 LTS (Xenial Xerus)
Kernel (e.g. uname -a): Linux ubun 4.15.0-62-generic #69~16.04.1-Ubuntu SMP Fri Sep 6 02:43:35 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod): rook: v1.1.1
Storage backend version (e.g. for ceph do ceph -v): ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
Kubernetes version (use kubectl version):Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

Recovering Rook cluster after Kubernetes cluster loss
Is this a bug report or feature request?

Feature Request

Feature Request

Are there any similar features already existing: n/a

What should the feature do: Assuming I have backup of rook directory from nodes and I lost my kubernetes cluster I would like to be able to restore persistent volumes after kubernetes cluster redeployment making sure at the same time that existing PVs will be claimed by pods in new cluster

What would be solved through this feature: Make possible to not lose all the data with lost of the kubernetes cluster

Does this have an impact on existing features:

Environment:

OS (e.g. from /etc/os-release):

Kernel (e.g. uname -a):

Cloud provider or hardware configuration:

Rook version (use rook version inside of a Rook Pod):

Kubernetes version (use kubectl version):

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Bare Metal/(VMs) with CoreOS

Ceph status (use ceph health in the Rook toolbox):

OSDs crashlooping after being OOMKilled: bind unable to bind

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: Updated ceph from v14.2.2-20190722 to v14.2.4-20190917, which seems to have made some changes in memory management, and nodes started getting system OOMKilles followed by OSDs crashlooping.

2019-09-26 01:20:10.118 7f70aa104dc0 -1 Falling back to public interface
2019-09-26 01:20:10.128 7f70aa104dc0 -1  Processor -- bind unable to bind to v2:10.244.15.18:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address
2019-09-26 01:20:10.128 7f70aa104dc0 -1  Processor -- bind was unable to bind. Trying again in 5 seconds
2019-09-26 01:20:15.137 7f70aa104dc0 -1  Processor -- bind unable to bind to v2:10.244.15.18:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address
2019-09-26 01:20:15.137 7f70aa104dc0 -1  Processor -- bind was unable to bind. Trying again in 5 seconds
2019-09-26 01:20:20.144 7f70aa104dc0 -1  Processor -- bind unable to bind to v2:10.244.15.18:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address
2019-09-26 01:20:20.144 7f70aa104dc0 -1  Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address

How to reproduce it (minimal and precise): Get an OSD OOMKilled by system

Rook version (use rook version inside of a Rook Pod): 1.1.1
Storage backend version (e.g. for ceph do ceph -v): v14.2.4-20190917
Kubernetes version (use kubectl version): 1.15.1

OSD Processor -- bind unable to bind to IP on any port in range 6800-7300: (99) Cannot assign requested address

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

I'm running OpenShift v3.11.82 on 3 nodes using RHEL7 with FLEXVOLUME_DIR_PATH set to /usr/libexec/kubernetes/kubelet-plugins/volume/exec(no other changes to *.yml files). I've encountered OSD error "Processor -- bind unable to bind to v2:IP:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address" on all my 3 OSD.

Expected behavior:

Bind to address and start to listen on port. rook+ceph should work as expexted.

How to reproduce it (minimal and precise):

Environment:

OS (e.g. from /etc/os-release): RHEL7
Kernel (e.g. uname -a): Linux node2.system10.vlan124.mcp 3.10.0-957.10.1.el7.x86_64 #1 SMP Thu Feb 7 07:12:53 UTC 2019 x86_64 x86_64 x86_64 GN
Cloud provider or hardware configuration: On-prem (VMVare based)
Rook version (use rook version inside of a Rook Pod): 1.0.0
Kubernetes version (use kubectl version): OpenShift v.3.11.82 (aka. K8s v1.11)
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): OpenShift
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

LOGS

rook-ceph-osd-0

2019-05-08 17:01:38.770518 I | rookcmd: starting Rook v1.0.0-13.g05b0166 with arguments '/rook/rook ceph osd start -- --foreground --id 0 --osd-uuid c05a22dc-97e5-4463-b4e6-ebd9b00d0f2e --conf /var/lib/rook/osd0/rook-ceph.config --cluster ceph --default-log-to-file false'
2019-05-08 17:01:38.776349 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=INFO, --osd-id=0, --osd-store-type=bluestore, --osd-uuid=c05a22dc-97e5-4463-b4e6-ebd9b00d0f2e
2019-05-08 17:01:38.776480 I | op-mon: parsing mon endpoints: 
2019-05-08 17:01:38.776577 W | op-mon: ignoring invalid monitor 
2019-05-08 17:01:38.784163 I | exec: Running command: stdbuf -oL ceph-volume lvm activate --no-systemd --bluestore 0 c05a22dc-97e5-4463-b4e6-ebd9b00d0f2e
2019-05-08 17:01:39.683863 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
2019-05-08 17:01:40.012259 I | Running command: /usr/sbin/restorecon /var/lib/ceph/osd/ceph-0
2019-05-08 17:01:40.312664 I | Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
2019-05-08 17:01:40.648273 I | Running command: /bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-aaa9dfa1-6146-4a9d-9b19-781bf0f71abe/osd-data-06f7e1bb-336c-4b7f-860f-39ce2217ef5c --path /var/lib/ceph/osd/ceph-0 --no-mon-config
2019-05-08 17:01:41.200711 I | Running command: /bin/ln -snf /dev/ceph-aaa9dfa1-6146-4a9d-9b19-781bf0f71abe/osd-data-06f7e1bb-336c-4b7f-860f-39ce2217ef5c /var/lib/ceph/osd/ceph-0/block
2019-05-08 17:01:41.508309 I | Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block
2019-05-08 17:01:41.813803 I | Running command: /bin/chown -R ceph:ceph /dev/mapper/ceph--aaa9dfa1--6146--4a9d--9b19--781bf0f71abe-osd--data--06f7e1bb--336c--4b7f--860f--39ce2217ef5c
2019-05-08 17:01:42.097564 I | Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
2019-05-08 17:01:42.388105 I | --> ceph-volume lvm activate successful for osd ID: 0
2019-05-08 17:01:42.400127 I | exec: Running command: ceph-osd --foreground --id 0 --osd-uuid c05a22dc-97e5-4463-b4e6-ebd9b00d0f2e --conf /var/lib/rook/osd0/rook-ceph.config --cluster ceph --default-log-to-file false
2019-05-08 17:01:43.178290 I | 2019-05-08 17:01:43.177 7fe1fd660d80 -1 Falling back to public interface
2019-05-08 17:01:43.829540 I | 2019-05-08 17:01:43.829 7fe1fd660d80 -1 osd.0 69 log_to_monitors {default=true}
2019-05-08 17:01:43.847126 I | 2019-05-08 17:01:43.846 7fe1efef6700 -1 osd.0 69 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2019-05-08 17:01:54.275215 I | 2019-05-08 17:01:54.274 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:01:54.275251 I | 2019-05-08 17:01:54.274 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:01:59.291147 I | 2019-05-08 17:01:59.290 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:01:59.291193 I | 2019-05-08 17:01:59.290 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:04.308557 I | 2019-05-08 17:02:04.308 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:04.308596 I | 2019-05-08 17:02:04.308 7fe1e5ee2700 -1  Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address
2019-05-08 17:02:04.320883 I | 2019-05-08 17:02:04.320 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:04.320920 I | 2019-05-08 17:02:04.320 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:09.332874 I | 2019-05-08 17:02:09.332 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:09.332911 I | 2019-05-08 17:02:09.332 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:14.345173 I | 2019-05-08 17:02:14.344 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:14.345214 I | 2019-05-08 17:02:14.344 7fe1e5ee2700 -1  Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address
2019-05-08 17:02:14.359483 I | 2019-05-08 17:02:14.358 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:14.359524 I | 2019-05-08 17:02:14.358 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:19.372082 I | 2019-05-08 17:02:19.371 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:19.372114 I | 2019-05-08 17:02:19.371 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:24.385045 I | 2019-05-08 17:02:24.384 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:24.385079 I | 2019-05-08 17:02:24.384 7fe1e5ee2700 -1  Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address
2019-05-08 17:02:24.385338 I | 2019-05-08 17:02:24.384 7fe1f3786700 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2019-05-08 17:02:24.385351 I | 2019-05-08 17:02:24.384 7fe1f3786700 -1 osd.0 74 *** Got signal Interrupt ***

osd-0 config

[global]
fsid                      = e9f14792-30db-427d-8537-73e5b73a91ac
run dir                   = /var/lib/rook/osd0
mon initial members       = c a b
mon host                  = v1:172.30.246.255:6789,v1:172.30.152.231:6789,v1:172.30.117.179:6789
public addr               = 10.130.0.19
cluster addr              = 10.130.0.19
mon keyvaluedb            = rocksdb
mon_allow_pool_delete     = true
mon_max_pg_per_osd        = 1000
debug default             = 0
debug rados               = 0
debug mon                 = 0
debug osd                 = 0
debug bluestore           = 0
debug filestore           = 0
debug journal             = 0
debug leveldb             = 0
filestore_omap_backend    = rocksdb
osd pg bits               = 11
osd pgp bits              = 11
osd pool default size     = 1
osd pool default min size = 1
osd pool default pg num   = 100
osd pool default pgp num  = 100
osd objectstore           = filestore
crush location            = root=default host=node2-system10-vlan124-mcp
rbd_default_features      = 3
fatal signal handlers     = false

[osd.0]
keyring              = /var/lib/ceph/osd/ceph-0/keyring
bluestore block path = /var/lib/ceph/osd/ceph-0/block

mgr pod in CrashLoop in 0.8.x
Is this a bug report or feature request? Bug Report

Deviation from expected behavior: After rescheduling the mgr pod, it goes into a CrashLoop with the following:

2018-09-30 19:26:48.956022 I | ceph-mgr: 2018-09-30 19:26:48.955809 7f0adebf4700 1 mgr send_beacon active 2018-09-30 19:26:50.970860 I | ceph-mgr: 2018-09-30 19:26:50.970649 7f0adebf4700 1 mgr send_beacon active 2018-09-30 19:26:52.985827 I | ceph-mgr: 2018-09-30 19:26:52.985611 7f0adebf4700 1 mgr send_beacon active 2018-09-30 19:26:54.004538 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Bus STARTING 2018-09-30 19:26:54.004566 I | ceph-mgr: CherryPy Checker: 2018-09-30 19:26:54.004575 I | ceph-mgr: The Application mounted at '' has an empty config. 2018-09-30 19:26:54.004581 I | ceph-mgr: 2018-09-30 19:26:54.004588 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Started monitor thread '_TimeoutMonitor'. 2018-09-30 19:26:54.004594 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Bus STARTING 2018-09-30 19:26:54.004600 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Started monitor thread '_TimeoutMonitor'. 2018-09-30 19:26:54.004606 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Serving on :::7000 2018-09-30 19:26:54.004611 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Bus STARTED 2018-09-30 19:26:54.004624 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Serving on :::9283 2018-09-30 19:26:54.004630 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Bus STARTED 2018-09-30 19:26:54.004636 I | ceph-mgr: terminate called after throwing an instance of 'std::out_of_range' 2018-09-30 19:26:54.004644 I | ceph-mgr: what(): map::at failed to run mgr. failed to start mgr: Failed to complete 'ceph-mgr': signal: aborted (core dumped).

Expected behavior: No crash loop ;)

How to reproduce it (minimal and precise): Personally I've experienced it in several test clusters, but haven't had time to dig into it until tonight. Another user mentioned this a week ago on slack and @galexrt pointed to this bug in Ceph that seems to be related: https://tracker.ceph.com/issues/24982

In that issue, people mention multiple RGWs and I'm running RGWs as a daemonset for these clusters. So I tried to scale the amount of RGWs down to 1, using NodeAffinity, and nor the mgr was able to start up. After it's started, I can scale the RGWs back up to full count (5 on testing cluster) in one go and the mgr stays up. Without knowing this in depth, it seems to me the RGWs are building up a history of metrics to deliver to the mgr when they can't reach it. When the mgr starts again, these historic metrics overwhelms it and it gets startled, not to confuse with started.

Environment:

OS (e.g. from /etc/os-release): CoreOS

Kernel (e.g. uname -a): Something new

Cloud provider or hardware configuration: Bare metal

Rook version (use rook version inside of a Rook Pod): v0.8.1 - 99% sure I saw it on 0.8.2 as well, but had to vacate that due to a different bug.

Kubernetes version (use kubectl version): 1.11.3

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Kubespray

Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): Seems happy, but no dashboard and metrics obviously.
OpenShift: insufficient permission inside the containers
Bug Report

What happened:

When trying to create a cluster the operator fails with:

op-cluster: failed to create cluster in namespace rook. failed to start the mons. failed to initialize ceph cluster info. failed to get cluster info. failed to create mon secret s. failed to create dir /var/lib/rook/rook. mkdir /var/lib/rook: permission denied

What you expected to happen:

Cluster creation should succeed.

Additional information:

OpenShift uses the following feature to get fewer user privileges on application development where the expected user is 'root', see https://blog.openshift.com/jupyter-on-openshift-part-6-running-as-an-assigned-user-id/

How to reproduce it (minimal and precise):

Simply run kubectl create -f rook-cluster.yml

Environment:

OS (e.g. from /etc/os-release): CentOS Linux release 7.4.1708 (Core)

Kernel (e.g. uname -a): Linux k8s-master.example.com 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Cloud provider or hardware configuration: VM 1 CPU, 4GB RAM

Rook version (use rook version inside of a Rook Pod): v0.6.0-80.g3dfb151

Kubernetes version (use kubectl version): v1.7.6+a08f5eeb62

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): OpenShift

Ceph status (use ceph health in the Rook toolbox): no cluster yet

rook cluster fails after restart k8s node

i'm created operator, cluster and filesystem:

              apiVersion: rook.io/v1alpha1
              kind: Filesystem
              metadata:
                name: rookfs
                namespace: rook-cluster
              spec:
                metadataPool:
                  replicated:
                    size: 3
                dataPools:
                  - erasureCoded:
                     codingChunks: 1
                     dataChunks: 2
                metadataServer:
                  activeCount: 1
                  activeStandby: true

and try to connect to it:

apiVersion: v1
kind: Pod
metadata:
  name: ceph-tools
  namespace: rook-cluster
spec:
  containers:
  - name: ceph-tools
    image: nginx:1.13.5-alpine
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - name: ceph-store
        mountPath: /srv/ceph
  volumes:
    - name: ceph-store
      flexVolume:
        driver: rook.io/rook
        fsType: ceph
        options:
          fsName: rookfs
          clusterName: rook-cluster
          path: /

and have such error:

  12m   1m      6       kubelet, 10.1.29.25             Warning FailedMount     Unable to mount volumes for pod "ceph-tools_rook-cluster(6ac879d0-d672-11e7-88d3-0050569d5b15)": timeout expired waiting for volumes to attach/mount for pod "rook-cluster"/"ceph-tools". list of unattached/unmounted volumes=[ceph-store]
  12m   1m      6       kubelet, 10.1.29.25             Warning FailedSync      Error syncing pod
  9m    13s     2       kubelet, 10.1.29.25             Warning FailedMount     MountVolume.SetUp failed for volume "ceph-store" : mount command failed, status: Failure, reason: failed to mount filesystem rookfs to /var/lib/kubelet/pods/6ac879d0-d672-11e7-88d3-0050569d5b15/volumes/rook.io~rook/ceph-store with monitor 10.3.183.65:6790,10.3.65.12:6790,10.3.131.61:6790:/ and options [name=admin secret=AQBNCiFa0oqjKhAA2WH+8VwHB0y17Irg6HCmIw== mds_namespace=rookfs]: mount failed: exit status 32
Mounting command: mount
Mounting arguments: 10.3.183.65:6790,10.3.65.12:6790,10.3.131.61:6790:/ /var/lib/kubelet/pods/6ac879d0-d672-11e7-88d3-0050569d5b15/volumes/rook.io~rook/ceph-store ceph [name=admin secret=AQBNCiFa0oqjKhAA2WH+8VwHB0y17Irg6HCmIw== mds_namespace=rookfs]
Output: mount: mount 10.3.183.65:6790,10.3.65.12:6790,10.3.131.61:6790:/ on /var/lib/kubelet/pods/6ac879d0-d672-11e7-88d3-0050569d5b15/volumes/rook.io~rook/ceph-store failed: Connection timed out

Contribute rook-ceph operator to Community OKD/OpenShift Operators
Is this a bug report or feature request?

Feature Request

What should the feature do: Include the rook-ceph operator as a Community OpenShift Operator in the OKD and OCP OperatorHub.

What is use case behind this feature: As a OKD/OCP user, I want to install the rook-ceph operator on my cluster.

Environment:

OKD/OCP 3.11 or 4.1

I see PR https://github.com/operator-framework/community-operators/pull/78 which implements this feature was recently closed without explanation. Since the upstream rook-ceph operator does not work on OKD/OCP, it is important that it be included in the catalog as an unsupported community operator. The strimzi-kafka-operator seems similar. It appears in both the upstream and community catalogs. Then the downstream Red Hat certified and supported amq-streams distribution of the strimzi-kafka-operator appears in the certified red hat catalog. Is this not the case for the rook operator as well?
helm: add the missing config in helm for external cluster
Closes: https://github.com/rook/rook/issues/11480 Signed-off-by: parth-gr [email protected]

Description of your changes:

Which issue is resolved by this Pull Request: Resolves #

Checklist:

[ ] Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).

[ ] Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.

[ ] Reviewed the developer guide on Submitting a Pull Request

[ ] Pending release notes updated with breaking and/or notable changes for the next minor release.

[ ] Documentation has been updated, if necessary.

[ ] Unit tests have been added, if necessary.

[ ] Integration tests have been added, if necessary.

Error mounting a volume "rpc error: code = Internal desc = missing required field monitors"

When I try to create:

a volume: wp-dev-volume-backup (with the storage class: rook-cephfs)
a persistent volume claim: wp-dev-pvc-backup
a sidecar: backup-sidecar (that mounts the previously created volume) with this command:

kubectl apply -f .\test.yaml 
persistentvolume/wp-dev-volume-backup created
persistentvolumeclaim/wp-dev-pvc-backup created
deployment.apps/backup-sidecar created

I have this error:

20s         Normal    SuccessfulAttachVolume   pod/backup-sidecar-7896b7c9fd-s7f7z       AttachVolume.Attach succeeded for volume "wp-dev-volume-backup"
1s          Warning   FailedMount              pod/backup-sidecar-7896b7c9fd-s7f7z       MountVolume.MountDevice failed for volume "wp-dev-volume-backup" : rpc error: code = Internal desc = rpc error: code = Internal desc = missing required field monitors

It seems that the volume is attacched but it is not mounted.

All the items are defined in the file test.yaml:

---
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: rook-ceph.cephfs.csi.ceph.com
    volume.kubernetes.io/provisioner-deletion-secret-name: rook-csi-cephfs-provisioner
    volume.kubernetes.io/provisioner-deletion-secret-namespace: rook-ceph
  finalizers:
  - kubernetes.io/pv-protection
  name: wp-dev-volume-backup
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: wp-dev-pvc-backup
    namespace: wordpress-dev
  csi:
    controllerExpandSecretRef:
      name: rook-csi-cephfs-provisioner
      namespace: rook-ceph
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-node
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: cephfs
      pool: cephfs-replicated
    volumeHandle: wp-dev-volume-backup
  persistentVolumeReclaimPolicy: Delete
  storageClassName: rook-cephfs
  volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    volume.beta.kubernetes.io/storage-provisioner: rook-ceph.cephfs.csi.ceph.com
    volume.kubernetes.io/storage-provisioner: rook-ceph.cephfs.csi.ceph.com
  creationTimestamp: "2023-01-05T09:34:57Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: wp-dev-pvc-backup
  namespace: wordpress-dev
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: rook-cephfs
  volumeMode: Filesystem
  volumeName: wp-dev-volume-backup
---
apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 1
  labels:
    workload.user.cattle.io/workloadselector: apps.deployment-wordpress-dev-backup-sidecar  
  name: backup-sidecar
  namespace: wordpress-dev
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      workload.user.cattle.io/workloadselector: apps.deployment-wordpress-dev-backup-sidecar
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        workload.user.cattle.io/workloadselector: apps.deployment-wordpress-dev-backup-sidecar
    spec:
      affinity: {}
      containers:
      - image: php:fpm
        imagePullPolicy: Always
        name: container-0
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /backup
          name: volume-backup
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: volume-backup
        persistentVolumeClaim:
          claimName: wp-dev-pvc-backup
---

If I create the volume, the pvc and the container from Rancher it works and the volume is mounted, but if I use kubectl and the yaml files I have the previous error. I have to use yaml files because I'm creating an Helm chart.

Environment: I'm using Rancher/Kubernetes on the Azure cloud:

Rancher: v2.6.9
Kubernetes: v1.24.3
Rook version: v1.10.7
Ceph: v17.2.5
csi-node-driver-registrar: v2.5.1

object: Move bucket notifications to stable
Description of your changes: Bucket notifications and topics have been implemented since v1.8 and have been stable. Therefore, with v1.11 we move the feature to stable.

Which issue is resolved by this Pull Request: Related to #11484

Checklist:

[ ] Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).

[ ] Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.

[ ] Reviewed the developer guide on Submitting a Pull Request

[ ] Pending release notes updated with breaking and/or notable changes for the next minor release.

[ ] Documentation has been updated, if necessary.

[ ] Unit tests have been added, if necessary.

[ ] Integration tests have been added, if necessary.

Unable to upgrade from 1.0.4 to 1.1.9 (using all devices in more than one namespace is not supported)

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

Error [errno 13] error connecting to the cluster is faced after the upgrade when the ceph status with kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
Error op-cluster: failed to configure local ceph cluster. using all devices in more than one namespace is not supported is faced when we check the Operator logs
Error in the csi-provisioner as follows is faced:

kubectl -n rook-ceph logs csi-rbdplugin-provisioner-7cdb456cdc-wxtxt csi-provisioner
I0103 12:40:30.619889       1 leaderelection.go:246] failed to acquire lease rook-ceph/rook-ceph-rbd-csi-ceph-com
I0103 12:40:40.373596       1 leaderelection.go:350] lock is held by csi-rbdplugin-provisioner-7cdb456cdc-8fcnj and has not yet expired

Network connection cannot be established.:

$  kubectl -n rook-ceph get svc -l app=rook-ceph-mon
NAME              TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
rook-ceph-mon-a   ClusterIP   10.96.3.68   <none>        6789/TCP,3300/TCP   170m
camila@camila-ubuntu-2204:~$ kubectl -n rook-ceph exec -ti deploy/csi-cephfsplugin-provisioner -c csi-cephfsplugin -- bash
[root@csi-cephfsplugin-provisioner-9bd478589-d2hxb /]# curl 10.96.3.68 2>/dev/null

NOTE: The version 1.0.4 is installed using the default namespaces. Then, I am trying to migrate/upgrade to 1.1.9 to be able to move forward up to 1.4

Expected behavior:

Be able to upgrade from 1.0.4 to 1.1.9 without face an error so that I can move forward until the latest version releases

How to reproduce it (minimal and precise):

Install 1.0.4 default/examples and try to upgrade to 1.1.9

File(s) to submit:

The config to install 1.0.4 can be found in:

cluster manifests: https://github.com/replicatedhq/kURL/tree/main/addons/rook/1.0.4/cluster (from examples)
operator: https://github.com/replicatedhq/kURL/tree/main/addons/rook/1.0.4/operator (from examples)

Logs to submit:

$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
[root@camila-ubuntu-2204-test /]# ceph status
[errno 13] error connecting to the cluster

Operator's logs, if necessary

2022-12-30 10:59:15.522511 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/695696045
2022-12-30 10:59:15.675808 I | exec: [errno 13] error connecting to the cluster
2022-12-30 10:59:15.675981 E | op-cluster: failed to create cluster in namespace rook-ceph. failed to start the mons. failed to start mon pods. failed to check mon quorum a. failed to wait for mon quorum. exceeded max retry count waiting for monitors to reach quorum
2022-12-30 10:59:15.676003 E | op-cluster: failed to configure local ceph cluster. giving up waiting for cluster creating. timed out waiting for the condition
2022-12-30 10:59:15.676284 I | op-cluster: Update event for uninitialized cluster rook-ceph. Initializing...
2022-12-30 10:59:15.676311 I | op-cluster: CephCluster rook-ceph status: Error. using all devices in more than one namespace is not supported
2022-12-30 10:59:15.689390 E | op-cluster: failed to configure local ceph cluster. using all devices in more than one namespace is not supported
2022-12-30 10:59:15.689468 I | op-cluster: Update event for uninitialized cluster rook-ceph. Initializing...
2022-12-30 10:59:15.689482 I | op-cluster: CephCluster rook-ceph status: Error. using all devices in more than one namespace is not supported
2022-12-30 10:59:15.701223 E | op-cluster: failed to configure local ceph cluster. using all devices in more than one namespace is not supported

Cluster Status to submit:

$ kubectl -n $ROOK_NAMESPACE get deployment -l rook_cluster=$ROOK_NAMESPACE -o jsonpath=‘{range .items[*]}{“rook-version=“}{.metadata.labels.rook-version}{“\n”}{end}’ | sort | uniq
rook-version=v1.0.4
rook-version=v1.1.9

camila@camila-ubuntu-2204-test:~$ export ROOK_SYSTEM_NAMESPACE="rook-ceph"
camila@camila-ubuntu-2204-test:~$ export ROOK_NAMESPACE="rook-ceph"
camila@camila-ubuntu-2204-test:~$ kubectl -n $ROOK_SYSTEM_NAMESPACE get pods
NAME                                                  READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-9f4h6                                3/3     Running     0          87m
csi-cephfsplugin-provisioner-9bd478589-6kskn          4/4     Running     0          87m
csi-cephfsplugin-provisioner-9bd478589-8vmqk          4/4     Running     0          87m
csi-rbdplugin-provisioner-7cdb456cdc-649b6            5/5     Running     0          87m
csi-rbdplugin-provisioner-7cdb456cdc-csnjf            5/5     Running     0          87m
csi-rbdplugin-shznz                                   3/3     Running     0          87m
rook-ceph-agent-mq2px                                 1/1     Running     0          87m
rook-ceph-mgr-a-656d74c8f9-c74bh                      1/1     Running     0          118m
rook-ceph-mon-a-74d96dcbdd-m7hcv                      1/1     Running     0          86m
rook-ceph-operator-86f766bbb8-xwfnn                   1/1     Running     0          87m
rook-ceph-osd-1-bc99bf785-bpg7q                       1/1     Running     0          95m
rook-ceph-osd-prepare-camila-ubuntu-2204-test-h57gh   0/2     Completed   1          96m
rook-ceph-rgw-rook-ceph-store-a-6647fff9cc-cj8ck      1/1     Running     0          117m
rook-ceph-tools-d5dc67475-vp6r2                       1/1     Running     0          87m
rook-discover-pl2p7                                   1/1     Running     0          87m
camila@camila-ubuntu-2204-test:~$ kubectl -n $ROOK_NAMESPACE get pods
NAME                                                  READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-9f4h6                                3/3     Running     0          87m
csi-cephfsplugin-provisioner-9bd478589-6kskn          4/4     Running     0          87m
csi-cephfsplugin-provisioner-9bd478589-8vmqk          4/4     Running     0          87m
csi-rbdplugin-provisioner-7cdb456cdc-649b6            5/5     Running     0          87m
csi-rbdplugin-provisioner-7cdb456cdc-csnjf            5/5     Running     0          87m
csi-rbdplugin-shznz                                   3/3     Running     0          87m
rook-ceph-agent-mq2px                                 1/1     Running     0          87m
rook-ceph-mgr-a-656d74c8f9-c74bh                      1/1     Running     0          118m
rook-ceph-mon-a-74d96dcbdd-m7hcv                      1/1     Running     0          87m
rook-ceph-operator-86f766bbb8-xwfnn                   1/1     Running     0          87m
rook-ceph-osd-1-bc99bf785-bpg7q                       1/1     Running     0          95m
rook-ceph-osd-prepare-camila-ubuntu-2204-test-h57gh   0/2     Completed   1          96m
rook-ceph-rgw-rook-ceph-store-a-6647fff9cc-cj8ck      1/1     Running     0          117m
rook-ceph-tools-d5dc67475-vp6r2                       1/1     Running     0          87m
rook-discover-pl2p7                                   1/1     Running     0          87m
camila@camila-ubuntu-2204-test:~$ TOOLS_POD=$(kubectl -n $ROOK_NAMESPACE get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}')
camila@camila-ubuntu-2204-test:~$ kubectl -n $ROOK_NAMESPACE exec -it $TOOLS_POD -- ceph status
[errno 13] error connecting to the cluster
command terminated with exit code 13

$ kubectl -n rook-ceph exec deploy/rook-ceph-operator -- curl $(kubectl -n rook-ceph get svc -l app=rook-ceph-mon -o jsonpath='{.items[0].spec.clusterIP}'):3300 2>/dev/null
ceph v2

 kubectl -n rook-ceph get pod -l app=rook-ceph-mon
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-mon-a-6bbf7d74ff-gsnml   1/1     Running   0          32m

kubectl -n rook-ceph get pod -l app=csi-rbdplugin-provisioner
NAME                                         READY   STATUS    RESTARTS   AGE
csi-rbdplugin-provisioner-7cdb456cdc-8fcnj   5/5     Running   0          33m
csi-rbdplugin-provisioner-7cdb456cdc-wxtxt   5/5     Running   0          33m

Output of krew commands, if necessary

$ kubectl rook-ceph health
Info:  Checking if at least three mon pods are running on different nodes
Warning:  At least three mon pods should running on different nodes
rook-ceph-mon-a-6bbf7d74ff-gsnml                  1/1     Running     0          33m

Info:  Checking mon quorum and ceph health details
[errno 13] error connecting to the cluster
command terminated with exit code 13

Info:  Checking if at least three osd pods are running on different nodes
Warning:  At least three osd pods should running on different nodes
rook-ceph-osd-1-7df7bd4bf7-s2trz                  1/1     Running     0          42m

Info:  Pods that are in 'Running' status
NAME                                              READY   STATUS    RESTARTS   AGE
csi-cephfsplugin-ngmzq                            3/3     Running   0          34m
csi-cephfsplugin-provisioner-9bd478589-d2hxb      4/4     Running   0          34m
csi-cephfsplugin-provisioner-9bd478589-n69bd      4/4     Running   0          34m
csi-rbdplugin-provisioner-7cdb456cdc-8fcnj        5/5     Running   0          34m
csi-rbdplugin-provisioner-7cdb456cdc-wxtxt        5/5     Running   0          34m
csi-rbdplugin-xv8r9                               3/3     Running   0          34m
rook-ceph-agent-jbv6z                             1/1     Running   0          34m
rook-ceph-mgr-a-79c5cc6fd5-945f6                  1/1     Running   0          165m
rook-ceph-mon-a-6bbf7d74ff-gsnml                  1/1     Running   0          33m
rook-ceph-operator-86f766bbb8-g7vst               1/1     Running   0          34m
rook-ceph-osd-1-7df7bd4bf7-s2trz                  1/1     Running   0          42m
rook-ceph-rgw-rook-ceph-store-a-df4455ccf-gdj7k   1/1     Running   0          163m
rook-ceph-tools-d5dc67475-4nrtb                   1/1     Running   0          34m
rook-discover-2vzkp                               1/1     Running   0          34m

Warning:  Pods that are 'Not' in 'Running' status
NAME                                             READY   STATUS      RESTARTS   AGE

Info:  checking placement group status
[errno 13] error connecting to the cluster
command terminated with exit code 13
Warning:  

Info:  checking if at least one mgr pod is running
rook-ceph-mgr-a-79c5cc6fd5-945f6                  Running     camila-ubuntu-2204

$ kubectl rook-ceph ceph status
[errno 13] error connecting to the cluster
command terminated with exit code 13

Environment:

OS (e.g. from /etc/os-release): Ubuntu 22.04
Kernel (e.g. uname -a): 5.15.0-1025-gcp #32-Ubuntu SMP Wed Nov 23 21:46:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod): 1.0.4
Storage backend version (e.g. for ceph do ceph -v):
Kubernetes version (use kubectl version): 1.19.16
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): vanilla / installed with kubeadm
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

[root@camila-ubuntu-2204 /]# ceph status
[errno 13] error connecting to the cluster
[root@camila-ubuntu-2204 /]# ceph osd status
[errno 13] error connecting to the cluster
[root@camila-ubuntu-2204 /]# ceph df
[errno 13] error connecting to the cluster

Additionally (it shows a bug fixed in the 1.2)

By searching for the error I could find failed to configure local ceph cluster. using all devices in more than one namespace is not supported

It seems that was sorted out here https://github.com/rook/rook/pull/4692 ( issue https://github.com/rook/rook/issues/4633 )
But it was not backport to 1.1.9 https://github.com/rook/rook/blob/v1.1.9/pkg/operator/ceph/cluster/controller.go#L103

RBD Mirror Daemon not syncing images

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

Expected behavior: RBD mirror daemon should sync images.

How to reproduce it (minimal and precise):

install helm operator
create cluster cr
create block pool
enable mirroring
create test image
enable image mirroring via volume replication

I followed the documentation on how to setup the rbd mirroring between two ceph clusters. After creating the volume replication object for a test pvc, the mirror daemon outputs the following error:

debug 2023-01-02T14:56:46.915+0000 7f0f929a26c0 0 rbd::mirror::PoolReplayer: 0x561b025b9800 init_rados: reverting global config option override: keyring: /etc/ceph/keyring-store/keyring -> /etc/ceph/96677e18-48e8-4fb8-8a5f-0dcc7d7f1eb9.client.rbd-mirror-peer.keyring,/etc/ceph/96677e18-48e8-4fb8-8a5f-0dcc7d7f1eb9.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
debug 2023-01-02T14:56:46.915+0000 7f0f929a26c0 0 rbd::mirror::PoolReplayer: 0x561b025b9800 init_rados: reverting global config option override: mon_host: [v2:10.0.0.24:3300,v1:10.0.0.24:6789],[v2:10.0.0.7:3300,v1:10.0.0.7:6789],[v2:10.0.0.3:3300,v1:10.0.0.3:6789],[v2:10.0.0.4:3300,v1:10.0.0.4:6789] ->
debug 2023-01-02T14:56:46.915+0000 7f0f929a26c0 -1 Errors while parsing config file!
debug 2023-01-02T14:56:46.915+0000 7f0f929a26c0 -1 can't open 96677e18-48e8-4fb8-8a5f-0dcc7d7f1eb9.conf: (2) No such file or directory
debug 2023-01-02T14:56:47.183+0000 7f0f929a26c0 0 rbd::mirror::PoolReplayer: 0x561b03e9bb00 init_rados: reverting global config option override: keyring: /etc/ceph/keyring-store/keyring -> /etc/ceph/96677e18-48e8-4fb8-8a5f-0dcc7d7f1eb9.client.rbd-mirror-peer.keyring,/etc/ceph/96677e18-48e8-4fb8-8a5f-0dcc7d7f1eb9.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
debug 2023-01-02T14:56:47.183+0000 7f0f929a26c0 0 rbd::mirror::PoolReplayer: 0x561b03e9bb00 init_rados: reverting global config option override: mon_host: [v2:10.0.0.24:3300,v1:10.0.0.24:6789],[v2:10.0.0.7:3300,v1:10.0.0.7:6789],[v2:10.0.0.3:3300,v1:10.0.0.3:6789],[v2:10.0.0.4:3300,v1:10.0.0.4:6789] ->
debug 2023-01-02T14:56:47.183+0000 7f0f929a26c0 -1 Errors while parsing config file!
debug 2023-01-02T14:56:47.183+0000 7f0f929a26c0 -1 can't open 96677e18-48e8-4fb8-8a5f-0dcc7d7f1eb9.conf: (2) No such file or directory
debug 2023-01-02T14:57:17.864+0000 7f0f7a509700 -1 librbd::managed_lock::GetLockerRequest: 0x561b05779110 handle_get_lockers: failed to retrieve lockers: (2) No such file or directory
debug 2023-01-02T14:57:17.880+0000 7f0f79d08700 -1 librbd::managed_lock::GetLockerRequest: 0x561b05779490 handle_get_lockers: failed to retrieve lockers: (2) No such file or directory
debug 2023-01-02T15:57:18.011010780+01:00 2023-01-02T14:57:18.004+0000 7f0f87d24700 -1 librbd::managed_lock::GetLockerRequest: 0x561b05779570 handle_get_lockers: failed to retrieve lockers: (2) No such file or directory2023-01-02T15:57:18.011021681+01:00
debug 2023-01-02T15:57:18.129115001+01:00 2023-01-02T14:57:18.128+0000 7f0f87523700 -1 librbd::managed_lock::GetLockerRequest: 0x561b057795e0 handle_get_lockers: failed to retrieve lockers: (2) No such file or directory2023-01-02T15:57:18.129151050+01:00

I have no idea where to problem is coming from. None of the sidecars of the csi provisioner is reporting any error. Any idea what might be the problem? I can enable the image mirroring via the ceph dashboard and the mirroring process seems to be allright.

Storage Orchestration for Kubernetes

What is Rook?

Getting Started and Documentation

Contributing

Report a Bug

Reporting Security Vulnerabilities

Contact

Community Meeting

Project Status

Official Releases

Licensing

Owner

Rook

Comments

Very high CPU usage on Ceph OSDs (v1.0, v1.1)

The rook-ceph-csi-config cm disappeared after host reboot

Operator not happy after upgrade

cephfs storageclass does not work

Recovering Rook cluster after Kubernetes cluster loss

OSDs crashlooping after being OOMKilled: bind unable to bind

OSD Processor -- bind unable to bind to IP on any port in range 6800-7300: (99) Cannot assign requested address

rook-ceph-osd-0

osd-0 config

mgr pod in CrashLoop in 0.8.x

OpenShift: insufficient permission inside the containers

rook cluster fails after restart k8s node

Contribute rook-ceph operator to Community OKD/OpenShift Operators

helm: add the missing config in helm for external cluster

Error mounting a volume "rpc error: code = Internal desc = missing required field monitors"

object: Move bucket notifications to stable

Unable to upgrade from 1.0.4 to 1.1.9 (using all devices in more than one namespace is not supported)

Additionally (it shows a bug fixed in the 1.2)

RBD Mirror Daemon not syncing images

Related tags

"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Yandex Files

QingStor Object Storage service support for go-storage

Rook is an open source cloud-native storage orchestrator for Kubernetes

High Performance, Kubernetes Native Object Storage

Cloud-Native distributed storage built on and for Kubernetes

An encrypted object storage system with unlimited space backed by Telegram.

Storj is building a decentralized cloud storage network

tstorage is a lightweight local on-disk storage engine for time-series data

storage interface for local disk or AWS S3 (or Minio) platform

SFTPGo - Fully featured and highly configurable SFTP server with optional FTP/S and WebDAV support - S3, Google Cloud Storage, Azure Blob

Terraform provider for the Minio object storage.

A Redis-compatible server with PostgreSQL storage backend

CSI for S3 compatible SberCloud Object Storage Service

Void is a zero storage cost large file sharing system.

This is a simple file storage server. User can upload file, delete file and list file on the server.

Perkeep (née Camlistore) is your personal storage system for life: a way of storing, syncing, sharing, modelling and backing up content.

s3git: git for Cloud Storage. Distributed Version Control for Data.

A High Performance Object Storage released under Apache License

The Container Storage Interface (CSI) Driver for Fortress Block Storage This driver allows you to use Fortress Block Storage with your container orchestrator