Ceph Operations Guide
William Jing
Ceph Operations Guide
Installation (Helm)
Add the Rook Helm repository:
helm repo add rook-release https://charts.rook.io/release
Install the operator:
helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph -f values.yaml
Install the cluster:
helm install --create-namespace --namespace rook-ceph rook-ceph-cluster --set operatorNamespace=rook-ceph rook-release/rook-ceph-cluster -f values.yaml
Operator Configuration
Default Values for rook-ceph-operator
Pod Resource Requests & Limits:
resources: requests: cpu: 20m
Global Log Level:
logLevel: INFO
CSI Configuration:
RBD Provisioner Resources:
csiRBDProvisionerResource: | - name: csi-provisioner resource: requests: cpu: 10m ...
RBD Plugin Resources:
csiRBDPluginResource: | - name: driver-registrar resource: requests: memory: 128Mi cpu: 5m limits: memory: 256Mi ...
CephFS Provisioner and Plugin Resources (similar format).
NFS Provisioner and Plugin Resources (similar format).
Monitoring:
monitoring: enabled: true
Cluster Configuration
Toolbox:
toolbox: enabled: true resources: requests: cpu: '10m'
Ceph Cluster Specifications:
cephClusterSpec: dashboard: port: 7000 labels: monitoring: release: prometheus-stack resources: mgr: requests: cpu: "50m" mon: requests: cpu: "100m" ... removeOSDsIfOutAndSafeToRemove: true
Removing OSDs
Stop the Rook Operator:
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
Mark OSD as out:
ceph osd out osd.<ID>
Confirm OSD is down:
kubectl -n rook-ceph scale deployment rook-ceph-osd-<ID> --replicas=0 ceph osd down osd.<ID>
Wait for backfilling to complete (
active+clean
PGs).Remove the OSD:
ceph osd purge <ID> --yes-i-really-mean-it ceph auth del osd.<ID> ceph osd crush remove <nodeName>
Verify:
ceph osd tree
Restart the Rook Operator.
Disk Partitioning
List available disks:
sudo fdisk -l
Partition a disk:
sudo fdisk /dev/sda # Use `n` to create and `w` to save.
Clearing Devices
Clear partitions:
sgdisk --zap-all $DISK dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync
Exposing Monitoring GUI
Certificate Definition (Cert Manager):
apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: ceph-sololude-certificate namespace: istio-ingress spec: secretName: ceph-ingress-cert commonName: ceph.sololude.com dnsNames: - ceph.sololude.com issuerRef: name: sololude-issuer
Gateway Definition (Istio):
apiVersion: networking.istio.io/v1beta1 kind: Gateway metadata: name: rook-ceph-dashboard-gw namespace: rook-ceph spec: selector: app: istio-ingressgateway servers: - port: number: 443 name: https-ceph protocol: HTTPS hosts: - ceph.sololude.com tls: mode: SIMPLE credentialName: ceph-ingress-cert - port: number: 80 name: http-ceph protocol: HTTP hosts: - ceph.sololude.com
Virtual Service Definition (Istio):
apiVersion: networking.istio.io/v1 kind: VirtualService metadata: name: ceph-gateway-vs namespace: rook-ceph spec: hosts: - ceph.sololude.com gateways: - rook-ceph-dashboard-gw http: - route: - destination: host: rook-ceph-mgr-dashboard
Issues and Troubleshooting
Service Port Change:
Set
cephClusterSpec.dashboard.port=7000
in Helm values.
OSD Keyring Mismatch:
Retrieve keyrings and resolve mismatch.
Entity Exists with Key Mismatch:
Delete older auth:
ceph auth del osd.x
Monitoring
Enable monitoring in Helm values:
monitoring: enabled: true
Add labels for monitoring:
cephClusterSpec.labels.monitoring={release: prometheus-stack}
Upgrade
Upgrade Helm:
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
Upgrade using Helm:
helm upgrade -n rook-ceph rook-ceph rook-release/rook-ceph -f values.yaml