Ceph Operations Guide

Wi

William Jing

DC
K8s
Storage
6
5 min read
Last modified: December 26, 2024
Ceph Operations Guide

Ceph Operations Guide

Installation (Helm)

  1. Add the Rook Helm repository:

    helm repo add rook-release https://charts.rook.io/release
  2. Install the operator:

    helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph -f values.yaml
  3. Install the cluster:

    helm install --create-namespace --namespace rook-ceph rook-ceph-cluster --set operatorNamespace=rook-ceph rook-release/rook-ceph-cluster -f values.yaml

Operator Configuration

Default Values for rook-ceph-operator

  • Pod Resource Requests & Limits:

    resources:
      requests:
        cpu: 20m
  • Global Log Level:

    logLevel: INFO
  • CSI Configuration:

    • RBD Provisioner Resources:

      csiRBDProvisionerResource: |
        - name: csi-provisioner
          resource:
            requests:
              cpu: 10m
        ...
    • RBD Plugin Resources:

      csiRBDPluginResource: |
        - name: driver-registrar
          resource:
            requests:
              memory: 128Mi
              cpu: 5m
            limits:
              memory: 256Mi
        ...
    • CephFS Provisioner and Plugin Resources (similar format).

    • NFS Provisioner and Plugin Resources (similar format).

  • Monitoring:

    monitoring:
      enabled: true

Cluster Configuration

  • Toolbox:

    toolbox:
      enabled: true
      resources:
        requests:
          cpu: '10m'
  • Ceph Cluster Specifications:

    cephClusterSpec:
      dashboard:
        port: 7000
      labels:
        monitoring:
          release: prometheus-stack
      resources:
        mgr:
          requests:
            cpu: "50m"
        mon:
          requests:
            cpu: "100m"
        ...
      removeOSDsIfOutAndSafeToRemove: true

Removing OSDs

  1. Stop the Rook Operator:

    kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
  2. Mark OSD as out:

    ceph osd out osd.<ID>
  3. Confirm OSD is down:

    kubectl -n rook-ceph scale deployment rook-ceph-osd-<ID> --replicas=0
    ceph osd down osd.<ID>
  4. Wait for backfilling to complete (active+clean PGs).

  5. Remove the OSD:

    ceph osd purge <ID> --yes-i-really-mean-it
    ceph auth del osd.<ID>
    ceph osd crush remove <nodeName>
  6. Verify:

    ceph osd tree
  7. Restart the Rook Operator.

Disk Partitioning

  1. List available disks:

    sudo fdisk -l
  2. Partition a disk:

    sudo fdisk /dev/sda
    # Use `n` to create and `w` to save.

Clearing Devices

  1. Clear partitions:

    sgdisk --zap-all $DISK
    dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync

Exposing Monitoring GUI

  1. Certificate Definition (Cert Manager):

    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      name: ceph-sololude-certificate
      namespace: istio-ingress
    spec:
      secretName: ceph-ingress-cert
      commonName: ceph.sololude.com
      dnsNames:
        - ceph.sololude.com
      issuerRef:
        name: sololude-issuer
  2. Gateway Definition (Istio):

    apiVersion: networking.istio.io/v1beta1
    kind: Gateway
    metadata:
      name: rook-ceph-dashboard-gw
      namespace: rook-ceph
    spec:
      selector:
        app: istio-ingressgateway
      servers:
        - port:
            number: 443
            name: https-ceph
            protocol: HTTPS
          hosts:
            - ceph.sololude.com
          tls:
            mode: SIMPLE
            credentialName: ceph-ingress-cert
        - port:
            number: 80
            name: http-ceph
            protocol: HTTP
          hosts:
            - ceph.sololude.com
  3. Virtual Service Definition (Istio):

    apiVersion: networking.istio.io/v1
    kind: VirtualService
    metadata:
      name: ceph-gateway-vs
      namespace: rook-ceph
    spec:
      hosts:
        - ceph.sololude.com
      gateways:
        - rook-ceph-dashboard-gw
      http:
        - route:
            - destination:
                host: rook-ceph-mgr-dashboard

Issues and Troubleshooting

  1. Service Port Change:

    • Set cephClusterSpec.dashboard.port=7000 in Helm values.

  2. OSD Keyring Mismatch:

    • Retrieve keyrings and resolve mismatch.

  3. Entity Exists with Key Mismatch:

    • Delete older auth:

      ceph auth del osd.x

Monitoring

  • Enable monitoring in Helm values:

    monitoring:
      enabled: true
  • Add labels for monitoring:

    cephClusterSpec.labels.monitoring={release: prometheus-stack}

Upgrade

  1. Upgrade Helm:

    curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
  2. Upgrade using Helm:

    helm upgrade -n rook-ceph rook-ceph rook-release/rook-ceph -f values.yaml

Comments

You must be logged in to comment.