This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Operations

Day-2 procedures for running the Klustre CSI Plugin.

Use these task guides when you need to change cluster settings, roll out new plugin versions, or troubleshoot node issues. Each page focuses on one repeatable operation so you can jump straight to the steps you need.

1 - Nodes and Volumes

How Klustre CSI prepares nodes and exposes Lustre volumes.

Learn about node prerequisites, kubelet integration, and how static Lustre volumes are represented in Kubernetes.

1.1 - Nodes

Prepare and operate Kubernetes nodes that run the Klustre CSI daemonset.

Klustre CSI only schedules on nodes that can mount Lustre exports. Use the topics below to prepare those nodes, understand what the daemonset mounts from the host, and keep kubelet integration healthy.

1.1.1 - Node preparation

Install the Lustre client, label nodes, and grant the privileges required by Klustre CSI.

Install the Lustre client stack

Every node that runs Lustre-backed pods must have:

  • mount.lustre and umount.lustre binaries (via lustre-client RPM/DEB).
  • Kernel modules compatible with your Lustre servers.
  • Network reachability to the Lustre MGS/MDS/OSS endpoints.

Verify installation:

mount.lustre --version
lsmod | grep lustre

Label nodes

The default storage class and daemonset use the label lustre.csi.klustrefs.io/lustre-client=true.

kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client=true

Remove the label when a node no longer has Lustre access:

kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client-

Allow privileged workloads

Klustre CSI pods require:

  • privileged: true, allowPrivilegeEscalation: true
  • hostPID: true, hostNetwork: true
  • HostPath mounts for /var/lib/kubelet, /dev, /sbin, /usr/sbin, /lib, and /lib64

Label the namespace with Pod Security Admission overrides:

kubectl create namespace klustre-system
kubectl label namespace klustre-system \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/audit=privileged \
  pod-security.kubernetes.io/warn=privileged

Maintain consistency

  • Keep AMIs or OS images in sync so every node has the same Lustre client version.
  • If you use autoscaling groups, bake the client packages into your node image or run a bootstrap script before kubelet starts.
  • Automate label management with infrastructure-as-code (e.g., Cluster API, Ansible) so the right nodes receive the lustre-client=true label on join/leave events.

1.1.2 - Node integration flow

Understand how Klustre CSI interacts with kubelet and the host filesystem.

Daemonset host mounts

DaemonSet/klustre-csi-node mounts the following host paths:

  • /var/lib/kubelet/plugins and /var/lib/kubelet/pods – required for CSI socket registration and mount propagation.
  • /dev – ensures device files (if any) are accessible when mounting Lustre.
  • /sbin, /usr/sbin, /lib, /lib64 – expose the host’s Lustre client binaries and libraries to the container.

If your kubelet uses custom directories, update pluginDir and registrationDir in the settings ConfigMap.

CSI socket lifecycle

  1. The node plugin listens on csiEndpoint (defaults to /var/lib/kubelet/plugins/lustre.csi.klustrefs.io/csi.sock).
  2. The node-driver-registrar sidecar registers that socket with kubelet via registrationDir.
  3. Kubelet uses the UNIX socket to call NodePublishVolume and NodeUnpublishVolume when pods mount or unmount PVCs.

If the daemonset does not come up or kubelet cannot reach the socket, run:

kubectl describe daemonset klustre-csi-node -n klustre-system
kubectl logs -n klustre-system daemonset/klustre-csi-node -c klustre-csi

PATH and library overrides

The containers inherit PATH and LD_LIBRARY_PATH values that point at the host bind mounts. If your Lustre client lives elsewhere, override:

  • nodePlugin.pathEnv
  • nodePlugin.ldLibraryPath

via Helm values or by editing the daemonset manifest.

Health signals

  • Kubernetes events referencing lustre.csi.klustrefs.io indicate mount/unmount activity.
  • kubectl get pods -n klustre-system -o wide should show one pod per labeled node.
  • A missing pod usually means the node label is absent or taints/tolerations are mismatched.

1.2 - Volumes

Model Lustre exports as PersistentVolumes and understand RWX behavior.

Klustre CSI focuses on static provisioning: you point a PV at an existing Lustre export, bind it to a PVC, and mount it into pods. Explore the topics below for the manifest workflow and mount attribute details.

1.2.1 - Static PV workflow

Define PersistentVolumes and PersistentVolumeClaims that reference Lustre exports.

1. Create the PersistentVolume

apiVersion: v1
kind: PersistentVolume
metadata:
  name: lustre-static-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  storageClassName: klustre-csi-static
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: lustre.csi.klustrefs.io
    volumeHandle: lustre-static-pv
    volumeAttributes:
      source: 10.0.0.1@tcp0:/lustre-fs
      mountOptions: flock,user_xattr
  • volumeHandle just needs to be unique within the cluster; it is not used by the Lustre backend.
  • volumeAttributes.source carries the Lustre management target and filesystem path.

2. Bind with a PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: lustre-static-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: klustre-csi-static
  volumeName: lustre-static-pv
  resources:
    requests:
      storage: 10Gi

Even though Lustre capacity is managed outside Kubernetes, the storage field should match the PV so the binder succeeds.

3. Mount from workloads

volumes:
  - name: lustre
    persistentVolumeClaim:
      claimName: lustre-static-pvc
containers:
  - name: app
    image: busybox
    volumeMounts:
      - name: lustre
        mountPath: /mnt/lustre

Multiple pods can reference the same PVC because Lustre supports ReadWriteMany. Pods must schedule on labeled nodes (lustre.csi.klustrefs.io/lustre-client=true).

4. Cleanup

Deleting the PVC detaches pods but the PV remains because the reclaim policy is Retain. Manually delete the PV when you no longer need it.

1.2.2 - Volume attributes and mount options

Map Kubernetes fields to Lustre mount flags and behaviors.

volumeAttributes

KeyExamplePurpose
source10.0.0.1@tcp0:/lustre-fsHost(s) and filesystem path given to mount.lustre.
mountOptionsflock,user_xattrComma-separated Lustre mount flags.

Additional keys (e.g., subdir) can be added in the future; the driver simply passes the map to the Lustre helper script.

Storage class tuning

See the storage class reference for details on:

  • allowedTopologies – keep workloads on nodes with the Lustre label.
  • reclaimPolicy – typically Retain for static PVs.
  • mountOptions – defaults to flock and user_xattr, but you can add noatime, flock, user_xattr, etc.

Override mount options per volume by setting volumeAttributes.mountOptions. This is useful when a subset of workloads needs different locking semantics.

Access modes

  • Use ReadWriteMany for shared Lustre volumes.
  • ReadOnlyMany is supported when you only need read access.
  • ReadWriteOnce offers no benefit with Lustre; prefer RWX.

Lifecycle reminders

  • Klustre CSI does not provision or delete Lustre exports. Ensure the server-side directory exists and has the correct permissions.
  • Kubernetes capacity values are advisory. Quotas should be enforced on the Lustre server.
  • PersistentVolumeReclaimPolicy=Retain keeps PVs around after PVC deletion; clean them up manually to avoid dangling objects.

2 - Maintenance and Upgrade

Keep Klustre CSI healthy during node drains and Kubernetes upgrades.

Patch nodes, rotate images, or upgrade the CSI plugin without interrupting workloads.

Select the maintenance guide you need from the navigation—node checklist, upgrade plan, and future topics all live underneath this page.

2.1 - Node maintenance checklist

Drain nodes safely and ensure Klustre CSI pods return to service.

1. Cordon and drain

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Because the Klustre CSI daemonset is a DaemonSet, it is unaffected by --ignore-daemonsets, but draining ensures your workloads move off the node before reboot.

2. Verify daemonset status

kubectl get pods -n klustre-system -o wide | grep <node>

Expect the daemonset pod to terminate when the node drains and recreate once the node returns.

3. Patch or reboot the node

  • Apply OS updates, reboot, or swap hardware as needed.
  • Ensure the Lustre client packages remain installed (validate with mount.lustre --version).

4. Uncordon and relabel if necessary

kubectl uncordon <node>

If the node lost the lustre.csi.klustrefs.io/lustre-client=true label, reapply it after verifying Lustre connectivity.

5. Watch for daemonset rollout

kubectl rollout status daemonset/klustre-csi-node -n klustre-system

6. Confirm workloads recover

Use kubectl get pods for namespaces that rely on Lustre PVCs to ensure pods are running and mounts succeeded.

Tips

  • For large clusters, drain one Lustre node at a time to keep mounts available.
  • If kubectl drain hangs due to pods using Lustre PVCs, identify them with kubectl get pods --all-namespaces -o wide | grep <node> and evict manually.

2.2 - Upgrade guide

Plan Klustre CSI version upgrades alongside Kubernetes changes.

1. Review release notes

Check the klustre-csi-plugin GitHub releases for breaking changes, minimum Kubernetes versions, and image tags.

2. Update the image reference

  • Helm users: bump image.tag and nodePlugin.registrar.image.tag in your values file, then run helm upgrade.
  • Manifest users: edit manifests/configmap-klustre-csi-settings.yaml (nodeImage, registrarImage) and reapply the manifests.

See Update the node daemonset image for detailed steps.

3. Roll out sequentially

kubectl rollout restart daemonset/klustre-csi-node -n klustre-system
kubectl rollout status daemonset/klustre-csi-node -n klustre-system

The daemonset restarts one node at a time, keeping existing mounts available.

4. Coordinate with Kubernetes upgrades

When upgrading kubelet:

  1. Follow the node maintenance checklist for each node.
  2. Upgrade the node OS/kubelet.
  3. Verify the daemonset pod recreates successfully before moving to the next node.

5. Validate workloads

  • Spot-check pods that rely on Lustre PVCs (kubectl exec into them and run df -h /mnt/lustre).
  • Ensure no stale FailedMount events exist.

Rollback

If the new version misbehaves:

  1. Revert nodeImage and related settings to the previous tag.
  2. Run kubectl rollout restart daemonset/klustre-csi-node -n klustre-system.
  3. Inspect logs to confirm the old version is running.

3 - Label Lustre-capable nodes

Apply and verify the topology label used by the Klustre storage class.

The default klustre-csi-static storage class restricts scheduling to nodes labeled lustre.csi.klustrefs.io/lustre-client=true. Use this runbook whenever you add or remove nodes from the Lustre client pool.

Requirements

  • Cluster-admin access with kubectl.
  • Nodes already have the Lustre client packages installed and can reach your Lustre servers.

Steps

  1. Identify nodes that can mount Lustre

    kubectl get nodes -o wide
    

    Cross-reference with your infrastructure inventory or automation outputs to find the node names that have Lustre connectivity.

  2. Apply the label

    kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client=true
    

    Repeat for each eligible node. Use --overwrite if the label already exists but the value should change.

  3. Verify

    kubectl get nodes -L lustre.csi.klustrefs.io/lustre-client
    

    Ensure only the nodes with Lustre access show true. Remove the label from nodes that lose access:

    kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client-
    
  4. Confirm DaemonSet placement

    kubectl get pods -n klustre-system -o wide \
      -l app.kubernetes.io/name=klustre-csi
    

    Pods from the klustre-csi-node daemonset should exist only on labeled nodes. If you see pods on unlabeled nodes, check the nodeSelector and tolerations in the daemonset spec.

4 - Update the Klustre CSI image

Roll out a new plugin container image across the daemonset.

Use this guide to bump the Klustre CSI image version (for example, when adopting a new release).

Requirements

  • Cluster-admin access.
  • The new image is pushed to a registry reachable by your cluster (GHCR or a mirror).
  • The ghcr-secret or equivalent image pull secret already contains credentials for the registry.

Steps

  1. Edit the settings ConfigMap

    The manifests and Helm chart both reference ConfigMap/klustre-csi-settings. Update the nodeImage key with the new tag:

    kubectl -n klustre-system edit configmap klustre-csi-settings
    

    Example snippet:

    data:
      nodeImage: ghcr.io/klustrefs/klustre-csi-plugin:0.1.2
    

    Save and exit.

  2. Restart the daemonset pods

    kubectl rollout restart daemonset/klustre-csi-node -n klustre-system
    
  3. Watch the rollout

    kubectl rollout status daemonset/klustre-csi-node -n klustre-system
    kubectl get pods -n klustre-system -o wide
    
  4. Verify the running image

    kubectl get pods -n klustre-system -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].image}{"\n"}{end}'
    

    Confirm all pods now report the new tag.

  5. Optional: clean up old images

    If you mirror images, remove unused tags from your registry or automation as needed.

5 - Collect diagnostics

Gather logs and cluster state for troubleshooting or support requests.

When reporting an issue, provide the following artifacts so maintainers can reproduce the problem.

1. Capture pod logs

kubectl logs -n klustre-system daemonset/klustre-csi-node -c klustre-csi --tail=200 > klustre-csi.log
kubectl logs -n klustre-system daemonset/klustre-csi-node -c node-driver-registrar --tail=200 > node-driver-registrar.log

If a specific pod is failing, target it directly:

kubectl logs -n klustre-system <pod-name> -c klustre-csi --previous

2. Describe pods and daemonset

kubectl describe daemonset klustre-csi-node -n klustre-system > klustre-csi-daemonset.txt
kubectl describe pods -n klustre-system -l app.kubernetes.io/name=klustre-csi > klustre-csi-pods.txt

3. Export relevant resources

kubectl get csidriver lustre.csi.klustrefs.io -o yaml > csidriver.yaml
kubectl get storageclass klustre-csi-static -o yaml > storageclass.yaml
kubectl get configmap klustre-csi-settings -n klustre-system -o yaml > configmap.yaml

Remove sensitive data (e.g., registry credentials) before sharing.

4. Include node information

  • Output of uname -a, lsmod | grep lustre, and the Lustre client version on affected nodes.
  • Whether the node can reach your Lustre servers (share ping or mount.lustre command output if available).

5. Bundle and share

Package the files into an archive and attach it to your GitHub issue or support request:

tar czf klustre-diagnostics.tgz klustre-csi.log node-driver-registrar.log \
  klustre-csi-daemonset.txt klustre-csi-pods.txt csidriver.yaml storageclass.yaml configmap.yaml