Klustre CSI Plugin is an open-source Container Storage Interface (CSI) node driver that lets Kubernetes workloads mount existing Lustre file systems. It targets high-throughput, shared ReadWriteMany use cases such as HPC, AI/ML training, and media workloads.

1 - Overview

Klustre CSI Plugin brings Lustre’s high-throughput storage into Kubernetes clusters.

Klustre CSI Plugin is an open-source Container Storage Interface (CSI) node driver that lets Kubernetes workloads mount existing Lustre file systems. The project focuses on high-performance computing (HPC), AI/ML training, and media workloads that need shared ReadWriteMany semantics with the bandwidth of Lustre.

What does it provide?

Kubernetes-native storage – Exposes Lustre exports via CSI objects such as CSIDriver, PersistentVolume, and PersistentVolumeClaim.
Node daemonset – Runs a privileged pod on every Lustre-capable worker node to perform mounts and unmounts using the host’s Lustre client.
Static provisioning – Administrators define PersistentVolumes that point at existing Lustre paths (for example 10.0.0.1@tcp0:/lustre-fs) and bind them to workloads.
Helm and raw manifests – Install using the published manifests in manifests/ or the OCI Helm chart oci://ghcr.io/klustrefs/charts/klustre-csi-plugin.
Cluster policy alignment – Default RBAC, topology labels, and resource requests are tuned so scheduling is constrained to nodes that actually have the Lustre client installed.

Why would I use it?

Use the Klustre CSI plugin when you:

Operate Lustre today (on-prem or cloud) and want to reuse those file systems inside Kubernetes without rearchitecting your storage layer.
Need shared volume access (ReadWriteMany) with low latency and high throughput, such as MPI workloads, model training jobs, or render farms.
Prefer Kubernetes-native declarative workflows for provisioning, auditing, and cleaning up Lustre mounts.
Want a lightweight component that focuses on node-side responsibilities without managing Lustre servers themselves.

Current scope and limitations

The project intentionally ships a minimal surface area:

✅ Mounts and unmounts existing Lustre exports on demand.
✅ Supports ReadWriteMany access modes and Lustre mount options (flock, user_xattr, etc.).
⚠️ Does not implement controller-side operations such as CreateVolume, snapshots, expansion, or metrics, so dynamic provisioning and quotas remain outside the plugin.
⚠️ Assumes the Lustre client stack is pre-installed on each worker node—images do not bundle client packages.
⚠️ Requires privileged pods with SYS_ADMIN, hostPID, and hostNetwork. Clusters that forbid privileged workloads cannot run the driver.

These boundaries keep the plugin predictable while the community converges on the right APIs. Contributions that expand support (e.g., NodeGetVolumeStats) are welcomed through GitHub issues and pull requests.

Architecture at a glance

DaemonSet: klustre-csi-node schedules one pod per eligible node. It mounts /sbin, /usr/sbin, /lib, /lib64, /dev, and kubelet directories from the host so the Lustre kernel modules and socket paths remain consistent.
Node Driver Registrar: A sidecar container handles kubelet registration and lifecycle hooks by connecting to the CSI UNIX socket.
ConfigMap-driven settings: Runtime configuration such as log level, image tags, and kubelet plugin directory live in ConfigMap klustre-csi-settings, making updates declarative.
StorageClass controls placement: The default klustre-csi-static storage class enforces the lustre.csi.klustrefs.io/lustre-client=true node label to avoid scheduling onto nodes without Lustre support.

Where should I go next?

Introduction: Install the plugin and mount your first Lustre share.
GitHub Repository: Review source, open issues, or clone manifests.
Helm Charts: Browse the Helm catalog and values for the CSI plugin chart.
Community Discussion: Ask questions, propose features, or share deployment feedback.

2 - Requirements

Detailed requirements for running the Klustre CSI Plugin.

Use this page when you need the full checklist (versions, node prep, and registry access) before installing Klustre CSI. The Quickstart and Introduction pages summarize this information, but this serves as a canonical reference.

Kubernetes cluster

Kubernetes v1.20 or newer with CSI v1.5 enabled.
Control plane and kubelets must allow privileged pods (hostPID, hostNetwork, SYS_ADMIN capability).
Cluster-admin kubectl access.

Lustre-capable worker nodes

Install the Lustre client packages (mount.lustre, kernel modules, user-space tools) on every node that will host Lustre-backed workloads.
Ensure network connectivity (TCP, RDMA, etc.) from those nodes to your Lustre servers.

Label the nodes:

kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client=true

Namespace, security, and registry access

Create the namespace and optional GHCR image pull secret:

kubectl create namespace klustre-system
kubectl create secret docker-registry ghcr-secret \
  --namespace klustre-system \
  --docker-server=ghcr.io \
  --docker-username=<github-user> \
  --docker-password=<github-token>

Label the namespace for Pod Security Admission if your cluster enforces it:

kubectl label namespace klustre-system \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/audit=privileged \
  pod-security.kubernetes.io/warn=privileged

Tooling

kubectl, git, and optionally helm.
Hugo/npm/Go are only needed if you plan to contribute to this documentation site.

3 - Quickstart

Opinionated steps to install Klustre CSI and mount a Lustre share in under 10 minutes.

Looking for the fastest path from zero to a mounted Lustre volume? Follow this TL;DR workflow, then explore the detailed installation pages if you need customization.

Requirements

Before you sprint through the commands below, complete the requirements checklist. You’ll need:

kubectl, git, and optionally helm.
Worker nodes with the Lustre client installed and reachable MGS/MDS/OSS endpoints.
Nodes labeled lustre.csi.klustrefs.io/lustre-client=true.
The klustre-system namespace plus optional GHCR image pull secret (see the requirements page for the canonical commands).

Step 1 — Install Klustre CSI

export KLUSTREFS_VERSION=main
kubectl apply -k "github.com/klustrefs/klustre-csi-plugin//manifests?ref=$KLUSTREFS_VERSION"

Step 2 — Verify the daemonset

Wait for the DaemonSet rollout to complete:

kubectl rollout status daemonset/klustre-csi-node -n klustre-system --timeout=120s

Then wait for all node pods to report Ready:

kubectl wait --for=condition=Ready pod -l app=klustre-csi-node -n klustre-system --timeout=120s

Pods should schedule only on nodes labeled lustre.csi.klustrefs.io/lustre-client=true.

Save the following manifest as lustre-demo.yaml, updating volumeAttributes.source to your Lustre target:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: lustre-demo-pv
spec:
  storageClassName: klustre-csi-static
  capacity:
    storage: 1Ti
  accessModes:
    - ReadWriteMany
  csi:
    driver: lustre.csi.klustrefs.io
    volumeHandle: lustre-demo
    volumeAttributes:
      source: 10.0.0.1@tcp0:/lustre-fs # TODO: replace with your Lustre target
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: lustre-demo-pvc
spec:
  storageClassName: klustre-csi-static
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Ti
  volumeName: lustre-demo-pv
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lustre-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: lustre-demo
  template:
    metadata:
      labels:
        app: lustre-demo
    spec:
      containers:
        - name: demo
          image: busybox:1.36
          command: ["sh", "-c", "sleep 3600"]
          volumeMounts:
            - mountPath: /mnt/lustre
              name: lustre
      volumes:
        - name: lustre
          persistentVolumeClaim:
            claimName: lustre-demo-pvc

Apply the demo manifest:

kubectl apply -f lustre-demo.yaml

Confirm the Lustre mount is visible in the pod:

kubectl exec deploy/lustre-demo -- df -h /mnt/lustre

Write a test file into the mounted Lustre share:

kubectl exec deploy/lustre-demo -- sh -c 'date > /mnt/lustre/hello.txt'

Step 4 — Clean up (optional)

Remove the demo PV, PVC, and Deployment:

kubectl delete -f lustre-demo.yaml

If you only installed Klustre CSI for this quickstart and want to remove it as well, uninstall the driver:

kubectl delete -k "github.com/klustrefs/klustre-csi-plugin//manifests?ref=$KLUSTREFS_VERSION"

What’s next?

Need provider-specific instructions or more realistic workloads? Check the advanced notes for Kind labs, bare metal clusters, or Amazon EKS.
Want to understand every knob and see production-ready volume patterns? Dive into the Advanced installation section and the Operations → Nodes & Volumes pages, including the Static PV workflow.

4 - Advanced Installation

Deep dive into Klustre CSI prerequisites and install methods.

Ready to customize your deployment? Use the pages in this section for the full checklist, installation methods, and platform-specific notes. Requirements always come first; after that, pick either manifests or Helm, then dive into the environment-specific guidelines.

4.1 - Kind Quickstart

Stand up a local Kind cluster, simulate a Lustre client, and exercise Klustre CSI Plugin without touching production clusters.

This walkthrough targets Linux hosts with Docker/Podman because Kind worker nodes run as containers. macOS and Windows hosts cannot load kernel modules required by Lustre, but you can still observe the driver boot sequence. The shim below fakes mount.lustre with tmpfs so you can run the end-to-end demo locally.

Requirements

Docker 20.10+ (or a compatible container runtime supported by Kind).
Kind v0.20+.
kubectl v1.27+ pointed at your Kind context.
A GitHub personal access token with read:packages if you plan to pull images from GitHub Container Registry via an image pull secret (optional but recommended).

1. Create a Kind cluster

Save the following Kind configuration and create the cluster:

cat <<'EOF' > kind-klustre.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.29.2
- role: worker
  image: kindest/node:v1.29.2
EOF

kind create cluster --name klustre-kind --config kind-klustre.yaml
kubectl cluster-info --context kind-klustre-kind

2. Install a Lustre shim inside the nodes

The CSI plugin shells out to mount.lustre and umount.lustre. Kind nodes do not ship with the Lustre client, so we create lightweight shims that mount a tmpfs and behave like a Lustre mount. This allows the volume lifecycle to complete even though no real Lustre server exists.

cat <<'EOF' > lustre-shim.sh
#!/bin/bash
set -euo pipefail
SOURCE="${1:-tmpfs}"
TARGET="${2:-/mnt/lustre}"
shift 2 || true
mkdir -p "$TARGET"
if mountpoint -q "$TARGET"; then
  exit 0
fi
mount -t tmpfs -o size=512m tmpfs "$TARGET"
EOF

cat <<'EOF' > lustre-unmount.sh
#!/bin/bash
set -euo pipefail
TARGET="${1:?target path required}"
umount "$TARGET"
EOF
chmod +x lustre-shim.sh lustre-unmount.sh

for node in $(kind get nodes --name klustre-kind); do
  docker cp lustre-shim.sh "$node":/usr/sbin/mount.lustre
  docker cp lustre-unmount.sh "$node":/usr/sbin/umount.lustre
  docker exec "$node" chmod +x /usr/sbin/mount.lustre /usr/sbin/umount.lustre
done

3. Prepare node labels

Label the Kind worker node so it is eligible to run Lustre workloads:

kubectl label node klustre-kind-worker lustre.csi.klustrefs.io/lustre-client=true

The default klustre-csi-static storage class uses the label above inside allowedTopologies. Label any node that will run workloads needing Lustre.

4. Deploy Klustre CSI Plugin

Install the driver into the Kind cluster using the published Kustomize manifests:

export KLUSTREFS_VERSION=main
kubectl apply -k "github.com/klustrefs/klustre-csi-plugin//manifests?ref=$KLUSTREFS_VERSION"

Then watch the pods come up:

kubectl get pods -n klustre-system -o wide

Then wait for the daemonset rollout to complete:

kubectl rollout status daemonset/klustre-csi-node -n klustre-system --timeout=120s

Wait until the klustre-csi-node daemonset shows READY pods on the control-plane and worker nodes.

Create a demo manifest that provisions a static PersistentVolume and a BusyBox deployment. Because the mount.lustre shim mounts tmpfs, data is confined to the worker node memory and disappears when the pod restarts. Replace the source string with the Lustre target you plan to use later—here it is only metadata.

Create the demo manifest with a heredoc:

cat <<'EOF' > lustre-demo.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: lustre-demo-pv
spec:
  storageClassName: klustre-csi-static
  capacity:
    storage: 1Ti
  accessModes:
    - ReadWriteMany
  csi:
    driver: lustre.csi.klustrefs.io
    volumeHandle: lustre-demo
    volumeAttributes:
      # This is only metadata in the Kind lab; replace with a real target for production clusters.
      source: 10.0.0.1@tcp0:/lustre-fs
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: lustre-demo-pvc
spec:
  storageClassName: klustre-csi-static
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Ti
  volumeName: lustre-demo-pv
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lustre-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: lustre-demo
  template:
    metadata:
      labels:
        app: lustre-demo
    spec:
      containers:
        - name: demo
          image: busybox:1.36
          command: ["sh", "-c", "sleep 3600"]
          volumeMounts:
            - name: lustre
              mountPath: /mnt/lustre
      volumes:
        - name: lustre
          persistentVolumeClaim:
            claimName: lustre-demo-pvc
EOF

Apply the demo manifest:

kubectl apply -f lustre-demo.yaml

Wait for the demo deployment to become available:

kubectl wait --for=condition=available deployment/lustre-demo

Confirm the Lustre (tmpfs) mount is visible in the pod:

kubectl exec deploy/lustre-demo -- df -h /mnt/lustre

Write and read back a test file:

kubectl exec deploy/lustre-demo -- sh -c 'echo "hello from $(hostname)" > /mnt/lustre/hello.txt'

kubectl exec deploy/lustre-demo -- cat /mnt/lustre/hello.txt

You should see the tmpfs mount reported by df and be able to write temporary files.

6. Clean up (optional)

Remove the demo PV, PVC, and Deployment:

kubectl delete -f lustre-demo.yaml

If you want to tear down the Kind environment as well:

kubectl delete namespace klustre-system
kind delete cluster --name klustre-kind
rm kind-klustre.yaml lustre-shim.sh lustre-unmount.sh lustre-demo.yaml

Troubleshooting

If the daemonset pods crash with ImagePullBackOff, use kubectl describe daemonset/klustre-csi-node -n klustre-system and kubectl logs daemonset/klustre-csi-node -n klustre-system -c klustre-csi to inspect the error. The image is public on ghcr.io, so no image pull secret is required; ensure your nodes can reach ghcr.io (or your proxy) from inside the cluster.
If the demo pod fails to mount /mnt/lustre, make sure the shim scripts were copied to every Kind node and are executable. You can rerun the docker cp ... mount.lustre / umount.lustre loop from step 2 after adding or recreating nodes.
Remember that tmpfs lives in RAM. Large writes in the demo workload consume memory inside the Kind worker container and disappear after pod restarts. Move to a real Lustre environment for persistent data testing.

Use this local experience to get familiar with the manifests and volume lifecycle, then follow the main Introduction guide when you are ready to operate against real Lustre backends.

4.2 - Amazon EKS Notes

Outline for deploying Klustre CSI Plugin on managed Amazon EKS clusters backed by Lustre (FSx or self-managed).

The AWS-oriented quickstart is under construction. It will cover:

Preparing EKS worker nodes with the Lustre client (either Amazon Linux extras or the FSx-provided packages).
Handling IAM roles for service accounts (IRSA) and pulling container images from GitHub Container Registry.
Connecting to FSx for Lustre file systems (imported or linked to S3 buckets) and exposing them via static PersistentVolumes.

Until the full write-up lands, adapt the Introduction flow by:

Installing the Lustre client on your managed node groups (e.g., with yum install lustre-client in your AMI or through user data).
Labeling the nodes that have Lustre access with lustre.csi.klustrefs.io/lustre-client=true.
Applying the Klustre CSI manifests or Helm chart in the klustre-system namespace.

Feedback on which AWS-specific topics matter most (FSx throughput tiers, PrivateLink, IAM policies, etc.) is welcome in the community discussions.

4.3 - Bare Metal Notes

Notes for operators preparing self-managed clusters before following the main introduction flow.

This guide will describe how to prepare on-prem or colocation clusters where you manage the operating systems directly (kernel modules, Lustre packages, kubelet paths, etc.). While the detailed walkthrough is in progress, you can already follow the general Introduction page and keep the following considerations in mind:

Ensure every node that should host Lustre-backed pods has the Lustre client packages installed via your distribution’s package manager (for example, lustre-client RPM/DEB).
Label those nodes with lustre.csi.klustrefs.io/lustre-client=true.
Grant the klustre-system namespace Pod Security admission exemptions (e.g., pod-security.kubernetes.io/enforce=privileged) because the daemonset requires hostPID, hostNetwork, and SYS_ADMIN.

If you are interested in helping us document more advanced configurations (multiple interfaces, bonded networks, RDMA, etc.), please open an issue or discussion in the GitHub repository.

4.4 - Install with Helm

Deploy the Klustre CSI plugin using the OCI-distributed Helm chart.

The Helm chart is published under oci://ghcr.io/klustrefs/charts/klustre-csi-plugin.

1. Authenticate (optional)

If you use a GitHub personal access token for GHCR:

helm registry login ghcr.io -u <github-user>

Skip this step if anonymous pulls are permitted in your environment.

2. Install or upgrade

helm upgrade --install klustre-csi \
  oci://ghcr.io/klustrefs/charts/klustre-csi-plugin \
  --version 0.1.1 \
  --namespace klustre-system \
  --create-namespace \
  --set imagePullSecrets[0].name=ghcr-secret

Adjust the release name, namespace, and imagePullSecrets as needed. You can omit the secret if GHCR is reachable without credentials.

3. Override values

Common overrides:

nodePlugin.logLevel – adjust verbosity (debug, info, etc.).
nodePlugin.pluginDir, nodePlugin.kubeletRegistrationPath – change if /var/lib/kubelet differs on your hosts.
storageClass.mountOptions – add Lustre mount flags such as flock or user_xattr.

View the full schema:

helm show values oci://ghcr.io/klustrefs/charts/klustre-csi-plugin --version 0.1.1

4. Check status

kubectl get pods -n klustre-system
helm status klustre-csi -n klustre-system

When pods are ready, continue with the validation instructions or deploy a workload that uses the Lustre-backed storage class.

4.5 - Install with kubectl/manifests

Apply the published Klustre CSI manifests with kubectl.

1. Install directly with Kustomize (no clone)

If you just want a default install, you don’t need to clone the repository. You can apply the published manifests directly from GitHub:

export KLUSTREFS_VERSION=main
kubectl apply -k "github.com/klustrefs/klustre-csi-plugin//manifests?ref=$KLUSTREFS_VERSION"

The manifests/ directory includes the namespace, RBAC, CSIDriver, daemonset, node service account, default StorageClass (klustre-csi-static), and settings config map.

2. Work from a local clone (recommended for customization)

If you plan to inspect or customize the manifests, clone the repo and work from a local checkout:

git clone https://github.com/klustrefs/klustre-csi-plugin.git
cd klustre-csi-plugin

You can perform the same default install from the local checkout:

kubectl apply -k manifests

3. Customize with a Kustomize overlay (optional)

To change defaults such as logLevel, nodeImage, or the CSI endpoint path without editing the base files, create a small overlay that patches the settings config map.

Create overlays/my-cluster/kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../manifests

patchesStrategicMerge:
  - configmap-klustre-csi-settings-patch.yaml

Create overlays/my-cluster/configmap-klustre-csi-settings-patch.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: klustre-csi-settings
  namespace: klustre-system
data:
  logLevel: debug
  nodeImage: ghcr.io/klustrefs/klustre-csi-plugin:0.1.1

Then apply your overlay instead of the base:

kubectl apply -k overlays/my-cluster

You can add additional patches in the overlay (for example, to tweak the daemonset or StorageClass) as your cluster needs grow.

4. Verify rollout

kubectl get pods -n klustre-system -o wide
kubectl describe daemonset klustre-csi-node -n klustre-system
kubectl logs daemonset/klustre-csi-node -n klustre-system -c klustre-csi

After the daemonset is healthy on all Lustre-capable nodes, continue with the validation steps or jump to the sample workload.

5 - Operations

Day-2 procedures for running the Klustre CSI Plugin.

Use these task guides when you need to change cluster settings, roll out new plugin versions, or troubleshoot node issues. Each page focuses on one repeatable operation so you can jump straight to the steps you need.

5.1 - Nodes and Volumes

How Klustre CSI prepares nodes and exposes Lustre volumes.

Learn about node prerequisites, kubelet integration, and how static Lustre volumes are represented in Kubernetes.

5.1.1 - Nodes

Prepare and operate Kubernetes nodes that run the Klustre CSI daemonset.

Klustre CSI only schedules on nodes that can mount Lustre exports. Use the topics below to prepare those nodes, understand what the daemonset mounts from the host, and keep kubelet integration healthy.

5.1.1.1 - Node preparation

Install the Lustre client, label nodes, and grant the privileges required by Klustre CSI.

Install the Lustre client stack

Every node that runs Lustre-backed pods must have:

mount.lustre and umount.lustre binaries (via lustre-client RPM/DEB).
Kernel modules compatible with your Lustre servers.
Network reachability to the Lustre MGS/MDS/OSS endpoints.

Verify installation:

mount.lustre --version
lsmod | grep lustre

Label nodes

The default storage class and daemonset use the label lustre.csi.klustrefs.io/lustre-client=true.

kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client=true

Remove the label when a node no longer has Lustre access:

kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client-

Allow privileged workloads

Klustre CSI pods require:

privileged: true, allowPrivilegeEscalation: true
hostPID: true, hostNetwork: true
HostPath mounts for /var/lib/kubelet, /dev, /sbin, /usr/sbin, /lib, and /lib64

Label the namespace with Pod Security Admission overrides:

kubectl create namespace klustre-system
kubectl label namespace klustre-system \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/audit=privileged \
  pod-security.kubernetes.io/warn=privileged

Maintain consistency

Keep AMIs or OS images in sync so every node has the same Lustre client version.
If you use autoscaling groups, bake the client packages into your node image or run a bootstrap script before kubelet starts.
Automate label management with infrastructure-as-code (e.g., Cluster API, Ansible) so the right nodes receive the lustre-client=true label on join/leave events.

5.1.1.2 - Node integration flow

Understand how Klustre CSI interacts with kubelet and the host filesystem.

Daemonset host mounts

DaemonSet/klustre-csi-node mounts the following host paths:

/var/lib/kubelet/plugins and /var/lib/kubelet/pods – required for CSI socket registration and mount propagation.
/dev – ensures device files (if any) are accessible when mounting Lustre.
/sbin, /usr/sbin, /lib, /lib64 – expose the host’s Lustre client binaries and libraries to the container.

If your kubelet uses custom directories, update pluginDir and registrationDir in the settings ConfigMap.

CSI socket lifecycle

The node plugin listens on csiEndpoint (defaults to /var/lib/kubelet/plugins/lustre.csi.klustrefs.io/csi.sock).
The node-driver-registrar sidecar registers that socket with kubelet via registrationDir.
Kubelet uses the UNIX socket to call NodePublishVolume and NodeUnpublishVolume when pods mount or unmount PVCs.

If the daemonset does not come up or kubelet cannot reach the socket, run:

kubectl describe daemonset klustre-csi-node -n klustre-system
kubectl logs -n klustre-system daemonset/klustre-csi-node -c klustre-csi

PATH and library overrides

The containers inherit PATH and LD_LIBRARY_PATH values that point at the host bind mounts. If your Lustre client lives elsewhere, override:

nodePlugin.pathEnv
nodePlugin.ldLibraryPath

via Helm values or by editing the daemonset manifest.

Health signals

Kubernetes events referencing lustre.csi.klustrefs.io indicate mount/unmount activity.
kubectl get pods -n klustre-system -o wide should show one pod per labeled node.
A missing pod usually means the node label is absent or taints/tolerations are mismatched.

5.1.2 - Volumes

Model Lustre exports as PersistentVolumes and understand RWX behavior.

Klustre CSI focuses on static provisioning: you point a PV at an existing Lustre export, bind it to a PVC, and mount it into pods. Explore the topics below for the manifest workflow and mount attribute details.

5.1.2.1 - Static PV workflow

Define PersistentVolumes and PersistentVolumeClaims that reference Lustre exports.

1. Create the PersistentVolume

apiVersion: v1
kind: PersistentVolume
metadata:
  name: lustre-static-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  storageClassName: klustre-csi-static
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: lustre.csi.klustrefs.io
    volumeHandle: lustre-static-pv
    volumeAttributes:
      source: 10.0.0.1@tcp0:/lustre-fs
      mountOptions: flock,user_xattr

volumeHandle just needs to be unique within the cluster; it is not used by the Lustre backend.
volumeAttributes.source carries the Lustre management target and filesystem path.

2. Bind with a PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: lustre-static-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: klustre-csi-static
  volumeName: lustre-static-pv
  resources:
    requests:
      storage: 10Gi

Even though Lustre capacity is managed outside Kubernetes, the storage field should match the PV so the binder succeeds.

3. Mount from workloads

volumes:
  - name: lustre
    persistentVolumeClaim:
      claimName: lustre-static-pvc
containers:
  - name: app
    image: busybox
    volumeMounts:
      - name: lustre
        mountPath: /mnt/lustre

Multiple pods can reference the same PVC because Lustre supports ReadWriteMany. Pods must schedule on labeled nodes (lustre.csi.klustrefs.io/lustre-client=true).

4. Cleanup

Deleting the PVC detaches pods but the PV remains because the reclaim policy is Retain. Manually delete the PV when you no longer need it.

5.1.2.2 - Volume attributes and mount options

Map Kubernetes fields to Lustre mount flags and behaviors.

`volumeAttributes`

Key	Example	Purpose
`source`	`10.0.0.1@tcp0:/lustre-fs`	Host(s) and filesystem path given to `mount.lustre`.
`mountOptions`	`flock,user_xattr`	Comma-separated Lustre mount flags.

Additional keys (e.g., subdir) can be added in the future; the driver simply passes the map to the Lustre helper script.

Storage class tuning

See the storage class reference for details on:

allowedTopologies – keep workloads on nodes with the Lustre label.
reclaimPolicy – typically Retain for static PVs.
mountOptions – defaults to flock and user_xattr, but you can add noatime, flock, user_xattr, etc.

Override mount options per volume by setting volumeAttributes.mountOptions. This is useful when a subset of workloads needs different locking semantics.

Access modes

Use ReadWriteMany for shared Lustre volumes.
ReadOnlyMany is supported when you only need read access.
ReadWriteOnce offers no benefit with Lustre; prefer RWX.

Lifecycle reminders

Klustre CSI does not provision or delete Lustre exports. Ensure the server-side directory exists and has the correct permissions.
Kubernetes capacity values are advisory. Quotas should be enforced on the Lustre server.
PersistentVolumeReclaimPolicy=Retain keeps PVs around after PVC deletion; clean them up manually to avoid dangling objects.

5.2 - Maintenance and Upgrade

Keep Klustre CSI healthy during node drains and Kubernetes upgrades.

Patch nodes, rotate images, or upgrade the CSI plugin without interrupting workloads.

Select the maintenance guide you need from the navigation—node checklist, upgrade plan, and future topics all live underneath this page.

5.2.1 - Node maintenance checklist

Drain nodes safely and ensure Klustre CSI pods return to service.

1. Cordon and drain

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Because the Klustre CSI daemonset is a DaemonSet, it is unaffected by --ignore-daemonsets, but draining ensures your workloads move off the node before reboot.

2. Verify daemonset status

kubectl get pods -n klustre-system -o wide | grep <node>

Expect the daemonset pod to terminate when the node drains and recreate once the node returns.

3. Patch or reboot the node

Apply OS updates, reboot, or swap hardware as needed.
Ensure the Lustre client packages remain installed (validate with mount.lustre --version).

4. Uncordon and relabel if necessary

kubectl uncordon <node>

If the node lost the lustre.csi.klustrefs.io/lustre-client=true label, reapply it after verifying Lustre connectivity.

5. Watch for daemonset rollout

kubectl rollout status daemonset/klustre-csi-node -n klustre-system

6. Confirm workloads recover

Use kubectl get pods for namespaces that rely on Lustre PVCs to ensure pods are running and mounts succeeded.

Tips

For large clusters, drain one Lustre node at a time to keep mounts available.
If kubectl drain hangs due to pods using Lustre PVCs, identify them with kubectl get pods --all-namespaces -o wide | grep <node> and evict manually.

5.2.2 - Upgrade guide

Plan Klustre CSI version upgrades alongside Kubernetes changes.

1. Review release notes

Check the klustre-csi-plugin GitHub releases for breaking changes, minimum Kubernetes versions, and image tags.

2. Update the image reference

Helm users: bump image.tag and nodePlugin.registrar.image.tag in your values file, then run helm upgrade.
Manifest users: edit manifests/configmap-klustre-csi-settings.yaml (nodeImage, registrarImage) and reapply the manifests.

See Update the node daemonset image for detailed steps.

3. Roll out sequentially

kubectl rollout restart daemonset/klustre-csi-node -n klustre-system
kubectl rollout status daemonset/klustre-csi-node -n klustre-system

The daemonset restarts one node at a time, keeping existing mounts available.

4. Coordinate with Kubernetes upgrades

When upgrading kubelet:

Follow the node maintenance checklist for each node.
Upgrade the node OS/kubelet.
Verify the daemonset pod recreates successfully before moving to the next node.

5. Validate workloads

Spot-check pods that rely on Lustre PVCs (kubectl exec into them and run df -h /mnt/lustre).
Ensure no stale FailedMount events exist.

Rollback

If the new version misbehaves:

Revert nodeImage and related settings to the previous tag.
Run kubectl rollout restart daemonset/klustre-csi-node -n klustre-system.
Inspect logs to confirm the old version is running.

5.3 - Label Lustre-capable nodes

Apply and verify the topology label used by the Klustre storage class.

The default klustre-csi-static storage class restricts scheduling to nodes labeled lustre.csi.klustrefs.io/lustre-client=true. Use this runbook whenever you add or remove nodes from the Lustre client pool.

Requirements

Cluster-admin access with kubectl.
Nodes already have the Lustre client packages installed and can reach your Lustre servers.

Steps

Identify nodes that can mount Lustre
```
kubectl get nodes -o wide
```
Cross-reference with your infrastructure inventory or automation outputs to find the node names that have Lustre connectivity.
Apply the label
```
kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client=true
```
Repeat for each eligible node. Use --overwrite if the label already exists but the value should change.

Verify

kubectl get nodes -L lustre.csi.klustrefs.io/lustre-client

Ensure only the nodes with Lustre access show true. Remove the label from nodes that lose access:

kubectl label nodes <node-name> lustre.csi.klustrefs.io/lustre-client-

Confirm DaemonSet placement
```
kubectl get pods -n klustre-system -o wide \
  -l app.kubernetes.io/name=klustre-csi
```
Pods from the klustre-csi-node daemonset should exist only on labeled nodes. If you see pods on unlabeled nodes, check the nodeSelector and tolerations in the daemonset spec.

5.4 - Update the Klustre CSI image

Roll out a new plugin container image across the daemonset.

Use this guide to bump the Klustre CSI image version (for example, when adopting a new release).

Requirements

Cluster-admin access.
The new image is pushed to a registry reachable by your cluster (GHCR or a mirror).
The ghcr-secret or equivalent image pull secret already contains credentials for the registry.

Steps

Edit the settings ConfigMap
The manifests and Helm chart both reference ConfigMap/klustre-csi-settings. Update the nodeImage key with the new tag:
```
kubectl -n klustre-system edit configmap klustre-csi-settings
```
Example snippet:
```
data:
  nodeImage: ghcr.io/klustrefs/klustre-csi-plugin:0.1.2
```
Save and exit.

Restart the daemonset pods

kubectl rollout restart daemonset/klustre-csi-node -n klustre-system

Watch the rollout

kubectl rollout status daemonset/klustre-csi-node -n klustre-system
kubectl get pods -n klustre-system -o wide

Verify the running image

kubectl get pods -n klustre-system -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].image}{"\n"}{end}'

Confirm all pods now report the new tag.

Optional: clean up old images
If you mirror images, remove unused tags from your registry or automation as needed.

5.5 - Collect diagnostics

Gather logs and cluster state for troubleshooting or support requests.

When reporting an issue, provide the following artifacts so maintainers can reproduce the problem.

1. Capture pod logs

kubectl logs -n klustre-system daemonset/klustre-csi-node -c klustre-csi --tail=200 > klustre-csi.log
kubectl logs -n klustre-system daemonset/klustre-csi-node -c node-driver-registrar --tail=200 > node-driver-registrar.log

If a specific pod is failing, target it directly:

kubectl logs -n klustre-system <pod-name> -c klustre-csi --previous

2. Describe pods and daemonset

kubectl describe daemonset klustre-csi-node -n klustre-system > klustre-csi-daemonset.txt
kubectl describe pods -n klustre-system -l app.kubernetes.io/name=klustre-csi > klustre-csi-pods.txt

3. Export relevant resources

kubectl get csidriver lustre.csi.klustrefs.io -o yaml > csidriver.yaml
kubectl get storageclass klustre-csi-static -o yaml > storageclass.yaml
kubectl get configmap klustre-csi-settings -n klustre-system -o yaml > configmap.yaml

Remove sensitive data (e.g., registry credentials) before sharing.

4. Include node information

Output of uname -a, lsmod | grep lustre, and the Lustre client version on affected nodes.
Whether the node can reach your Lustre servers (share ping or mount.lustre command output if available).

Package the files into an archive and attach it to your GitHub issue or support request:

tar czf klustre-diagnostics.tgz klustre-csi.log node-driver-registrar.log \
  klustre-csi-daemonset.txt klustre-csi-pods.txt csidriver.yaml storageclass.yaml configmap.yaml

6 - Monitoring

Upcoming documentation on collecting metrics and logs for Klustre CSI.

This section will capture how to observe Klustre CSI Plugin (log scraping, Prometheus exporters, Grafana dashboards).

Tracking issue: #TODO.

7 - Reference

Configuration and manifest details for the Klustre CSI Plugin.

Look up manifests, configuration flags, storage classes, and Helm values. This section complements the how-to guides by surfacing the exact fields you can tune.

7.1 - Settings ConfigMap

Values consumed by the Klustre CSI node daemonset via ConfigMap.

ConfigMap/klustre-csi-settings provides runtime configuration to the node daemonset. Each key maps to either an environment variable or a command-line argument.

Key	Description	Default
`csiEndpoint`	UNIX socket path used by the node plugin. Must align with kubelet’s plugin directory.	`unix:///var/lib/kubelet/plugins/lustre.csi.klustrefs.io/csi.sock`
`driverRegistrationArg`	Argument passed to the node-driver-registrar sidecar.	`--kubelet-registration-path=/var/lib/kubelet/plugins/lustre.csi.klustrefs.io/csi.sock`
`logLevel`	Verbosity for the Klustre CSI binary (`info`, `debug`, etc.).	`info`
`nodeImage`	Container image for the Klustre CSI node plugin.	`ghcr.io/klustrefs/klustre-csi-plugin:0.0.1`
`pluginDir`	HostPath where CSI sockets live.	`/var/lib/kubelet/plugins/lustre.csi.klustrefs.io`
`priorityClassName`	Priority class applied to the daemonset pods.	`system-node-critical`
`registrarImage`	Container image for the node-driver-registrar sidecar.	`registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.1`
`registrationDir`	HostPath where kubelet expects CSI driver registration files.	`/var/lib/kubelet/plugins_registry`

To update any field:

kubectl -n klustre-system edit configmap klustre-csi-settings
kubectl rollout restart daemonset/klustre-csi-node -n klustre-system

Ensure any customized paths (e.g., pluginDir) match the volumes mounted in the daemonset spec.

7.2 - Storage class parameters

Details of the default static storage class shipped with Klustre CSI.

The manifests bundle a StorageClass named klustre-csi-static. It targets pre-provisioned Lustre exports and enforces node placement.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: klustre-csi-static
allowedTopologies:
- matchLabelExpressions:
  - key: lustre.csi.klustrefs.io/lustre-client
    values:
    - "true"
mountOptions:
- flock
- user_xattr
provisioner: lustre.csi.klustrefs.io
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Field summary

provisioner – Must stay lustre.csi.klustrefs.io so PVCs bind to the Klustre driver.
allowedTopologies – Uses the lustre.csi.klustrefs.io/lustre-client=true label to ensure only Lustre-capable nodes run workloads. Update the label key/value if you customize node labels.
mountOptions – Defaults to flock and user_xattr. Add or remove Lustre options as needed (e.g., nolock, noatime).
reclaimPolicy – Retain keeps the PV around when a PVC is deleted, which is typical for statically provisioned Lustre shares.
volumeBindingMode – WaitForFirstConsumer defers binding until a pod is scheduled, ensuring topology constraints match the consuming workload.

Customization tips

Create additional storage classes for different mount flag sets or topology labels. Ensure each class references the same provisioner.
If you disable topology constraints, remove allowedTopologies, but be aware that pods might schedule onto nodes without Lustre access.
For multi-cluster environments, consider namespacing storage class names (e.g., klustre-csi-static-prod).

7.3 - Helm values

Commonly overridden values for the Klustre CSI Helm chart.

The chart is published as oci://ghcr.io/klustrefs/charts/klustre-csi-plugin. Run helm show values oci://ghcr.io/klustrefs/charts/klustre-csi-plugin --version 0.1.1 for the full schema. This page summarizes frequently tuned fields.

Path	Default	Purpose
`image.repository`	`ghcr.io/klustrefs/klustre-csi-plugin`	Node plugin image repository.
`image.tag`	`0.1.1`	Node plugin tag.
`imagePullSecrets`	`[]`	Global image pull secrets applied to all pods.
`nodePlugin.pluginDir`	`/var/lib/kubelet/plugins/lustre.csi.klustrefs.io`	Host path for CSI sockets.
`nodePlugin.kubeletRegistrationPath`	`/var/lib/kubelet/plugins/lustre.csi.klustrefs.io/csi.sock`	Path passed to kubelet registrar.
`nodePlugin.logLevel`	`info`	Verbosity for the node binary.
`nodePlugin.resources`	`requests: 50m/50Mi`, `limits: 200m/200Mi`	Container resource settings.
`nodePlugin.registrar.image.repository`	`registry.k8s.io/sig-storage/csi-node-driver-registrar`	Sidecar repository.
`nodePlugin.registrar.image.tag`	`v2.10.1`	Sidecar tag.
`nodePlugin.extraVolumes` / `extraVolumeMounts`	`[]`	Inject custom host paths (e.g., additional libraries).
`storageClass.create`	`true`	Toggle creation of `klustre-csi-static`.
`storageClass.allowedTopologies[0].matchLabelExpressions[0].key`	`lustre.csi.klustrefs.io/lustre-client`	Node label key for placement.
`storageClass.mountOptions`	`["flock","user_xattr"]`	Default Lustre mount flags.
`settingsConfigMap.create`	`true`	Controls whether the chart provisions `klustre-csi-settings`.
`serviceAccount.create`	`true`	Create the node service account automatically.
`rbac.create`	`true`	Provision RBAC resources (ClusterRole/Binding).

Example override file

image:
  tag: 0.1.2
nodePlugin:
  logLevel: debug
  extraVolumeMounts:
    - name: host-etc
      mountPath: /host/etc
  extraVolumes:
    - name: host-etc
      hostPath:
        path: /etc
storageClass:
  mountOptions:
    - flock
    - user_xattr
    - noatime

Install with:

helm upgrade --install klustre-csi \
  oci://ghcr.io/klustrefs/charts/klustre-csi-plugin \
  --version 0.1.1 \
  --namespace klustre-system \
  --create-namespace \
  -f overrides.yaml

7.4 - Parameter reference

CLI flags and environment variables for the Klustre CSI node daemonset.

The node daemonset containers accept a small set of flags and environment variables. Most values are sourced from ConfigMap/klustre-csi-settings. Use this table as a quick lookup when you need to override behavior.

Component / flag	Env var	Purpose	Default source
`klustre-csi --node-id`	`KUBE_NODE_NAME`	Unique identifier sent to the CSI sidecars and kubelet. Normally the Kubernetes node name.	Downward API (`spec.nodeName`).
`klustre-csi --endpoint`	`CSI_ENDPOINT`	Path to the CSI UNIX socket served by the node plugin. Must match kubelet registration path.	`csiEndpoint` in the settings ConfigMap.
`klustre-csi --log-level`	`LOG_LEVEL`	Driver verbosity (`error`, `warn`, `info`, `debug`, `trace`).	`logLevel` in the settings ConfigMap.
`PATH`	`PATH`	Ensures `mount.lustre`, `umount.lustre`, and related tools are found inside the container.	`/host/usr/sbin:/host/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin`.
`LD_LIBRARY_PATH`	`LD_LIBRARY_PATH`	Points to host library directories required by the Lustre client binaries.	`/host/lib:/host/lib64:/host/usr/lib:/host/usr/lib64`.
`node-driver-registrar --csi-address`	n/a	Location of the CSI socket inside the pod.	`/csi/csi.sock`.
`node-driver-registrar --kubelet-registration-path`	n/a (derived from ConfigMap)	Host path where kubelet looks for CSI drivers.	`driverRegistrationArg` from the settings ConfigMap.

How to override

Edit the settings ConfigMap:
```
kubectl -n klustre-system edit configmap klustre-csi-settings
```
Change csiEndpoint, driverRegistrationArg, or logLevel as needed.
If you must customize PATH or LD_LIBRARY_PATH, edit the daemonset directly (Helm users can override nodePlugin.pathEnv or nodePlugin.ldLibraryPath values).

Restart the daemonset pods:

kubectl rollout restart daemonset/klustre-csi-node -n klustre-system

Notes

--node-id should stay aligned with the Kubernetes node name unless you have a strong reason to deviate (CSI treats it as the authoritative identifier).
Changing CSI_ENDPOINT or driverRegistrationArg requires matching host path mounts in the daemonset (pluginDir, registrationDir).
Increasing LOG_LEVEL to debug or trace is useful for troubleshooting but may emit sensitive information—reset it after collecting logs.

8 - Tutorials

End-to-end scenarios that combine Klustre CSI with real workloads.

Follow these guides when you want more than a single command—each tutorial walks through a complete workflow that exercises Klustre CSI Plugin alongside common Kubernetes patterns.

Open an issue in github.com/klustrefs/klustre-csi-plugin if you’d like to see another workflow documented.

8.1 - Static Lustre volume demo

Provision a Lustre-backed scratch space, populate it, and consume it from a training deployment.

Use these snippets as starting points for demos, CI smoke tests, or reproduction cases when you report issues.

Static Lustre volume demo

Creates a static PV/PVC pointing at 10.0.0.1@tcp0:/lustre-fs and mounts it in a BusyBox deployment.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: lustre-static-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  storageClassName: klustre-csi-static
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: lustre.csi.klustrefs.io
    volumeHandle: lustre-static-pv
    volumeAttributes:
      source: 10.0.0.1@tcp0:/lustre-fs
      mountOptions: flock,user_xattr
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: lustre-static-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: klustre-csi-static
  resources:
    requests:
      storage: 10Gi
  volumeName: lustre-static-pv
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lustre-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: lustre-demo
  template:
    metadata:
      labels:
        app: lustre-demo
    spec:
      containers:
        - name: app
          image: busybox
          command: ["sleep", "infinity"]
          volumeMounts:
            - name: lustre-share
              mountPath: /mnt/lustre
      volumes:
        - name: lustre-share
          persistentVolumeClaim:
            claimName: lustre-static-pvc

Validate

kubectl apply -f lustre-demo.yaml
kubectl exec deploy/lustre-demo -- df -h /mnt/lustre
kubectl exec deploy/lustre-demo -- sh -c 'echo "hello $(date)" > /mnt/lustre/hello.txt'

Delete when finished:

kubectl delete -f lustre-demo.yaml

Pod-level health probe

Use a simple read/write loop to verify Lustre connectivity inside a pod:

apiVersion: v1
kind: Pod
metadata:
  name: lustre-probe
spec:
  containers:
  - name: probe
    image: busybox
    command: ["sh", "-c", "while true; do date >> /mnt/lustre/probe.log && tail -n1 /mnt/lustre/probe.log; sleep 30; done"]
    volumeMounts:
    - name: lustre-share
      mountPath: /mnt/lustre
  volumes:
  - name: lustre-share
    persistentVolumeClaim:
      claimName: lustre-static-pvc

Run kubectl logs pod/lustre-probe -f to inspect the periodic writes.

Where to find more

manifests/ directory in the GitHub repo for installation YAML.
Kind quickstart for a self-contained lab, including the shim scripts used to emulate Lustre mounts.

8.2 - Share a dataset between prep and training jobs

Provision a Lustre-backed scratch space, populate it, and consume it from a training deployment.

This tutorial wires a simple data pipeline together:

Create a static Lustre PersistentVolume and bind it to a PersistentVolumeClaim.
Run a data-prep job that writes artifacts into the Lustre mount.
Start a training deployment that reads the prepared data.
Validate shared access and clean up.

Requirements

Klustre CSI Plugin installed and verified (see the Introduction).
An existing Lustre export, e.g., 10.0.0.1@tcp0:/lustre-fs.
kubectl access with cluster-admin privileges.

1. Define the storage objects

Save the following manifest as lustre-pipeline.yaml. Update volumeAttributes.source to match your Lustre target and tweak mountOptions if required.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: lustre-scratch-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  storageClassName: klustre-csi-static
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: lustre.csi.klustrefs.io
    volumeHandle: lustre-scratch
    volumeAttributes:
      source: 10.0.0.1@tcp0:/lustre-fs
      mountOptions: flock,user_xattr
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: lustre-scratch-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: klustre-csi-static
  resources:
    requests:
      storage: 100Gi
  volumeName: lustre-scratch-pv

Apply it:

kubectl apply -f lustre-pipeline.yaml

Confirm the PVC is bound:

kubectl get pvc lustre-scratch-pvc

2. Run the data-prep job

Append the job definition to lustre-pipeline.yaml or save it separately as dataset-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: dataset-prep
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: writer
          image: busybox
          command:
            - sh
            - -c
            - |
              echo "Generating synthetic dataset..."
              RUNDIR=/mnt/lustre/datasets/run-$(date +%s)
              mkdir -p "$RUNDIR"
              dd if=/dev/urandom of=$RUNDIR/dataset.bin bs=1M count=5
              echo "ready" > $RUNDIR/status.txt
              ln -sfn "$RUNDIR" /mnt/lustre/datasets/current
          volumeMounts:
            - name: lustre
              mountPath: /mnt/lustre
      volumes:
        - name: lustre
          persistentVolumeClaim:
            claimName: lustre-scratch-pvc

Apply and monitor (substitute the file name you used above):

kubectl apply -f dataset-job.yaml  # or lustre-pipeline.yaml
kubectl logs job/dataset-prep

Ensure the job completes successfully before moving on.

3. Launch the training deployment

This deployment tails the generated status file and lists artifacts to demonstrate read access.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: trainer
spec:
  replicas: 1
  selector:
    matchLabels:
      app: trainer
  template:
    metadata:
      labels:
        app: trainer
    spec:
      containers:
        - name: trainer
          image: busybox
          command:
            - sh
            - -c
            - |
              ls -lh /mnt/lustre/datasets/current
              tail -f /mnt/lustre/datasets/current/status.txt
          volumeMounts:
            - name: lustre
              mountPath: /mnt/lustre
      volumes:
        - name: lustre
          persistentVolumeClaim:
            claimName: lustre-scratch-pvc

Apply and inspect logs:

kubectl apply -f trainer-deployment.yaml
kubectl logs deploy/trainer

You should see the dataset files created by the job alongside the status text.

4. Cleanup

When finished, remove all resources. Because the PV uses Retain, data remains on the Lustre share; delete or archive it manually if desired.

kubectl delete deployment trainer
kubectl delete job dataset-prep
kubectl delete pvc lustre-scratch-pvc
kubectl delete pv lustre-scratch-pv

Next steps

Adapt the job and deployment containers to your actual preprocessing/training images.
Add a CronJob to refresh datasets on a schedule.
Use the Kind quickstart if you need a disposable lab cluster to iterate on this flow.

9 - Contribution Guidelines

How to propose doc changes for the Klustre website.

These instructions target the klustrefs/website repository. Follow them whenever you update content, navigation, or styling.

Toolchain

Static site generator: Hugo extended edition, v0.146.0 or newer.
Theme: Docsy.
Package manager: npm (used for Docsy assets).
Hosting: Netlify deploy previews triggered from pull requests.

Contribution workflow

Fork https://github.com/klustrefs/website.
Create a feature branch (docs/my-topic).
Make your edits (Markdown lives under content/en/...).
Run npm install once, then npm run dev or hugo server to preview locally.
Commit with clear messages (docs: add getting started guide).
Open a pull request against main. If the work is in progress, prefix the title with WIP.
Ensure the Netlify preview (deploy/netlify — Deploy preview ready!) renders correctly before requesting review.

All PRs need review by a maintainer before merging. We follow the standard GitHub review process.

Editing tips

Prefer short paragraphs and actionable steps. Use ordered lists for sequences and fenced code blocks with language hints (```bash).
Keep front matter tidy: title, description, and weight control sidebar order.
When adding a new page, update any index pages or navigation lists that should reference it (for example, the Introduction landing page).
Avoid using draft: true in front matter; draft pages don’t deploy to previews. Instead keep work-in-progress content on a branch until it’s review-ready.

Updating a page from the browser

Use the Edit this page link:

Click the link in the top right corner of any doc page.
GitHub opens the corresponding file in your fork (you may need to create/update your fork first).
Make the edit, describe the change, and open a pull request. Review the Netlify preview before merging.

Local preview

git clone git@github.com:<your-username>/website.git
cd website
npm install
hugo server --bind 0.0.0.0 --buildDrafts=false

Visit http://localhost:1313 to preview. Hugo watches for changes and automatically reloads the browser.

Filing issues

If you notice an error but can’t fix it immediately, open an issue in klustrefs/website. Include:

URL of the affected page.
Description of the problem (typo, outdated instructions, missing section, etc.).
Optional suggestion or screenshot.

Style reminders

Use present tense (“Run kubectl get pods”) and active voice.
Prefer metric units and standard Kubernetes terminology.
Provide prerequisites and cleanup steps for any workflow.
When referencing code or file paths, wrap them in backticks (kubectl, /etc/hosts).

Need help? Ask in the GitHub discussions or on the Klustre Slack workspace.

10 -

title: Concepts description: Understand the Klustre CSI architecture, components, and data flow. weight: 2

This page explains how the Klustre CSI Plugin is structured so you can reason about what happens when volumes are created, mounted, or removed.

High-level architecture

Klustre CSI Plugin implements the CSI node service for Lustre file systems. It focuses on node-side operations only:

Kubernetes API           Kubelet on each node
--------------           --------------------
PersistentVolume   ->    NodePublishVolume -> mount.lustre -> Pod mount
PersistentVolumeClaim    NodeUnpublishVolume -> umount.lustre

There is no controller deployment or dynamic provisioning. Instead, administrators define static PersistentVolume objects that reference existing Lustre exports. The plugin mounts those exports inside pods when a PersistentVolumeClaim is bound to a workload.

Components

DaemonSet (klustre-csi-node) – Runs on every node that has the Lustre client installed. It includes:
- klustre-csi container: the Rust-based CSI node driver.
- node-driver-registrar sidecar: registers the driver socket with kubelet and handles node-level CSIDriver metadata.
Settings ConfigMap – Injects runtime configuration (socket paths, image tags, log level).
StorageClass (klustre-csi-static) – Encapsulates mount options, retain policy, and topology constraints.
CSIDriver object – Advertises the driver name lustre.csi.klustrefs.io, declares that controller operations aren’t required, and enables podInfoOnMount.

Data flow

Cluster admin creates a PV that sets csi.driver=lustre.csi.klustrefs.io and volumeAttributes.source=<HOST>@tcp:/fs.
PVC binds to the PV, usually via the klustre-csi-static storage class.
Pod scheduled on labeled node (lustre.csi.klustrefs.io/lustre-client=true).
kubelet invokes NodePublishVolume:
- CSI driver reads the source attribute.
- It spawns mount.lustre inside the container with the host’s /sbin and /usr/sbin bind-mounted.
- The Lustre client mounts into the pod’s volumeMount.
Pod uses the mounted path (read/write simultaneously across multiple pods).
When the pod terminates and the PVC is detached, kubelet calls NodeUnpublishVolume, which triggers umount.lustre.

Labels and topology

The driver relies on a node label to ensure only valid hosts run Lustre workloads:

lustre.csi.klustrefs.io/lustre-client=true

allowedTopologies in the storage class references this label so the scheduler only considers labeled nodes. Make sure to label/clear nodes as you add or remove Lustre support (see Label Lustre-capable nodes).

Security and privileges

Pods run privileged with SYS_ADMIN capability to execute mount.lustre.
Host namespaces: hostNetwork: true and hostPID: true so the driver can interact with kubelet paths and kernel modules.
HostPath mounts expose /var/lib/kubelet, /dev, /sbin, /usr/sbin, /lib, and /lib64. Verify your cluster policy (PSA/PSP) allows this in the klustre-system namespace.

Limitations (by design)

❌ No dynamic provisioning (no CreateVolume/DeleteVolume): you must create static PVs.
❌ No CSI snapshots or expansion.
❌ No node metrics endpoints.
✅ Designed for ReadWriteMany workloads with preexisting Lustre infrastructure.

Understanding these guardrails helps you decide whether Klustre CSI fits your cluster and what automation you may need to layer on top (for example, scripts that generate PVs from a Lustre inventory). Now that you know the architecture, head back to the Introduction or browse the Reference for tunable parameters.

Klustre Documentation

1 - Overview

What does it provide?

Why would I use it?

Current scope and limitations

Architecture at a glance

Where should I go next?

2 - Requirements

Kubernetes cluster

Lustre-capable worker nodes

Namespace, security, and registry access

Tooling

3 - Quickstart

Requirements

Step 1 — Install Klustre CSI

Step 2 — Verify the daemonset

Step 3 — Mount a Lustre share

Step 4 — Clean up (optional)

What’s next?

4 - Advanced Installation

4.1 - Kind Quickstart

Requirements

1. Create a Kind cluster

2. Install a Lustre shim inside the nodes

3. Prepare node labels

4. Deploy Klustre CSI Plugin

5. Mount the simulated Lustre share

6. Clean up (optional)

Troubleshooting

4.2 - Amazon EKS Notes

4.3 - Bare Metal Notes

4.4 - Install with Helm

1. Authenticate (optional)

2. Install or upgrade

3. Override values

4. Check status

4.5 - Install with kubectl/manifests

1. Install directly with Kustomize (no clone)

2. Work from a local clone (recommended for customization)

3. Customize with a Kustomize overlay (optional)

4. Verify rollout

5 - Operations

5.1 - Nodes and Volumes

5.1.1 - Nodes

5.1.1.1 - Node preparation

Install the Lustre client stack

Label nodes

Allow privileged workloads

Maintain consistency

5.1.1.2 - Node integration flow

Daemonset host mounts

CSI socket lifecycle

PATH and library overrides

Health signals

5.1.2 - Volumes

5.1.2.1 - Static PV workflow

1. Create the PersistentVolume

2. Bind with a PersistentVolumeClaim

3. Mount from workloads

4. Cleanup

5.1.2.2 - Volume attributes and mount options

volumeAttributes

Storage class tuning

Access modes

Lifecycle reminders

5.2 - Maintenance and Upgrade

5.2.1 - Node maintenance checklist

1. Cordon and drain

2. Verify daemonset status

3. Patch or reboot the node

4. Uncordon and relabel if necessary

5. Watch for daemonset rollout

6. Confirm workloads recover

Tips

5.2.2 - Upgrade guide

1. Review release notes

2. Update the image reference

3. Roll out sequentially

4. Coordinate with Kubernetes upgrades

5. Validate workloads

`volumeAttributes`