How to Use S3 Storage in Kubernetes with CSI

When a cluster runs out of disk space, object storage comes into play. S3 is scalable, cheaper than block disks, and well-suited for logs, archives, and artifacts. Directly connecting S3 to a pod isn’t possible. Kubernetes storage expects a volume and doesn’t know how to access S3 via the HTTP API, so CSI S3 is required.

In this article, we’ll explore how to use S3 object storage in Kubernetes using CSI. We’ll explain how the CSI S3 driver works, what the Kubernetes S3 Storage Class is, how to prepare a bucket and cluster, configure a PVC, and then attach an S3 bucket as a volume to an application.

What is S3 and how to use it in Kubernetes

S3 is an object storage system, not a classic file system. Data is stored in a bucket as objects with keys and metadata. Access is via HTTP, and operations are API requests. It’s not POSIX or a network drive.

Kubernetes object storage cannot be used in this form. A pod operates on a filesystem mounted by the kubelet. Kubernetes storage uses the following resources to manage mounted filesystems:

  • PersistentVolume, PersistentVolumeClaim,
  • StorageClass,
  • VolumeAttachment,
  • CSI Driver.

To connect these worlds, Kubernetes’s Container Storage Interface (CSI) comes into play. CSI (Container Storage Interface) defines a standard interface between Kubernetes and a specific driver. A third-party CSI driver implements this interface and turns S3 into a volume that can be mounted in a pod.

What tasks are S3 storage suitable for in Kubernetes?

S3 doesn’t replace node disks. However, it’s well-suited for scenarios where capacity and reliability are more important than minimal latency.

The most common uses of S3 storage in Kubernetes are:

  • Application and infrastructure logs. Pods write logs to a mounted directory, and via K8S CSI S3, they are stored in a bucket where they can remain for years.
  • Archives and backups. Database backups, dumps, and configuration archives can be conveniently stored in object storage, where you can configure retention periods and storage classes.
  • Data for analytics and machine learning. Datasets, calculation results, model artifacts. Jobs within the cluster read and write objects directly, and S3 becomes a common layer for different services.
  • Static content. Images, CSS, and JS, as well as user-uploaded files. The pod can serve these itself or use S3 as a source for further distribution.

S3 is poorly suited for transactional workloads, relational databases, and scenarios with thousands of small writes per second. These scenarios still require block disks and traditional Kubernetes storage.

What is CSI, and how does the CSI S3 driver work?

The Kubernetes CSI is a specification. It describes what operations a driver must support for Kubernetes to be able to:

  • create and delete volumes,
  • mount and unmount volumes on nodes,
  • Find out storage parameters.

The drivers themselves are implemented by various vendors. For S3, popular variants are csi-s3 (s3-csi), as well as their cloud-specific forks. Documentation often refers to the driver as “Kubernetes S3 CSI driver” or simply “CSI S3.”

A typical k8s CSI S3 is designed like this:

  • Each node runs a CSI Node component. It is responsible for actually mounting the volume and accessing FUSE or another file system on top of S3.
  • The cluster has a CSI Controller. It creates and deletes volumes, processes storage allocation requests, and monitors the volume lifecycle.
  • CSI Identity – contains information about the CSI driver.
  • CSI Volume – a volume that can be mounted to pods.
  • Kubernetes communicates with the driver via the standard Kubernetes CSI API. To it, it’s just another type of storage.

Internally, the S3-compatible driver uses a FUSE mounter, such as GeeseFS. The container sees a regular directory, and all read and write operations are converted into requests to object storage.

Limitations and features of S3 operation via CSI

This approach has some important features to keep in mind before launching into production:

  • This is not a full-fledged POSIX file system. Changing permissions and ownership, partial overwriting, hard links, and other complex file operations do not always work correctly.
  • Latency depends on the network and the S3 service. Each file access is essentially an object request. Therefore, a lot depends on the storage class. For example, cold storage is fine for logs and static data. For real-time chat or OLTP, this is not the case, but for some even busier workloads, a combination of S3 region location and storage class can help.
  • Consistency is usually eventual. If one pod writes a file, another may see the old directory state with a slight delay.

Therefore, Kubernetes S3 CSI is ideal for use with logs, archives, statics, and static artifacts of any size, but not for databases and critical transactions.

Preparing storage: bucket, keys, permissions

To connect S3 to a cluster, first configure the storage itself:

  • Create a bucket in the desired region. Consider a naming scheme for keys and prefixes, such as logs/backup/ml/.
  • Set up a user and access keys. Generate an access key and secret key, which will then be added to the Secret for CSI S3.
  • Restrict permissions. Use a bucket policy to restrict actions in this bucket, for example, to prevent CSI S3 from accidentally deleting someone else’s data.
  • Enable encryption and retention policy. Configure object encryption, lifecycle management for hot and old data, and, if necessary, versioning.

These steps are more conveniently described in Terraform. One module creates the bucket, the second creates the user and key, and the third creates the access policies. Then, Kubernetes object storage will be subject to the same IaC rules as the rest of the infrastructure.

Preparing a Kubernetes cluster

Next, we prepare the cluster so that Kubernetes S3 CSI works without any surprises:

  • Ensure CSI support. You need a Kubernetes version that supports the CSI Kubernetes v1 API; this is typically supported starting with k8s server version 1.10 and above.
  • Configure a network of nodes for the S3 service. The nodes must have network access to the object storage via DNS and port 443, and security rules must not restrict traffic.
  • Prepare permissions within the cluster itself. CSI S3 will create and update PersistentVolumes (hereinafter PVs), work with PersistentVolumeClaims (hereinafter PVCs), and StorageClasses, so the appropriate roles and bindings are required.

Managed services already have some settings, but a security policy may impose restrictions. If a driver requires privileged access or access to the hostPath, these areas should be addressed.

Installing and Configuring CSI for S3

Selecting and preparing the CSI S3 driver

For S3, one of the csi-s3 forks or a ready-made module from the cloud provider is typically used. When choosing the Kubernetes S3 CSI driver, check a few things:

  • Compatibility with your cluster version. An older driver may not support newer versions of Kubernetes.
  • StorageClass and PVC instructions are available. Comprehensive documentation saves hours of debugging.
  • Supported mounter. GeeseFS or another FUSE layer directly impacts performance. For example, s3fs provides a wide range of POSIX functions, but goofys is optimized for performance, losing some POSIX features.

Installation via Helm or manifest

The easiest way to install the module is through Helm. It’s often called CSI-S3 or s3-csi. An installation example might look like this:

helm repo add s3-csi-driver https://example.com/helm/s3-csi
helm repo update
helm install csi-s3 s3-csi-driver/csi-s3\
  --namespace kube-system \
  --create-namespace \
  --set secret.accessKey=<ACCESS_KEY> \
  --set secret.secretKey=<SECRET_KEY> \
  --set secret.region=<REGION> \
  --set secret.endpoint=https://s3.example.com

So, Helm takes care of deploying all the necessary components. If Helm isn’t an option, installation via standard Kubernetes manifest files with Deployment, DaemonSet, and CRD is sufficient.

Secrets and credentials

Access to S3 on the cluster side is provided via Secret and ServiceAccount:

apiVersion: v1
kind: Secret
metadata:
  name: csi-s3-secret
  namespace: kube-system
type: Opaque
data:
  accessKey: <base64-access-key>
  secretKey: <base64-secret-key>
  endpoint: <base64-endpoint-url>
  region: <base64-region>

This Secret will use the CSI S3 driver. It’s important to restrict access to it and not store private keys in the config map.

S3 StorageClass Definition

Now we need to describe exactly how the S3 volume will look from the cluster’s perspective. To do this, we create an S3 StorageClass Kubernetes. Sometimes it’s simply called storage class S3.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-s3
provisioner: csi.s3.example.com
parameters:
  bucket: my-app-bucket
  prefix: kubernetes/
  mounter: geesefs
Reclaim Policy: Retain
volumeBindingMode: WaitForFirstConsumer

Here we’ve described the Kubernetes S3 storage class. It knows which bucket to work with, which prefix to use for reading and writing objects, and how to handle the volume after deleting the PVC.

Creating a PVC and checking the S3 bucket mounting

Next, we create a PVC and check that the bucket is actually mounted as a volume:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: csi-s3-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: csi-s3
  resources:
    requests:
      Storage: 50Gi

This PVC requests a volume in the csi-s3 class. For Kubernetes, this is a standard storage request. For the driver, it’s a command to prepare an area in S3.

To make sure everything works, we set up a test pod and mounted the PVC there:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: s3-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: s3-app
  template:
    metadata:
      labels:
        app: s3-app
    spec:
      containers:
        - name: s3-app-container
          image: nginx
          volumeMounts:
            - name: s3-storage
              mountPath: /usr/share/nginx/html
      volumes:
        - name: s3-storage
          persistentVolumeClaim:
            claimName: csi-s3-pvc

In this form, the S3 app sees the /usr/share/nginx/html directory as a file system. In fact, behind it is an S3 bucket connected via S3 CSI.

Using S3 in a Kubernetes Application

From a developer’s perspective, everything looks simple. A volume appears in the Deployment manifest, and a directory appears in the container that can be accessed.

There are several things to consider:

  • Access modes. ReadWriteMany is typically used to allow multiple pods to read and write to the same volume.
  • Writing patterns. It’s better to write data in large blocks rather than thousands of small files. For logs, appending to a large file stored as an object is convenient, but it’s important to keep in mind that for large volumes, writing should be done infrequently.
  • Sharing considerations. Concurrent writes from multiple pods can lead to race conditions. If your application isn’t prepared for this, consider serializing access or separating pods by prefix.

With these points in mind, Kubernetes S3 CSI can be used without rewriting application code. It simply writes to the file system, and the CSI S3 driver handles the rest.

Automation and operation

To prevent the solution from turning into a manual assembly, everything should be described as code:

  • Manage storage and clusters with Terraform. Buckets, keys, policies, clusters, and Helm releases with CSI S3 can be defined in a single set of modules.
  • Monitor the driver and object storage. Monitor pod status with CSI, mount errors, timeouts, and quota limits.
  • Plan scaling. If your load increases, it’s important to understand how quickly S3 and the FUSE layer can handle the additional requests.
  • Configure an admission webhook for PV injection. To change the manifests applied on the fly, consider implementing a Kubernetes admission webhook. 

With this approach, S3-based Kubernetes storage becomes a managed part of the infrastructure, rather than a set of manual settings.

Working with data: reading, writing, compatibility, and latency

When reading through the Kubernetes S3 CSI Driver, every operation in a mounted directory is converted into an Object Storage request. Sequential reading of large files and caching of frequently accessed objects provide predictable response times, while random access to thousands of small files increases latency and network load. 

For reading critical data, it is better to keep the hot layer on block storage and use Kubernetes Object Storage via CSI as a slower but more capacious layer.

Writing also requires careful design. It’s better to focus on append-only and batch-based writing rather than partial file overwriting: this makes it easier for the driver to map file operations to objects in the bucket. It’s also worth keeping in mind that the mounter may send large files using multipart uploads, so write operations to large objects should be minimized. 

POSIX compliance is limited: locks, fsync, and complex sharing scenarios may not work as expected. This should be explicitly noted in the Deployment and PVC descriptions when an S3 StorageClass Kubernetes volume is shared between multiple pods.

The latency of accessing data via S3 CSI depends on the distance to the S3 region and the cache settings, storage class, and quotas. For operational purposes, it’s important to measure not only the average read and write times, but also the p95–p99 latency, the number of retries, and errors. 

If applications are sensitive to these parameters, it is better to separate them into a separate PVC and StorageClass, and use K8s CSI S3 primarily for less latency-sensitive tasks.

Use cases: logs, archives, machine learning data

For logs and technical logging, S3 is best suited as a low-cost and virtually unlimited storage layer. Applications sequentially write records to files, and the Kubernetes S3 CSI Driver transparently stores them in a bucket. In Kubernetes’ S3 StorageClass, you can define a separate log class with an aggressive lifecycle policy and separate prefixes for services, and in Helm and Deployment, you can define uniform PVCs for logging.

Archives and backups are the second natural use case for Kubernetes Object Storage. This is where rare but valuable data goes: SQL dumps, configuration archives, and export sets. For such PVCs, you can configure a less expensive storage class and a longer object lifetime, and run the archiving tasks as Jobs that use the same K8s CSI S3, but write to their own bucket prefixes.

Machine learning data is conveniently stored in separate buckets and PVCs to separate training datasets, raw downloads, and model artifacts. Using Terraform, you can define a module that creates a bucket, access policy, and PVC for a specific ML project, and in a Helm Chart, mount this volume in a Job and Deployment for training and inference. This makes the Kubernetes S3 CSI Driver a standard way to distribute datasets within a cluster without being tied to local disks.

Troubleshooting and common errors

Most CSI S3 failures fall into one of several scenarios:

  • Invalid keys or access rights. The driver can’t authenticate to S3, the PVC gets stuck in the Pending state, and the logs show AccessDenied.
  • Unreachable endpoint. Connection errors and mount timeouts. Check DNS, routing rules, and the service address.
  • RBAC errors. The CSI driver cannot create PVs or read StorageClasses. The Kubernetes S3 CSI driver crashes with API access errors.
  • Version incompatibility. The old CSI-S3 module doesn’t support new Kubernetes versions and crashes on startup.
  • There are issues accessing S3. The account being used may be read-only, or the bucket being written to may have special prefix rules.

The diagnostics are standard. We use kubectl get podskubectl describe, and kubectl logs for pods with CSI and for applications using PVC. Often, a glance at the logs is enough to understand the problem.

Conclusions

S3 in Kubernetes via CSI is a convenient way to connect object storage as a volume. Kubernetes storage continues to work with volumes and PVCs, and the CSI S3 driver handles the S3 integration.

The correct scheme looks like this: we configure the bucket and access rights, prepare the cluster, install S3-CSI or another module via Helm, create an S3 StorageClass Kubernetes and PVC, and then mount the volume in the application. 

For developers, it looks like a regular directory connection, and for administrators, it looks like another managed storage type, and the software doesn’t even realize it’s working with S3.

The key is to be mindful of your workload profile and the limitations of your S3 provider, avoid trying to migrate everything to it, and keep your configuration under control through Terraform and monitoring. Then, Kubernetes S3 CSI won’t be a source of failure, but a reliable and understandable tool.


Explore More IT Terms


Share this term: Facebook X LinkedIn WhatsApp Email
CONTINUE LEARNING Next: Boolean algebra →

Leave a Reply

Your email address will not be published. Required fields are marked *