This guide covers production MongoDB on Kubernetes end-to-end — from replica set topology and sharded cluster architecture to operator selection, PITR backups, and compute sizing.
Why MongoDB on Kubernetes. Key benefits, production maturity, and what changes compared to VM-based deployments.
Replica set with driver-side direct connection, sharded clusters with mongos routing, Config Server RS, and multi-AZ HA.
WiredTiger cache sizing, K8s resource limits, oplog sizing for PITR coverage, NVMe storage classes, and journal layout.
Prometheus with mongodb_exporter, oplog window alerting, WiredTiger cache hit ratio, slow operations, and PMM integration.
Percona Backup for MongoDB (PBM), physical snapshots, continuous oplog archiving to S3, and point-in-time restore.
Percona Operator, MongoDB Community Operator, KubeDB, and Atlas Kubernetes Operator — side-by-side feature matrix.
MongoDB is the world's leading document database, and Kubernetes has become the standard infrastructure platform. Running MongoDB on Kubernetes means your clusters are declaratively defined, GitOps-managed, and co-located with the applications they serve — on your infrastructure, under your control.
Unlike relational workloads, MongoDB's horizontal scaling model (sharding) was designed for distributed systems from the start. Kubernetes amplifies this: shards become StatefulSets, mongos routers become Deployments, and failover automation is baked into both the operator and MongoDB's Raft-based election protocol. The result is a database platform that scales out as naturally as the applications it powers.
Run on any Kubernetes — GKE, EKS, AKS, bare metal, or on-prem. Open-source operators under Apache 2.0 with no per-node license fees.
Rolling upgrades, replica set scaling, shard addition, and TLS certificate rotation — all via Kubernetes CRDs. No manual runbooks.
Bin-pack MongoDB alongside other workloads. Scale dev clusters to a single-node. Assign secondary members to spot/preemptible node pools.
MongoDB on Kubernetes supports two primary topologies: a single Replica Set where the application driver connects directly to all members, and a Sharded Cluster with mongos routers for horizontal data partitioning. Choosing the right topology depends on your dataset size and scaling requirements.
A MongoDB Replica Set consists of three or more nodes: one Primary (handles all writes) and one or more Secondaries (replicate via oplog tailing). Unlike PostgreSQL, where the application typically speaks to a single endpoint, the MongoDB driver is topology-aware — it discovers all replica set members automatically from the initial seed list and maintains a connection to each.
This means no proxy is needed for a single replica set. The application connection string lists all member hosts (or the service DNS entries). The driver monitors member health continuously and routes reads and writes intelligently based on your configured read preference.
mongodb://mongo-0.mongo-svc:27017,mongo-1.mongo-svc:27017,mongo-2.mongo-svc:27017/?replicaSet=rs0&readPreference=secondaryPreferredThe driver performs topology discovery from any of the seed hosts. Adding or removing members is transparent to the application after the replica set membership is updated.
The driver routes read operations based on your readPreference setting — no extra infrastructure needed:
MongoDB uses a Raft-based election protocol. When the primary becomes unavailable, the replica set elects a new primary automatically:
On Kubernetes: set priority on members to prefer nodes in the primary AZ. Use votes: 0 for arbiter-only members to spare resources.
When your dataset exceeds what a single replica set can serve — in throughput or storage — a sharded cluster distributes data across multiple shards. Each shard is itself a full replica set. Applications never talk to shards directly: they connect to mongos routers, which are stateless proxy pods that route operations to the correct shard using a shard key.
A sharded cluster has three component groups: mongos routers (stateless, scale horizontally as a Deployment), a Config Server Replica Set (3-node RS storing chunk ranges and metadata), and N data shards (each a 3-node RS). The mongos reads chunk metadata from the Config Server to route every query or write to the right shard.
The Config Server RS is a dedicated 3-node replica set that stores the cluster's sharding metadata: which chunks of data live on which shard. Every mongos reads from this RS before routing operations.
kind: PerconaServerMongoDB
spec:
sharding:
enabled: true
configsvrReplSet:
size: 3
affinity:
podAntiAffinityTopologyKey: topology.kubernetes.io/zone The shard key determines how data is distributed. Poor shard key selection leads to hotspots where all writes go to one shard, negating the benefit of sharding.
{ userId: "hashed" } Even distribution ✓ { tenantId: 1, _id: 1 } Tenant partitioning ✓ { createdAt: 1 } Monotonic hotspot ✗ { status: 1 } Low cardinality ✗ Use hashed shard keys for maximum write distribution. Use compound range shard keys when zone sharding (geographic partitioning) is required.
MongoDB's performance depends almost entirely on memory: the WiredTiger cache must hold the working set, and the OS page cache provides a second layer. Getting resource limits and storage layout wrong on Kubernetes will cause constant cache evictions and poor read performance.
WiredTiger is MongoDB's default storage engine. It maintains an in-memory cache of working data. By default, the cache size is (0.5 × RAM) - 1 GB. On Kubernetes you must set this explicitly — the container memory limit is not automatically visible to MongoDB, and the default calculation may use the node's total RAM, leading to OOMKills.
resources:
limits:
memory: "8Gi"
requests:
memory: "8Gi"
# wiredTigerCacheSizeGB not set →
# MongoDB may read host RAM (e.g. 128 GB)
# and set cache to 63 GB → OOMKill resources:
limits:
memory: "16Gi"
requests:
memory: "16Gi"
# Rule: cache ≤ (containerMemory × 0.5) - 1 GB
# For 16 Gi container → cache = 7 GB
storage:
wiredTiger:
engineConfig:
cacheSizeGB: 7 MongoDB uses two memory layers: the WiredTiger cache (for modified and frequently-accessed data) and the OS page cache (for reads not in the WiredTiger cache). Set the container memory limit to at least 2× the WiredTiger cache size to leave room for the OS page cache, connections, and aggregation pipeline buffers.
The oplog (operations log) is a capped collection on each replica set member that records all write operations. Its size determines your PITR window — how far back in time you can restore, and how much replication lag is tolerable before a secondary falls too far behind to resync.
~990 MB May roll over in <1 hour under write pressure 10–50 GB Provides multi-hour PITR window; size to cover maintenance duration 50–100 GB Ensure secondaries can relag without falling off the oplog storage: oplogSizeMB: 51200 # 50 GB — set in MiB, not GB
MongoDB's WiredTiger performs many small random I/Os for cache evictions and checkpoints. NVMe or local SSD-backed storage classes deliver the sub-millisecond latency these operations require. Use node-local PVCs via local-path-provisioner or cloud NVMe storage classes.
Secondary members replicate via oplog tailing — a largely sequential read/write pattern that is tolerant of higher latency. Cloud block storage (gp3, Premium SSD, Persistent Disk) is acceptable for secondaries, reducing cost without impacting write throughput on the primary.
MongoDB produces rich observability data, but the default monitoring focus for Kubernetes operators differs from standalone deployments. Oplog window and replication lag are the two most operationally critical metrics — both invisible without dedicated exporters.
The Percona mongodb_exporter scrapes MongoDB's serverStatus, replSetGetStatus, and dbStats endpoints and exposes them as Prometheus metrics. The Percona Operator for MongoDB deploys it as a sidecar and automatically creates a ServiceMonitor for Prometheus Operator.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mongo-metrics
spec:
selector:
matchLabels:
app.kubernetes.io/name: percona-server-mongodb
endpoints:
- port: metrics
interval: 30s
path: /metrics These five metrics should be the foundation of any MongoDB on Kubernetes alert configuration:
mongodb_mongod_replset_oplog_tail_timestamp - mongodb_mongod_replset_oplog_head_timestamp Time span covered by the oplog. Alert if < 2× your longest maintenance window. Shrinking oplog window means secondaries may fall off and require full resync.
mongodb_mongod_replset_member_replication_lag How far behind each secondary is from the primary's optime. Alert at > 30 s; investigate at > 10 s under normal load.
mongodb_mongod_wiredtiger_cache_bytes_read_into_cache / mongodb_mongod_wiredtiger_cache_bytes_total Should stay above 95%. Dropping below 90% signals the working set no longer fits in cache — time to scale memory or shard.
mongodb_mongod_connections{state="current"} MongoDB spawns one thread per connection. Alert when approaching maxIncomingConnections. High connection count is a sign PgBouncer-equivalent pooling (mongos or application-level pooling) is needed.
mongodb_extra_info_page_faults_total Page faults mean data was not in the WiredTiger cache or OS page cache — direct disk reads. Sustained page faults indicate insufficient memory.
Monitor mongodb_mongod_replset_member_state — alert on any member in state RECOVERING (6), DOWN (8), or UNKNOWN (6+).
Set kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.80 — MongoDB does not pre-allocate but checkpoint files can spike disk usage.
For sharded clusters, alert immediately if the Config Server RS loses quorum. All mongos routing stops until a majority is restored.
Track mongodb_mongos_connections for sharded clusters. mongos pools connections to each shard; size the pool to avoid shard connection exhaustion.
Enable profiling: 1 with slowOpThresholdMs: 100. PMM's Query Analytics surfaces the top slow operations with explain plan data.
Percona provides pre-built Grafana dashboards for MongoDB Replica Set Overview, WiredTiger, and ReplSet Summary — importable via PMM or standalone JSON.
MongoDB point-in-time recovery is implemented differently from relational databases. Instead of WAL files, MongoDB uses the oplog — a capped collection that records every write operation. Percona Backup for MongoDB (PBM) continuously archives this oplog to object storage, enabling recovery to any second within the oplog window.
PBM is the recommended backup solution for self-managed MongoDB on Kubernetes. It supports both physical backups (direct copy of WiredTiger data files — fast for large datasets) and logical backups (BSON document dump). For PITR, PBM archives oplog chunks to S3-compatible storage continuously.
# Trigger a physical backup pbm backup --type=physical # Restore to a point in time pbm restore --time="2026-04-01T14:32:00" # Check PITR coverage window pbm status | grep -A5 "PITR"
Point-in-time recovery combines a base backup snapshot with a continuous stream of oplog operations. Restoring to any point requires replaying the oplog from the nearest base backup up to the target timestamp.
Production workloads on MongoDB 5+. Fast backup, fast restore, second-level PITR. Runs as a sidecar agent in each pod — no external backup host needed.
Development clusters or datasets under 50 GB. Portable across MongoDB patch versions. Slower than physical for backup and restore.
No PITR capability. Locks collections during dump on older versions. Slow for large datasets. Insufficient alone for production SLAs.
No oplog-aware consistency for sharded clusters. Cannot restore to arbitrary points in time. Use only as an additional layer alongside PBM.
The Kubernetes ecosystem offers several operators for MongoDB, ranging from full-featured open-source solutions to proprietary enterprise platforms. Here's how the main options compare on the features that matter most in production.
The most feature-complete open-source MongoDB operator. Supports replica sets and sharded clusters, PBM integration for PITR, PMM for monitoring, TLS, custom roles, and multi-cluster federation. Active development with regular releases.
The official open-source operator from MongoDB Inc. Handles replica set provisioning and basic lifecycle management. No sharding support, no built-in backup, no monitoring integration. Suitable for development clusters and use cases where Atlas Operator is not viable.
Multi-database operator suite (MongoDB, PostgreSQL, MySQL, Redis, and more) with a single unified management plane. Enterprise features including automated backups and monitoring require a commercial license. Useful for teams that need a single operator for many database types under an enterprise agreement.
Manages MongoDB Atlas clusters via Kubernetes CRDs. The operator code is open source but the database runs in Atlas (MongoDB's cloud service) — not on your own Kubernetes nodes. Offers full feature parity with Atlas including Global Clusters and Atlas Search. Requires an Atlas account and subscription costs.
OpenEverest is a CNCF Sandbox project that provides a unified control plane for multiple database engines on Kubernetes. For MongoDB workloads, it currently uses Percona Operator for MongoDB as its engine — delivering full replica set and sharded cluster support, PBM-based PITR, and PMM monitoring through a single unified API.
OpenEverest is built on a modular operator architecture: the underlying engine is pluggable, and additional MongoDB operator backends are on the roadmap. The same modular approach already spans MySQL (Percona Operator for MySQL), MongoDB, and more engines planned. This means teams invest in one API, one RBAC model, and one monitoring stack — regardless of which database or operator runs underneath.
Replica sets with driver-native failover, sharded clusters with mongos, PBM PITR to S3 — all managed through a single open-source control plane. No vendor lock-in. No surprises.