This website uses cookies to ensure you get the best experience. Read more in Privacy Policy.
OK
$ kubectl get mongodbclusters --all-namespaces

Running MongoDB
on Kubernetes

Replica Sets, Sharded Clusters & Operator Comparison

1. Why MongoDB on Kubernetes?

MongoDB is the world's leading document database, and Kubernetes has become the standard infrastructure platform. Running MongoDB on Kubernetes means your clusters are declaratively defined, GitOps-managed, and co-located with the applications they serve — on your infrastructure, under your control.

Unlike relational workloads, MongoDB's horizontal scaling model (sharding) was designed for distributed systems from the start. Kubernetes amplifies this: shards become StatefulSets, mongos routers become Deployments, and failover automation is baked into both the operator and MongoDB's Raft-based election protocol. The result is a database platform that scales out as naturally as the applications it powers.

No Vendor Lock-In

Run on any Kubernetes — GKE, EKS, AKS, bare metal, or on-prem. Open-source operators under Apache 2.0 with no per-node license fees.

Automated Operations

Rolling upgrades, replica set scaling, shard addition, and TLS certificate rotation — all via Kubernetes CRDs. No manual runbooks.

$

Cost Efficiency

Bin-pack MongoDB alongside other workloads. Scale dev clusters to a single-node. Assign secondary members to spot/preemptible node pools.

WHY_KUBERNETES

  • Declarative topology via CRDs
  • GitOps-driven provisioning
  • Automatic Raft-based failover
  • Self-service for dev teams
  • Centralised secrets (Vault, K8s Secrets)
  • Unified Prometheus monitoring
  • PITR via oplog from day one
PRODUCTION MATURITY

Large e-commerce platforms, financial services firms, and SaaS providers run multi-TB sharded MongoDB clusters on Kubernetes in production. The Percona Operator for MongoDB and the MongoDB Community Operator both have extensive production history.

2. Architecture

MongoDB on Kubernetes supports two primary topologies: a single Replica Set where the application driver connects directly to all members, and a Sharded Cluster with mongos routers for horizontal data partitioning. Choosing the right topology depends on your dataset size and scaling requirements.

TOPOLOGY 1 — Single Replica Set

Direct Connection: No Proxy Required

A MongoDB Replica Set consists of three or more nodes: one Primary (handles all writes) and one or more Secondaries (replicate via oplog tailing). Unlike PostgreSQL, where the application typically speaks to a single endpoint, the MongoDB driver is topology-aware — it discovers all replica set members automatically from the initial seed list and maintains a connection to each.

This means no proxy is needed for a single replica set. The application connection string lists all member hosts (or the service DNS entries). The driver monitors member health continuously and routes reads and writes intelligently based on your configured read preference.

Connection String — all members listed
mongodb://mongo-0.mongo-svc:27017,mongo-1.mongo-svc:27017,mongo-2.mongo-svc:27017/?replicaSet=rs0&readPreference=secondaryPreferred
The driver performs topology discovery from any of the seed hosts. Adding or removing members is transparent to the application after the replica set membership is updated.
WHEN TO USE REPLICA SET
Working set fits on one node
Dataset under ~500 GB–1 TB
Simpler operational model
PITR from a single oplog stream
Driver-native failover (~10 s)
Write throughput limited to one primary

Read Preference Options

The driver routes read operations based on your readPreference setting — no extra infrastructure needed:

primary All reads go to the primary. Strongest consistency. Default.
primaryPreferred Reads go to primary; falls back to secondary if unavailable.
secondary All reads go to a secondary. Good for analytics and reporting.
secondaryPreferred Prefers secondaries; falls back to primary if none available.
nearest Routes to the member with lowest network latency.

Automatic Failover

MongoDB uses a Raft-based election protocol. When the primary becomes unavailable, the replica set elects a new primary automatically:

1 Primary stops sending heartbeats (10 s electionTimeout by default)
2 Secondaries detect the outage and a candidate initiates election
3 Majority vote — the most up-to-date secondary wins
4 Driver detects topology change and reroutes writes to new primary in ~10–15 s

On Kubernetes: set priority on members to prefer nodes in the primary AZ. Use votes: 0 for arbiter-only members to spare resources.

TOPOLOGY 2 — Sharded Cluster

Horizontal Scale-Out with mongos

When your dataset exceeds what a single replica set can serve — in throughput or storage — a sharded cluster distributes data across multiple shards. Each shard is itself a full replica set. Applications never talk to shards directly: they connect to mongos routers, which are stateless proxy pods that route operations to the correct shard using a shard key.

A sharded cluster has three component groups: mongos routers (stateless, scale horizontally as a Deployment), a Config Server Replica Set (3-node RS storing chunk ranges and metadata), and N data shards (each a 3-node RS). The mongos reads chunk metadata from the Config Server to route every query or write to the right shard.

Use sharding when: dataset exceeds 1 TB, write throughput needs horizontal scaling, or data must be partitioned by geography or tenant.
Avoid sharding when: your working set fits on a single RS — sharding adds operational complexity (Config Server RS, shard key design, chunk balancing).
MongoDB Sharded Cluster — Kubernetes Topology
APPLICATION PODS MONGOS ROUTERS (stateless · K8s Deployment) DATA SHARDS — each shard is a full Replica Set App Pod driver v6+ App Pod driver v6+ App Pod driver v6+ mongos-0 routes by shard key mongos-1 routes by shard key Config Server RS 3 nodes · chunk metadata metadata Shard 1 RS Primary wt-cache journal PVC Sec 1 replica PVC Sec 2 replica PVC ← oplog tailing → PBM oplog → S3 Shard 2 RS Primary wt-cache journal PVC Sec 1 replica PVC Sec 2 replica PVC ← oplog tailing → PBM oplog → S3 Shard 3 RS Primary wt-cache journal PVC Sec 1 replica PVC Sec 2 replica PVC ← oplog tailing → PBM oplog → S3 Config Server RS (3 cfgsvr pods) stores chunk ranges — deployed separately; not shown as a data shard

Config Server Replica Set

The Config Server RS is a dedicated 3-node replica set that stores the cluster's sharding metadata: which chunks of data live on which shard. Every mongos reads from this RS before routing operations.

  • Always exactly 3 nodes — never scale down
  • Outage of Config Server stops all mongos routing
  • Must be placed on Guaranteed QoS pods
  • Low storage requirements (~1–5 GB) but high availability requirement
kind: PerconaServerMongoDB
spec:
  sharding:
    enabled: true
    configsvrReplSet:
      size: 3
      affinity:
        podAntiAffinityTopologyKey: topology.kubernetes.io/zone

Shard Key Selection

The shard key determines how data is distributed. Poor shard key selection leads to hotspots where all writes go to one shard, negating the benefit of sharding.

PatternResult
{ userId: "hashed" } Even distribution ✓
{ tenantId: 1, _id: 1 } Tenant partitioning ✓
{ createdAt: 1 } Monotonic hotspot ✗
{ status: 1 } Low cardinality ✗

Use hashed shard keys for maximum write distribution. Use compound range shard keys when zone sharding (geographic partitioning) is required.

3. Compute & Storage

MongoDB's performance depends almost entirely on memory: the WiredTiger cache must hold the working set, and the OS page cache provides a second layer. Getting resource limits and storage layout wrong on Kubernetes will cause constant cache evictions and poor read performance.

WiredTiger Cache Sizing

WiredTiger is MongoDB's default storage engine. It maintains an in-memory cache of working data. By default, the cache size is (0.5 × RAM) - 1 GB. On Kubernetes you must set this explicitly — the container memory limit is not automatically visible to MongoDB, and the default calculation may use the node's total RAM, leading to OOMKills.

BAD — implicit sizing
resources:
  limits:
    memory: "8Gi"
  requests:
    memory: "8Gi"
# wiredTigerCacheSizeGB not set →
# MongoDB may read host RAM (e.g. 128 GB)
# and set cache to 63 GB → OOMKill
GOOD — explicit cache sizing
resources:
  limits:
    memory: "16Gi"
  requests:
    memory: "16Gi"
# Rule: cache ≤ (containerMemory × 0.5) - 1 GB
# For 16 Gi container → cache = 7 GB
storage:
  wiredTiger:
    engineConfig:
      cacheSizeGB: 7
MEMORY LAYOUT

MongoDB uses two memory layers: the WiredTiger cache (for modified and frequently-accessed data) and the OS page cache (for reads not in the WiredTiger cache). Set the container memory limit to at least 2× the WiredTiger cache size to leave room for the OS page cache, connections, and aggregation pipeline buffers.

Oplog Sizing

The oplog (operations log) is a capped collection on each replica set member that records all write operations. Its size determines your PITR window — how far back in time you can restore, and how much replication lag is tolerable before a secondary falls too far behind to resync.

Default (avoid in production) ~990 MB May roll over in <1 hour under write pressure
Recommended baseline 10–50 GB Provides multi-hour PITR window; size to cover maintenance duration
Heavy-write workloads 50–100 GB Ensure secondaries can relag without falling off the oplog
storage:
  oplogSizeMB: 51200  # 50 GB — set in MiB, not GB

BEST_PRACTICE

  • Always set cacheSizeGB explicitly
  • Container limit ≥ 2× cacheSizeGB
  • Use Guaranteed QoS for primary members
  • Separate data and journal volumes
  • Oplog ≥ 10 GB for any production RS
  • Use NVMe-backed StorageClass for primary
  • Network-attached OK for secondaries

QOS_CLASSES

Guaranteed Primary + Config Server RS. requests == limits for both CPU and memory.
Burstable Secondaries and Analytics. Memory request set; CPU can burst.
Best Effort Never for MongoDB. Evicted first under memory pressure.

NVMe / Local SSD — Primary Members

RECOMMENDED

MongoDB's WiredTiger performs many small random I/Os for cache evictions and checkpoints. NVMe or local SSD-backed storage classes deliver the sub-millisecond latency these operations require. Use node-local PVCs via local-path-provisioner or cloud NVMe storage classes.

  • Sub-ms latency for checkpoint writes
  • High IOPS for cache eviction under pressure
  • WAL (journal) writes are sequential — benefit from NVMe burst

Network-Attached Storage — Secondaries

ACCEPTABLE

Secondary members replicate via oplog tailing — a largely sequential read/write pattern that is tolerant of higher latency. Cloud block storage (gp3, Premium SSD, Persistent Disk) is acceptable for secondaries, reducing cost without impacting write throughput on the primary.

  • Lower cost than NVMe for secondary nodes
  • Sequential oplog writes tolerate higher latency
  • Separate journal PVC prevents I/O contention

4. Monitoring

MongoDB produces rich observability data, but the default monitoring focus for Kubernetes operators differs from standalone deployments. Oplog window and replication lag are the two most operationally critical metrics — both invisible without dedicated exporters.

Prometheus + mongodb_exporter

The Percona mongodb_exporter scrapes MongoDB's serverStatus, replSetGetStatus, and dbStats endpoints and exposes them as Prometheus metrics. The Percona Operator for MongoDB deploys it as a sidecar and automatically creates a ServiceMonitor for Prometheus Operator.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mongo-metrics
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: percona-server-mongodb
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
PMM (Percona Monitoring & Management) provides a pre-built Grafana stack with MongoDB dashboards, query analytics, and replication topology views — available as a K8s Deployment alongside the operator.

Critical MongoDB Metrics

These five metrics should be the foundation of any MongoDB on Kubernetes alert configuration:

Oplog Window (hours)
mongodb_mongod_replset_oplog_tail_timestamp - mongodb_mongod_replset_oplog_head_timestamp

Time span covered by the oplog. Alert if < 2× your longest maintenance window. Shrinking oplog window means secondaries may fall off and require full resync.

Replication Lag (seconds)
mongodb_mongod_replset_member_replication_lag

How far behind each secondary is from the primary's optime. Alert at > 30 s; investigate at > 10 s under normal load.

WiredTiger Cache Hit Ratio
mongodb_mongod_wiredtiger_cache_bytes_read_into_cache / mongodb_mongod_wiredtiger_cache_bytes_total

Should stay above 95%. Dropping below 90% signals the working set no longer fits in cache — time to scale memory or shard.

Current Connections
mongodb_mongod_connections{state="current"}

MongoDB spawns one thread per connection. Alert when approaching maxIncomingConnections. High connection count is a sign PgBouncer-equivalent pooling (mongos or application-level pooling) is needed.

Page Faults / Disk I/O
mongodb_extra_info_page_faults_total

Page faults mean data was not in the WiredTiger cache or OS page cache — direct disk reads. Sustained page faults indicate insufficient memory.

Replica Set member state

Monitor mongodb_mongod_replset_member_state — alert on any member in state RECOVERING (6), DOWN (8), or UNKNOWN (6+).

PVC capacity alerts

Set kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.80 — MongoDB does not pre-allocate but checkpoint files can spike disk usage.

Config Server RS health

For sharded clusters, alert immediately if the Config Server RS loses quorum. All mongos routing stops until a majority is restored.

mongos active connections

Track mongodb_mongos_connections for sharded clusters. mongos pools connections to each shard; size the pool to avoid shard connection exhaustion.

Slow query tracking

Enable profiling: 1 with slowOpThresholdMs: 100. PMM's Query Analytics surfaces the top slow operations with explain plan data.

Grafana dashboard

Percona provides pre-built Grafana dashboards for MongoDB Replica Set Overview, WiredTiger, and ReplSet Summary — importable via PMM or standalone JSON.

5. Backups & PITR

MongoDB point-in-time recovery is implemented differently from relational databases. Instead of WAL files, MongoDB uses the oplog — a capped collection that records every write operation. Percona Backup for MongoDB (PBM) continuously archives this oplog to object storage, enabling recovery to any second within the oplog window.

Percona Backup for MongoDB (PBM)

PBM is the recommended backup solution for self-managed MongoDB on Kubernetes. It supports both physical backups (direct copy of WiredTiger data files — fast for large datasets) and logical backups (BSON document dump). For PITR, PBM archives oplog chunks to S3-compatible storage continuously.

Physical Backup Copies WiredTiger data files directly. Backup time proportional to data size, not document count. Recommended for production (>50 GB).
Logical Backup BSON dump via mongodump. Slower but portable across MongoDB versions. Useful for development and smaller datasets.
Oplog Archiving (PITR) Continuously uploads oplog chunks to S3 on a configurable interval (default 10 min). Enables second-level PITR within the upload window.
Sharded Cluster Aware PBM coordinates backups across all shards and the Config Server RS simultaneously, ensuring a consistent cluster-wide snapshot.
# Trigger a physical backup
pbm backup --type=physical

# Restore to a point in time
pbm restore --time="2026-04-01T14:32:00"

# Check PITR coverage window
pbm status | grep -A5 "PITR"

PITR via Oplog Archiving

Point-in-time recovery combines a base backup snapshot with a continuous stream of oplog operations. Restoring to any point requires replaying the oplog from the nearest base backup up to the target timestamp.

Base backup physical snapshot oplog chunks → S3 (every 10 min) Incident data loss event Restore target T - 5 min before incident Base backup + oplog replay T0 T+N
PITR granularity: seconds (limited by oplog upload interval)
PITR window = age of oldest base backup in S3
Oplog must not roll over between backup and restore target
Works identically for replica sets and sharded clusters
ACCEPTABLE
PBM Logical + PITR

Development clusters or datasets under 50 GB. Portable across MongoDB patch versions. Slower than physical for backup and restore.

6. MongoDB Operators Compared

The Kubernetes ecosystem offers several operators for MongoDB, ranging from full-featured open-source solutions to proprietary enterprise platforms. Here's how the main options compare on the features that matter most in production.

Percona Operator for MongoDB

Percona
License Apache 2.0
Sharding Full support
Backup PBM (physical + PITR)

The most feature-complete open-source MongoDB operator. Supports replica sets and sharded clusters, PBM integration for PITR, PMM for monitoring, TLS, custom roles, and multi-cluster federation. Active development with regular releases.

Apache 2.0 Sharding PBM PITR PMM
↗ percona.com

MongoDB Community Operator

MongoDB Inc.
License Apache 2.0
Sharding Not supported
Backup No native integration

The official open-source operator from MongoDB Inc. Handles replica set provisioning and basic lifecycle management. No sharding support, no built-in backup, no monitoring integration. Suitable for development clusters and use cases where Atlas Operator is not viable.

Apache 2.0 Replica Set only Basic lifecycle
↗ GitHub

KubeDB

AppsCode
License BSL / Proprietary
Sharding Supported
Backup Stash (paid)

Multi-database operator suite (MongoDB, PostgreSQL, MySQL, Redis, and more) with a single unified management plane. Enterprise features including automated backups and monitoring require a commercial license. Useful for teams that need a single operator for many database types under an enterprise agreement.

Multi-DB Enterprise Sharding
↗ kubedb.com

Atlas Kubernetes Operator

MongoDB Inc.
License Apache 2.0 (code)
Sharding Via Atlas
Self-hosted No — Atlas cloud only

Manages MongoDB Atlas clusters via Kubernetes CRDs. The operator code is open source but the database runs in Atlas (MongoDB's cloud service) — not on your own Kubernetes nodes. Offers full feature parity with Atlas including Global Clusters and Atlas Search. Requires an Atlas account and subscription costs.

Atlas Cloud Full-featured Subscription
↗ mongodb.com

OpenEverest: The Unified Approach

OpenEverest is a CNCF Sandbox project that provides a unified control plane for multiple database engines on Kubernetes. For MongoDB workloads, it currently uses Percona Operator for MongoDB as its engine — delivering full replica set and sharded cluster support, PBM-based PITR, and PMM monitoring through a single unified API.

OpenEverest is built on a modular operator architecture: the underlying engine is pluggable, and additional MongoDB operator backends are on the roadmap. The same modular approach already spans MySQL (Percona Operator for MySQL), MongoDB, and more engines planned. This means teams invest in one API, one RBAC model, and one monitoring stack — regardless of which database or operator runs underneath.

CNCF Sandbox Open Source Modular Architecture Percona Operator Now Sharding Support Multi-Database
solanica@k8s:~$

Run MongoDB on Kubernetes — The Right Way

Replica sets with driver-native failover, sharded clusters with mongos, PBM PITR to S3 — all managed through a single open-source control plane. No vendor lock-in. No surprises.

100% Open Source
3+ DB Engines
40–60% Cost Savings