MongoDB

Document database built on BSON. Flexible schemas, a rich query language, and Atlas Vector Search for RAG.

Category
Databases
Difficulty
Beginner
When to use
Your data is document-shaped, your schema evolves, or you want vector search and operational storage in one place.
When not to use
Your workload is heavily relational with multi-table transactions, or you need strict SQL analytics.
Alternatives
PostgreSQL DynamoDB Elasticsearch Firestore

At a glance

FieldValue
CategoryOperational Database
DifficultyBeginner → Intermediate (schema design and sharding get tricky)
When to useDocument-shaped data, evolving schema, RAG storage with Atlas Vector
When not to useHeavy multi-entity transactions, SQL-first analytics teams
AlternativesPostgreSQL (+ pgvector), DynamoDB, Elasticsearch, Firestore

The document model

MongoDB stores documents (JSON-like BSON) inside collections inside databases. A document is a self-contained aggregate — everything an API call needs, in one read. No joins if you don’t want them.

  • Max document size: 16 MB. This is a soft but real constraint.
  • Every document has an _id, indexed by default.
  • Fields can be anything: scalars, arrays, nested docs, arrays of nested docs. You design for reads, not for normalization.

The mental shift from Postgres: think in aggregates, not tables. If your API always loads a user together with their last 10 orders, put them in the same document.

When to pick MongoDB over Postgres

Pick Mongo when:

  • The schema genuinely varies across records (user profiles with different integrations, event logs with arbitrary payloads).
  • You want the storage shape to match the API response shape.
  • You need horizontal scale past a single Postgres primary and want built-in sharding instead of Citus or bolt-ons.
  • You want Atlas Vector Search colocated with your operational data.

Pick Postgres when:

  • You have a truly relational workload with lots of joins and constraints.
  • You need strict ACID across multiple entities.
  • Your team is SQL-first and pgvector covers your vector needs.

There is no universally correct answer. Most systems end up with Postgres for the core transactional model and Mongo (or Redis, or S3) around it.

Aggregation pipeline

The aggregation framework is Mongo’s answer to SQL GROUP BY and window functions. A pipeline is a list of stages, each transforming the stream.

db.orders.aggregate([
  { $match: { status: 'paid', createdAt: { $gte: ISODate('2026-01-01') } } },
  {
    $group: {
      _id: '$userId',
      total: { $sum: '$amount' },
      count: { $sum: 1 },
    },
  },
  { $sort: { total: -1 } },
  { $limit: 10 },
  {
    $lookup: {
      from: 'users',
      localField: '_id',
      foreignField: '_id',
      as: 'user',
    },
  },
]);

Common stages to learn first: $match, $project, $group, $sort, $limit, $unwind, $lookup, $facet.

Atlas Vector Search for RAG

MongoDB Atlas supports native vector indexes. For a RAG system, this means you can store the chunk text, metadata, and embedding vector in a single document, and query with a single aggregation:

db.chunks.aggregate([
  {
    $vectorSearch: {
      index: 'chunks_embedding_idx',
      path: 'embedding',
      queryVector: queryEmbedding,
      numCandidates: 200,
      limit: 10,
    },
  },
  { $match: { tenant: 'ephizen' } },
  { $project: { text: 1, source: 1, score: { $meta: 'vectorSearchScore' } } },
]);

The nice part: you get metadata filtering, hybrid search with Atlas Search (BM25), and your app database in one system. The not-nice part: you pay Atlas for a search-tier node.

Indexes: the non-optional ones

  • _id — automatic, unique.
  • Compound indexes on the exact fields your hot queries filter and sort on. Order matters: equality fields first, then sort field, then range fields (the “ESR” rule).
  • Unique indexes for any business uniqueness constraint (email, slug).
  • TTL indexes for ephemeral data — sessions, password resets, rate-limit counters.
  • Never rely on KEYS-style full scans. Every slow query in production traces back to a missing index.

Schema design pitfalls

  • Unbounded arrays. Pushing an entry for every user login into a user.logins array will eventually blow past the 16 MB limit. Put append-heavy data in its own collection and reference back.
  • Over-normalization. Copying SQL tables 1:1 into Mongo and doing $lookup everywhere defeats the point. Duplicate safely when the data is read-heavy and updates are rare.
  • No schema validation. “Schemaless” is a marketing word. Use JSON Schema validators on critical collections — at minimum for required fields and types.
  • $where and JavaScript expressions. These are full collection scans. Avoid them in hot paths.
  • One giant collection for everything. Separate by concern. The query planner and your indexes will thank you.