Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Indexing & Queries

AeorDB indexes are opt-in and configured per-directory. Nothing is indexed by default – you control exactly which fields are indexed, with which strategies, and for which file types. This keeps the engine lean and predictable.

Index Configuration

Create a .config/indexes.json file in any directory to define indexes for files in that directory:

curl -X PUT http://localhost:3000/engine/users/.config/indexes.json \
  -H "Content-Type: application/json" \
  -d '{
    "indexes": [
      {"name": "name", "type": ["string", "trigram"]},
      {"name": "age", "type": "u64"},
      {"name": "city", "type": "string"},
      {"name": "email", "type": "trigram"},
      {"name": "created_at", "type": "timestamp"}
    ]
  }'

When this file is created or updated, the engine automatically triggers a background reindex of all existing files in the directory.

Index Types

TypeOrder-PreservingDescription
u64YesUnsigned 64-bit integer. Range-tracking with observed min/max.
i64YesSigned 64-bit integer. Shifted to [0.0, 1.0] for NVT storage.
f64Yes64-bit floating point. Clamping for NaN/Inf handling.
stringPartiallyExact string matching. Multi-stage scalar: first byte weighted + length.
timestampYesUTC millisecond timestamps. Range-tracking.
trigramNoTrigram-based fuzzy text matching. Tolerates typos, supports substring search.
phoneticNoGeneral phonetic matching (Soundex algorithm).
soundexNoSoundex encoding for English names.
dmetaphoneNoDouble Metaphone for multi-cultural phonetic matching.

Multi-Strategy Indexes

A single field can be indexed with multiple strategies by passing type as an array:

{"name": "title", "type": ["string", "trigram", "phonetic"]}

This creates three separate index files for the same field:

  • title.string.idx – exact match queries
  • title.trigram.idx – fuzzy/substring queries
  • title.phonetic.idx – phonetic queries

Use the appropriate query operator to target the desired index.

How Indexes Work

AeorDB uses a Normalized Vector Table (NVT) for index lookups. Each indexed field gets its own NVT.

The NVT Approach

  1. A ScalarConverter maps each field value to a scalar in [0.0, 1.0]
  2. The scalar maps to a bucket in the NVT
  3. The bucket points to the matching entries

For numeric types (u64, i64, f64, timestamp), the converter tracks the observed min/max and distributes values uniformly across the [0.0, 1.0] range. This means range queries (gt, lt, between) are efficient – they resolve to a contiguous range of buckets.

For a query like WHERE age > 30:

  1. converter.to_scalar(30) computes where 30 falls in the bucket range
  2. All buckets after that point are candidates
  3. Only those buckets are scanned

This is O(1) for the bucket lookup, with a small linear scan within the bucket.

Two-Tier Execution

Simple queries (single field, direct comparison) use direct scalar lookups – no bitmaps, no compositing. Most queries fall into this tier.

Complex queries (OR, NOT, multi-field boolean logic) build NVT bitmaps and composite them:

  • Each field condition produces a bitmask over the NVT buckets
  • AND = bitwise AND of masks
  • OR = bitwise OR of masks
  • NOT = bitwise NOT of a mask
  • The final mask identifies which buckets contain results

Memory usage is bounded: a bitmask for 1M buckets is only 128KB, regardless of how many entries exist.

Source Resolution

By default, the name field in an index definition is used as the JSON key to extract the value:

{"name": "age", "type": "u64"}

This extracts the age field from {"name": "Alice", "age": 30}.

Nested Fields

For nested JSON or parser output, use the source array to specify the path:

{"name": "author", "source": ["metadata", "author"], "type": "string"}

This extracts metadata.author from a JSON structure like:

{"metadata": {"author": "Jane Smith", "title": "Report"}}

The source array supports:

  • String segments for object key lookup: ["metadata", "author"]
  • Integer segments for array index access: ["items", 0, "name"]

Plugin Mapper

For complex extraction logic, delegate to a WASM plugin:

{
  "name": "summary",
  "source": {"plugin": "my-mapper", "args": {"mode": "summary", "max_length": 500}},
  "type": "trigram"
}

The plugin receives the parsed JSON and the args object, and returns the extracted field value.

WASM Parser Integration

For non-JSON files (PDFs, images, XML, etc.), a parser plugin converts raw bytes into a JSON object that the indexing pipeline can work with.

Configuration

{
  "parser": "pdf-extractor",
  "parser_memory_limit": "256mb",
  "indexes": [
    {"name": "title", "source": ["metadata", "title"], "type": ["string", "trigram"]},
    {"name": "author", "source": ["metadata", "author"], "type": "phonetic"},
    {"name": "content", "source": ["text"], "type": "trigram"},
    {"name": "page_count", "source": ["metadata", "page_count"], "type": "u64"}
  ]
}

The parser receives a JSON envelope with the file data (base64-encoded) and metadata:

{
  "data": "<base64-encoded file bytes>",
  "meta": {
    "filename": "report.pdf",
    "path": "/docs/reports/report.pdf",
    "content_type": "application/pdf",
    "size": 1048576
  }
}

The parser returns a JSON object (like {"text": "...", "metadata": {"title": "...", ...}}), and the source paths in each index definition walk this JSON to extract field values.

Global Parser Registry

You can also register parsers globally by content type at /.config/parsers.json:

{
  "application/pdf": "pdf-extractor",
  "image/jpeg": "image-metadata",
  "image/png": "image-metadata"
}

When a file is stored and no parser is configured in the directory’s index config, the engine checks this registry using the file’s content type.

Failure Handling

Parser and indexing failures never prevent file storage. The file is always stored regardless of parse/index errors. If logging is enabled in the index config ("logging": true), errors are written to .logs/ under the directory.

Automatic Reindexing

When you store or update a .config/indexes.json file, the engine automatically enqueues a background reindex task for that directory. The task:

  1. Reads the current index config
  2. Lists all files in the directory
  3. Re-runs the indexing pipeline for each file (in batches of 50, yielding between batches)
  4. Reports progress via GET /admin/tasks

During reindexing, queries still work but may return incomplete results. The query response includes a meta.reindexing field with the current progress:

{
  "results": [...],
  "meta": {
    "reindexing": 0.67,
    "reindexing_eta": 1775968398803,
    "reindexing_indexed": 6700,
    "reindexing_total": 10000
  }
}

Query API

Queries are submitted as POST /query with a JSON body:

{
  "path": "/users/",
  "where": {
    "and": [
      {"field": "age", "op": "gt", "value": 30},
      {"field": "city", "op": "eq", "value": "Portland"},
      {"not": {"field": "role", "op": "eq", "value": "banned"}}
    ]
  },
  "sort": {"field": "age", "order": "desc"},
  "limit": 50,
  "offset": 0
}

Boolean Logic

The where clause supports full boolean logic:

{
  "where": {
    "or": [
      {"field": "city", "op": "eq", "value": "Portland"},
      {
        "and": [
          {"field": "age", "op": "gt", "value": 25},
          {"field": "city", "op": "eq", "value": "Seattle"}
        ]
      }
    ]
  }
}

For backward compatibility, a flat array in where is treated as an implicit and:

{
  "where": [
    {"field": "age", "op": "gt", "value": 30},
    {"field": "city", "op": "eq", "value": "Portland"}
  ]
}

Query Operators

OperatorDescriptionValue Type
eqEqualsany
gtGreater thannumeric, timestamp
gteGreater than or equalnumeric, timestamp
ltLess thannumeric, timestamp
lteLess than or equalnumeric, timestamp
betweenInclusive range[min, max]
fuzzyTrigram fuzzy matchstring (requires trigram index)
phoneticPhonetic matchstring (requires phonetic/soundex/dmetaphone index)

Next Steps