Search Engines

Index & Reverse Index

yash101

Published 2/7/2025

Updated 2/7/2025

Jump to Page 📚

🔄 Where we left off

In the previous page, we implemented data acquisition and data cleanup. We are now able to:

Acquire some text data (enron emails)
Clean up the text
Tokenize the text into lists of tokens

💡 Indices

In the first part of this series, Introduction to the Search Problem, we did a thought experiment into using the table of contents and the index of a textbook to find more about a topic. Can we use the same methods to search our corpus?

Corpus: In search, a corpus is simply the collection of text documents that a search engine indexes and retrieves results

🤔 Think to yourself, how do you find a topic in a textbook?

🗃️ Building an Index

🤔 Think to yourself, How are table of contents and indices created in books?

Textbooks are written manually by authors and edited by a team of editors. The amount of text and the number of concepts and terms in a textbook are finite. Thus, editors have the ability to pick and choose which topics and terms will be the most useful for their readers.

However, if you are building a replacement to Google, it is impossible to manually extract the important topics to build an index from each page of each website.

🤔 Think to yourself, what information is required to build an index? And how can we build an index programmatically?

🧩 The Morphology of an Index

The index in a textbook maps each entry to the pages which contain it. Example index entry:

A  
Antibiotics, 65-68  
Archaea, 15, 22-24  
Autotrophic bacteria, 30  

D  
DNA replication in prokaryotes, 32-34

H  
Horizontal gene transfer, 38-42  

M  
Microbiome, 60-63  
Mitochondria (Endosymbiotic Theory), 70

P  
Pathogenic bacteria, 55-58  
Plasmids, 36  
Prokaryotic ribosomes, 21  

R  
Ribosomes, 21  
Respiration in prokaryotes, 44  

V  
Viruses vs. Bacteria, 75-78

Notice when looking for prokaryote, how our index has the following entries:

DNA replication in prokaryotes
Prokaryotic ribosimes
Respiration in Prokaryotes

Each entry in the index contains a list of pages where to look to read more.

🤨 Terminology: Index, Forward Index, Reverse Index, Inverted Index

There are two ways we can represent our documents, each of which has its own purpose: the “forward index” and “reverse index” (also known as an inverted index, or index).

Let’s illustrate a simple example:

Documents
---------
1️⃣ The cat sits on the mat.
2️⃣ The dog sleeps on the mat.
3️⃣ The cat and dog play together.
4️⃣ The quick brown fox jumps over the lazy dog

A forward index contains the list of tokens in each document:

Forward Index
-------------
1️⃣ { The, cat, sits, on, the, mat }
2️⃣ { The, dog, sleeps, on, the, mat }
3️⃣ { The, cat, and, dog, play, together }
4️⃣ { The, quick, brown, fox, jumps, over, the, lazy, dog }

The reverse index contains the list of documents which contain each token.

Reverse index
-------------
the: 1️⃣2️⃣3️⃣4️⃣
cat: 1️⃣3️⃣
sits: 1️⃣
on: 1️⃣2️⃣
mat: 1️⃣2️⃣
dog: 2️⃣3️⃣4️⃣
sleeps: 2️⃣
and: 3️⃣
play: 3️⃣
together: 3️⃣
quick: 4️⃣
brown: 4️⃣
fox: 4️⃣
over: 4️⃣
lazy: 4️⃣

If we want to search “dog cat play”, we can look up in the index:

dog: 2️⃣3️⃣4️⃣
cat: 1️⃣3️⃣
play: 3️⃣

And let’s simply rank the documents on which was found most:

3️⃣: 3 hits
2️⃣: 1 hit
1️⃣: 1 hit
4️⃣: 1 hit

Thus, the best result for our query is document 3️⃣.

Forward Index: a data structure which maps each document to tokens it contains.
Reverse index, Inverted index, Index: a mapping from a token to its appearances; in search engines, we can use a reverse index to map a token to the documents it appears in

🧑‍💻 Coding a Reverse Index

Let’s attempt to code this!

☕️ The code!

class Indexer {
  constructor() {
    this.idx = new Map();
  }

  index(text, uri) {
    // Build a forward index from the document
    const tokens = new Set();
    tokenize(preprocess(text)).forEach(wordSet.add);

    // Update the reverse index with all of our tokens
    for (const token of tokens) {
      if (!this.idx.has(token)) {
        this.idx.set(token, []);
      }

      this.idx.get(token).push(uri);
    }
  }
}

⚙️ How does it work?

Since a reverse index maps a term to its occurrences, we will use a JavaScript Map object to hold the reverse index
- this.idx = new Map() creates the Map which we will use
We tokenize the text and generate a list of tokens
- tokenize(preprocess(text)) $\to$ string[]
We add each token to a JavaScript Set to deduplicate
- const tokens = new Set(); creates a set to add tokens to
- (...).forEach(wordSet.add) adds each token to that set
We add each token to the index: for (const token of tokens) { ... }
For each token, we check if it exists in the reverse index. If it doesn’t, then we add an empty array to the index

if (!this.idx.has(token)) {
  this.idx.set(token, []);
}

We add the document identifier for the array in the reverse index for each token
- this.idx.get(token).push(uri);

Indexing our Corpus (Enron emails dataset) 📚

Now that we know how to index our corpus and have built our index(document) function, we just need to feed in documents to build an index.

const indexer = new Indexer();

// ...

for await (const [ filepath, contents ] of readFilesRecursively(dirPath)) {
  try {
    indexer.index(contents, filepath);
  } catch (e) {
    console.log(filepath, contents);
    console.error(e);
  }
}

I ran this function and indexed the entire Enron emails dataset.

Searching Using a Reverse Index 🕵️‍♀️

Building the reverse index is the hard part. Turns out that searching with a reverse index is easy and very efficient.

The time complexity of the search operation, where $t$ is the numebr of tokens in the search string:

O (t)

And the space complexity of the search operation, where $t$ is the number of tokens, and $d$ is the average number of documents mapped to a token:

O (t \times \bar{d})

Pseudocode Steps to search using a reverse index:

Normalize the search query to the reverse index. This is fancy speak for use the exact same methods to process the search query as the contents of the reverse index.
Create a hash map to map each document found to the number times it was found.
Add a document to the hash map if it is a hit and increment its hit count (score)
Rank the documents based on these scores

Implemented,

query(searchQuery) {
  if (!searchQuery)
    return [];

  // List of tokens in the search query
  const tokens = new Set(tokenize(preprocess(searchQuery)));

  // Forward index we will build by reversing the reverse index
  const unrankedResults = new Map();
  for (const token of tokens) {
    // Get all documents associated with a token
    const hits = idx.get(term) || [];

    // Add each document to the unrankedResults map and/or increment their value (+1 for each time a document is found)
    for (const hit of hits) {
      const current = unrankedResults.get(hit) || 0;
      unrankedResults.set(hit, current + 1);
    }
  }

  return Array
    .from(unrankedResults.entries())
    .sort((a, b) => b[1] - a[1])
}

Try the Demo Below: it implements searching with a prebuilt reverse index of 5 documents.

Further Learning:

Try playing with the search functionality above. Try adding or modifying doSearch(query) to try more queries.
Try understanding how the reverse index works (format the JSON for the index and understand how it works)

🐘 The Elephant in the Room

The index in the playground is humongous. And to make matters worse, it was built with just 5 documents! Our naïve reverse index does not scale well!

Attempting to index the entire Enron email dataset resulted in a disastrously big index. The memory usage of Node.js was 2 GB of RAM, quite a costly VPS or pod!

Node.js is using 2 GB of RAM

Next, I tried dumping the full index, only to get an error that the JSON blob was too big to write. Oh, and yes, that is $1, 080, 813$ entries in the reverse index!

debug
1080813
file:///Users/yash/code/microsearch/runner.js:84
      fs.writeFile('debug.json', JSON.stringify(Object.fromEntries(index.idx), null, 2));
                                      ^

RangeError: Invalid string length
    at JSON.stringify (<anonymous>)
    at file:///Users/yash/code/microsearch/runner.js:84:39
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)

Node.js v22.13.0

Our goal was to be able to do fullly client-side search on a large dataset. Unfortunately, that is currently sounding like a pipe dream!

🎒 Summary

In this article, we learned what a forward index and reverse index are in a search engine. We also used our intuition on how to find a word in a textbook to develop the data structures and algorithms required to create the forward and reverse index.

⏭️ Next Section

The next section is a workshop where you will build a search engine using a reverse index and train it on Æsop’s Fables.

Data acquisition, Cleanup and Tokenization

Workshop | Build a Reverse Index