Search Engines

Term-frequency Optimization & Zipf's Law

yash101

Published 2/9/2025

Updated 2/9/2025

Jump to Page 📚

Previously, we built a simple search engine with a reverse index. However, our index turned out to be ridiculously huge! In this page, we will identify ways to optimize our index to only contain what we need to attain a good search.

🗄️ Not Everything Needs to be Indexed

In a textbook, indexes are small, maybe a few pages, or 50 pages at the most. If the editors attempted to index every single term in a book, and every single location that term can be found, the index would likely be larger than the textbook itself. Obviously, this isn’t practical.

Instead, authors and editors can use their discretion to choose which terms and words are important enough to decide to index, and if a page is important enough to add to the index.

When building a search engine, how can we choose the terms which matter and ignore those which don’t? The search engine does not normally have access to the author or editors to build an index of a page. It has to figure out a way for itself.

⚖️ Not all Words are Equal

Not every word carries the same weight in a sentence or a phrase. Some words provide rich context, while others add little or no value. If we examine the inverse index from our naïve approach, we notice a lot of junk words and stop words, common words that don’t help in finding relevant information.

Take a look at this sentence:

'Moreover, in most languages such as English, we use lots of words which add little meaning.'

Now, think about the actual value of the following words:

🚫 'Moreover', 'in', 'most', 'such', 'as', 'we', 'use', 'lots', 'of', 'which'
✅ 'languages', 'English', 'words', 'add', 'little', 'meaning'

These words don’t significantly change the meaning much. They act as glue but don’t carry much weight when searching for key concepts.

🤔 Word Frequency and Stopwords

Common words, often called stopwords, appear frequently but don’t significantly change the meaning of a sentence. In contrast, less common words carry more weight because they help differentiate between topics.

🧩 Example: If you search for ‘energy crisis’, the words ‘energy’ and ‘crisis’ are far more useful than ‘the’, ‘in’, or ‘of’, which appear in almost every document.

⁉️ So, how do we quantify which words are important and which ones can be ignored?

📊 Term-Frequency Map (TF)

A dirt simple approach is to simply count the word occurrences throughout our corpus. By building a histogram of terms, we can analyze and see which words appear frequently, and which words appear rarely.

Once we have these counts, we can:

✅ Ignore extremely common words (like ‘the’, ‘in’, ‘of’).
✅ Prioritize rare but meaningful words (like ‘mitochondria’, ‘photosynthesis’).

Let’s build! To build a term-frequency map, we simply

const termFrequencyMap = new Map();

for (const article of corpus) {
  const { text, title } = article;
  const tokens = tokenize(normalize(text));

  // Add or increment the count of each token in the histogram map
  for (const token of tokens) {
    termFrequencyMap.set(token, (termFrequencyMap.get(token) || 0) + 1);
  }
}

🧐 Zipf’s Law

the
be
to
of
and
a
in
that
have
I
it
for
not
on
with
he

Zipf’s law: ‘an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the n-th entry is often approximately inversely proportional to n.’ (Wikipedia:zipf’s law)

Zipf’s law states that the frequency of a word is inversely proportional to its rank , mathematically expressed as:

f (r) = \frac{C}{r^{s}}

where:

$s$ is a characteristic exponent, normally close to $1$ for natural language
$r$ is the rank of a term. Given an array of terms in a corpus, sorted by their frequency of occurence, the rank is the index of a term
$C$ is a constant which is used to normalize the probabilities to $1$ . We will use $C = 1$ in this article.

Since $s$ is normally very close to $1$ , we can approximate it as:

f (r) \propto \frac{1}{r} = \frac{C}{r}

🕵️ Vocabulary Analysis in Æsop’s Fables

For the analysis and visualizations below, I used Python along with a few libraries:

NumPy for numerical computations
SciPy for statistical analysis
Matplotlib for visualization.

Note that the code used to generate these plots is hidden using a magic command of this website. The code is available in the last page (errata) as well as linked in the sources section

$k$ terms are randomly sampled so matplotlib does not choke on the data.

Analysis steps:

Preprocess & Tokenize text. I used node.js with the same preprocessing code as used elsewhere in this article
Graph data
- Terms were sorted by their frequency, high to low
- $k$ terms were randomly sampled from the corpus, $k = 2000$ for this visualization
- Scipy was used to perform linear regression on the $\log (rank) : \log (frequencies)$
- Results were graphed using matplotlib

📈 Visualization

Console Logs

Fitted Zipf parameters: s = 1.100, C = 3.238e+03

R-squared: 0.9778, p-value: 0, std-err: 0.003708

minimum frequency filter: n=1; sample count: k=2000

Display Output

🕵️ Is the Vocabulary of Aesop Fables Zipfian?

Looking at the results from the regression above,

There is a high goodness of fit, $R^{2} \approx 1.0$

$R^{2} = 0.9778$ means that approximately $97.78 %$ of the variability in the $\log (term-frequency)$ is explained by $\log (rank)$ . This high value indicates that the relationship is almost perfectly linear in the $\log - \log$ space, which is exactly what we expect from Zipf’s law (i.e., a power-law distribution).
There is statistically significant relationship ( $p -value \approx 0$ )

The $p -value = 0$ indicates that the probability of observing the linear relationship by random chance is negligible. It shows the relationship is statistically signficant, confirming that the Zipfian model (power-law distribution) is appropriate for the data.
Additionally, note how our regression results show $s \approx 1$ ? Turns out this matches most natural language datasets!
‼️ Note: the stairstepping at the end of the data is due to term frequency being discrete.
‼️ Note: the numbers may slightly differ if this regression is re-run due to the random sampling.

🔩 Can we use this to Improve our Search?

Zipf’s law tells us that a small number of words appear very frequently, while most words are rare. But how can we use this insight to optimize search?

Applying Zipf’s Law to Index Reduction

If a few words dominate the corpus, we can shrink our index by ignoring:
- High-frequency stopwords (e.g., “the”, “is”, “and”).
- Ultra-rare words that appear only once and contribute little to retrieval.
Improving Search Efficiency

A naive search engine treats all words equally, but Zipf’s law suggests this is inefficient. Instead of indexing everything, we can prioritize rare but informative terms (e.g., ‘quantum’ vs. ‘the’) and compress our index by removing overly common words.

Let’s apply Zipf’s law to shrink our index while minimizing any impacts to our search by building a heuristic to reduce the size of our index significantly:

🤔 Heuristic

We can count the number of occurrences of each term to get a histogram of term frequencies
- $term \to f_{term}$ : hashmap of the term frequency
We can denoise our data by removing terms which are only seen $N_{min}$ times
We can remove the $(α N_{corpus})$ highest frequency terms (stop words)
We can remove the $(β N_{corpus})$ lowest frequency terms (noise and garbage)

🤨 Let’s try it!

🥽 Experimentation

Try to modify ALPHA, BETA and MIN_FREQUENCY and see how it affects the indexing
Try to add a MAX_FREQUENCY constant and see how it affects indexing
Try to figure out why there are no search results for fox when ALPHA is set to $0.05$ ?

📓 Sources

Wikipedia: Zipf’s law
Experiments:
- Aesop fables preprocessor
- Python code within this article’s source, hidden on this website
ChatGPT
- Used for some help with proofreading and general article improvements

⏭️ Next Page

The next page is under construction.

Workshop Solution | Reverse Index

TF-IDF