Search Engines
Term-frequency Optimization & Zipf's Law
yash101
Published 2/9/2025
Updated 2/9/2025
Jump to Page 📚
Previously, we built a simple search engine with a reverse index. However, our index turned out to be ridiculously huge! In this page, we will identify ways to optimize our index to only contain what we need to attain a good search.
🗄️ Not Everything Needs to be Indexed
In a textbook, indexes are small, maybe a few pages, or 50 pages at the most. If the editors attempted to index every single term in a book, and every single location that term can be found, the index would likely be larger than the textbook itself. Obviously, this isn’t practical.
Instead, authors and editors can use their discretion to choose which terms and words are important enough to decide to index, and if a page is important enough to add to the index.
When building a search engine, how can we choose the terms which matter and ignore those which don’t? The search engine does not normally have access to the author or editors to build an index of a page. It has to figure out a way for itself.
⚖️ Not all Words are Equal
Not every word carries the same weight in a sentence or a phrase. Some words provide rich context, while others add little or no value. If we examine the inverse index from our naïve approach, we notice a lot of junk words and stop words, common words that don’t help in finding relevant information.
Take a look at this sentence:
'Moreover, in most languages such as English, we use lots of words which add little meaning.'
Now, think about the actual value of the following words:
🚫 'Moreover', 'in', 'most', 'such', 'as', 'we', 'use', 'lots', 'of', 'which'
✅ 'languages', 'English', 'words', 'add', 'little', 'meaning'
These words don’t significantly change the meaning much. They act as glue but don’t carry much weight when searching for key concepts.
🤔 Word Frequency and Stopwords
Common words, often called stopwords, appear frequently but don’t significantly change the meaning of a sentence. In contrast, less common words carry more weight because they help differentiate between topics.
🧩 Example: If you search for ‘energy crisis’, the words ‘energy’ and ‘crisis’ are far more useful than ‘the’, ‘in’, or ‘of’, which appear in almost every document.
⁉️ So, how do we quantify which words are important and which ones can be ignored?
📊 Term-Frequency Map (TF)
A dirt simple approach is to simply count the word occurrences throughout our corpus. By building a histogram of terms, we can analyze and see which words appear frequently, and which words appear rarely.
Once we have these counts, we can:
- ✅ Ignore extremely common words (like ‘the’, ‘in’, ‘of’).
- ✅ Prioritize rare but meaningful words (like ‘mitochondria’, ‘photosynthesis’).
Let’s build! To build a term-frequency map, we simply
const termFrequencyMap = new Map();
for (const article of corpus) {
const { text, title } = article;
const tokens = tokenize(normalize(text));
// Add or increment the count of each token in the histogram map
for (const token of tokens) {
termFrequencyMap.set(token, (termFrequencyMap.get(token) || 0) + 1);
}
}
🧐 Zipf’s Law
- Zipf’s law
- ‘an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the n-th entry is often approximately inversely proportional to n.’ (Wikipedia:zipf’s law)
Zipf’s law states that the frequency of a word is inversely proportional to its rank , mathematically expressed as:
where:
is a characteristic exponent, normally close to for natural language is the rank of a term. Given an array of terms in a corpus, sorted by their frequency of occurence, the rank is the index of a term is a constant which is used to normalize the probabilities to . We will use in this article.
Since
🕵️ Vocabulary Analysis in Æsop’s Fables
For the analysis and visualizations below, I used Python along with a few libraries:
- NumPy for numerical computations
- SciPy for statistical analysis
- Matplotlib for visualization.
Note that the code used to generate these plots is hidden using a magic command of this website. The code is available in the last page (errata) as well as linked in the sources section
Analysis steps:
- Preprocess & Tokenize text. I used node.js with the same preprocessing code as used elsewhere in this article
- Graph data
- Terms were sorted by their frequency, high to low
terms were randomly sampled from the corpus, for this visualization- Scipy was used to perform linear regression on the
- Results were graphed using matplotlib
📈 Visualization
Fitted Zipf parameters: s = 1.100, C = 3.238e+03
R-squared: 0.9778, p-value: 0, std-err: 0.003708
minimum frequency filter: n=1; sample count: k=2000
🕵️ Is the Vocabulary of Aesop Fables Zipfian?
Looking at the results from the regression above,
-
There is a high goodness of fit,
means that approximately of the variability in the is explained by . This high value indicates that the relationship is almost perfectly linear in the space, which is exactly what we expect from Zipf’s law (i.e., a power-law distribution).
-
There is statistically significant relationship (
)The
indicates that the probability of observing the linear relationship by random chance is negligible. It shows the relationship is statistically signficant, confirming that the Zipfian model (power-law distribution) is appropriate for the data.
-
Additionally, note how our regression results show
? Turns out this matches most natural language datasets! -
‼️ Note: the stairstepping at the end of the data is due to term frequency being discrete.
-
‼️ Note: the numbers may slightly differ if this regression is re-run due to the random sampling.
🔩 Can we use this to Improve our Search?
Zipf’s law tells us that a small number of words appear very frequently, while most words are rare. But how can we use this insight to optimize search?
-
Applying Zipf’s Law to Index Reduction
If a few words dominate the corpus, we can shrink our index by ignoring:
- High-frequency stopwords (e.g., “the”, “is”, “and”).
- Ultra-rare words that appear only once and contribute little to retrieval.
-
Improving Search Efficiency
A naive search engine treats all words equally, but Zipf’s law suggests this is inefficient. Instead of indexing everything, we can prioritize rare but informative terms (e.g., ‘quantum’ vs. ‘the’) and compress our index by removing overly common words.
Let’s apply Zipf’s law to shrink our index while minimizing any impacts to our search by building a heuristic to reduce the size of our index significantly:
🤔 Heuristic
- We can count the number of occurrences of each term to get a histogram of term frequencies
: hashmap of the term frequency
- We can denoise our data by removing terms which are only seen
times - We can remove the
highest frequency terms (stop words) - We can remove the
lowest frequency terms (noise and garbage)
🤨 Let’s try it!
🥽 Experimentation
- Try to modify
ALPHA
,BETA
andMIN_FREQUENCY
and see how it affects the indexing - Try to add a
MAX_FREQUENCY
constant and see how it affects indexing - Try to figure out why there are no search results for fox when
ALPHA
is set to ?
📓 Sources
- Wikipedia: Zipf’s law
- Experiments:
- ChatGPT
- Used for some help with proofreading and general article improvements
⏭️ Next Page
The next page is under construction.