Search Engines
Introduction & The Search Problem
yash101
Published 2/9/2025
Updated 2/9/2025
Table of Contents 📚
Search engines are everywhere! You probably used a search engine to find this article. Google, Bing, Yahoo, how do these work? And how can we build a simple but effective search engine?
In this article, we will dive deep into understanding the science of search, and use our understanding to attempt at building a client-side search engine for websites.
This article is written with the first-principles approach. Instead of jumping straight to established methods, we will start with something familiar, looking up a topic in a textbook. Then we will work towards improving and optimizing our search methods to make our search engine lightweight, yet functional.
🌅 Background
Search engines are everywhere! You probably used a search engine to find this article. Google, Bing, Yahoo, how do these work? And how can we build a simple but effective search engine?
In this article, we will dive deep into understanding the science of search, and use our understanding to attempt at building a client-side search engine for websites.
This article is written with the first-principles approach. Instead of jumping straight to established methods, we will start with something familiar, looking up a topic in a textbook. Then we will work towards improving and optimizing our search methods to make our search engine lightweight, yet functional.
💪 Motivation
💡 I have a few project ideas in mind which would require search algorithms
- Implement search on this website
- Implement a better search engine for wikis such as Confluence or Wikipedia
- Implement a financial data engine
In this article, we will focus on the first - building a search engine for this website.
However, adding search to this website introduces a special challenge. This site is static-site generated and has NO backend. Zero, nada, zip-zilch. Traditionally, search engines have been implemented as a separate service (or set of services). But we will try to build a solution which can be efficiently implemented fully on the client.
🧠 Learning
My goal with writing this article is to learn how search works through trial-and-error, building a search engine, and describing how it works along with the topics I learn along the way.
Instead of researching existing search engine architectures, I’m taking a first-principles approach. Instead of jumping straight to established methods or using something which exists (SOLR, Lucene, Elasic). Thus, this article serves as a log of my journey.
⛳️ Goals
Some goals for this project:
- Learn how search engines work
- Learn some NLP (natural language processing) techniques
- Achieve a decent accuracy
- Search should have no backend, and be run 100% in the browser
- Search should support unicode. Languages other than English exist, and so do emojis.
Tools Used 🛠️
- Node.js (v22.13.0)
- Enron Emails Dataset - used for large scale test data
- Æsop’s Fables - used for test data in our demos
- Web Browser with JavaScript support
Notes:
- for Æsop’s fables, the text version of the Project Gutenberg archive was transformed into JSON for ease of use. The processing of the Project Gutenberg text is outside the scope of this project, and the processing may have some errors.
- The majority of code will be written in JavaScript, due to the goal of being able to run search 100% on the browser.
Search, The Problem 👀
Once upon a time, the internet was in its infancy, Google was still under development, and CLRS, The legendary Introduction to Algorithms textbook was just published. But none of that mattered, you had a biology assignment due which required you to understand prokaryotic organisms. While the Internet offered little in help, your best resource sat in front of you - a biology textbook!
Now, take a moment and think… How do you find a topic in a textbook?
You could read the entire textbook, every single page, and find out more about prokaryotic organisms, but is that your best approach? Do you have the time to read through every single page for your assignment? Probably not. Thus, we need to do better and use the resources we have available to ourselves to find a better process.
What resources does the textbook provide you to efficiently find what you are looking for?
Search, The Textbook Method 🗂️
Textbooks normally have two tools, available to your disposal, to help you find information quickly:
- Table of Contents (TOC) at the beginning
- Index at the end
Each of these tools work differently from each other, but together, they make searching for topics much faster than flipping through pages.
The Table of Contents is like a roadmap, listing chapters and sections in order, allowing you to navigate through a book top-down.
For example, if you need to learn about bacteria, the TOC might tell you:
📖 Chapter 3: Bacteria and Prokaryotes (Page 45)This tells you where to start reading about bacteria. But if you need something very specific—like “Gram-positive bacteria”—you might not find that in the TOC.
The Index maps topics and terms to their location. Instead of listing topics by chapters, it lists every single important word or term in alphabetical order, along with the exact page numbers.
For example, if you need Gram-positive Bacteria, you can check the index:
🔍 Gram-positive bacteria – Pages 26, 28, 47
Now, you can go straight to the exact pages without reading everything before it.
Can we model a simple search engine off the same process used to find a topic in a book?
Attempting to Create a High-Level Overview 🍳
Before diving into implementation, let’s break down the core components of a search system by asking two key questions:
- What components are necessary for search?
- What role does each component play?
Data Acquisition 🏗️
Before we can search anything, we need to gather and prepare the data. This involves two key steps:
- 🧺 Gather the data
- On the internet: render pages, find links and follow links, downloading the rendered website
- Local: recursively find files in a directory, read them
- 🧼 Clean up the data - prepare the data so it can be searched
- Clean up unicode code points
- Identify meaningful sections and metadata
- Standardize formatting for consistent searchability
Once we acquire and clean up data, we face a design decision: Do we preprocess heavily upfront to enable efficient search later OR do we store raw data and brute-force search at query time?
Next Section
In the next section, we will acquire data, normalize it and tokenize it, while having fun with unicode and regular expressions.