Search Engines
Introduction & The Search Problem
yash101
Published 2/7/2025
Updated 2/7/2025
Jump to Page 📚
Note: this article is currently a published preview
It’s unfinished, and expect inconsistencies and errors. Raise issues here with recommendations or feedback.
Hi, I’m Yash!
I’m excited about the search problem. The goal of this article was to learn how search engines work, build my own simple search engine, and use it as an opportunity to write high quality teaching content about search. This is my first time writing a principles-first article and I’d love your feedback!
Currently, this website does not support commenting (I’m trying to figure out if there’s a backend-free way to implement it still). Thus, feel free to share your feedback by raising an issue on GitHub!
Thanks,
yash
🏋️♂️ Who this is for?
This article was designed to cater to most people. If you’re new to coding, especially in JavaScript or Python, you may have difficulty understanding some of the algorithms. However, I try to dive deep into the different topics along the way so there should be something for everyone to learn!
💪 Motivation
💡 I have a few project ideas in mind which would require search algorithms
- Implement search on this website
- Implement a better search engine for wikis such as Confluence or Wikipedia
- Implement a financial data engine
🧠 Learning
My goal with writing this article is to learn how search works through trial-and-error, building a search engine, and describing how it works along with the topics I learn along the way.
Instead of researching existing search engine architectures, I’m taking a first-principles approach. Instead of jumping straight to established methods or using something which exists (SOLR, Lucene, Elasic). Thus, this article serves as a log of my journey.
⛳️ Goals
Some goals for this project:
- Learn how search engines work
- Learn some NLP (natural language processing) techniques
- Achieve a decent accuracy
- Search should have no backend, and be run 100% in the browser
- Search should support unicode. Languages other than English exist, and so do emojis.
🎯 End Result
By the end of this article, we should have a working lightweight search engine with an offline indexing process and fully client-side search implemented.
🛠️ Tools Used
- Node.js (v22.13.0)
- Enron Emails Dataset - used for large scale test data
- Downloaded from Project Gutenberg and transformed to a JSON for ease of use in this project. The processing is outside the scope of this article and the JSON may contain errors.
- Æsop’s Fables - used for test data in our demos
- Web Browser with JavaScript support
- The majority of code will be written in JavaScript since it runs natively in browsers
- Python
- Used for visualization, data preprocessing and validation of my JavaScript models and code
🔎 Search, The Problem
Once upon a time, the internet was in its infancy, Google was still under development, and CLRS, The legendary Introduction to Algorithms textbook was just published. But none of that mattered, you had a biology assignment due which required you to understand prokaryotic organisms. While the Internet offered little in help, your best resource sat in front of you - a biology textbook!
Now, take a moment and think… How do you find a topic in a textbook?
You could read the entire textbook, every single page, and find out more about prokaryotic organisms, but is that your best approach? Do you have the time to read through every single page for your assignment? Probably not. Thus, we need to do better and use the resources we have available to ourselves to find a better process.
What resources does the textbook provide you to efficiently find what you are looking for?
📔 Search, The Textbook Method
Textbooks normally have two tools, available to your disposal, to help you find information quickly:
- Table of Contents (TOC) at the beginning
- Index at the end
Each of these tools work differently from each other, but together, they make searching for topics much faster than flipping through pages.
- Table of Contents
- like a roadmap, listing chapters and sections in order, allowing you to navigate through a book top-down.
If you need to learn about bacteria, the table of contents may contain
📖 Chapter 3: Bacteria and Prokaryotes (Page 45)
In this example, the table of contents shows where to read about bacteria, the whole concept. But if you need something very specific, like “Gram-positive bacteria”, you may be unable to find that in the table of contents.
- Index
- maps topics and terms to their location. Instead of listing topics by chapters, it lists every single important word or term in alphabetical order, along with the exact page numbers.
For example, if you need Gram-positive Bacteria, you can check the index:
🔍 Gram-positive bacteria – Pages 26, 28, 47
Using the index, you are able to navigate straight to the correct page efficiently.
Can we model a simple search engine off the same process used to find a topic in a book?
⏭️ Next Section
In the next section, we will acquire data and clean it up.