Search Engines
Data acquisition, normalization, cleaning and tokenization
yash101
Published 2/7/2025
Updated 2/7/2025
Table of Contents 📚
Search engines are ubiquitous. Google, Bing, Yahoo, you name it, all of these are internet-wide search engines. Even your favorite social media platforms have a search functionality. How do they work? How can we build our own? And how can we build a search engine which is cheap and easy to run?
In this part of the multi-part series, we will implement data acquisition and cleanup.
Data Acquisition
Data acquisition is arguable the hardest part in building a search engine. In typical search engines, such as Google or Bing, data acquisiton is performed by a spider, which crawls the web, ingesting content which can be indexed. For this project, we will focus on keeping data acquition simple and use a prepared dataset instead.
In this project, we will use the Enron Emails Dataset. To read all our text and build our corpus, we simply need to recursively iterate through the directory of the Enron emails dataset and read the files in it. The files are formatted as emails (with headers), but we will ignore that for now unless that turns into an issue (keeping it simple first).
// These functions limit the simultaneous reading of files to prevent exhausting system resouce limits
const limit = limitFunction(100);
async function limitedReadFile(path) {
return limit(() => fs.readFile(path, 'utf8'));
}
async function* readFilesRecursively(dir) {
const entries = await fs.readdir(dir, { withFileTypes: true });
for (const entry of entries) {
const fullPath = path.join(dir, entry.name);
if (await entry.isDirectory()) {
yield* readFilesRecursively(fullPath); // Recurse into subdirectory
} else if (entry.isFile()) {
const contents = await limitedReadFile(fullPath);
yield [fullPath, contents];
}
}
}
(async () => {
for await (const [ filepath, contents ] of readFilesRecursively(dirPath)) {
// DO something with the filepath and the contents of the file
}
})();
Data Cleanup and Preprocessing
Data is rarely clean. Common issues faced with search: weird characters, weird spacing, case folding, plenty of Unicode shenanigans (it’s a story for another day), different language formats, and so much more. In this article, we will use Unicode normalization and a few regular expressions to significantly clean up our data.
Next, Lets Tokenize our Cleaned Data
What is a Token? Tokens are chunks of text which we treat as a single unit - a word, for example. If a document is a Lego castle, each brick used to build that castle is a token.
In this article, tokens will be characters separated by spaces. In lamen terms, a word.
Putting it all Together
Next
In the next section, we will learn how we can intuitively attempt to solve the search problem through the use of a reverse index.