Nearest Neighbors with FAISS
How can we search for nearest vectors at scale?
yash101
Published 4/7/2025
Updated 4/7/2025
What is FAISS
FAISS is a C++ library, with Python bindings, which implements many data structures and algorithms which are useful for storing and searching for vectors. In this article, we will use FAISS to implement nearest-neighbor search on a dataset of vectors.
Getting Started
To get started, first ensure you have FAISS installed (faiss-cpu
if you do not have a GPU, and faiss-gpu
otherwise).
# Installing Python deps
!! pip install faiss-cpu matplotlib numpy
Next, we can import Python libraries
import json
import numpy as np
import faiss
Next, we can load some vectors
Source for aesop-vectors.json
:
aesop_vectors = None
with open('./aesop-vectors.json', 'r') as f:
aesop_vectors = json.load(f)
N_DIMENSIONS = len(next(iter(aesop_vectors.values())))
print(f'Number of vectors: {len(aesop_vectors)}; Vector size: {N_DIMENSIONS}')
Number of vectors: 310; Vector size: 32
Next, we can create an index
index = faiss.IndexFlatL2(N_DIMENSIONS)
But wait a moment, what is IndexFlatL2
?
- L2, L2 distance is the Euclidean distance between the two vectors. The distance is calculated by calculating the length of a direct line from one vector to the next.
- Flat index is a brute-force index. This works by iterating through every single vector in the database and finding the lowest distances.
How can we improve this?
- We can normalize our vectors then use
IndexFlatIP
(inner-product) for cosine similarity - A better and more efficient indexing strategy? We will look into this further later in this article.
Next, we can add data to the index
vectors = np.array(list(aesop_vectors.values()), dtype='float32')
print('Vectors: ', vectors)
index.add(vectors)
ids_start = index.ntotal - len(vectors)
ids_end = index.ntotal
id_mapping = {}
iter_label = iter(aesop_vectors.keys())
for i in range(ids_start, ids_end):
id_mapping[i] = next(iter_label)
print('First 10 vectors: ', list(id_mapping.items())[:10], '...')
Vectors: [[1. 0. 1. ... 1. 0. 0.]
[1. 0. 1. ... 0. 0. 0.]
[1. 1. 1. ... 0. 0. 0.]
...
[1. 0. 0. ... 0. 0. 0.]
[1. 0. 0. ... 1. 0. 0.]
[1. 0. 0. ... 0. 0. 0.]]
First 10 vectors: [(0, 'The Wolf And The Lamb'), (1, 'The Bat And The Weasels'), (2, 'The Ass And The Grasshopper'), (3, 'The Lion And The Mouse'), (4, 'The Charcoal-Burner And The Fuller'), (5, 'The Father And His Sons'), (6, 'The Boy Hunting Locusts'), (7, 'The Cock and the Jewel'), (8, 'The Kingdom of the Lion'), (9, 'The Wolf and the Crane')] ...
Next, we can query the index to find similar content
seed_story = 'The Fox and the Bramble'
seed_embedding = aesop_vectors[seed_story]
print(f'{seed_story} = {seed_embedding}')
k = 3
print(f'Finding k={k} nearest embeddings')
# Query from the FAISS index
distances, indices = index.search(np.array([seed_embedding]), k)
The Fox and the Bramble = [1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]
Finding k=3 nearest embeddings
print(f'Results for {seed_story}:')
for (i, (distance, index)) in enumerate(zip(distances[0], indices[0])):
print(f' [{i + 1}] Found {id_mapping[index]} with id {index} with a distance of {distance}')
Results for The Fox and the Bramble:
[1] Found The Fox and the Bramble with id 283 with a distance of 0.0
[2] Found The Fox and the Monkey with id 67 with a distance of 1.0
[3] Found The Playful Ass with id 111 with a distance of 1.0
Better Indexing Strategies
Our current approach results in us holding every single vector in memory and linearly searching through the vectors to find our best matches. This does not scale due to two reasons:
- We can only fit only so many vectors in RAM
- Linear search is slow, and gets linearly slower as more data is added. No bueno!
So what are better indexing methods we can use?
See the full list of supported indexing methods at the FAISS docs, linked here