What is it?

A vector database allows you to many forms of data (text, image, audio, etc.) and lets you query through the data based on similarity. For example, let’s say we have a vector database storing a bunch of news article titles. If you take a title, for example:
"NASA's Perseverance Rover Discovers New Rock Formations on Mars"
and query the vector database for the 3 most similar titles, the vector database will return titles that are semantically similar to the title you provided. An example output could be:
1. "New Discoveries by NASA's Perseverance Rover on the Martian Surface"
2. "Mars Rover Uncovers Ancient Rock Formations"
3. "NASA's Perseverance Rover Finds Evidence of Water on Mars"

What is it used for?

Vector databases has many uses such as:
  • Retrieval-Augmented Generation (RAG)
  • AI Agents
  • Content-based recommendation engines
  • Semantic search
A great example of an active use of vector databases is Spotify’s recommendation engine (check it out, it’s really cool and open-source~ ❤️). Spotify is well-known for it amazing music recommendations. Music recommendations in Spotify are done with search operations using a vector database. Songs in Spotify are stored in a vector database, which allows Spotify to easily compute the k-most similar songs to a target song, where all songs are represented as vectors. There is a great episode of the NerdOut@Spotify podcast that talks about their recommendation engine and vector databases as a whole. Highly recommend it.

How does it actually work? Why vectors?

Vector databases allow you to store multiple forms of data, but as vectors. This is because vectors are an efficient way of universally storing features or characteristics about a piece of data. Data is converted into a vector by passing it through machine-learning models (referred to as embedding models) which extracts features from your data and generates a vector (also referred to as an embedding) containing the extracted features. When comparing the similarity between two pieces of data, the vector database compares the vectors representing each piece of data. The closer the vectors are from each other, the more similar is the data. For example, let’s represent three fruits as 3-dimensional vectors. We’ll define each fruit by three simple features:
  • Sweetness (scale of 0 to 10)
  • Sourness (scale of 0 to 10)
  • Size (scale of 0 to 10)
Now, let’s create our vectors: Orange: Sweet(6), Sour(5), Size(5) —> [6,5,5][6, 5, 5] Lemon: Sweet(2), Sour(9), Size(4) —> [2,9,4][2, 9, 4] Lime: Sweet(3), Sour(8), Size(3) —> [3,8,3][3, 8, 3] We can visualize these vectors in a 3-dimensional space Notice how the vectors representing the lemon and lime are closer to each other than the orange. This indicates that the lemon and lime are more similar to each other than to the orange, which makes sense given their taste profiles. It is important to note that in real-world applications, vectors (embeddings) are usually of much higher dimensions (e.g., 3072 dimensions) and are generated using embedding models. These high-dimensional vectors can capture intricate features and nuances of the data, allowing for more accurate similarity comparisons. The same logic applies: the closer the vectors are from each other, the more similar the data is, regardless of the dimensions. I used 3 dimensions here for simplicity and visualization purposes. For example, OpenAI’s text-embedding-3-large model generates 3072-dimensional vectors for text data. For more visualizations for vector similarity, check out this great visual tool by TensorFlow! I personally love the MNIST visual.

How is similarilty between vectors calculated?

Vector databases usually support various similarity metrics, which are mathematical methods of defining the distance between two vectors.
  • Euclidean
d(u,v)=i=1n(viui)2d(\vec{u}, \vec{v})=\sqrt{\sum_{i=1}^n (v_i - u_i)^2}
  • Cosine
cos(θ)=uvuv=i=1nuivii=1nui2i=1nvi2\cos(\theta) = \frac{\vec{u} \cdot \vec{v}}{|\vec{u}||\vec{v}|} = \frac{\sum_{i=1}^{n} u_i v_i}{\sqrt{\sum_{i=1}^{n} u_i^2} \sqrt{\sum_{i=1}^{n} v_i^2}}
  • Inner Product
uv=i=1nuivi\vec{u} \cdot \vec{v} = \sum_{i=1}^n u_i v_i Different embedding models may work better with different similarity metrics, so it’s important to consider this when choosing an embedding model. For more in-depth information on how similarity search is performed specifically in EigenDB, read take a look at this.

Further reading