- What are vector databases
- Why are vector databases important to AI
- Core concepts of a vector database
- Factors to consider when choosing a vector database
- Popular Vector Databases for your consideration
- Step by Step guide to implementing a Vector database
- Step 1 : Installing Milvus
- Step 2: Creating a Milvus Client
- Step 3 : Create a collection
- Step 4: Inserting data into the Collection
- Step 5: Create an Index
- Step 6: Sample searching for similar vectors
- Bonus: How to prepare your data for Vector database
- Conclusion
What are vector databases
Vector Databases are used for data storage and retrieval like all databases but these are designed to handle high dimensional vector data.which are mathematical representations of features or attributes
Vector databases do Similarity Search: Which is to find similar vectors in a database for the given search query.
Similarity search is achieved through algorithms that reduce space and time complexity when compared to traditional databases like SQL
Why Vector Databases are important in AI
- Vector databases are very important in AI. This is because they process large-scale multimedia data, natural language processing and neural networks
- Vector databases enable resource efficient and time efficient storage and retrieval of high dimensional vector data.
- High dimensional vector data includes feature vectors and embeddings, these data capture complex patterns and data relationships
- In AI there is a need for seaches like neasrest neighbour search, Clustering and classification. these types of searches are resource intensive and hense there is a need for specilized databases for the same
- Vector databases provide fast and accurate similarity search thus improving performance and scalability of AI apps
Core Concepts of a Vector Database
As we have already seen that Vector database handles high dimential vecotor data
- Indexing: Vector database use k-d trees, ball trees, and other such techniques to perform high dimensional vector searches like nearest neighbour or clustering searches
- Scalability: These databases are designed to handle huge amounts of data and the databases can be scaled with multiple machines running in parallel.
- Distance metrics: Vector databases compute similarity between vectors such as cosine similarity, Eucilidean distance and Manhatten distance to figure how similar vectors are from one another and then cluster them together
- Performance Optimization: Query latency and memory useage optimization are critical for AI applications and thus is what the vector databases are designed to do
Factors to consider when choosing a Vector Database
When considering a vector database, first consider what are the requirements of your project.
There are many vector databases available in the market today. From lightweight to high performance and scalable databases
There are paid version available as well. There are some which are self hosted and others you can just purchase as a SaaS product
Here are the 4 factors that you need to consider when choosing the right vector database for your project
- Scalability
- Performance
- Community Support
- Compatibility
Let us look at these individually
Scalability
What is the model size you are dealing with, you can choose a lightweight database as well, if your data size is small
If you have a large dataset, determine if the database can be scaled over multiple machines or not
Consider if the database can be scaled over multiple data centers if the project is a large scale
Performance
- Performance can be thought of through the metrics of the following
- Query latency
- memory usage
- indexing time
Hardware acceleration is also quite essential, nowadays most vector databases can run on GPUs instead of CPU giving a boost to performance
In many cases when considering databases you can choose which one suits your purpose for a given amount of performance per dollar you can choose a database optimized for speed vs accuracy
Compatibility
Different databases work well with different programming languages, this is especially true when working with vector databases
Check if the database you are considering works with the programming language used in your project
What are the distance metrics and indexing techniques that you are using in your project? Is the database compatible with that
Does the database offer APIs, libraries and connectors that integrate with your project
Community Support
What is the level of community support around a particular database? This is important because of community support there is a lot of support available to the developer, like the support of stack overflow, articles on how to achieve something or set up the database for a particular purpose etc
Having Community support also means access to tutorials, detailed docs and articles on how to implement things
Databases having large communities are also well maintained and receive regular support in the form of bug fixes, new features and security updates
Popular Vector Databases for your consideration
Here are 4 of the most popular vector databases available in the market today
FAISS: Developed by Facebook, a large-scale vector database model. It is popular for its performance and flexibility in running AI applications. It also supports GPU acceleration is which a great add-on. It is primarily compatible with python
Milvus: Advertised as the most popular vector database for enterprise users. Milvus can be used in applications like computer vision, machine learning and natural language processing and it is Open source as well. It is compatible with most programs in AI and has support for multiple indexing techniques and also offers GPU hardware acceleration and distributed deployment
Annoy: This is a C++ library developed by Spotify that is open-source and lightweight. It searches points in space that are close to a given point.
Weaviate: Weaviate is an open-source database that provides HNSW that is Hierarchical Navigable Small World, which is a graph-based technique often used in vector databases. It offers a balance between accuracy and speed and you can specify which is more preferable to you. This technique may require more ram than other techniques
Step by Step guide to implementing a Vector database
For this guide, we will be using one of the most popular vector databases out there Milvus
Step 1: Installing Milvus
You can install Milvus in a docker container as well. There are minimal hardware requirements for installing milvus, you can check them out on the milvus website
this article is brought to you by DeadSimpleChat Chat API for your website and app
to install download the milvus YAML file
$ wget https://github.com/milvus-io/milvus/releases/download/v2.2.13/milvus-standalone-docker-compose.yml -O docker-compose.yml
After downloading the YAML file start the MILVUS with the below command
sudo docker-compose up -d
Creating milvus-etcd ... done
Creating milvus-minio ... done
Creating milvus-standalone ... done
then you can check whether the milvus is up and running by
$ sudo docker-compose ps
you will get something like
Name Command State Ports
--------------------------------------------------------------------------------------------------------------------
milvus-etcd etcd -advertise-client-url ... Up 2379/tcp, 2380/tcp
milvus-minio /usr/bin/docker-entrypoint ... Up (healthy) 9000/tcp
milvus-standalone /tini -- milvus run standalone Up 0.0.0.0:19530->19530/tcp, 0.0.0.0:9091->9091/tcp
Connect to Milvus
Check the local port where milvus is running and replace the container name with a custom name
docker port milvus-standalone 19530/tcp
This command will return a local Ip address and port number and you can connect to it
Stop Milvus
You can stop Milvus using the following command
sudo docker-compose down
Creating the NodeJs Project
Let us create a new directory and cd into it
mkdir milvus-nodej
cd milvus-nodejs
next let us initialize the project like so
npm init -y
then we will install the milvus and other dependencies
npm install milvus-2.2.12 --save
npm install
Step 2 Creating a Milvus Client
a. Create a new file named index.js
and import the Milvus sdk like
const {MilvusClient} = require("milvus-2.2.12");
Step 3: Create a collection
Now, let us define a collection schema and include the data fields and data types
const collectionSchema = {
collection_name: "test_collection",
fields: [
{
field_name: "vector",
data_type: "FloatVector",
type_params: {
dim: 128,
},
},
{
field_name: "id",
data_type: "Int64",
auto_id: true,
},
],
};
Now, let us create a collection using the Milvus client
async function createCollection() {
const response = await milvusClient.createCollection(collectionSchema);
console.log("Collection has been created:", response);
}
createCollection();
Step 4: Inserting Data into the Collection
a. Preparing the data to be inserted
const vectors = [
{
id: 1,
vector: Array.from({ length: 128 }, () => Math.random()),
},
{
id: 2,
vector: Array.from({ length: 128 }, () => Math.random()),
},
];
b. inserting the data into the collection
async function insertData() {
const response = await milvusClient.insert({
collection_name: "test_collection",
fields_data: vectors,
});
console.log("Data has been added to the collection:", response);
}
insertData();
Step 5: Create an Index
Let us define the parameters such as the index and metric types
const indexParams = {
collection_name: "test_collection",
field_name: "vector",
index_type: "IVF_FLAT",
metric_type: "L2",
type_params: {
nlist: 1024,
},
};
Now, using the Milvus CLient we will create an index
async function createIndex() {
const response = await milvusClient.createIndex(indexParams);
console.log("A new index has been created:", response);
}
createIndex();
Step 6 Sample searching for similar vectors
const searchParams = {
collection_name: "test_collection",
field_name: "vector",
top_k: 5,
search_params: {
anns_field: "vector",
metric_type: "L2",
query_records: Array.from({ length: 128 }, () => Math.random()),
round_decimal: 4,
},
};
Now let us do a sample search query into our database. For this we will need to define some search parameters such as top-k results and search radius which we have done above
async function search() {
const response = await milvusClient.search(searchParams);
console.log("Results:", response);
}
search();
Thus we have implemented the Milvus client in our node js project.
Bonus 1: How to prepare your Data for the Vector database
Data Pre-Processing and Feature extraction in Vector databases
In vector databases, there are 3 methods we use for Data preprocessing and these are
- Normalization
- Dimensionality reduction
- Feature selection
Let us consider all these in detail
Normalization
Normalization includes adjusting the dataset values such that they are in a common scale. We do this because we do not want a single feature to dominate the model dues to the differences in the magnitude of values
steps involved
The features that need to be normalized are numerical features that have different scales and units of measurements
There are a number of methods used for this
Min-Max scaling. Scaling all the values in the range of [0-1].
Z-score standardization: In this method, we scale the values using statistics. All the values are based on the mean (average), standard deviation: how far do these values go from the mean
Apply whatever method you prefer but always use the same scaling params for training and test sets to avoid data leakage
Dimensionality Reduction
This involves reducing the number of features in our data set but retaining the important features.
With reduced dimensions, the model runs fast, computational complexity decreases and the data is simplified. Here are some of the steps
Steps involved
Set aside the number of dimensions and the amount of variance that we want in our reduced dataset, then choose a reduction technique accordingly
Principal Component Analysis: this technique maximizes variance and uses a linear method to project data into a lower dimensional space
t-Distributed Stochastic Neighbor Embedding: It preserves local structures in data and is a nonlinear method of reducing the dimensionality of a dataset
Feature Selection
In Feature selection, we select the most relevant features and discard the ones that are not relevant to our use case. This helps to reduce noise, makes the model more interpretable and decreases the training time
steps involved
Filter method: This involves ranking the features based on the criterion involved. Such as co-relation, mutual information and selecting the top k-features
Wrapper methods: Using a specific machine learning model to evaluate features and iteratively removing the ones that are not relevant to our use-case
Embedded methods: Methods such as LASSO and Ridge regression apply regularization which reduces the impact of features that are not important. this method involves feature selection along with model training.
Apply whatever method you deep best to your dataset and choose the top k features and discard the other features
Need Chat API for your website or app
DeadSimpleChat is an Chat API provider
- Add Scalable Chat to your app in minutes
- 10 Million Online Concurrent users
- 99.999% Uptime
- Moderation features
- 1-1 Chat
- Group Chat
- Fully Customizable
- Chat API and SDK
- Pre-Built Chat
Conclusion
In this article we discussed about vector databases and how you can implement a vector database
I hope you liked the article. Thank you for reading