The Apify Vector Database Integrations facilitate the transfer of data from Apify Actors to a vector database. This process includes data processing, optional splitting into chunks, embedding computation, and data storage
These integrations support incremental updates, ensuring that only changed data is updated. This reduces unnecessary embedding computation and storage operations, making it ideal for search and retrieval augmented generation (RAG) use cases.
This repository contains Actors for different vector databases.
- Retrieve a dataset as output from an Actor.
- [Optional] Split text data into chunks using langchain.
- [Optional] Update only changed data.
- Compute embeddings, e.g. using OpenAI or Cohere.
- Save data into the database.
- Add database to docker-compose.yml for local testing (if the database is available in docker).
version: '3.8'
services:
pgvector-container:
image: pgvector/pgvector:pg16
environment:
- POSTGRES_PASSWORD=password
- POSTGRES_DB=apify
ports:
- "5432:5432"
-
Add postgres dependency to
pyproject.toml
:poetry add --group=pgvector "langchain_postgres"
and mark the group pgvector as optional (in
pyproject.toml
):[tool.poetry.group.postgres] optional = true
-
Create a new actor in the
actors
directory, e.g.actors/pgvector
and add the following files:README.md
- the actor documentation.actor/actor.json
- the actor definition.actor/input_schema.json
- the actor input schema
-
Create a pydantic model for the actor input schema. Edit Makefile to generate the input schema from the model:
datamodel-codegen --input $(DIRS_WITH_ACTORS)/pgvector/.actor/input_schema.json --output $(DIRS_WITH_CODE)/src/models/pgvector_input_model.py --input-file-type jsonschema --field-constraints
and then run
make pydantic-model
-
Import the created model in
src/models/__init__.py
:from .pgvector_input_model import PgvectorIntegration ``
-
Create a new module (
pgvector.py
) in thevector_stores
directory, e.g.vector_stores/pgvector
and implement all classPGVectorDatabase
and all required methods. -
Add PGVector into
SupportedVectorStores
in theconstants.py
class SupportedVectorStores(str, enum.Enum): pgvector = "pgvector"
-
Add PGVectorDatabase into
entrypoint.py
if actor_type == SupportedVectorStores.pgvector.value: await run_actor(PgvectorIntegration(**actor_input), actor_input)
-
Add
PGVectorDatabase
andPgvectorIntegration
into_types.py
ActorInputsDb: TypeAlias = ChromaIntegration | PgvectorIntegration | PineconeIntegration | QdrantIntegration VectorDb: TypeAlias = ChromaDatabase | PGVectorDatabase | PineconeDatabase | QdrantDatabase
-
Add
PGVectorDatabase
intovector_stores/vcs.py
if isinstance(actor_input, PgvectorIntegration): from .vector_stores.pgvector import PGVectorDatabase return PGVectorDatabase(actor_input, embeddings)
-
Add
PGVectorDatabase
fixture intotests/conftets.py
@pytest.fixture() def db_pgvector(crawl_1: list[Document]) -> PGVectorDatabase: db = PGVectorDatabase( actor_input=PgvectorIntegration( postgresSqlConnectionStr=os.getenv("POSTGRESQL_CONNECTION_STR"), postgresCollectionName=INDEX_NAME, embeddingsProvider=EmbeddingsProvider.OpenAI.value, embeddingsApiKey=os.getenv("OPENAI_API_KEY"), datasetFields=["text"], ), embeddings=embeddings, ) db.unit_test_wait_for_index = 0 db.delete_all() # Insert initially crawled objects db.add_documents(documents=crawl_1, ids=[d.metadata["id"] for d in crawl_1]) yield db db.delete_all()
-
Add the
db_pgvector
fixture intotests/test_vector_stores.py
DATABASE_FIXTURES = ["db_pinecone", "db_chroma", "db_qdrant", "db_pgvector"]
-
Update README.md in the
actors/pgvector
directory -
Add the
pgvector
to the README.md in the root directory -
Run tests
make test
-
Run the actor locally
export ACTOR_PATH_IN_DOCKER_CONTEXT=actors/pgvector apify run -p
-
Setup Actor on Apify platform at
https://console.apify.com
Build configuration
Git URL: https://github.com/apify/store-vector-db Branch: master Folder: actors/pgvector
-
Test the actor on the Apify platform