3.3.3 Definition of Vectorization

A vector expression is defined by using the SQL function create_vectorizer to define a set of parameters for vectorization called a vectorizer.

Example) Definition of vectorization

rag_database=> SELECT ai.create_vectorizer(
     'sample_table'::regclass,
     destination => 'sample_embeddings',
     embedding => ai.embedding_ollama('all-minilm', 384),
     chunking => ai.chunking_recursive_character_text_splitter('contents'),
     processing => ai.processing_default(batch_size => 200, concurrency => 1),
     scheduling => pgx_vectorizer.schedule_vectorizer(interval '1 hour'),
     indexing => ai.indexing_hnsw(min_rows =>50000, opclass => 'vector_cosine_ops')
); 
create_vectorizer
-------------------
                 1 - The ID of the created vectorizer
(1 row)

In the vectorization definition, you can specify information about the table that contains the text data to be vectorized, the embedding model and vector length that are directly related to vector representation, preprocessing to be performed before vectorization, and other information, as well as specify the timing of vectorization as a schedule.To perform automatic background vectorization within Fujitsu Enterprise Postgres, specify pgx_vectorizer.schedule_vectorizer for the scheduling argument.

Point

Do not change the name, primary key, column names, or other information of the table that contains the text to be converted into vectors, as this will cause the vectorization process to not work properly.

You can view the vectorization definition you created in the ai.vectorizer table.

SELECT * FROM ai.vectorizer where view_name = 'sample_embeddings';
id            | 1
source_schema | public
source_table  | sample_table
source_pk     | [{"pknum": 1, "attnum": 1, "attname": "id", "typname": "int4"}]
target_schema | public
target_table  | sample_embeddings_store
view_schema   | public
view_name     | sample_embeddings
trigger_name  | _vectorizer_src_trg_1
queue_schema  | ai
queue_table   | _vectorizer_q_1
config        | {"version": "0.8.0", "chunking": {"chunk_size": 800, "separators": ["\n\n", "\n", ".", "?", "!", " ", ""], "config_type": "chunking", "chunk_column": "contents", "chunk_overlap": 400, "implementation": "recursive_character_text_splitter", "is_separator_regex": false}, "indexing": {"config_type": "indexing", "implementation": "none"}, "embedding": {"model": "all-minilm", "dimensions": 384, "config_type": "embedding", "implementation": "ollama"}, "formatting": {"template": "$chunk", "config_type": "formatting", "implementation": "python_template"}, "processing": {"batch_size": 2000, "concurrency": 1, "config_type": "processing", "implementation": "default"}, "scheduling": {"config_type": "scheduling", "implementation": "none", "schedule_interval": "01:00:00", "extra_implementation": "pgx_vectorizer"}}
disabled      | f