A vector expression is defined by using the SQL function create_vectorizer to define a set of parameters for vectorization called a vectorizer.
Example) Definition of vectorization
rag_database=> SELECT ai.create_vectorizer( 'sample_table'::regclass, destination => 'sample_embeddings', embedding => ai.embedding_ollama('all-minilm', 384), chunking => ai.chunking_recursive_character_text_splitter('contents'), processing => ai.processing_default(batch_size => 200, concurrency => 1), scheduling => pgx_vectorizer.schedule_vectorizer(interval '1 hour'), indexing => ai.indexing_hnsw(min_rows =>50000, opclass => 'vector_cosine_ops') ); create_vectorizer ------------------- 1 - The ID of the created vectorizer (1 row)
In the vectorization definition, you can specify information about the table that contains the text data to be vectorized, the embedding model and vector length that are directly related to vector representation, preprocessing to be performed before vectorization, and other information, as well as specify the timing of vectorization as a schedule.To perform automatic background vectorization within Fujitsu Enterprise Postgres, specify pgx_vectorizer.schedule_vectorizer for the scheduling argument.
Point
Do not change the name, primary key, column names, or other information of the table that contains the text to be converted into vectors, as this will cause the vectorization process to not work properly.
You can view the vectorization definition you created in the ai.vectorizer table.
SELECT * FROM ai.vectorizer where view_name = 'sample_embeddings'; id | 1 source_schema | public source_table | sample_table source_pk | [{"pknum": 1, "attnum": 1, "attname": "id", "typname": "int4"}] target_schema | public target_table | sample_embeddings_store view_schema | public view_name | sample_embeddings trigger_name | _vectorizer_src_trg_1 queue_schema | ai queue_table | _vectorizer_q_1 config | {"version": "0.8.0", "chunking": {"chunk_size": 800, "separators": ["\n\n", "\n", ".", "?", "!", " ", ""], "config_type": "chunking", "chunk_column": "contents", "chunk_overlap": 400, "implementation": "recursive_character_text_splitter", "is_separator_regex": false}, "indexing": {"config_type": "indexing", "implementation": "none"}, "embedding": {"model": "all-minilm", "dimensions": 384, "config_type": "embedding", "implementation": "ollama"}, "formatting": {"template": "$chunk", "config_type": "formatting", "implementation": "python_template"}, "processing": {"batch_size": 2000, "concurrency": 1, "config_type": "processing", "implementation": "default"}, "scheduling": {"config_type": "scheduling", "implementation": "none", "schedule_interval": "01:00:00", "extra_implementation": "pgx_vectorizer"}} disabled | f