3.3.3 Definition of Vectorization

The definition of vectorization is done with the pgx_vectorizer.pgx_create_vectorizer function provided by pgx_vectorizer. In the vectorization definition, you can specify information about the table that contains the text data to be vectorized, the embedding model and dimensions directly related to vector representation, chunking processing, as well as specify the timing of vectorization as a schedule. If you want to immediately schedule automatic vectorization in the background with Fujitsu Enterprise Postgres, specify the pgx_vectorizer.schedule_vectorizer function. Additionally, if you want to specify the vectorization later or manually control the timing of vectorization, specify the ai.scheduling_none function.

When using hybrid search, you can simultaneously define a GIN or GiST index for full-text search used in hybrid search.

When using an embedded model managed in the database, specify pgx_vectorizer.pgx_embedding_onnx as the argument embedding for the pgx_vectorizer.pgx_create_vectorizer function. For more details, refer to "5.2.9 Definition of Vectorization".

When defining vectorization, an embedded view combining a table containing text data and a table containing vector data is defined. The embedded view is a view that adds the following columns to the original table, and for each piece of text data in the original table, it holds records divided into chunks.

Column	Type	Description
embedding_uuid	uuid	Text chunk ID
chunk_seq	integer	Sequential number of chunks in a text data
chunk	text	Text chunk
embedding	vector	Vector representation of the chunk

Information

In the following cases, refer to "Definition of full-text search index" to define the index.

When using an index other than GIN or GiST in hybrid search
When you have already performed only vectorization and want to add an index for full-text search later

Fujitsu Enterprise Postgres provides pg_trgm and pg_bigm as full-text search features using GIN and GiST indexes. When using these as full-text search features, set up pg_trgm and pg_bigm in advance. Refer to the following for setup instructions.

pg_trgm
"Additional Supplied Modules and Extensions" in the PostgreSQL Documentation
pg_bigm
"pg_bigm" in the Installation and Setup Guide for Server

Example) Definition of vectorization and full-text search index

Below is an example of simultaneously defining a vectorization and a GIN index for full-text search on the knowledge data table sample_table in text format.

rag_database=> SELECT pgx_vectorizer.pgx_create_vectorizer(
     'sample_table'::regclass,
     destination => 'sample_embeddings',
     embedding => ai.embedding_ollama('all-minilm', 384),
     chunking => ai.chunking_recursive_character_text_splitter('contents'),
     processing => ai.processing_default(batch_size => 200, concurrency => 1),
     scheduling => pgx_vectorizer.schedule_vectorizer(interval '1 hour'),
     indexing => ai.indexing_hnsw(min_rows =>50000, opclass => 'vector_cosine_ops'),
     fulltext_indexing => pgx_vectorizer.pgx_fulltext_indexing_gin(opclass => 'gin_trgm_ops')
     );
pgx_create_vectorizer
-------------------
                 1 - The ID of the created vectorizer
(1 row)

The definition of the full-text search index created by the pgx_create_vectorizer function can be checked in the pgx_vectorizer.pgx_fulltext_index table.

Note

Regarding the table containing text subject to vectorization, the following cannot be changed:

Changes to the schema containing the table
Changes to the table name
Changes to the primary key, addition or removal of columns constituting the primary key
Name and data type of columns included in the primary key
Name and data type of columns containing text data subject to vectorization

Changing these may cause automatic vectorization or semantic text search to fail. If you want to change them, or if you have already changed them, you need to redefine the vectorization. In this case, re-vectorization of all text in the target table is required.

Point

Users can implement their own full-text search functionality, other than pg_trgm or pg_bigm provided by Fujitsu Enterprise Postgres, by defining indexes supported by that full-text search functionality and specifying search conditions using its operators, allowing for hybrid search with custom full-text search capabilities.

Definition of full-text search index

If you use an index other than GIN or GiST index for full-text search, manually define the index for the embedding table created after defining the vectorization. The name of the embedding table is the name with "_store" added to the end of the embedding view name specified in the destination argument when defining the vectorization. The column name for text chunks in the embedding table is chunk.

In the following example, a full-text search index my_iam, other than GIN and GiST, is defined in the embedded table sample_embeddings2_store.

CREATE INDEX sample_fti ON sample_embeddings2_store (chunk) USING my_iam;

Deleting full-text search index

You can delete the full-text search index by removing the information of the full-text search index from the vectorization and then executing the DROP INDEX statement.

rag_database=> SELECT pgx_delete_fulltext_index_config(pgx_vectorizer.get_vectorizer_id(view_name => 'sample_embeddings');
rag_database=> SELECT indexname FROM pg_indexes where tablename = 'sample_embeddings_store';
indexname  
------------------------------------------
sample_embeddings_store_chunk_idx - It is created for the column called "chunk".

rag_database=> DROP INDEX sample_embeddings_store_chunk_idx;

Changing the full-text search index

When changing the full-text search index defined by the pgx_create_vectorizer function, first delete the full-text search index and then define a new index.

--Deleting full-text search index
rag_database=> SELECT pgx_delete_fulltext_index_config(pgx_vectorizer.get_vectorizer_id(view_name => 'sample_embeddings');
rag_database=> SELECT indexname FROM pg_indexes where tablename = 'sample_embeddings_store';
indexname  
------------------------------------------
sample_embeddings_store_chunk_idx - It is created for the column called "chunk".

rag_database=> DROP INDEX sample_embeddings_store_chunk_idx;

-- Adding full-text search index
rag_database=> CREATE INDEX ON sample_embeddings_store USING gin (chunk gin_bigm_ops);