In this feature, the model is loaded into memory for calculations based on the model by the inference server. During actual calculations, CPU resources are used. In this feature, resources on the same machine are shared between the database and the inference server. Therefore, it is necessary to consider the impact of resource usage such as memory, CPU, and disk by the database and inference server on the main operations and appropriately control the resources. Additionally, for each resource, consider the following to ensure that the resources used by this feature are not insufficient.
Memory
Total size of the model to be loaded
CPU
Number of simultaneous vectorization processes
Parallelism of a single vectorization process
Disk resources
(The directory specified in the pgx_inference.triton_model_repository_path parameter)
Total size of the model to be loaded
The memory used by this feature can be controlled with the following parameters.
pgx_inference.total_model_size_limit
Set the upper limit for the total size of models that can be loaded into the inference server. Basically, use the default value of -1 (unlimited). Set a value other than unlimited when the memory size on the machine is small or there are multiple system administrators who can load models, and you want to limit the mass loading of models by those users.
Check and manage the official documentation of Triton Inference Server regarding the resources used by Triton Inference Server.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/onnxruntime_backend/README.html