5.2.1 Setting Up Inference Server

The Triton Inference Server to be used as the inference server for this feature must meet the following requirements.

The version of Triton Inference Server must be 2.20.0 or later and meet the following requirements:
- The following APIs are available.
  For details about the API, refer to the official documentation of Triton Inference Server.
  - Model loading: RepositoryModelLoad
  - Model unloading: RepositoryModelUnload
  - Inference: ModelInfer/ModelStreamInfer
  - Model status check: RepositoryIndex
  - Model interface check: ModelConfig
- The model repository has the structure described in the official documentation of Triton Inference Server.
- The automatic completion feature of the configuration file is available.
The model repository of the Triton Inference Server must be located in a place accessible as a local file by Fujitsu Enterprise Postgres.
It must be accessible via gRPC API.
ONNX Runtime must be enabled as the backend.
Including onnxruntime-extensions built without Python, supporting custom operators for the tokenizer.
The --model-control-mode option is specified as explicit.

Before using this feature, start the Triton Inference Server on the same machine as the database.

Fujitsu Enterprise Postgres provides a sample Dockerfile that meets the above requirements. To create a container image using this sample file, execute the following command. Set the appropriate label for the volumepath option (v option) so that the directory on the host side is visible from the container. "<x>" indicates the product version.

cp /opt/fsepv<x>server64/share/triton_dockerfile.sample ./triton_dockerfile
$ podman build -f ./triton_dockerfile -t triton_image
$ mkdir -p /path/to/model/repository
$ podman run -d --name triton_container -p8001:8001 -p8002:8002 -v /path/to/model/repository:/models \
triton_image tritonserver --model-repository=/models --http-port=0 --grpc-port=8001 \
--metrics-port=8002 --backend-config=onnxruntime,device=cpu --model-control-mode=explicit \
--log-info=true --log-warning=true --log-error=true --log-verbose=0
$ podman container ls # Confirm that the container is running.

By using systemd, you can automatically start the created container. Below is an example of starting an inference server as a service using a sample file necessary for the automatic start of the container.

$ cp /opt/fsepv<x>server64/share/triton.container.sample \
~/.config/containers/systemd/triton.container
$ systemctl --user daemon-reload
$ systemctl --user start triton.service

To achieve model-level access control, configure mutual TLS authentication(mTLS authentication) on the inference server. For details, refer to "5.4.3 Security".

Enable metrics on the inference server to identify the cause when problems occur. Also, enable timestamp formatting for log output.