Onnxruntime

Consumes too much memory

In some cases, I encountered onnx models would accumulate GPU memory with the backend of Onnxruntime. Or we can say that it will consume too much memory. Please check the GitHub's issue.

Hence, the resolution is to set this line code in the Onnxruntime script. The default of this parameter in Onnxruntime is Enable.

session_option.enable_mem_pattern = False

For example:

session_option = onnxruntime.SessionOptions()
session_option.enable_mem_pattern = False
decoder = onnxruntime.InferenceSession(decoder_path, session_option, providers=['CUDAExecutionProvider'])

Also, comparison the inference time between Enable and False, the speeds are same. The meaning is that the speed does not be influenced by this parameter.

Solution

On the lower version of Triton Inference Server, it seems no providing this parameters in the configuration file.

Originally my experiment was on v21.03, it seems not work. Hence, I switched to newer version v22.07 then it can work well.

The method is to add some settings in your model configuration file (i.e., config.pbtxt). You can check the reference from here.

▍For example:

name : "test_model" 
platform: "onnxruntime_onnx" 
max_batch_size: 0 
 
input [ 
  { 
    name: "input_1" 
    data_type: TYPE_FP32 
    dims: [1, 3, -1, -1] 
  },
  { 
    name: "input_2" 
    data_type: TYPE_FP32 
    dims: [-1, -1, -1] 
  } 
] 
 
output [ 
{
    name: "output_1" 
    data_type: TYPE_FP32 
    dims: [-1, 1, -1, -1] 
  }
] 
 
instance_group [ 
  { 
    count: 2 
    kind: KIND_GPU 
  } 
] 


optimization {
  graph : {
    level : 1
}}

parameters { key: "enable_mem_pattern" value: { string_value: "0" } }

Note

enable_mem_pattern: Use 1 to enable memory pattern and 0 to disable. See this for more information.

Onnxruntime

Consumes too much memory​

Solution​

Consumes too much memory

Solution