Skip to content

Conversation

kunal-vaishnavi
Copy link
Contributor

@kunal-vaishnavi kunal-vaishnavi commented Sep 26, 2025

Description

This PR allows users to customize run options per ONNX model that runs in ONNX Runtime GenAI. It also enables users to provide separate session options and provider options per ONNX model.

Usage

The run options can be added as key-value pairs in a separate, optional section within the GenAI config.

"session_options": {
    "log_id": "onnxruntime-genai",
    "use_device_allocator_for_initializers": true,
    "provider_options": [
        {
            "cuda": {
                "enable_cuda_graph": "0",
                "enable_skip_layer_norm_strict_mode": "1",
                "max_mem": "0",
                "arena_extend_strategy": "0",
                "initial_chunk_size_bytes": "5368709120",
                "max_dead_bytes_per_chunk": "0",
                "initial_growth_chunk_size_bytes": "1000000000"
            }
        }
    ]
},
"run_options": {
    "enable_memory_arena_shrinkage": "cpu:0;gpu:0"
},

You can also have separate run options per ONNX model within the GenAI config.

"decoder": {
    "session_options": {
        "log_id": "onnxruntime-genai",
        "provider_options": []
    },
    "run_options": {
        "enable_memory_arena_shrinkage": "cpu:0;gpu:0"
    }
},
"vision": {
    "session_options": {
        "log_id": "onnxruntime-genai",
        "provider_options": []
    },
    "run_options": {
        "enable_memory_arena_shrinkage": "cpu:0;gpu:0"
    },
    "inputs": {
        "pixel_values": "pixel_values",
        "attention_mask": "image_attention_mask",
        "image_sizes": "image_sizes"
    },
    "outputs": {
        "image_features": "image_features"
    }
},
"speech": {
    "session_options": {
        "log_id": "onnxruntime-genai",
        "provider_options": []
    },
    "run_options": {
        "enable_memory_arena_shrinkage": "cpu:0;gpu:0"
    },
    "inputs": {
        "audio_embeds": "audio_embeds",
        "attention_mask": "audio_attention_mask",
        "audio_sizes": "audio_sizes",
        "audio_projection_mode": "audio_projection_mode"
    },
    "outputs": {
        "audio_features": "audio_features"
    }
},
"embedding": {
    "session_options": {
        "log_id": "onnxruntime-genai",
        "provider_options": []
    },
    "run_options": {
        "enable_memory_arena_shrinkage": "cpu:0;gpu:0"
    },
    "inputs": {
        "input_ids": "input_ids",
        "image_features": "image_features",
        "audio_features": "audio_features"
    },
    "outputs": {
        "inputs_embeds": "inputs_embeds"
    }
},

Documentation

Session Options

For a full list, please see the list of keys available here.

Provider Options

For a full list, please see your target execution provider's page inside the ONNX Runtime docs.

Run Options

For a full list, please see the list of keys available here.

Motivation and Context

This PR allows users to use run options such as memory.enable_memory_arena_shrinkage to reduce memory usage for memory-constrained environments.

Here is a quick reference of the memory benefits for Phi-4 multi-modal with two example images.

Run Option Provider Option After First Run After Second Run
None None 13290 MiB 27626 MiB
"enable_memory_arena_shrinkage": "cpu:0;gpu:0" None 11240 MiB 24552 MiB
None "arena_extend_strategy": "0" 10978 MiB 25870 MiB
"enable_memory_arena_shrinkage": "cpu:0;gpu:0" "arena_extend_strategy": "0" 9506 MiB 23224 MiB

It also resolves this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support dedicated session and provider options for each model in VLM
1 participant