Skip to content

Conversation

ggerganov
Copy link
Member

I deployed 2x runners with AMD V710 GPUs to run CI workflows. However, they are extremely slow. Here are some benches for gemma 3 270M:

./bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gemma-3-270m-GGUF_gemma-3-270m-Q8_0.gguf -n 32

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro V710 MxGPU (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
gemma3 270M Q8_0 271.81 MiB 268.10 M Vulkan 99 pp512 457.51 ± 0.23
gemma3 270M Q8_0 271.81 MiB 268.10 M Vulkan 99 tg32 5.27 ± 0.02

build: aa3ee0e (6582)


ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Pro V710 MxGPU, gfx1101 (0x1101), VMM: no, Wave Size: 32

model size params backend ngl test t/s
gemma3 270M Q8_0 271.81 MiB 268.10 M ROCm 99 pp512 166.89 ± 5.25
gemma3 270M Q8_0 271.81 MiB 268.10 M ROCm 99 tg32 6.48 ± 0.01

build: aa3ee0e (6582)


Does anyone know if this is expected? I installed ROCm driver per the following instructions:

https://learn.microsoft.com/en-us/azure/virtual-machines/linux/azure-n-series-amd-gpu-driver-linux-installation-guide

Is there some extra configuration needed to make AMD run faster? Currently, the computation using GPU (either with ROCm/HIP or Vulkan) is multiple times slower compared to CPU-only which does not seem normal. So I guess I have misconfigured something, but not sure what.

cc @IMbackK @netrunnereve

@ggerganov ggerganov requested a review from CISC as a code owner September 25, 2025 12:12
@IMbackK
Copy link
Collaborator

IMbackK commented Sep 25, 2025

Thats certainly much much slower than this gpu should be.
unfortunately i am not aware of any mechanism that could cause this - nor have i ever ran a virtualized amdgpu setup.

@github-actions github-actions bot added the devops improvements to build systems and github actions label Sep 25, 2025
@ggerganov
Copy link
Member Author

ggerganov commented Sep 25, 2025

Yes, I think the GPU virtualization that these VMs use is massively degrading the performance. Either that, or I misconfigured something.

Open to suggestions/opinions if having these runners would be useful. On one hand I guess it's better than nothing. On the other hand, 50 minutes per workflow will likely result in infinite queue of jobs.

In any case, this is the best I can do using Azure cloud. If people have ideas how to provision AMD hardware in an alternative way - open to suggestions.

@IMbackK
Copy link
Collaborator

IMbackK commented Sep 25, 2025

amd previously offered us time on mi300 machines on digital ocean (https://www.amd.com/en/developer/resources/cloud-access/amd-developer-cloud.html) in our collaboration meeting, maybe they can spare the container hours for CI.

I can attest that these containers are fast.

@netrunnereve
Copy link
Collaborator

Yeah this isn't right. I skimmed through the install guide and it looks like that it tells you to install the proprietary AMD driver using sudo amdgpu-install --usecase=workstation,rocm,amf --opencl=rocr --vulkan=pro --no-32 --accept-eula, and those aren't going to run as well as the open source ones. However I'm also seeing that you're running with RADV so that's strange. What's your current Mesa version and amdgpu version?

I'd expect the virtualization to have some effect but this is ridiculously slow. Another thing you can do is check the gpu utilization to see if it's actually being used properly.

@netrunnereve
Copy link
Collaborator

If people have ideas how to provision AMD hardware in an alternative way - open to suggestions.

This might be an possiblity if you have a colo or office space for physical machines (does ggml even have an office?). Lemonade uses llama.cpp as their backend and they might be willing to provide us with support.

@ggerganov
Copy link
Member Author

@netrunnereve While running perplexity with the ROCm build with the 270M Gemma, the amd-smi does not seem to report activity, but it is using memory:

ggml@ggml-7-x86-amd-v710:~$ amd-smi monitor
GPU  POWER   GPU_T   MEM_T   GFX_CLK   GFX%   MEM%   ENC%   DEC%      VRAM_USAGE
  0    N/A     N/A     N/A       N/A    N/A    N/A    N/A    N/A    1.9/  4.3 GB

Here are some dumps:

ggml@ggml-7-x86-amd-v710:~/work/llama.cpp/build-rocm$ dpkg -l | grep mesa
ii  amdgpu-multimedia                        1:6.4.60403-2194681.24.04               amd64        Meta package to install mesa multimedia components.
ii  libegl-mesa0:amd64                       25.0.7-0ubuntu0.24.04.2                 amd64        free implementation of the EGL API -- Mesa vendor library
ii  libegl1-amdgpu-mesa:amd64                1:25.0.0.60403-2194681.24.04            amd64        free implementation of the EGL API -- Mesa vendor library
ii  libegl1-amdgpu-mesa-drivers:amd64        1:25.0.0.60403-2194681.24.04            amd64        free implementation of the EGL API -- hardware drivers
ii  libgl1-amdgpu-mesa-dri:amd64             1:25.0.0.60403-2194681.24.04            amd64        free implementation of the OpenGL API -- DRI modules
ii  libgl1-amdgpu-mesa-glx:amd64             1:25.0.0.60403-2194681.24.04            amd64        free implementation of the OpenGL API -- GLX runtime
ii  libgl1-mesa-dri:amd64                    25.0.7-0ubuntu0.24.04.2                 amd64        free implementation of the OpenGL API -- DRI modules
ii  libglx-mesa0:amd64                       25.0.7-0ubuntu0.24.04.2                 amd64        free implementation of the OpenGL API -- GLX vendor library
ii  mesa-amdgpu-libgallium:amd64             1:25.0.0.60403-2194681.24.04            amd64        shared infrastructure for Mesa drivers
ii  mesa-amdgpu-va-drivers:amd64             1:25.0.0.60403-2194681.24.04            amd64        Mesa VA-API video acceleration drivers
ii  mesa-amdgpu-vdpau-drivers:amd64          1:25.0.0.60403-2194681.24.04            amd64        Mesa VDPAU video acceleration drivers
ii  mesa-common-dev:amd64                    25.0.7-0ubuntu0.24.04.2                 amd64        Developer documentation for Mesa
ii  mesa-libgallium:amd64                    25.0.7-0ubuntu0.24.04.2                 amd64        shared infrastructure for Mesa drivers
ii  mesa-va-drivers:amd64                    25.0.7-0ubuntu0.24.04.2                 amd64        Mesa VA-API video acceleration drivers
ii  mesa-vdpau-drivers:amd64                 25.0.7-0ubuntu0.24.04.2                 amd64        Mesa VDPAU video acceleration drivers
ii  mesa-vulkan-drivers:amd64                25.0.7-0ubuntu0.24.04.2                 amd64        Mesa Vulkan graphics drivers
ggml@ggml-7-x86-amd-v710:~/work/llama.cpp/build-rocm$ modinfo amdgpu | grep version
version:        6.12.12
srcversion:     AC5C22E22EEDC97831DD74B
vermagic:       6.11.0-1018-azure SMP mod_unload modversions 
parm:           hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool)

ggml@ggml-7-x86-amd-v710:~/work/llama.cpp/build-rocm$ vulkaninfo --summary
'DISPLAY' environment variable not set... skipping surface info
==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.321


Instance Extensions: count = 24
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_headless_surface                : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 5
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 13
---------------------------
VK_LAYER_AMD_switchable_graphics_64 AMD switchable graphics layer                                                                                     1.4.308  version 1
VK_LAYER_INTEL_nullhw               INTEL NULL HW                                                                                                     1.1.73   version 1
VK_LAYER_KHRONOS_profiles           Khronos Profiles layer                                                                                            1.4.321  version 1
VK_LAYER_KHRONOS_shader_object      Khronos Shader object layer                                                                                       1.4.321  version 1
VK_LAYER_KHRONOS_synchronization2   Khronos Synchronization2 layer                                                                                    1.4.321  version 1
VK_LAYER_KHRONOS_validation         Khronos Validation Layer                                                                                          1.4.321  version 1
VK_LAYER_LUNARG_api_dump            LunarG API dump layer                                                                                             1.4.321  version 2
VK_LAYER_LUNARG_crash_diagnostic    Crash Diagnostic Layer is a crash/hang debugging tool that helps determines GPU progress in a Vulkan application. 1.4.321  version 1
VK_LAYER_LUNARG_gfxreconstruct      GFXReconstruct Capture Layer Version 1.0.5                                                                        1.4.321  version 4194309
VK_LAYER_LUNARG_monitor             Execution Monitoring Layer                                                                                        1.4.321  version 1
VK_LAYER_LUNARG_screenshot          LunarG image capture layer                                                                                        1.4.321  version 1
VK_LAYER_MESA_device_select         Linux device selection layer                                                                                      1.4.303  version 1
VK_LAYER_MESA_overlay               Mesa Overlay layer                                                                                                1.4.303  version 1

Devices:
========
GPU0:
	apiVersion         = 1.4.308
	driverVersion      = 2.0.342
	vendorID           = 0x1002
	deviceID           = 0x7461
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = AMD Radeon Pro V710 MxGPU
	driverID           = DRIVER_ID_AMD_PROPRIETARY
	driverName         = AMD proprietary driver
	driverInfo         = (LLPC)
	conformanceVersion = 1.4.0.0
	deviceUUID         = 02000000-0000-0000-0000-000000000000
	driverUUID         = 414d442d-4c49-4e55-582d-445256000000

I'm not very sure which commands to run, so if you have any specific in mind, let me know.

@IMbackK
Copy link
Collaborator

IMbackK commented Sep 26, 2025

Is the GPU exclusive to this vm? V710 supports 12 way partitioning, if its configured like that it may simply be loaded by the other vms.

@ggerganov
Copy link
Member Author

Yes, the 2 runners that I deployed are of type Standard_NV4ads_V710_v5 - each one uses 1/6th of a full V710 GPU. I would still expect them to perform better, but maybe this is what you get with partial GPUs - not sure.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Sep 26, 2025

Looking at some of the old CI runs it looks like the v710 was doing fine then, taking less than 20 minutes per run which is around the same time the v100 machine took. I wonder if the vm got messed up or if the host it's running on has some problems.

https://github.com/ggml-org/llama.cpp/actions/runs/17938107299/job/51008109600
https://github.com/ggml-org/llama.cpp/actions/runs/17938107299/job/51008109713

@netrunnereve While running perplexity with the ROCm build with the 270M Gemma, the amd-smi does not seem to report activity, but it is using memory:

Maybe rocm-smi doesn't work on vms, I don't know. You can also try with radeontop or amdgpu-top.

Here are some dumps:

The Mesa and amdgpu versions are fine, but the vulkaninfo shows that you're on the proprietary driver and I wonder if the backend is mistakenly showing RADV. This doesn't explain why ROCM is so slow but let's deal with one thing at a time. Personally I would just get rid of the amdgpu-install stuff and first try the default Ubuntu driver packages, but if you want to use amdgpu-install then remove it with amdgpu-uninstall and then reinstall with amdgpu-install -y --usecase=graphics which should only install the open source drivers.

@ggerganov
Copy link
Member Author

radeontop shows this during perplexity:

image

remove it with amdgpu-uninstall and then reinstall with amdgpu-install -y --usecase=graphics which should only install the open source drivers.

Tried this but not much change. I also tried a few more alternatives:

amdgpu-install -y --usecase=graphics
amdgpu-install -y --usecase=graphics,rocm
amdgpu-install -y --usecase=graphics --vulkan=pro --no-32 

But performance is the same.

I wonder if the vm got messed up or if the host it's running on has some problems.

You are right that based on the logs, before it was much faster. I initially deployed the VM in Europe, but there was an issue where the it's availability was not guaranteed so it worked for a few hours and then stops.

I then requested availability in US and after a few days with Azure support, got these permanent VMs. But maybe there is indeed some underlying issue with the host.

I feel like the memory transfer between the RAM and the GPU is very slow. Not sure how to benchmark it though.

@netrunnereve
Copy link
Collaborator

radeontop shows this during perplexity:

The gtt usage is strange as it shouldn't be using that much on such a small model.

I then requested availability in US and after a few days with Azure support, got these permanent VMs. But maybe there is indeed some underlying issue with the host.

I feel like the memory transfer between the RAM and the GPU is very slow. Not sure how to benchmark it though.

Try memtest-vulkan, it'll give you an idea of what your memory bandwidth is. It won't hit the memory bandwidth limit but the write test gets within 75% of it on my card.

@ggerganov
Copy link
Member Author

Here are the results from memtest-vulkan:

ggml@ggml-9-x86-amd-v710:~/vulkan/memtest$ ./memtest_vulkan 
https://github.com/GpuZelenograd/memtest_vulkan v0.5.0 by GpuZelenograd
To finish testing use Ctrl+C
1: Bus=0x00:00 DevId=0x7461   5GB AMD Radeon Pro V710 MxGPU (RADV NAVI32)
2: Bus=0x00:00 DevId=0x0000   16GB llvmpipe (LLVM 20.1.2, 256 bits)
                                                   Override index to test:1
Standard 5-minute test of 1: Bus=0x00:00 DevId=0x7461   5GB AMD Radeon Pro V710 MxGPU (RADV NAVI32)
      1 iteration. Passed  0.7377 seconds  written:    1.8GB   6.9GB/sec        checked:    3.5GB   7.2GB/sec
      3 iteration. Passed  1.4752 seconds  written:    3.5GB   6.9GB/sec        checked:    7.0GB   7.2GB/sec
     10 iteration. Passed  5.1508 seconds  written:   12.2GB   6.9GB/sec        checked:   24.5GB   7.2GB/sec
     51 iteration. Passed 30.2803 seconds  written:   71.8GB   6.8GB/sec        checked:  143.5GB   7.3GB/sec
     92 iteration. Passed 30.1841 seconds  written:   71.8GB   6.9GB/sec        checked:  143.5GB   7.3GB/sec
    133 iteration. Passed 30.1517 seconds  written:   71.8GB   6.9GB/sec        checked:  143.5GB   7.3GB/sec
    174 iteration. Passed 30.1678 seconds  written:   71.8GB   6.9GB/sec        checked:  143.5GB   7.2GB/sec
    215 iteration. Passed 30.1654 seconds  written:   71.8GB   6.8GB/sec        checked:  143.5GB   7.3GB/sec
    256 iteration. Passed 30.1429 seconds  written:   71.8GB   6.9GB/sec        checked:  143.5GB   7.3GB/sec
    297 iteration. Passed 30.1583 seconds  written:   71.8GB   6.9GB/sec        checked:  143.5GB   7.3GB/sec
    338 iteration. Passed 30.2269 seconds  written:   71.8GB   6.9GB/sec        checked:  143.5GB   7.2GB/sec
    379 iteration. Passed 30.1843 seconds  written:   71.8GB   6.9GB/sec        checked:  143.5GB   7.3GB/sec
Standard 5-minute test PASSed! Just press Ctrl+C unless you plan long test run.
Extended endless test started; testing more than 2 hours is usually unneeded
use Ctrl+C to stop it when you decide it's enough
^C
memtest_vulkan: no any errors, testing PASSed.
  press any key to continue...

@netrunnereve
Copy link
Collaborator

Wow that's some atrocious memory bandwidth which explains the slow runs. I'm pretty sure either the host or GPU is broken.

@ggerganov
Copy link
Member Author

Yeah, something is wrong. I tried redeploying the instances multiple times on different operating systems - always the same result.

amd-smi monitor not showing all the information is quite suspicious:

$ amd-smi monitor
GPU  POWER   GPU_T   MEM_T   GFX_CLK   GFX%   MEM%   ENC%   DEC%      VRAM_USAGE
  0    N/A     N/A     N/A       N/A    N/A    N/A    N/A    N/A    0.2/  4.3 GB

I'm out of ideas. If you think of something to try let me know. Otherwise will probably retry in a few months.

I can also open SSH access on a fresh VM if you or someone else wants to give this a try.

@netrunnereve
Copy link
Collaborator

I'm out of ideas too, you've pretty much tried what I would've done myself.

@ggerganov ggerganov marked this pull request as draft September 29, 2025 06:00
@ggerganov ggerganov force-pushed the gg/ci-add-amd-workflows branch from f48d3f3 to c355b35 Compare September 29, 2025 06:02
@ggerganov ggerganov force-pushed the gg/ci-add-amd-workflows branch from c355b35 to 498888b Compare September 29, 2025 06:03
@ggerganov
Copy link
Member Author

Apart from being massively slow, the workflows seem to work fine:

https://github.com/ggml-org/llama.cpp/actions/runs/18087388576

This PR enables the AMD runs only for commits on master.

@ggerganov ggerganov marked this pull request as ready for review September 29, 2025 09:51
@ggerganov ggerganov merged commit d72f5f7 into master Sep 29, 2025
4 of 54 checks passed
@ggerganov ggerganov deleted the gg/ci-add-amd-workflows branch September 29, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants