Skip to content

Conversation

dakinggg
Copy link
Contributor

@dakinggg dakinggg commented May 20, 2025

What does this PR do?

Previous PR to upgrade EFA installer version switched which ARG we use to determine whether to install EFA, but this was not propagated to the build args in the github actions, so all images were built with EFA, which causes issues for RDMA on non AWS clusters.

Current image (broken): rdma-broken-1-pWWCyG
New image from this PR (fixed): rdma-fixed-1-ecrVG0

For the broken run, you can see a bunch of warnings in the logs

libibverbs: Warning: couldn't load driver 'librxe-rdmav34.so': librxe-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libbnxt_re-rdmav34.so': libbnxt_re-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'liberdma-rdmav34.so': liberdma-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libmthca-rdmav34.so': libmthca-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libefa-rdmav34.so': libefa-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libipathverbs-rdmav34.so': libipathverbs-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libmana-rdmav34.so': libmana-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libvmw_pvrdma-rdmav34.so': libvmw_pvrdma-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libocrdma-rdmav34.so': libocrdma-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libhns-rdmav34.so': libhns-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libcxgb4-rdmav34.so': libcxgb4-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libsiw-rdmav34.so': libsiw-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libirdma-rdmav34.so': libirdma-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libqedr-rdmav34.so': libqedr-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libmlx4-rdmav34.so': libmlx4-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libhfi1verbs-rdmav34.so': libhfi1verbs-rdmav34.so: cannot open shared object file: No such file or directory

which are not present for the fixed run.

Previous action: https://github.com/mosaicml/composer/actions/runs/15005222089/job/42162037261
Action on this PR: https://github.com/mosaicml/composer/actions/runs/15127189449/job/42521298584?pr=3857

You can see in the logs that the previous action installed EFA even though it wasn't supposed to since that action is for a non AWS image

#17 [pytorch_stage 11/20] RUN if [ -n "1.39.0" ] ; then         cd /tmp &&         curl -OsS https://efa-installer.amazonaws.com/aws-efa-installer-1.39.0.tar.gz &&         tar -xf /tmp/aws-efa-installer-1.39.0.tar.gz &&         cd aws-efa-installer &&         apt-get update &&         ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify &&         rm -rf /tmp/aws-efa-installer* ;     fi

whereas in the new action the EFA installation is skipped

#17 [pytorch_stage 11/20] RUN if [ -n "" ] ; then         cd /tmp &&         curl -OsS https://efa-installer.amazonaws.com/aws-efa-installer-.tar.gz &&         tar -xf /tmp/aws-efa-installer-.tar.gz &&         cd aws-efa-installer &&         apt-get update &&         ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify &&         rm -rf /tmp/aws-efa-installer* ;     fi

And you can also see that all the AWS docker build actions on this pr were fully cached, since nothing is changing for them.

@dakinggg dakinggg marked this pull request as ready for review May 20, 2025 02:19
@dakinggg dakinggg requested a review from a team as a code owner May 20, 2025 02:19
@dakinggg dakinggg requested a review from irenedea May 20, 2025 02:20
@dakinggg dakinggg merged commit 7ad045d into main May 20, 2025
30 checks passed
@dakinggg dakinggg deleted the fix-rdma branch May 20, 2025 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants