Skip to content

Conversation

rithwik-db
Copy link
Contributor

What does this PR do?

Essentially, all usages of s3_bucket has been removed outside of test_object_store.py and test_s3_object_store.py since we have the ephemeral and read_only paths in UC Volumes now.

@rithwik-db rithwik-db requested a review from a team as a code owner June 2, 2025 20:58
removed s3 bucket usage

formatted

moved to read_only

formatted

hopefully works

this should work

hopefully cpu passes

moved tests to gpu since they don't need cpu
@rithwik-db
Copy link
Contributor Author

@dakinggg I keep getting this error on daily tests even though it seems to work perfectly on my remote instance. Just wanted to check if you've encountered this issue before...

@rithwik-db rithwik-db requested a review from dakinggg June 5, 2025 02:36
@dakinggg
Copy link
Contributor

dakinggg commented Jun 5, 2025

Ah I think its because dist is initialized with gpu (i.e. nccl) rather than cpu (i.e gloo) when running the gpu tests. its probably not worth adjusting test set up to support this, and just let the test run on gpu instead of cpu

@dakinggg
Copy link
Contributor

dakinggg commented Jun 6, 2025

It actually passed on 2.7 🤔

Screenshot 2025-06-05 at 7 22 21 PM

@dakinggg
Copy link
Contributor

dakinggg commented Jun 6, 2025

I'm ok just adjusting the tolerance for 2.6 if thats sufficient for the test to pass. Not super critical to keep exact numerical determinism across torch versions - and in fact on GPU I'd guess this might not be possible.

@rithwik-db
Copy link
Contributor Author

@dakinggg the daily tests pass when I use a separate checkpoint for 2.6 vs 2.7

@dakinggg
Copy link
Contributor

dakinggg commented Jun 6, 2025

ok, good enough for me. deterministically resuming a run from an older version of torch is a serious edge case.

@rithwik-db
Copy link
Contributor Author

Reran daily tests to make sure with latest fixes: https://github.com/mosaicml/composer/actions/runs/15500553934

@rithwik-db rithwik-db merged commit 96db24c into main Jun 6, 2025
28 checks passed
@rithwik-db rithwik-db deleted the use-volumes branch June 6, 2025 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants