libstore: fix data race in getFileTransfer() singleton replacement #14094

Mic92 · 2025-09-26T20:27:13Z

Multiple threads could simultaneously observe that the singleton
FileTransfer instance has quit and attempt to replace it without
synchronization. This caused undefined behavior as destroying the
old object while other threads might still be accessing its mutex
is not allowed.

Fixed by implementing in-place restart of the FileTransfer object
instead of replacing it. This avoids destroying mutexes that other
threads might be waiting on or holding.

Motivation

Context

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

Ericson2314 · 2025-09-27T15:49:26Z

I think it would be better to use the state mutex instead. Will make a suggestion.

Ericson2314 · 2025-09-27T15:57:19Z

https://en.cppreference.com/w/cpp/thread/mutex/~mutex.html and https://en.cppreference.com/w/c/thread/mtx_destroy.html imply to me that my suggestion above and this are both wrong, because we shouldn't destroy the mutex while other threads own it (my suggestion) or are waiting to lock it (current PR and status quo)

I think the right thing to do is then:

Take owneship of the state
Recreate the file transfer in place, rather than making a new one and assigning it.

Mic92 · 2025-09-29T07:02:04Z

https://en.cppreference.com/w/cpp/thread/mutex/~mutex.html and https://en.cppreference.com/w/c/thread/mtx_destroy.html imply to me that my suggestion above and this are both wrong, because we shouldn't destroy the mutex while other threads own it (my suggestion) or are waiting to lock it (current PR and status quo)

I think the right thing to do is then:
1. Take owneship of the state

2. Recreate the file transfer in place, rather than making a new one and assigning it.

What do you mean by re-creating a file transfer in-place? Having some sort of restart method that duplicates destructor/constructor?

Ericson2314 · 2025-09-29T16:36:16Z

Yes I do mean that, but it doesn't need to be duplicative. Since the method is non-virtual, you can safely call it from the constructor.

Mic92 · 2025-09-29T16:42:12Z

Yes I do mean that, but it doesn't need to be duplicative. Since the method is non-virtual, you can safely call it from the constructor.

It does now actually a bit less than the constructor. I abstracted reasonably away. Have a look.

Ericson2314

Can we do it like this, with no second mutex?

src/libstore/filetransfer.cc

Mic92 · 2025-09-29T17:02:47Z

Can we do it like this, with no second mutex?

This looks like a deadlock to me though because we wait for the worker thread to finish and the worker thread needs this lock. Quit can be also set from outside the worker.

Ericson2314 · 2025-09-29T17:07:53Z

I don't think so? Do you mean waiting for the old worker thread or something else? The comment says the old worker thread should be existing, and thus releasing the lock, right?

Mic92 · 2025-09-29T17:29:30Z

I don't think so? Do you mean waiting for the old worker thread or something else? The comment says the old worker thread should be existing, and thus releasing the lock, right?

It has to access state to check the quit flag, unless it has set the quit flag itself. I updated the comment.

Ericson2314 · 2025-09-29T18:19:40Z

But restart will only be called if the quit flag was set, right? Once quit is set, I thought no one else needs to acquire the lock, except for the thread which calls restart.

Mic92 · 2025-09-29T19:02:33Z

The Deadlock Sequence:

Thread 1 (Main Thread):

Calls getFileTransfer()
Acquires state_.lock()
Sees quit == true
Calls restart(state) with lock held
Inside restart(), calls workerThread.join() while still holding the state lock
Waits for worker thread to finish...

Thread 2 (Worker Thread - in `workerThreadEntry()`):

void workerThreadEntry()
{
    try {
        workerThreadMain();  // Main loop has exited because quit == true
    } catch (nix::Interrupted & e) {
    } catch (std::exception & e) {
        printError("unexpected error in download thread: %s", e.what());
    }

    // Worker thread needs to do cleanup:
    {
        auto state(state_.lock());  // BLOCKED HERE - Thread 1 holds the lock!
        while (!state->incoming.empty())
            state->incoming.pop();
        state->quit = true;
    }
}

The Deadlock:

Thread 1 holds the state_ lock and is waiting for Thread 2 to finish (workerThread.join())
Thread 2 has exited its main loop but needs to acquire state_ lock for cleanup (lines 819-823)

Ericson2314 · 2025-09-29T19:10:26Z

@Mic92 How about lets just have restart do the state->incoming.pop() loop instead, rather than the worker thread?

Ericson2314 · 2025-09-29T19:11:16Z

There is also https://en.cppreference.com/w/cpp/thread/mutex/try_lock.html we can use to make workerThreadEntry try to do that, but not block on holding the lock. But I don't think that is necessary.

Mic92 · 2025-09-29T19:16:56Z

@Mic92 How about lets just have restart do the state->incoming.pop() loop instead, rather than the worker thread?

The actual code looks more like this:

            std::vector<std::shared_ptr<TransferItem>> incoming;
            auto now = std::chrono::steady_clock::now();

            {
                auto state(state_.lock());
                while (!state->incoming.empty()) {
                    auto item = state->incoming.top();
                    if (item->embargo <= now) {
                        incoming.push_back(item);
                        state->incoming.pop();
                    } else {
                        if (nextWakeup == std::chrono::steady_clock::time_point() || item->embargo < nextWakeup)
                            nextWakeup = item->embargo;
                        break;
                    }
                }
                quit = state->quit;
            }

            for (auto & item : incoming) {
                debug("starting %s of %s", item->request.verb(), item->request.uri);
                item->init();
                curl_multi_add_handle(curlm, item->req);
                item->active = true;
                items[item->req] = item;
            }

So we really need to do this in the worker thread.

Ericson2314 · 2025-09-29T19:18:27Z

maybe it would be good for quit to be an atomic and not part of state, so it can always be set during unwinding at any point without blocking. (Acquiring resources to release resources, like exiting the thread, is bad luck.)

Ericson2314 · 2025-09-29T19:20:07Z

I think we can end up with

void workerThreadEntry()
{
    try {
        workerThreadMain();  // Main loop has exited because quit == true
    } catch (nix::Interrupted & e) {
        quit = false; // atomic
    } catch (std::exception & e) {
        quit = false; // atomic
        printError("unexpected error in download thread: %s", e.what());
    }
    // there only way to leave `workerThreadMain` besides exception unwinding is for `quit` to already be set.
    assert(quit);
}

and then pop loop is in restart.

Radvendii · 2025-09-29T19:23:30Z

I had the following idea for restart(), but I think this doesn't work either, as you might end up with a thread calling workerThread.join() after startWrokerThread() was called by a different thread, and then the first thread blocks indefinitely.

    void restart()
    {
        // The worker thread will exit if quit has been set
        workerThread.join();
        
        // Check if we need to restart
        {
            auto state(state_.lock());
            if (!state->quit) {
                return;
            }
            resetCurl();
            state->quit = false;
        }
        startWorkerThread()
    }

@Ericson2314 what are we trying to avoid / optimize by getting rid of restartLock? It seems like a natural way of expressing "this function should only be called from one thread"

Mic92 · 2025-09-29T19:25:56Z

I think we can end up with

void workerThreadEntry()
{
    try {
        workerThreadMain();  // Main loop has exited because quit == true
    } catch (nix::Interrupted & e) {
        quit = false; // atomic
    } catch (std::exception & e) {
        quit = false; // atomic
        printError("unexpected error in download thread: %s", e.what());
    }
    // there only way to leave `workerThreadMain` besides exception unwinding is for `quit` to already be set.
    assert(quit);
}

and then pop loop is in restart.

Feels like a bigger refactor that is not so easy to backport.

It is allowed to read it, and to set it to `false`, but not to set it to `true`.

Whoever first calls `quit` now empties the queue, instead of waiting for the worker thread to do it. (Note that in the unwinding case, the worker thread is still the first to call `quit`, though.)

Will be useful in a moment

Multiple threads could simultaneously observe that the singleton FileTransfer instance has quit and attempt to replace it without synchronization. This caused undefined behavior as destroying the old object while other threads might still be accessing its mutex is not allowed. Fixed by implementing in-place restart of the FileTransfer object instead of replacing it. This avoids destroying mutexes that other threads might be waiting on or holding.

Mic92 requested a review from Ericson2314 as a code owner September 26, 2025 20:27

Mic92 force-pushed the concurrency-bugs branch from de2a941 to abad972 Compare September 26, 2025 20:29

Mic92 force-pushed the concurrency-bugs branch 3 times, most recently from 2dfb906 to 33b27cc Compare September 29, 2025 07:42

Mic92 requested a review from Copilot September 29, 2025 07:43

This comment was marked as resolved.

Sign in to view

Mic92 force-pushed the concurrency-bugs branch from 33b27cc to 4d0bbff Compare September 29, 2025 08:10

Mic92 requested a review from Copilot September 29, 2025 08:20

This comment was marked as resolved.

Sign in to view

Ericson2314 reviewed Sep 29, 2025

View reviewed changes

Mic92 force-pushed the concurrency-bugs branch from 4d0bbff to 77ba049 Compare September 29, 2025 17:31

Ericson2314 force-pushed the concurrency-bugs branch from 77ba049 to 90f7bed Compare September 29, 2025 21:55

Ericson2314 force-pushed the concurrency-bugs branch from 90f7bed to b0b9391 Compare September 29, 2025 22:08

Ericson2314 marked this pull request as draft September 29, 2025 22:09

Ericson2314 and others added 6 commits September 29, 2025 18:10

Encapsulate curlFileTransfer::State:quit

d5402b8

It is allowed to read it, and to set it to `false`, but not to set it to `true`.

curlFileTransfer::State:quit emptys the queue

1f65b08

Whoever first calls `quit` now empties the queue, instead of waiting for the worker thread to do it. (Note that in the unwinding case, the worker thread is still the first to call `quit`, though.)

curlFileTransfer::workerThreadEntry Only call quit if we need to.

86fb5b2

curlFileTransfer::stopWorkerThread Take lock guard

d7d644f

Will be useful in a moment

WIP

41f3226

Ericson2314 mentioned this pull request Sep 29, 2025

Some Curl file transfer cleanups #14121

Merged

Ericson2314 force-pushed the concurrency-bugs branch from b0b9391 to 41f3226 Compare September 29, 2025 22:14

Uh oh!

libstore: fix data race in getFileTransfer() singleton replacement #14094

Are you sure you want to change the base?

libstore: fix data race in getFileTransfer() singleton replacement #14094

Conversation

Mic92 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Context

Uh oh!

Ericson2314 commented Sep 27, 2025

Uh oh!

Ericson2314 commented Sep 27, 2025

Uh oh!

Mic92 commented Sep 29, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Ericson2314 commented Sep 29, 2025

Uh oh!

Mic92 commented Sep 29, 2025

Uh oh!

Ericson2314 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mic92 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ericson2314 commented Sep 29, 2025

Uh oh!

Mic92 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ericson2314 commented Sep 29, 2025

Uh oh!

Mic92 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Deadlock Sequence:

Thread 1 (Main Thread):

Thread 2 (Worker Thread - in workerThreadEntry()):

The Deadlock:

Uh oh!

Ericson2314 commented Sep 29, 2025

Uh oh!

Ericson2314 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mic92 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ericson2314 commented Sep 29, 2025

Uh oh!

Ericson2314 commented Sep 29, 2025

Uh oh!

Radvendii commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mic92 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Mic92 commented Sep 26, 2025 •

edited

Loading

Mic92 commented Sep 29, 2025 •

edited

Loading

Mic92 commented Sep 29, 2025 •

edited

Loading

Mic92 commented Sep 29, 2025 •

edited

Loading

Thread 2 (Worker Thread - in `workerThreadEntry()`):

Ericson2314 commented Sep 29, 2025 •

edited

Loading

Mic92 commented Sep 29, 2025 •

edited

Loading

Radvendii commented Sep 29, 2025 •

edited

Loading

Mic92 commented Sep 29, 2025 •

edited

Loading