Skip to content

Conversation

Mic92
Copy link
Member

@Mic92 Mic92 commented Sep 26, 2025

Multiple threads could simultaneously observe that the singleton
FileTransfer instance has quit and attempt to replace it without
synchronization. This caused undefined behavior as destroying the
old object while other threads might still be accessing its mutex
is not allowed.

Fixed by implementing in-place restart of the FileTransfer object
instead of replacing it. This avoids destroying mutexes that other
threads might be waiting on or holding.

Motivation

Context


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@Ericson2314
Copy link
Member

I think it would be better to use the state mutex instead. Will make a suggestion.

@Ericson2314
Copy link
Member

https://en.cppreference.com/w/cpp/thread/mutex/~mutex.html and https://en.cppreference.com/w/c/thread/mtx_destroy.html imply to me that my suggestion above and this are both wrong, because we shouldn't destroy the mutex while other threads own it (my suggestion) or are waiting to lock it (current PR and status quo)

I think the right thing to do is then:

  1. Take owneship of the state
  2. Recreate the file transfer in place, rather than making a new one and assigning it.

@Mic92
Copy link
Member Author

Mic92 commented Sep 29, 2025

https://en.cppreference.com/w/cpp/thread/mutex/~mutex.html and https://en.cppreference.com/w/c/thread/mtx_destroy.html imply to me that my suggestion above and this are both wrong, because we shouldn't destroy the mutex while other threads own it (my suggestion) or are waiting to lock it (current PR and status quo)

I think the right thing to do is then:

1. Take owneship of the state

2. Recreate the file transfer in place, rather than making a new one and assigning it.

What do you mean by re-creating a file transfer in-place? Having some sort of restart method that duplicates destructor/constructor?

@Mic92 Mic92 force-pushed the concurrency-bugs branch 3 times, most recently from 2dfb906 to 33b27cc Compare September 29, 2025 07:42
@Mic92 Mic92 requested a review from Copilot September 29, 2025 07:43
Copilot

This comment was marked as resolved.

@Mic92 Mic92 requested a review from Copilot September 29, 2025 08:20
Copilot

This comment was marked as resolved.

@Ericson2314
Copy link
Member

Yes I do mean that, but it doesn't need to be duplicative. Since the method is non-virtual, you can safely call it from the constructor.

@Mic92
Copy link
Member Author

Mic92 commented Sep 29, 2025

Yes I do mean that, but it doesn't need to be duplicative. Since the method is non-virtual, you can safely call it from the constructor.

It does now actually a bit less than the constructor. I abstracted reasonably away. Have a look.

Copy link
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do it like this, with no second mutex?

@Mic92
Copy link
Member Author

Mic92 commented Sep 29, 2025

Can we do it like this, with no second mutex?

This looks like a deadlock to me though because we wait for the worker thread to finish and the worker thread needs this lock. Quit can be also set from outside the worker.

@Ericson2314
Copy link
Member

I don't think so? Do you mean waiting for the old worker thread or something else? The comment says the old worker thread should be existing, and thus releasing the lock, right?

@Mic92
Copy link
Member Author

Mic92 commented Sep 29, 2025

I don't think so? Do you mean waiting for the old worker thread or something else? The comment says the old worker thread should be existing, and thus releasing the lock, right?

It has to access state to check the quit flag, unless it has set the quit flag itself. I updated the comment.

@Ericson2314
Copy link
Member

But restart will only be called if the quit flag was set, right? Once quit is set, I thought no one else needs to acquire the lock, except for the thread which calls restart.

@Mic92
Copy link
Member Author

Mic92 commented Sep 29, 2025

The Deadlock Sequence:

Thread 1 (Main Thread):

  1. Calls getFileTransfer()
  2. Acquires state_.lock()
  3. Sees quit == true
  4. Calls restart(state) with lock held
  5. Inside restart(), calls workerThread.join() while still holding the state lock
  6. Waits for worker thread to finish...

Thread 2 (Worker Thread - in workerThreadEntry()):

void workerThreadEntry()
{
    try {
        workerThreadMain();  // Main loop has exited because quit == true
    } catch (nix::Interrupted & e) {
    } catch (std::exception & e) {
        printError("unexpected error in download thread: %s", e.what());
    }

    // Worker thread needs to do cleanup:
    {
        auto state(state_.lock());  // BLOCKED HERE - Thread 1 holds the lock!
        while (!state->incoming.empty())
            state->incoming.pop();
        state->quit = true;
    }
}

The Deadlock:

  • Thread 1 holds the state_ lock and is waiting for Thread 2 to finish (workerThread.join())
  • Thread 2 has exited its main loop but needs to acquire state_ lock for cleanup (lines 819-823)

@Ericson2314
Copy link
Member

@Mic92 How about lets just have restart do the state->incoming.pop() loop instead, rather than the worker thread?

@Ericson2314
Copy link
Member

Ericson2314 commented Sep 29, 2025

There is also https://en.cppreference.com/w/cpp/thread/mutex/try_lock.html we can use to make workerThreadEntry try to do that, but not block on holding the lock. But I don't think that is necessary.

@Mic92
Copy link
Member Author

Mic92 commented Sep 29, 2025

@Mic92 How about lets just have restart do the state->incoming.pop() loop instead, rather than the worker thread?

The actual code looks more like this:

            std::vector<std::shared_ptr<TransferItem>> incoming;
            auto now = std::chrono::steady_clock::now();

            {
                auto state(state_.lock());
                while (!state->incoming.empty()) {
                    auto item = state->incoming.top();
                    if (item->embargo <= now) {
                        incoming.push_back(item);
                        state->incoming.pop();
                    } else {
                        if (nextWakeup == std::chrono::steady_clock::time_point() || item->embargo < nextWakeup)
                            nextWakeup = item->embargo;
                        break;
                    }
                }
                quit = state->quit;
            }

            for (auto & item : incoming) {
                debug("starting %s of %s", item->request.verb(), item->request.uri);
                item->init();
                curl_multi_add_handle(curlm, item->req);
                item->active = true;
                items[item->req] = item;
            }

So we really need to do this in the worker thread.

@Ericson2314
Copy link
Member

maybe it would be good for quit to be an atomic and not part of state, so it can always be set during unwinding at any point without blocking. (Acquiring resources to release resources, like exiting the thread, is bad luck.)

@Ericson2314
Copy link
Member

I think we can end up with

void workerThreadEntry()
{
    try {
        workerThreadMain();  // Main loop has exited because quit == true
    } catch (nix::Interrupted & e) {
        quit = false; // atomic
    } catch (std::exception & e) {
        quit = false; // atomic
        printError("unexpected error in download thread: %s", e.what());
    }
    // there only way to leave `workerThreadMain` besides exception unwinding is for `quit` to already be set.
    assert(quit);
}

and then pop loop is in restart.

@Radvendii
Copy link
Contributor

Radvendii commented Sep 29, 2025

I had the following idea for restart(), but I think this doesn't work either, as you might end up with a thread calling workerThread.join() after startWrokerThread() was called by a different thread, and then the first thread blocks indefinitely.

    void restart()
    {
        // The worker thread will exit if quit has been set
        workerThread.join();
        
        // Check if we need to restart
        {
            auto state(state_.lock());
            if (!state->quit) {
                return;
            }
            resetCurl();
            state->quit = false;
        }
        startWorkerThread()
    }

@Ericson2314 what are we trying to avoid / optimize by getting rid of restartLock? It seems like a natural way of expressing "this function should only be called from one thread"

@Mic92
Copy link
Member Author

Mic92 commented Sep 29, 2025

I think we can end up with

void workerThreadEntry()
{
    try {
        workerThreadMain();  // Main loop has exited because quit == true
    } catch (nix::Interrupted & e) {
        quit = false; // atomic
    } catch (std::exception & e) {
        quit = false; // atomic
        printError("unexpected error in download thread: %s", e.what());
    }
    // there only way to leave `workerThreadMain` besides exception unwinding is for `quit` to already be set.
    assert(quit);
}

and then pop loop is in restart.

Feels like a bigger refactor that is not so easy to backport.

@Ericson2314 Ericson2314 marked this pull request as draft September 29, 2025 22:09
Ericson2314 and others added 6 commits September 29, 2025 18:10
It is allowed to read it, and to set it to `false`, but not to set it
to `true`.
Whoever first calls `quit` now empties the queue, instead of waiting for
the worker thread to do it.

(Note that in the unwinding case, the worker thread is still the first
to call `quit`, though.)
Multiple threads could simultaneously observe that the singleton
FileTransfer instance has quit and attempt to replace it without
synchronization. This caused undefined behavior as destroying the
old object while other threads might still be accessing its mutex
is not allowed.

Fixed by implementing in-place restart of the FileTransfer object
instead of replacing it. This avoids destroying mutexes that other
threads might be waiting on or holding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants