Introduced timeout in allocate_connection to prevent deadlocks #18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I wrote an application that writes a lot of data into a cassandra cluster. Sometimes my application hangs after calling the execute(...) function of the driver. I found out, that the driver is waiting forever for a new connection (see full Stacktrace below).
Every time this issue occurs, one of the tcp connections to the cassandra cluster is in the state "CLOSE_WAIT".
-+-
cql_demo 23740 kristof 6u IPv4 496342 0t0 TCP 192.168.1.26:57590->192.168.1.108:9042 (ESTABLISHED)
cql_demo 23740 kristof 7u IPv4 498916 0t0 TCP 192.168.1.26:50402->192.168.1.113:9042 (ESTABLISHED)
cql_demo 23740 kristof 8u IPv4 498917 0t0 TCP 192.168.1.26:51351->192.168.1.114:9042 (ESTABLISHED)
cql_demo 23740 kristof 9u IPv4 496346 0t0 TCP 192.168.1.26:39331->192.168.1.115:9042 (ESTABLISHED)
cql_demo 23740 kristof 10u IPv4 496348 0t0 TCP 192.168.1.26:57594->192.168.1.108:9042 (ESTABLISHED)
cql_demo 23740 kristof 11u IPv4 496350 0t0 TCP 192.168.1.26:41152->192.168.1.111:9042 (ESTABLISHED)
cql_demo 23740 kristof 12u IPv4 495444 0t0 TCP 192.168.1.26:41153->192.168.1.111:9042 (CLOSE_WAIT)
-+-
I'm using debian 7 (stable) with gcc 4.7.2 and boost 1.49. The cassandra nodes are running debian 7 with cassandra 2.0.5. I'm not sure, this is an issue in the boost library or a multi threading issue in the driver. The problem only occurs if my application doesn't wait for the future returned by execute() to finish. If my application waits for the future before executing the next statement, the problem does not occur.
The state CLOSE_WAIT of the tcp connection may indicate, that the driver did forget to call the shutdown(..) / close(..) functions on the socket. In function cql_session_impl_t::try_remove_connection(..) are some TODO comments that indicates the code might have some multi threading issues.
My patch does not solve the root cause of the problem. But it ensures that we do not wait forever for the connection.
(gdb) thread apply all backtrace
Thread 4 (Thread 0x7f8f01a84700 (LWP 23783)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f8f02c7fb11 in boost::asio::detail::posix_event::waitboost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex > (this=0x7f8f01a83c90, lock=...)
#2 0x00007f8f02c7795c in boost::asio::detail::task_io_service::do_run_one (this=0x1025df0, lock=..., this_thread=..., private_op_queue=..., ec=...)
#3 0x00007f8f02c77397 in boost::asio::detail::task_io_service::run (this=0x1025df0, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:146
#4 0x00007f8f02c77ce1 in boost::asio::io_service::run (this=0x1025ca0) at /usr/include/boost/asio/impl/io_service.ipp:59
#5 0x00007f8f02ce22b2 in cql::cql_cluster_impl_t::asio_thread_main (io_service=...) at /tmp/cpp-driver.git/include/cql/internal/cql_cluster_impl.hpp:89
#6 0x00007f8f02ce8d9c in boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > >::operator()<void (*)(boost::shared_ptrboost::asio::io_service), boost::_bi::list0> (this=0x10263b0, f=@0x10263a8: 0x7f8f02ce2292 cql::cql_cluster_impl_t::asio_thread_main(boost::shared_ptrboost::asio::io_service), a=...)
#7 0x00007f8f02ce8d31 in boost::_bi::bind_t<void, void ()(boost::shared_ptrboost::asio::io_service), boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > > >::operator() (this=0x10263a8) at /usr/include/boost/bind/bind_template.hpp:20
#8 0x00007f8f02ce8ad0 in boost::detail::thread_data<boost::_bi::bind_t<void, void ()(boost::shared_ptrboost::asio::io_service), boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > > > >::run (this=0x1026220) at /usr/include/boost/thread/detail/thread.hpp:62
#9 0x00007f8f03c61629 in ?? () from /usr/lib/libboost_thread.so.1.49.0
#10 0x00007f8f0387ab50 in start_thread (arg=) at pthread_create.c:304
#11 0x00007f8f01f6d0ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()
Thread 3 (Thread 0x7f8f01283700 (LWP 23784)):
#0 0x00007f8f01f6d743 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f8f02c76753 in boost::asio::detail::epoll_reactor::run (this=0x1027010, block=true, ops=...) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:366
#2 0x00007f8f02c77835 in boost::asio::detail::task_io_service::do_run_one (this=0x1025df0, lock=..., this_thread=..., private_op_queue=..., ec=...)
#3 0x00007f8f02c77397 in boost::asio::detail::task_io_service::run (this=0x1025df0, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:146
#4 0x00007f8f02c77ce1 in boost::asio::io_service::run (this=0x1025ca0) at /usr/include/boost/asio/impl/io_service.ipp:59
#5 0x00007f8f02ce22b2 in cql::cql_cluster_impl_t::asio_thread_main (io_service=...) at /tmp/cpp-driver.git/include/cql/internal/cql_cluster_impl.hpp:89
#6 0x00007f8f02ce8d9c in boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > >::operator()<void (*)(boost::shared_ptrboost::asio::io_service), boost::_bi::list0> (this=0x1026710, f=@0x1026708: 0x7f8f02ce2292 cql::cql_cluster_impl_t::asio_thread_main(boost::shared_ptrboost::asio::io_service), a=...)
#7 0x00007f8f02ce8d31 in boost::_bi::bind_t<void, void ()(boost::shared_ptrboost::asio::io_service), boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > > >::operator() (this=0x1026708) at /usr/include/boost/bind/bind_template.hpp:20
#8 0x00007f8f02ce8ad0 in boost::detail::thread_data<boost::_bi::bind_t<void, void ()(boost::shared_ptrboost::asio::io_service), boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > > > >::run (this=0x1026580) at /usr/include/boost/thread/detail/thread.hpp:62
#9 0x00007f8f03c61629 in ?? () from /usr/lib/libboost_thread.so.1.49.0
#10 0x00007f8f0387ab50 in start_thread (arg=) at pthread_create.c:304
#11 0x00007f8f01f6d0ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x7f8f00a82700 (LWP 23785)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f8f02c7fb11 in boost::asio::detail::posix_event::waitboost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex > (this=0x7f8f00a81d30, lock=...)
#2 0x00007f8f02c7795c in boost::asio::detail::task_io_service::do_run_one (this=0x1028370, lock=..., this_thread=..., private_op_queue=..., ec=...)
#3 0x00007f8f02c77397 in boost::asio::detail::task_io_service::run (this=0x1028370, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:146
#4 0x00007f8f02c77ce1 in boost::asio::io_service::run (this=0x1028300) at /usr/include/boost/asio/impl/io_service.ipp:59
#5 0x00007f8f02c7acb1 in boost::asio::detail::resolver_service_base::work_io_service_runner::operator() (this=0x102b0f8)
#6 0x00007f8f02ca84a6 in boost::asio::detail::posix_thread::funcboost::asio::detail::resolver_service_base::work_io_service_runner::run (this=0x102b0f0)
#7 0x00007f8f02c7ac43 in boost::asio::detail::boost_asio_detail_posix_thread_function (arg=0x102b0f0) at /usr/include/boost/asio/detail/impl/posix_thread.ipp:64
#8 0x00007f8f0387ab50 in start_thread (arg=) at pthread_create.c:304
#9 0x00007f8f01f6d0ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7f8f03b07720 (LWP 23782)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x0000000000435611 in boost::condition_variable::wait (this=0x7f8efc039208, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:56
#2 0x0000000000435d49 in boost::detail::future_object_base::wait (this=0x7f8efc0391c0, rethrow=false) at /usr/include/boost/thread/future.hpp:166
#3 0x00007f8f02cfa866 in boost::shared_futurecql::cql_future_connection_t::wait (this=0x7fffb86ab570) at /usr/include/boost/thread/future.hpp:902
#4 0x00007f8f02cf3878 in cql::cql_session_impl_t::allocate_connection (this=0x102d420, host=...) at /tmp/cpp-driver.git/src/cql/internal/cql_session_impl.cpp:227
#5 0x00007f8f02cf4385 in cql::cql_session_impl_t::connect (this=0x102d420, query_plan=..., stream=0x7fffb86ab94f, tried_hosts=0x7fffb86ab820)
#6 0x00007f8f02cf5ce6 in cql::cql_session_impl_t::get_connection (this=0x102d420, query=..., stream=0x7fffb86ab94f)
---Type to continue, or q to quit---
at /tmp/cpp-driver.git/src/cql/internal/cql_session_impl.cpp:662
#7 0x00007f8f02cf5516 in cql::cql_session_impl_t::execute (this=0x102d420, message=...) at /tmp/cpp-driver.git/src/cql/internal/cql_session_impl.cpp:565
#8 0x0000000000431bb8 in demo (host=..., use_ssl=false) at /tmp/cpp-driver.git/demo/main.cpp:156
#9 0x00000000004323df in main (argc=3, vargs=0x7fffb86ac188) at /tmp/cpp-driver.git/demo/main.cpp:209