Skip to content

Conversation

jnidzwetzki
Copy link

I wrote an application that writes a lot of data into a cassandra cluster. Sometimes my application hangs after calling the execute(...) function of the driver. I found out, that the driver is waiting forever for a new connection (see full Stacktrace below).

Every time this issue occurs, one of the tcp connections to the cassandra cluster is in the state "CLOSE_WAIT".

-+-
cql_demo 23740 kristof 6u IPv4 496342 0t0 TCP 192.168.1.26:57590->192.168.1.108:9042 (ESTABLISHED)
cql_demo 23740 kristof 7u IPv4 498916 0t0 TCP 192.168.1.26:50402->192.168.1.113:9042 (ESTABLISHED)
cql_demo 23740 kristof 8u IPv4 498917 0t0 TCP 192.168.1.26:51351->192.168.1.114:9042 (ESTABLISHED)
cql_demo 23740 kristof 9u IPv4 496346 0t0 TCP 192.168.1.26:39331->192.168.1.115:9042 (ESTABLISHED)
cql_demo 23740 kristof 10u IPv4 496348 0t0 TCP 192.168.1.26:57594->192.168.1.108:9042 (ESTABLISHED)
cql_demo 23740 kristof 11u IPv4 496350 0t0 TCP 192.168.1.26:41152->192.168.1.111:9042 (ESTABLISHED)
cql_demo 23740 kristof 12u IPv4 495444 0t0 TCP 192.168.1.26:41153->192.168.1.111:9042 (CLOSE_WAIT)
-+-

I'm using debian 7 (stable) with gcc 4.7.2 and boost 1.49. The cassandra nodes are running debian 7 with cassandra 2.0.5. I'm not sure, this is an issue in the boost library or a multi threading issue in the driver. The problem only occurs if my application doesn't wait for the future returned by execute() to finish. If my application waits for the future before executing the next statement, the problem does not occur.

The state CLOSE_WAIT of the tcp connection may indicate, that the driver did forget to call the shutdown(..) / close(..) functions on the socket. In function cql_session_impl_t::try_remove_connection(..) are some TODO comments that indicates the code might have some multi threading issues.

My patch does not solve the root cause of the problem. But it ensures that we do not wait forever for the connection.


(gdb) thread apply all backtrace

Thread 4 (Thread 0x7f8f01a84700 (LWP 23783)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f8f02c7fb11 in boost::asio::detail::posix_event::waitboost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex > (this=0x7f8f01a83c90, lock=...)

at /usr/include/boost/asio/detail/posix_event.hpp:80

#2 0x00007f8f02c7795c in boost::asio::detail::task_io_service::do_run_one (this=0x1025df0, lock=..., this_thread=..., private_op_queue=..., ec=...)

at /usr/include/boost/asio/detail/impl/task_io_service.ipp:380

#3 0x00007f8f02c77397 in boost::asio::detail::task_io_service::run (this=0x1025df0, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:146
#4 0x00007f8f02c77ce1 in boost::asio::io_service::run (this=0x1025ca0) at /usr/include/boost/asio/impl/io_service.ipp:59
#5 0x00007f8f02ce22b2 in cql::cql_cluster_impl_t::asio_thread_main (io_service=...) at /tmp/cpp-driver.git/include/cql/internal/cql_cluster_impl.hpp:89
#6 0x00007f8f02ce8d9c in boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > >::operator()<void (*)(boost::shared_ptrboost::asio::io_service), boost::_bi::list0> (this=0x10263b0, f=@0x10263a8: 0x7f8f02ce2292 cql::cql_cluster_impl_t::asio_thread_main(boost::shared_ptrboost::asio::io_service), a=...)

at /usr/include/boost/bind/bind.hpp:253

#7 0x00007f8f02ce8d31 in boost::_bi::bind_t<void, void ()(boost::shared_ptrboost::asio::io_service), boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > > >::operator() (this=0x10263a8) at /usr/include/boost/bind/bind_template.hpp:20
#8 0x00007f8f02ce8ad0 in boost::detail::thread_data<boost::_bi::bind_t<void, void (
)(boost::shared_ptrboost::asio::io_service), boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > > > >::run (this=0x1026220) at /usr/include/boost/thread/detail/thread.hpp:62
#9 0x00007f8f03c61629 in ?? () from /usr/lib/libboost_thread.so.1.49.0
#10 0x00007f8f0387ab50 in start_thread (arg=) at pthread_create.c:304
#11 0x00007f8f01f6d0ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f8f01283700 (LWP 23784)):
#0 0x00007f8f01f6d743 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f8f02c76753 in boost::asio::detail::epoll_reactor::run (this=0x1027010, block=true, ops=...) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:366
#2 0x00007f8f02c77835 in boost::asio::detail::task_io_service::do_run_one (this=0x1025df0, lock=..., this_thread=..., private_op_queue=..., ec=...)

at /usr/include/boost/asio/detail/impl/task_io_service.ipp:353

#3 0x00007f8f02c77397 in boost::asio::detail::task_io_service::run (this=0x1025df0, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:146
#4 0x00007f8f02c77ce1 in boost::asio::io_service::run (this=0x1025ca0) at /usr/include/boost/asio/impl/io_service.ipp:59
#5 0x00007f8f02ce22b2 in cql::cql_cluster_impl_t::asio_thread_main (io_service=...) at /tmp/cpp-driver.git/include/cql/internal/cql_cluster_impl.hpp:89
#6 0x00007f8f02ce8d9c in boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > >::operator()<void (*)(boost::shared_ptrboost::asio::io_service), boost::_bi::list0> (this=0x1026710, f=@0x1026708: 0x7f8f02ce2292 cql::cql_cluster_impl_t::asio_thread_main(boost::shared_ptrboost::asio::io_service), a=...)

at /usr/include/boost/bind/bind.hpp:253

#7 0x00007f8f02ce8d31 in boost::_bi::bind_t<void, void ()(boost::shared_ptrboost::asio::io_service), boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > > >::operator() (this=0x1026708) at /usr/include/boost/bind/bind_template.hpp:20
#8 0x00007f8f02ce8ad0 in boost::detail::thread_data<boost::_bi::bind_t<void, void (
)(boost::shared_ptrboost::asio::io_service), boost::_bi::list1boost::_bi::value<boost::shared_ptr<boost::asio::io_service > > > >::run (this=0x1026580) at /usr/include/boost/thread/detail/thread.hpp:62
#9 0x00007f8f03c61629 in ?? () from /usr/lib/libboost_thread.so.1.49.0
#10 0x00007f8f0387ab50 in start_thread (arg=) at pthread_create.c:304
#11 0x00007f8f01f6d0ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f8f00a82700 (LWP 23785)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f8f02c7fb11 in boost::asio::detail::posix_event::waitboost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex > (this=0x7f8f00a81d30, lock=...)

at /usr/include/boost/asio/detail/posix_event.hpp:80

#2 0x00007f8f02c7795c in boost::asio::detail::task_io_service::do_run_one (this=0x1028370, lock=..., this_thread=..., private_op_queue=..., ec=...)

at /usr/include/boost/asio/detail/impl/task_io_service.ipp:380

#3 0x00007f8f02c77397 in boost::asio::detail::task_io_service::run (this=0x1028370, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:146
#4 0x00007f8f02c77ce1 in boost::asio::io_service::run (this=0x1028300) at /usr/include/boost/asio/impl/io_service.ipp:59
#5 0x00007f8f02c7acb1 in boost::asio::detail::resolver_service_base::work_io_service_runner::operator() (this=0x102b0f8)

at /usr/include/boost/asio/detail/impl/resolver_service_base.ipp:32

#6 0x00007f8f02ca84a6 in boost::asio::detail::posix_thread::funcboost::asio::detail::resolver_service_base::work_io_service_runner::run (this=0x102b0f0)

at /usr/include/boost/asio/detail/posix_thread.hpp:82

#7 0x00007f8f02c7ac43 in boost::asio::detail::boost_asio_detail_posix_thread_function (arg=0x102b0f0) at /usr/include/boost/asio/detail/impl/posix_thread.ipp:64
#8 0x00007f8f0387ab50 in start_thread (arg=) at pthread_create.c:304
#9 0x00007f8f01f6d0ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f8f03b07720 (LWP 23782)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x0000000000435611 in boost::condition_variable::wait (this=0x7f8efc039208, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:56
#2 0x0000000000435d49 in boost::detail::future_object_base::wait (this=0x7f8efc0391c0, rethrow=false) at /usr/include/boost/thread/future.hpp:166
#3 0x00007f8f02cfa866 in boost::shared_futurecql::cql_future_connection_t::wait (this=0x7fffb86ab570) at /usr/include/boost/thread/future.hpp:902
#4 0x00007f8f02cf3878 in cql::cql_session_impl_t::allocate_connection (this=0x102d420, host=...) at /tmp/cpp-driver.git/src/cql/internal/cql_session_impl.cpp:227
#5 0x00007f8f02cf4385 in cql::cql_session_impl_t::connect (this=0x102d420, query_plan=..., stream=0x7fffb86ab94f, tried_hosts=0x7fffb86ab820)

at /tmp/cpp-driver.git/src/cql/internal/cql_session_impl.cpp:348

#6 0x00007f8f02cf5ce6 in cql::cql_session_impl_t::get_connection (this=0x102d420, query=..., stream=0x7fffb86ab94f)

---Type to continue, or q to quit---
at /tmp/cpp-driver.git/src/cql/internal/cql_session_impl.cpp:662
#7 0x00007f8f02cf5516 in cql::cql_session_impl_t::execute (this=0x102d420, message=...) at /tmp/cpp-driver.git/src/cql/internal/cql_session_impl.cpp:565
#8 0x0000000000431bb8 in demo (host=..., use_ssl=false) at /tmp/cpp-driver.git/demo/main.cpp:156
#9 0x00000000004323df in main (argc=3, vargs=0x7fffb86ac188) at /tmp/cpp-driver.git/demo/main.cpp:209

@jstasiewicz
Copy link
Contributor

Interesting - all remaining threads are stuck in some framework/library code. Recently we dealt with a similar bug, which was caused by mixing of sync/async paradigms.

Are you able to isolate a minimal working example (or perhaps nonminimal, but reproducible)? That would help me a lot. Anyway, thanks for this report.

@jnidzwetzki
Copy link
Author

Hello Juliusz,

you will find an example here: https://raw.github.com/jnidzwetzki/misc/master/cpp-driver/cql_deadlock.cpp

The program executes a prepared insert statement in a while(true) loop. Maybe, you need to start and abort the program 3-4 times until the deadlock occurs. In my environment, the deadlock occurs within the first 25 seconds or the program will run forever.

Best regards
Jan

@jstasiewicz
Copy link
Contributor

Thanks. I'll take a look into that after I finish my current tasks up.

@jnidzwetzki
Copy link
Author

Deadlock fixed with commit d9ebd68

mpenick pushed a commit that referenced this pull request Aug 3, 2016
Add cassandra_cluster_set_connect + request_timeout()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants