Tech Team Lead News: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read)

In a recent project, seemingly randomly, this exception occurred when doing a CQL 'select' statement from a Spring Boot project to Cassandra:

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read), /10.255.235.16 (Timeout during read))
...

After a lot of research, some people seemed to have reported the same issue. But no clear answer anywhere. Except that some Cassandra driver versions might be the cause of it: they mark (all) the node(s) as down and don't recognize it when it becomes available again.

But, the strange this is we have over 10 (micro) services running, all running at with least 2 instances. But only one of these services had this timeout problem. So it almost couldn't be the driver.... Though it did seem to be related with not using the connection for a while, because often our end-to-end tests just ran fine, time after time. But after a few hours, the tests would just fail. Then we didn't see the pattern yet...

But, as a test, we decided to let nobody use the environment against which the end-to-end tests run for a few hours; especially also because some of the below articles do mention as a solution to set the heartbeat (keep-alive) of the driver.

And indeed, the end-to-end tests started failing again after the grace period. Then we realized it: all our services have a Spring Boot health-check implemented, which is called every X seconds. EXCEPT the service that has the timeouts; it only recently got connected with Cassandra!

After fixing that, the error disappeared! Of course depending on the healthcheck for a connection staying alive is not the ideal solution. A better solution is probably setting the heartbeat interval on the driver on Cluster creation:

var poolingOptions = new PoolingOptions()
.SetCoreConnectionsPerHost(1)
.SetHeartBeatInterval(10000);
var cluster = Cluster
.Builder()
.AddContactPoints(hosts).
.WithPoolingOptions(poolingOptions)
.Build();

In the end it was the firewall which resets all TCP connections every two hours!

References

Tips to analyse the problem:

Log at lower levels: log4j.logger.com.datastax.driver.core=TRACE or DEBUG

Call getErrors(): http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/NoHostAvailableException.html

As reported by somebody else, the results of that call gave us just an empty list.

Tips to see more, like which host is up/down in the Cassandra cluster: https://groups.google.com/a/lists.datastax.com/forum/#!search/nohostavailableexception/java-driver-user/WoqUJrqTm98/KidIFkNdN1MJ

You can potentially see when the driver lost the connection by looking for a log statement like:

2014-12-12 16:36:55,843{UTC} [Reconnection-0] ERROR c.d.driver.core.ControlConnection - [Control connection] Cannot connect to any host, scheduling retry in 600000 milliseconds

Make also sure you are re-using the Session object; it's expensive to create, and maybe you are using up your connection pool due to too many Session objects: http://stackoverflow.com/questions/25145980/datastax-cassandra-java-driver-crashes-with-nohostavailableexception-after-a-few

Pooling options: http://stackoverflow.com/questions/24821966/nohostavailableexception-with-1000-concurrent-request-to-cassandra-with-datastax?rq=1

Connection pool problems, connections stay ESTABLISHED: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/nohostavailableexception$20timeout/java-driver-user/fbLFX2_uI7w/O8sQSa6XXj4J

Force limiting the query size at driver level when your query just takes too long: http://stackoverflow.com/questions/19528754/nohostavailableexception-with-cassandra-datastax-java-driver-if-large-resultse?rq=1

Setting a very low value for SocketOptions.setReadTimeoutMillis could be a triggering factor for this bug. If you changed this value, make sure it is greater than the server-side timeouts in cassandra.yaml

Try to connect to Cassandra with cqlsh from the host that's giving the timeout. If that's connecting fine, it might be a faulty driver, not correctly detecting the node is up (again).

As mentioned earlier, it might be a bad driver version. Many reports found of people seeing the error occur when upgrading the driver.

Huge batches can cause driver to mark node as down: "It turned out that it was related to the query pattern in that particular client. We found that in certain cases that client would try to write a HUGE batch, which would not complete under driver side default timeout settings. That would cause driver to think that the node is down, and eventually it will end up marking all nodes as down. Setting a limit on the batch size (by chunking) and using UNLOGGED batches seems to have solved the issue."

Operation timed out: Indicates a time out for your request on that host. This error means the host did not complete the query within SocketOptions#getReadTimeoutMillis(). Since you are not explicitly configuring this that means a query is not completing in 12 seconds (the default).

"A host can be made unavailable if a query times out on it, if we receive a 'DOWN' status event from Cassandra, or connection is lost to the Host. Once you understand the cause of your hosts being marked down, the next step is to see if you can mitigate it." Bug report: https://datastax-oss.atlassian.net/browse/JAVA-577

This was pointing in the right direction for us: Too long idle state? Keep alive? Firewall? https://datastax-oss.atlassian.net/browse/JAVA-204

Rolling restart not handled correctly in older driver: https://datastax-oss.atlassian.net/browse/JAVA-250 and https://datastax-oss.atlassian.net/browse/JAVA-367

"Connectiontimeout is by default 5 seconds: You can control the maximum time the driver will try connecting (to each node) through SocketOptions.setConnectTimeoutMillis() (the default is 5 seconds). The timeout above is per host. So if you pass a list of 100 contact points, you could in theory have to wait 500 seconds (by default) before getting the NoHostAvailableException. But there is no real point in providing that many contact points, and in practice, if Cassandra is not running on the node tried, the connection attempt will usually fail right away (you won't wait the timeout)."

SELECT IN is bad, but we don't do that. Though indeed ORDER BY is sub-optimal too: https://groups.google.com/a/lists.datastax.com/forum/#!search/nohostavailableexception/java-driver-user/b076HRgEfoo/s-kuLbhSAzMJ

"Are you doing cross DC writes? When i stopped the cross DC writes , the problem stopped occurring for me. I think the timeout is mainly due to driven lag."