Tech Team Lead News: December 2015

In a recent project, seemingly randomly, this exception occurred when doing a CQL 'select' statement from a Spring Boot project to Cassandra:

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read), /10.255.235.16 (Timeout during read))
...

After a lot of research, some people seemed to have reported the same issue. But no clear answer anywhere. Except that some Cassandra driver versions might be the cause of it: they mark (all) the node(s) as down and don't recognize it when it becomes available again.

But, the strange this is we have over 10 (micro) services running, all running at with least 2 instances. But only one of these services had this timeout problem. So it almost couldn't be the driver.... Though it did seem to be related with not using the connection for a while, because often our end-to-end tests just ran fine, time after time. But after a few hours, the tests would just fail. Then we didn't see the pattern yet...

But, as a test, we decided to let nobody use the environment against which the end-to-end tests run for a few hours; especially also because some of the below articles do mention as a solution to set the heartbeat (keep-alive) of the driver.

And indeed, the end-to-end tests started failing again after the grace period. Then we realized it: all our services have a Spring Boot health-check implemented, which is called every X seconds. EXCEPT the service that has the timeouts; it only recently got connected with Cassandra!

After fixing that, the error disappeared! Of course depending on the healthcheck for a connection staying alive is not the ideal solution. A better solution is probably setting the heartbeat interval on the driver on Cluster creation:

var poolingOptions = new PoolingOptions()
.SetCoreConnectionsPerHost(1)
.SetHeartBeatInterval(10000);
var cluster = Cluster
.Builder()
.AddContactPoints(hosts).
.WithPoolingOptions(poolingOptions)
.Build();

In the end it was the firewall which resets all TCP connections every two hours!

References

Tips to analyse the problem:

Log at lower levels: log4j.logger.com.datastax.driver.core=TRACE or DEBUG

Call getErrors(): http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/NoHostAvailableException.html

As reported by somebody else, the results of that call gave us just an empty list.

Tips to see more, like which host is up/down in the Cassandra cluster: https://groups.google.com/a/lists.datastax.com/forum/#!search/nohostavailableexception/java-driver-user/WoqUJrqTm98/KidIFkNdN1MJ

You can potentially see when the driver lost the connection by looking for a log statement like:

2014-12-12 16:36:55,843{UTC} [Reconnection-0] ERROR c.d.driver.core.ControlConnection - [Control connection] Cannot connect to any host, scheduling retry in 600000 milliseconds

Make also sure you are re-using the Session object; it's expensive to create, and maybe you are using up your connection pool due to too many Session objects: http://stackoverflow.com/questions/25145980/datastax-cassandra-java-driver-crashes-with-nohostavailableexception-after-a-few

Pooling options: http://stackoverflow.com/questions/24821966/nohostavailableexception-with-1000-concurrent-request-to-cassandra-with-datastax?rq=1

Connection pool problems, connections stay ESTABLISHED: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/nohostavailableexception$20timeout/java-driver-user/fbLFX2_uI7w/O8sQSa6XXj4J

Force limiting the query size at driver level when your query just takes too long: http://stackoverflow.com/questions/19528754/nohostavailableexception-with-cassandra-datastax-java-driver-if-large-resultse?rq=1

Setting a very low value for SocketOptions.setReadTimeoutMillis could be a triggering factor for this bug. If you changed this value, make sure it is greater than the server-side timeouts in cassandra.yaml

Try to connect to Cassandra with cqlsh from the host that's giving the timeout. If that's connecting fine, it might be a faulty driver, not correctly detecting the node is up (again).

As mentioned earlier, it might be a bad driver version. Many reports found of people seeing the error occur when upgrading the driver.

Huge batches can cause driver to mark node as down: "It turned out that it was related to the query pattern in that particular client. We found that in certain cases that client would try to write a HUGE batch, which would not complete under driver side default timeout settings. That would cause driver to think that the node is down, and eventually it will end up marking all nodes as down. Setting a limit on the batch size (by chunking) and using UNLOGGED batches seems to have solved the issue."

Operation timed out: Indicates a time out for your request on that host. This error means the host did not complete the query within SocketOptions#getReadTimeoutMillis(). Since you are not explicitly configuring this that means a query is not completing in 12 seconds (the default).

"A host can be made unavailable if a query times out on it, if we receive a 'DOWN' status event from Cassandra, or connection is lost to the Host. Once you understand the cause of your hosts being marked down, the next step is to see if you can mitigate it." Bug report: https://datastax-oss.atlassian.net/browse/JAVA-577

This was pointing in the right direction for us: Too long idle state? Keep alive? Firewall? https://datastax-oss.atlassian.net/browse/JAVA-204

Rolling restart not handled correctly in older driver: https://datastax-oss.atlassian.net/browse/JAVA-250 and https://datastax-oss.atlassian.net/browse/JAVA-367

"Connectiontimeout is by default 5 seconds: You can control the maximum time the driver will try connecting (to each node) through SocketOptions.setConnectTimeoutMillis() (the default is 5 seconds). The timeout above is per host. So if you pass a list of 100 contact points, you could in theory have to wait 500 seconds (by default) before getting the NoHostAvailableException. But there is no real point in providing that many contact points, and in practice, if Cassandra is not running on the node tried, the connection attempt will usually fail right away (you won't wait the timeout)."

SELECT IN is bad, but we don't do that. Though indeed ORDER BY is sub-optimal too: https://groups.google.com/a/lists.datastax.com/forum/#!search/nohostavailableexception/java-driver-user/b076HRgEfoo/s-kuLbhSAzMJ

"Are you doing cross DC writes? When i stopped the cross DC writes , the problem stopped occurring for me. I think the timeout is mainly due to driven lag."

Similar error reports

In iOS 9 and higher apps, a higher level of ciphers is required for a certificate for Forward Secrecy.

Before iOS 9, it was possible to let a site within a webview forward/redirect to another SSL protected site.
For example it was possible to let another site redirect to this one in iOS 8: https://www.securesuite.co.uk/

But since iOS 9 it is not allowed anymore and you'll get an error like:

An SSL error has occurred and a secure connection to the server cannot be made." UserInfo={NSUnderlyingError=0x7f9855dcb520 {Error Domain=kCFErrorDomainCFNetwork Code=-1200 "An SSL error has occurred and a secure connection to the server cannot be made." UserInfo={_kCFStreamErrorDomainKey=3, NSLocalizedRecoverySuggestion=Would you like to connect to the server anyway?, _kCFNetworkCFStreamSSLErrorOriginalValue=-9802, _kCFStreamPropertySSLClientCertificateState=0, NSLocalizedDescription=An SSL error has occurred and a secure connection to the server cannot be made., NSErrorFailingURLKey=https://www.securesuite.co.uk

You'll just see a blank/white screen in the webview; no errors or whatsoever on the screen.

The by-default supported list in iOS 9 and higher can be found here: https://developer.apple.com/library/prerelease/ios/documentation/General/Reference/InfoPlistKeyReference/Articles/CocoaKeys.html#//apple_ref/doc/uid/TP40009251-SW35

You can test easily whether a webpage/site is accepted at this site: https://www.ssllabs.com/ssltest/analyze.html

So when run for the above mentioned securesuite site, it tells us even in red the signature is still using the old SHA-1:

The site also even checks different clients like browsers and mobile operating systems like Android and iOS. And see the error in the Handshake Simulation section for ATS 9/ iOS 9: Client requires SHA2 certificate signatures

Running it against https://www.mastercard.com, which has SHA-2 as signature algorithm, the forwarding does work in iOS 8 and ios 9+:

And the iOS 9 client also likes it:

To still be able to have iOS 9 and higher apps work with those less-secure sites which still use SHA-1, you can specify which domains are "ok-ish", i.e whitelist per domain.

In the sections in https://developer.apple.com/library/prerelease/ios/documentation/General/Reference/InfoPlistKeyReference/Articles/CocoaKeys.html#//apple_ref/doc/uid/TP40009251-SW35 you can read how to whitelist: "Allowing Insecure Connection to a Single Server" and "Allowing Lowered Security to a Single Server" and "Using ATS For Your Servers and Allowing Insecure Connections Elsewhere".

Tech Team Lead News

Wednesday, December 9, 2015

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read)

References

Similar error reports

Wednesday, December 2, 2015

iOS9 / iOS 9 / iOS 9.1 / ATS 9: An SSL error has occurred and a secure connection to the server cannot be made due to old sha1 signed certificate

About Me

Subscribe via RSS

Subscribe via email

Twitter Follow Me

Useful Links

Total Pageviews

Live Traffic Map

Blog Archive