Tech Team Lead News: driver

LOGBack DBAppender IllegalStateException

Sometimes when starting a Spring Boot application with Logback DBAppender configured for PostgreSQL or AWS Aurora in logback-spring.xml, it gives this error:

java.lang.IllegalStateException: Logback configuration error detected: ERROR in ch.qos.logback.core.joran.spi.Interpreter@22:16 - RuntimeException in Action for tag [appender] java.lang.IllegalStateException: DBAppender cannot function if the JDBC driver does not support getGeneratedKeys method *and* without a specific SQL dialect

The error can be quite confusing. From the documentation it says that Logback should be able to detect the dialect from the driver class.

But apparently it doesn't. Sometimes. After investigating, it turns out that this error is also given when the driver can't connect correctly to the database. Because it will then not be able to find the metadata either, which it uses to detect the dialect. And thus you get this error too in that case!
A confusing error message indeed.

A suggestion in some post was to specify the <sqlDialect> tag, but that is not needed anymore in recent Logback versions. Indeed, it now gives these errors when putting it in logback-spring.xml file either below <password> or below <connectionSource>:

ERROR in ch.qos.logback.core.joran.spi.Interpreter@25:87 - no applicable action for [sqlDialect], current ElementPath is [[configuration][appender][connectionSource][dataSource][sqlDialect]]
or
ERROR in ch.qos.logback.core.joran.spi.Interpreter@27:79 - no applicable action for [sqlDialect], current ElementPath is [[configuration][appender][sqlDialect]]
To get a better error message it's better to implement the setup of the LogBack DBAppender in code, instead of in the logback-spring.xml. See for examples here and here.

In a recent project, seemingly randomly, this exception occurred when doing a CQL 'select' statement from a Spring Boot project to Cassandra:

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read), /10.255.235.16 (Timeout during read))
...

After a lot of research, some people seemed to have reported the same issue. But no clear answer anywhere. Except that some Cassandra driver versions might be the cause of it: they mark (all) the node(s) as down and don't recognize it when it becomes available again.

But, the strange this is we have over 10 (micro) services running, all running at with least 2 instances. But only one of these services had this timeout problem. So it almost couldn't be the driver.... Though it did seem to be related with not using the connection for a while, because often our end-to-end tests just ran fine, time after time. But after a few hours, the tests would just fail. Then we didn't see the pattern yet...

But, as a test, we decided to let nobody use the environment against which the end-to-end tests run for a few hours; especially also because some of the below articles do mention as a solution to set the heartbeat (keep-alive) of the driver.

And indeed, the end-to-end tests started failing again after the grace period. Then we realized it: all our services have a Spring Boot health-check implemented, which is called every X seconds. EXCEPT the service that has the timeouts; it only recently got connected with Cassandra!

After fixing that, the error disappeared! Of course depending on the healthcheck for a connection staying alive is not the ideal solution. A better solution is probably setting the heartbeat interval on the driver on Cluster creation:

var poolingOptions = new PoolingOptions()
.SetCoreConnectionsPerHost(1)
.SetHeartBeatInterval(10000);
var cluster = Cluster
.Builder()
.AddContactPoints(hosts).
.WithPoolingOptions(poolingOptions)
.Build();

In the end it was the firewall which resets all TCP connections every two hours!

References

Tips to analyse the problem:

Log at lower levels: log4j.logger.com.datastax.driver.core=TRACE or DEBUG

Call getErrors(): http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/NoHostAvailableException.html

As reported by somebody else, the results of that call gave us just an empty list.

Tips to see more, like which host is up/down in the Cassandra cluster: https://groups.google.com/a/lists.datastax.com/forum/#!search/nohostavailableexception/java-driver-user/WoqUJrqTm98/KidIFkNdN1MJ

You can potentially see when the driver lost the connection by looking for a log statement like:

2014-12-12 16:36:55,843{UTC} [Reconnection-0] ERROR c.d.driver.core.ControlConnection - [Control connection] Cannot connect to any host, scheduling retry in 600000 milliseconds

Make also sure you are re-using the Session object; it's expensive to create, and maybe you are using up your connection pool due to too many Session objects: http://stackoverflow.com/questions/25145980/datastax-cassandra-java-driver-crashes-with-nohostavailableexception-after-a-few

Pooling options: http://stackoverflow.com/questions/24821966/nohostavailableexception-with-1000-concurrent-request-to-cassandra-with-datastax?rq=1

Connection pool problems, connections stay ESTABLISHED: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/nohostavailableexception$20timeout/java-driver-user/fbLFX2_uI7w/O8sQSa6XXj4J

Force limiting the query size at driver level when your query just takes too long: http://stackoverflow.com/questions/19528754/nohostavailableexception-with-cassandra-datastax-java-driver-if-large-resultse?rq=1

Setting a very low value for SocketOptions.setReadTimeoutMillis could be a triggering factor for this bug. If you changed this value, make sure it is greater than the server-side timeouts in cassandra.yaml

Try to connect to Cassandra with cqlsh from the host that's giving the timeout. If that's connecting fine, it might be a faulty driver, not correctly detecting the node is up (again).

As mentioned earlier, it might be a bad driver version. Many reports found of people seeing the error occur when upgrading the driver.

Huge batches can cause driver to mark node as down: "It turned out that it was related to the query pattern in that particular client. We found that in certain cases that client would try to write a HUGE batch, which would not complete under driver side default timeout settings. That would cause driver to think that the node is down, and eventually it will end up marking all nodes as down. Setting a limit on the batch size (by chunking) and using UNLOGGED batches seems to have solved the issue."

Operation timed out: Indicates a time out for your request on that host. This error means the host did not complete the query within SocketOptions#getReadTimeoutMillis(). Since you are not explicitly configuring this that means a query is not completing in 12 seconds (the default).

"A host can be made unavailable if a query times out on it, if we receive a 'DOWN' status event from Cassandra, or connection is lost to the Host. Once you understand the cause of your hosts being marked down, the next step is to see if you can mitigate it." Bug report: https://datastax-oss.atlassian.net/browse/JAVA-577

This was pointing in the right direction for us: Too long idle state? Keep alive? Firewall? https://datastax-oss.atlassian.net/browse/JAVA-204

Rolling restart not handled correctly in older driver: https://datastax-oss.atlassian.net/browse/JAVA-250 and https://datastax-oss.atlassian.net/browse/JAVA-367

"Connectiontimeout is by default 5 seconds: You can control the maximum time the driver will try connecting (to each node) through SocketOptions.setConnectTimeoutMillis() (the default is 5 seconds). The timeout above is per host. So if you pass a list of 100 contact points, you could in theory have to wait 500 seconds (by default) before getting the NoHostAvailableException. But there is no real point in providing that many contact points, and in practice, if Cassandra is not running on the node tried, the connection attempt will usually fail right away (you won't wait the timeout)."

SELECT IN is bad, but we don't do that. Though indeed ORDER BY is sub-optimal too: https://groups.google.com/a/lists.datastax.com/forum/#!search/nohostavailableexception/java-driver-user/b076HRgEfoo/s-kuLbhSAzMJ

"Are you doing cross DC writes? When i stopped the cross DC writes , the problem stopped occurring for me. I think the timeout is mainly due to driven lag."

Tech Team Lead News

Thursday, December 28, 2017

Logback DBAppender sometimes gives error on AWS Aurora: IllegalStateException: DBAppender cannot function if the JDBC driver does not support getGeneratedKeys method and without a specific SQL dialect

LOGBack DBAppender IllegalStateException

Wednesday, December 9, 2015

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read)

References

Similar error reports

About Me

Subscribe via RSS

Subscribe via email

Twitter Follow Me

Useful Links

Total Pageviews

Live Traffic Map

Blog Archive

Tech Team Lead News

Thursday, December 28, 2017

Logback DBAppender sometimes gives error on AWS Aurora: IllegalStateException: DBAppender cannot function if the JDBC driver does not support getGeneratedKeys method *and* without a specific SQL dialect

LOGBack DBAppender IllegalStateException

Wednesday, December 9, 2015

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read)

References

Similar error reports

About Me

Subscribe via RSS

Subscribe via email

Twitter Follow Me

Useful Links

Total Pageviews

Live Traffic Map

Blog Archive

Logback DBAppender sometimes gives error on AWS Aurora: IllegalStateException: DBAppender cannot function if the JDBC driver does not support getGeneratedKeys method and without a specific SQL dialect