Wednesday, August 16, 2017

Lessons learned - Jackson, API design, Kafka

Introduction

This blogpost describes a bunch of lessons learned during a recent project I worked on.
They are just a bunch grouped together, too small to "deserve" their own separate post :)
Items discussed are: Jackson, JSON, API design, Kafka, Git.


Lessons learned

  • Pretty print (nicely format) JSON in a Linux shell prompt:

    cat file.json | jq

    You might have to 'apt-get install jq' first.

  • OpenOffice/LibreOffice formulas:

    To increase date and year by one from cell H1031=DATE(YEAR(H1031)+1; MONTH(H1031)+1; DAY(H1031))

    Count how many times a range of cells A2:A4501 has the value 1: =COUNTIF(A2:A4501; 1)

  • For monitoring a system a split between a health-status page and performance details is handy. The first one is to show issues/metrics that would require immediate action. Performance is for informational purposes, and usually does not require immediate action.

  • API design: Even if you have a simple method that just returns a date (string) for example, always return JSON (and not just a String of that value). Usueful for backwards compatibility: more fields can be easily added later.

  • When upgrading gitlab, it (gitlab) had changed a repository named 'users' to 'users0'. Turns out 'users' is a reserved repository name in gitlab since version 8.15.

    To change your local git settings to the new users0 perform these steps to update your remote origin:

    # check current setting
    $ git remote -vorigin  https://gitlab.local/gitlab/backend/users (fetch)
    origin  https://gitlab.local/gitlab/backend/users (push)

    # change it to the new one
    $ git remote set-url origin https://gitlab.local/gitlab/backend/users0

    # see it got changed
    $ git remote -v
    origin  https://gitlab.local/gitlab/backend/users0 (fetch)
    origin  https://gitlab.local/gitlab/backend/users0 (push)

  • Jackson JSON generating (serializing): probably a good practice is to not use @JsonInclude(JsonInclude.Include.NON_EMPTY)  or NON_NULLm since that would mean a key will be just not in the JSON when its value is empty or null. That could be confusing to the caller: sometimes it's there sometimes not.  So just leave it in, so it will be set to null.   Unless it would be a totally unrelated field like: amount and currency. If amount is null, currency (maybe) doesn't make sense, so then it could be left out.

  • Java:

    Comparator userComparator = (o1, o2)->o1.getCreated().compareTo(o2.getCreated());

    can be replaced now in Java 8 by:

    Comparator userComparator = Comparator.comparing(UserByPhoneNumber::getCreated);

  • Kafka partitioning tips: http://blog.rocana.com/kafkas-defaultpartitioner-and-byte-arrays

  • Kafka vs RabbitMQ:

    - Kafka is optimized for producers producing lots of data (batch-oriented producers) and consumers that are usually slower that the producers.
    - Performance: Rabbit: makes about 20K/s  Kafka: up to 150K/s.
    - Unlike other message system, Kafka brokers are stateless. This means that the consumer has to maintain how much it has consumed.

Friday, August 11, 2017

Java: generate random Date between now minus X months plus Y months

Introduction

This blogpost shows a Java 8+ code example on how to generate a timestamp between two months relative to today ("now").

The code

This example code creates a random java.util.Date between 12 months ago from today until 1 month ahead from today, which will be available in variable randomDate.

LocalDateTime nowMinusYear = LocalDateTime.now().minusMonths(12);
ZonedDateTime nowMinusYearZdt = nowMinusYear.atZone(ZoneId.of("Europe/Paris"));
beginTimeInMilliseconds = nowMinusYearZdt.toInstant().toEpochMilli();

LocalDateTime nowPlusMonth = LocalDateTime.now().plusMonths(1);
ZonedDateTime nowPlusMonthZdt = nowPlusMonth.atZone(ZoneId.of("Europe/Paris"));
endTimeInMilliseconds = nowPlusMonthZdt.toInstant().toEpochMilli();

System.out.println("System.out.currentInmillis = " + System.currentTimeMillis() + ", beginTimeInMilliseconds = " + beginTimeInMilliseconds + ", endTimeInMilliseconds = " + endTimeInMilliseconds);

Date randomDate = new Date(getRandomTimeInMillisBetweenTwoDates());
...

private static long getRandomTimeInMillisBetweenTwoDates() {
   long diff = endTimeInMilliseconds - beginTimeInMilliseconds + 1;
   return beginTimeInMilliseconds + (long) (Math.random() * diff);
}



How do Kubernetes and its pods behave regarding SIGTERM, SIGKILL and HTTP request routing

Introduction

During a recent project we saw that HTTP requests are still arriving in pods (Spring Boot MVC controllers) even though Kubernetes' kubelet told the pod to exit by sending it a SIGTERM.
Not nice, because that means that those HTTP requests that still get routed to the (shutting down) pod will most likely fail, since the Spring Boot Java process for example has already closed already all its connection pools.

See this post (also shown below) for an overview of the Kubernetes architecture, e.g regarding kubelets.


Analysis

The process for Kubernetes to terminate a pod is as follows:
  1. The kubelet always sends a SIGTERM before a SIGKILL.
  2. Only when a POD does not finish within the graceful period (default 30 sec) after SIGTERM, the kubelet sends a SIGKILL.
  3. Kubernetes keeps routing traffic to a pod until the readiness probe fails, even after the pod received a SIGTERM.
So for a pod there is always an interval between receiving the SIGTERM and the next readiness probe request for that pod. In that period requests can (and most likely) will still be routed to that pod, and even (business) logic can still be executed in the terminated pod.

This means that after sending the SIGTERM, the readiness probes must fail as soon as possible to prevent the SIGTERMed pod from receiving more HTTP requests. But still there will be a (small) period of time requests can be routed to the pod.

A solution would be to terminate the webserver within the pod's process (in this case Spring Boot's webserver) immediately gracefully after receiving a SIGTERM. This way any still directed requests  before the readiness probe fails will fail in any way, i.e no more requests are accepted.  
So still you would have some failing requests getting passed on to the pod.  But at least no business logic will be executed anymore.

This and other options/considerations are discussed here.





Wednesday, August 9, 2017

Cassandra Performance Tips - Batch inserts, Materialized views, Fallback strategy

Introduction

During a recent project we ran into multiple issues with Cassandra's performance. For example with queries being slow or having timeouts on only a specific environment (though they should have the same setup), inconsistently stored results, and how to optimize batch inserts when using Scala.

This blogpost describes how they were solved or attempted to be solved.


Setup: Cassandra running in a cluster with three nodes.

Performace related lessons learned

  1. On certain environments of the DTAP build-street slow queries (taking seconds) and weird consistency results appeared. Not all 4 environments were the same, though Acceptance and Production where as much the same as possible.

    We found as causes:

    Slow queries and timeouts: Cassandra driver was logging at both OS and driver level
    Inconsistently stored results: The clocks from different clients accessing C* were not the same, some off for minutes. Since the default in v3 of the Datastax/Cassandra driver protocol is clientside generated timestamps, you can get in trouble of course, since then the one with the most recent timestamp just always wins. But implementing serverside also won't be obvious, since different C* coordinators can give a millisec different timestamp.

  2. For Gatling performance tests written in Scala, we first needed inserting 50K records in a Cassandra database, simulating users already registered to the system. Trying to make this perform several options were tried:

    a- Plain string concatenated or prepared statements where taking over 5mins in total
    b- Inserting as a batch (apply batch) has limit of 50KiB in text size. That limit is too low for us: 50K records is almost 5MB. Splitting up was too much of a hassle.
    c- Making the calls async, as done here: https://github.com/afiskon/scala-cassandra-example
    But we were getting:

    17:20:57.275 [ERROR] [pool-1-thread-13] TestSimulation - ERROR while inserting row nr 13007, exception =
    com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.20.25.101:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [/172.20.25.101] Pool is busy (no available connection and the queue has reached its max size 256)))
    at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:211)
    at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:46)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:275)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution$1.onFailure(RequestHandler.java:336)
    at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172)
    at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
    at cle.common.util.concurrent.Futures$ImmediateFuture.addListener(Futures.java:102)
    at com.google.common.util.concurrent.Futures.addCallback(Futures.java:1184)
    at com.google.common.util.concurrent.Futures.addCallback(Futures.java:1120)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.query(RequestHandler.java:295)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:272)
    at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:115)
    at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:95)
    at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:132)
    at UserDAO.insert(UserDao.scala:58)
    ...


    Turns out it is the driver's local acquisition queue that fills up. You can increase it via poolingOptions.setMaxQueueSize, see: http://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling/#acquisition-queue
    We set it to 50000 to be safe it would just queue all records (50K). For a production environment this might not be a good idea of course, you might need to tune it to your needs.
    And the threads we set to 20 in the Executioncontext (used by the DAO from the example from github above). You can set it as in this example: http://stackoverflow.com/questions/15285284/how-to-configure-a-fine-tuned-thread-pool-for-futures

  3. Increasing CPU from 4 to 8 did seem to improve performance, less CPU saturation.

  4. Each time adding one more materialized view increases insert performance by 10%  (see here)

  5. For consistency and availability when one of the nodes might be gone or unreachable due to network problems, we setup Cassandra write such that first EACH_QUORUM is tried, then if fails, LOCAL_QUORUM as fallback strategy.

Below articles did help to analyse the problems further:

Tuesday, August 8, 2017

Gatling Lessons learned

Introduction

This post describes a couple of best practices when using Gatling, the Scala based performance and load-testing tool.


Also one or two Scala related tips will be shown.

Lessons learned

  1. You can't set a header-field in the header like this in a scenario:

    .header("request-tracing-id", UUID.randomUUID().toString())

    This is because the scenario is only created once, and unless you use session variables it, it is all static (a function).

    To solve this one can use a feeder, like this:

    val feeder = Iterator.continually(Map("traceHeader" -> UUID.randomUUID().toString))

    And then replace the UUID.randomUUID().toString() line with:

    .header("request-trace-id", "${traceHeader}")

  2. A Scala 2.11 example of a ternary (maybe not the best solution Scala-wise but readable :)

    .value(availableBalanceDecimal, if (dto.availableBalanceDecimal.isEmpty) null else dto.availableBalanceDecimal.get)

  3. Connecting correctly to a service with a Gatling test in one environment (Nginx, ingress, Kubernetes) for some reason did not work. But it was able to connect to the service under test correctly in another environment. Apparently it had something to do with a proxy in between because had to add .proxy() and it worked:

    val httpConf = http
    .baseURL("http://172.20.33.101:30666") // Here is the root for all relative URLs
    .header(HttpHeaderNames.ContentType, HttpHeaderValues.ApplicationJson)
    .header(HttpHeaderNames.Accept, HttpHeaderValues.ApplicationJson)
    .proxy(Proxy("performance.project.com", 30666))   // note it is the *same* machine as the baseURL,but specified by name... 


  4. A .check() with in it a .saveAs() will *not* happen when the earlier expression evaluates to false, or the conversion fails.
    Kindof makes sense when evaluates to false, but you might miss this one; or maybe you don't even want this .is() because all it means now only isFinished will be set to FINISHED and else it won't be set in the below example.

    someText is always found in the session, but the other one, isFinished, is not.

    .check(jsonPath("$[0].status").saveAs("someText"))
    .check(jsonPath("$[0].status").is("FINISHED").saveAs("isFinished"))
    ...

    .exec(session => {
       val isFinished = session.get("isFinished").asOption[String]
       logger.debug("Generated isFinished = {}", isFinished.getOrElse("Could not find expected isFinished..."))
       session
    })

    .doIf(session =>
       (!session.get("someText").as[String].equals("FINISHED") ||
       session.get("isFinished").as[Boolean].equals(false)
    ))(
    ...


    When running at DEBUG level the above logs:
       Session(Map(someText -> TIMED_OUT, ...)  /// So .saveAs() ocurred
       13:45:56.983 [DEBUG] SomeSimulation - Generated isFinished = Could not find expected isFinished...  


    So the second.saveAs() did not occur for isFinished at all, since it is not set to true nor false; it is not set at all!