For some code you may want to limit its execution time, to prevent infinite loops that can’t be detected within the code. I ran in to this issue with HTMLCleaner on certain HTML, which due to the architecture could not keep track of possible loops with a counter.

The code will need to handle thread interrupts internally by checking Thread.interrupted(), so it won’t always work on arbitrary code.

import java.util.concurrent.*;

class HTMLParseTask implements Callable<Document> {
    String html

    HTMLParseTask(String html) {
        this.html = html;

    Document call() throws Exception {
        TagNode tagNode = cleaner.clean(html);
        return domSerializer.createDOM(tagNode);

public Document clean(String html) {
    if(html == null) return null;
    // limit the html cleaning to 5s, to avoid any bad html causing infinite loops
    ExecutorService executor = Executors.newSingleThreadExecutor();
    Future<String> future = executor.submit(new HTMLParseTask(html));
    Document result = null;
    try {
        result = future.get(5, TimeUnit.SECONDS);
    } catch(TimeoutException ex) {
        future.cancel(true); // cancel and send a thread interrupt
        log.error("Error parsing HTML. Timed out");
    } finally {
    return result;

This is because Found attaches multiple IPs to your hostname. So if you use a TransportClient with ssl on port 9343 and add the first IP you find with client.addTransportAddress(new InetSocketTransportAddress(host, port)), it’ll eventually stop working because it’s stuck with an old, invalid IP. The solution is to lookup all the IPs on the hostname and add them to the TransportClient, then do this every 1-5min (or something less than the DNS TTL). The TransportClient will check for duplicates and reachability, so you should have a stable system now. I wrote a gist for doing this with Spring and Groovy:

I got burnt by this little architectural nuance in Elasticsearch recently. While batch processing items in a content store, updating their status, then searching for more items, I kept getting stale data and didn’t understand why. It turned out that Elasticsearch is _near_ realtime, with a default 1s refresh interval. So if you index and query within a second, you’re going to see old data. The best way around this is to do a refresh on the index just before you access it to make sure you have the latest data.

Checking logs when you have more than one servers is painful. Use Logback/Logstash-forwarder to send json-formatted logs to a central server running Logstash/ElasticSearch/Kibana, where you can then slice and dice logs to your heart’s content with the power of ElasticSearch and Kibana.

Confs and docs available here: