For some code you may want to limit its execution time, to prevent infinite loops that can’t be detected within the code. I ran in to this issue with HTMLCleaner on certain HTML, which due to the architecture could not keep track of possible loops with a counter.

The code will need to handle thread interrupts internally by checking Thread.interrupted(), so it won’t always work on arbitrary code.

import java.util.concurrent.*;

class HTMLParseTask implements Callable<Document> {
    String html

    HTMLParseTask(String html) {
        this.html = html;
    }

    @Override
    Document call() throws Exception {
        TagNode tagNode = cleaner.clean(html);
        return domSerializer.createDOM(tagNode);
    }
}

public Document clean(String html) {
    if(html == null) return null;
    // limit the html cleaning to 5s, to avoid any bad html causing infinite loops
    ExecutorService executor = Executors.newSingleThreadExecutor();
    Future<String> future = executor.submit(new HTMLParseTask(html));
    Document result = null;
    try {
        result = future.get(5, TimeUnit.SECONDS);
    } catch(TimeoutException ex) {
        future.cancel(true); // cancel and send a thread interrupt
        log.error("Error parsing HTML. Timed out");
    } finally {
        executor.shutdownNow();
    }
    return result;
}

I needed a Chrome extension that could open my single-page application and send any text field to it, and after editing, send the changes back to the field. Sounds simple, but it led me down many dead ends and complex APIs. The first catch was that chrome.windows.create can take a callback which gets a window object, but this is a https://developer.chrome.com/extensions/windows#type-Window object and not a ref to the actual new window, so all the StackOverflow posts saying you can set variables/functions on it didn’t work out. The second option I tried was chrome.tabs.executeScript, which let me manipulate the DOM in the new window, but not set variables on the window or anywhere. I could send the initial text via the DOM, but then there was no way to communicate changes back to my extension.

The best way to set up bi-directional comms is with https://developer.chrome.com/extensions/messaging#external-webpage. Let the SPA request the text when it’s ready, then send changes back via the same bus.

Background.js:

var selectedContent = null;
chrome.runtime.onMessageExternal.addListener(
  function(request, sender, sendResponse) {
    console.info("------------------------------- Got request", request);
    if (request.getSelectedContent) {
      sendResponse(selectedContent);        
    }
  });

Manifest.json:

The ids here are the extension IDs you wish to communicate with, can be “*” for all.

"externally_connectable": {
  "ids": ["naonkagfcedpnnhdhjahadkghagenjnc"],
  "matches": ["http://localhost:1338/*"]
},

Web app:

// extension ID is a hash generated from your extensions public key, so in development it’ll be generated when you load your unpacked extension, but when packaged it’ll be static

var extensionId = "naonkagfcedpnnhdhjahadkghagenjnc";
chrome.runtime.sendMessage(extensionId, {getSelectedContent: "true"},
  response => {
    console.info("----------------- Got response", response);
    if(response) {
      this.text = response;
    }
});

This is because Found attaches multiple IPs to your 0298347602938ahdf.us-east-1.aws.found.io hostname. So if you use a TransportClient with ssl on port 9343 and add the first IP you find with client.addTransportAddress(new InetSocketTransportAddress(host, port)), it’ll eventually stop working because it’s stuck with an old, invalid IP. The solution is to lookup all the IPs on the hostname and add them to the TransportClient, then do this every 1-5min (or something less than the DNS TTL). The TransportClient will check for duplicates and reachability, so you should have a stable system now. I wrote a gist for doing this with Spring and Groovy: https://gist.github.com/vaughnd/04350e4c5bf51dedabb8

I got burnt by this little architectural nuance in Elasticsearch recently. While batch processing items in a content store, updating their status, then searching for more items, I kept getting stale data and didn’t understand why. It turned out that Elasticsearch is _near_ realtime, with a default 1s refresh interval. So if you index and query within a second, you’re going to see old data. The best way around this is to do a refresh on the index just before you access it to make sure you have the latest data.

Nutch generates a list of urls to fetch from the crawldb. In ./bin/crawl it defaults the size of the fetch list to sizeFetchList=50000.
If you use the default setting generate.max.count=-1 which is unrestricted, you can potentially end up with 50000 urls from the same domain in your fetch list. Then the setting fetcher.queue.mode=byHost only creates a single fetch queue for the host. Now only one thread can work on a queue at a time because fetcher.threads.per.queue=1 to force polite crawling and respecting crawl delays, so instead of having X queues being processed by your 50 default threads, you have 1 queue being processed by one thread while 49 sit idle.

To fix this you need to use a generate.max.count=100 (or whatever value works best for your setup). Now, instead of grabbing 50000 urls from a single site, we grab 100 urls * 500 sites. So it takes multiple iterations to finish a single site, but it goes quicker because we get up to 500 fetch queues instead of only 1 actually fetching pages.

Checking logs when you have more than one servers is painful. Use Logback/Logstash-forwarder to send json-formatted logs to a central server running Logstash/ElasticSearch/Kibana, where you can then slice and dice logs to your heart’s content with the power of ElasticSearch and Kibana.

Confs and docs available here: https://github.com/vaughnd/centralised-logging

Helm for Emacs is a fantastic Quicksilver-like extension, but it gets quite wordy sometimes. Instead of C-u M-x helm-do-grep *nav to dir* *enter extensions* *enter query* to recursively grep, I defined the following in my init.el. Now hitting F1 will grep actual source across all my projects.

(defun project-search ()
  (interactive)
  (helm-do-grep-1 '("/home/vaughn/src")
                  '(4)
                  nil
                  '("*.clj" "*.cljs")))

(global-set-key (kbd "") 'project-search)

Here’s a quick rundown on getting Datomic free running on EC2 or any Ubuntu system. This includes a startup script, and a symlinked runtime to make upgrading Datomic less painful. I highly recommend scripting this and the rest of your cloud with Pallet.

  • Start up an EC2 instance (preferably m1.small since Datomic wants 1GB ram). I used ami-9c78c0f5 for Ubuntu 12.04 LTS.
  • Datomic runtime: Login as your admin user with sudo rights and run this to install Datomic:
sudo aptitude install unzip
sudo aptitude install openjdk7-jre
sudo useradd -s /bin/bash -d /var/lib/datomic
sudo -i -u datomic
export version="0.8.3599" # use the latest datomic version
mkdir data
wget http://downloads.datomic.com/${version}/datomic-free-${version}.zip
unzip -o datomic-free-${version}.zip
ln -s datomic-free-${version} runtime
  • Datomic configuration: Edit /var/lib/datomic/transactor.propeties and change “host”. You can’t use 0.0.0.0 to listen on all interfaces, so use 127.0.0.1 for localhost-only access or use the EC2 private IP so other instances can communicate with it:
########### free mode config ###############
protocol=free
host=<PRIVATE IP or 127.0.0.1 if accessing from same host only>
#free mode will use 3 ports starting with this one:
port=4334

## optional overrides if you don't want ./data and ./log
data-dir=/var/lib/datomic/data/
log-dir=/var/log/datomic/
  • Upstart init script: Edit /etc/init/datomic.conf (install upstart if it’s not installed by default):
start on runlevel [2345]

pre-start script
bash << "EOF"
mkdir -p /var/log/datomic
chown -R datomic /var/log/datomic
EOF
end script

start on (started network-interface
or started network-manager
or started networking)

stop on (stopping network-interface
or stopping network-manager
or stopping networking)

respawn

script
exec su - datomic -c 'cd /var/lib/datomic/runtime; /var/lib/datomic/runtime/bin/transactor /var/lib/datomic/transactor.properties 2>&1 >> /var/log/datomic/datomic.log'
end script

stop on runlevel [016]
  • Start datomic with “sudo service datomic start” and view logs in /var/log/datomic/*

Upgrading datomic:

sudo service datomic stop
su - datomic
export version="0.8.3611" # use the latest datomic version
wget http://downloads.datomic.com/${version}/datomic-free-${version}.zip
unzip -o datomic-free-${version}.zip
rm runtime
ln -s datomic-free-${version} runtime
sudo service datomic start

Backing up and restoring:

Customize this script and pop it in /etc/cron.daily/backup_datomic. Install rdiff-backup on your source and target hosts, and make sure you can ssh in without a password to the target from your source. Check the rdiff-backup site for more information. Alternatively, just use scp or rsync. I like rdiff-backup because it keeps incremental backups, just in case you get corrupt backups.

#!/bin/bash -ex
# cron.daily script

# customize these:
export TARGET_HOST=<TARGET HOST>
export TARGET_USER=datomic-backups
export SSH_KEY=/var/lib/datomic/.ssh/id_rsa
export BACKUP_DIR=/var/lib/datomic/backups
export DATABASE=mydb
export SOURCE_HOST=<IP/host datomic is listening on>

mkdir -p $BACKUP_DIR
cd /var/lib/datomic/runtime/
./bin/datomic backup-db datomic:free://${SOURCE_HOST}:4334/${DATABASE} file://${BACKUP_DIR}/${DATABASE}.dtdb

ssh -i $SSH_KEY -o StrictHostKeyChecking=no ${TARGET_USER}@${TARGET_HOST} "mkdir -p ~/backups/${hostname}/ ; cd ~/backups/${hostname}/ "
rdiff-backup -v5 --create-full-path --remote-schema "ssh -i $SSH_KEY -o StrictHostKeyChecking=no -C %s rdiff-backup --server" ${BACKUP_DIR}/${DATABASE}.dtdb ${TARGET_USER}@${TARGET_HOST}::/home/${TARGET_USER}/backups/`hostname`/${DATABASE}.dtdb

# Restoring
# rdiff-backup --restore-as-of now ${TARGET_USER}@${TARGET_HOST}::/home/${TARGET_USER}/backups/${SOURCE_HOST}/${DATABASE}.dtdb /tmp/${DATABASE}.dtdb
# rdiff-backup -r 2012-11-01 ${TARGET_USER}@${TARGET_HOST}::/home/${TARGET_USER}/backups/`hostname`/${DATABASE}.dtdb /tmp/${DATABASE}.dtdb

Restore from a backup with:

cd /var/lib/datomic/runtime
./bin/datomic restore-db file:///tmp/${DATABASE}.dtdb datomic:free://${SOURCE_HOST}:4334/${DATABASE}