I got burnt by this little architectural nuance in Elasticsearch recently. While batch processing items in a content store, updating their status, then searching for more items, I kept getting stale data and didn’t understand why. It turned out that Elasticsearch is _near_ realtime, with a default 1s refresh interval. So if you index and query within a second, you’re going to see old data. The best way around this is to do a refresh on the index just before you access it to make sure you have the latest data.
Category Archives: DevOps
Why does Apache Nutch sometimes get stuck using a single thread and crawling slowly?
Nutch generates a list of urls to fetch from the crawldb. In ./bin/crawl it defaults the size of the fetch list to sizeFetchList=50000.
If you use the default setting generate.max.count=-1 which is unrestricted, you can potentially end up with 50000 urls from the same domain in your fetch list. Then the setting fetcher.queue.mode=byHost only creates a single fetch queue for the host. Now only one thread can work on a queue at a time because fetcher.threads.per.queue=1 to force polite crawling and respecting crawl delays, so instead of having X queues being processed by your 50 default threads, you have 1 queue being processed by one thread while 49 sit idle.
To fix this you need to use a generate.max.count=100 (or whatever value works best for your setup). Now, instead of grabbing 50000 urls from a single site, we grab 100 urls * 500 sites. So it takes multiple iterations to finish a single site, but it goes quicker because we get up to 500 fetch queues instead of only 1 actually fetching pages.
Centralising Clojure/Java logging with Logback, LogStash, ElasticSearch, and Kibana
Checking logs when you have more than one servers is painful. Use Logback/Logstash-forwarder to send json-formatted logs to a central server running Logstash/ElasticSearch/Kibana, where you can then slice and dice logs to your heart’s content with the power of ElasticSearch and Kibana.
Confs and docs available here: https://github.com/vaughnd/centralised-logging
Deploying Datomic free on EC2 or any Ubuntu system
Here’s a quick rundown on getting Datomic free running on EC2 or any Ubuntu system. This includes a startup script, and a symlinked runtime to make upgrading Datomic less painful. I highly recommend scripting this and the rest of your cloud with Pallet.
- Start up an EC2 instance (preferably m1.small since Datomic wants 1GB ram). I used ami-9c78c0f5 for Ubuntu 12.04 LTS.
- Datomic runtime: Login as your admin user with sudo rights and run this to install Datomic:
sudo aptitude install unzip sudo aptitude install openjdk7-jre sudo useradd -s /bin/bash -d /var/lib/datomic sudo -i -u datomic export version="0.8.3599" # use the latest datomic version mkdir data wget http://downloads.datomic.com/${version}/datomic-free-${version}.zip unzip -o datomic-free-${version}.zip ln -s datomic-free-${version} runtime
- Datomic configuration: Edit /var/lib/datomic/transactor.propeties and change “host”. You can’t use 0.0.0.0 to listen on all interfaces, so use 127.0.0.1 for localhost-only access or use the EC2 private IP so other instances can communicate with it:
########### free mode config ############### protocol=free host=<PRIVATE IP or 127.0.0.1 if accessing from same host only> #free mode will use 3 ports starting with this one: port=4334 ## optional overrides if you don't want ./data and ./log data-dir=/var/lib/datomic/data/ log-dir=/var/log/datomic/
- Upstart init script: Edit /etc/init/datomic.conf (install upstart if it’s not installed by default):
start on runlevel [2345] pre-start script bash << "EOF" mkdir -p /var/log/datomic chown -R datomic /var/log/datomic EOF end script start on (started network-interface or started network-manager or started networking) stop on (stopping network-interface or stopping network-manager or stopping networking) respawn script exec su - datomic -c 'cd /var/lib/datomic/runtime; /var/lib/datomic/runtime/bin/transactor /var/lib/datomic/transactor.properties 2>&1 >> /var/log/datomic/datomic.log' end script stop on runlevel [016]
- Start datomic with “sudo service datomic start” and view logs in /var/log/datomic/*
Upgrading datomic:
sudo service datomic stop su - datomic export version="0.8.3611" # use the latest datomic version wget http://downloads.datomic.com/${version}/datomic-free-${version}.zip unzip -o datomic-free-${version}.zip rm runtime ln -s datomic-free-${version} runtime sudo service datomic start
Backing up and restoring:
Customize this script and pop it in /etc/cron.daily/backup_datomic. Install rdiff-backup on your source and target hosts, and make sure you can ssh in without a password to the target from your source. Check the rdiff-backup site for more information. Alternatively, just use scp or rsync. I like rdiff-backup because it keeps incremental backups, just in case you get corrupt backups.
#!/bin/bash -ex # cron.daily script # customize these: export TARGET_HOST=<TARGET HOST> export TARGET_USER=datomic-backups export SSH_KEY=/var/lib/datomic/.ssh/id_rsa export BACKUP_DIR=/var/lib/datomic/backups export DATABASE=mydb export SOURCE_HOST=<IP/host datomic is listening on> mkdir -p $BACKUP_DIR cd /var/lib/datomic/runtime/ ./bin/datomic backup-db datomic:free://${SOURCE_HOST}:4334/${DATABASE} file://${BACKUP_DIR}/${DATABASE}.dtdb ssh -i $SSH_KEY -o StrictHostKeyChecking=no ${TARGET_USER}@${TARGET_HOST} "mkdir -p ~/backups/${hostname}/ ; cd ~/backups/${hostname}/ " rdiff-backup -v5 --create-full-path --remote-schema "ssh -i $SSH_KEY -o StrictHostKeyChecking=no -C %s rdiff-backup --server" ${BACKUP_DIR}/${DATABASE}.dtdb ${TARGET_USER}@${TARGET_HOST}::/home/${TARGET_USER}/backups/`hostname`/${DATABASE}.dtdb # Restoring # rdiff-backup --restore-as-of now ${TARGET_USER}@${TARGET_HOST}::/home/${TARGET_USER}/backups/${SOURCE_HOST}/${DATABASE}.dtdb /tmp/${DATABASE}.dtdb # rdiff-backup -r 2012-11-01 ${TARGET_USER}@${TARGET_HOST}::/home/${TARGET_USER}/backups/`hostname`/${DATABASE}.dtdb /tmp/${DATABASE}.dtdb
Restore from a backup with:
cd /var/lib/datomic/runtime ./bin/datomic restore-db file:///tmp/${DATABASE}.dtdb datomic:free://${SOURCE_HOST}:4334/${DATABASE}