I got burnt by this little architectural nuance in Elasticsearch recently. While batch processing items in a content store, updating their status, then searching for more items, I kept getting stale data and didn’t understand why. It turned out that Elasticsearch is _near_ realtime, with a default 1s refresh interval. So if you index and query within a second, you’re going to see old data. The best way around this is to do a refresh on the index just before you access it to make sure you have the latest data.
Nutch generates a list of urls to fetch from the crawldb. In ./bin/crawl it defaults the size of the fetch list to sizeFetchList=50000.
If you use the default setting generate.max.count=-1 which is unrestricted, you can potentially end up with 50000 urls from the same domain in your fetch list. Then the setting fetcher.queue.mode=byHost only creates a single fetch queue for the host. Now only one thread can work on a queue at a time because fetcher.threads.per.queue=1 to force polite crawling and respecting crawl delays, so instead of having X queues being processed by your 50 default threads, you have 1 queue being processed by one thread while 49 sit idle.
To fix this you need to use a generate.max.count=100 (or whatever value works best for your setup). Now, instead of grabbing 50000 urls from a single site, we grab 100 urls * 500 sites. So it takes multiple iterations to finish a single site, but it goes quicker because we get up to 500 fetch queues instead of only 1 actually fetching pages.