Technological Wanderings - searching http://www.technologicalwanderings.co.uk/taxonomy/term/10 en Lucene Nutch http://www.technologicalwanderings.co.uk/node/6 <div class="field field-name-taxonomy-vocabulary-1 field-type-taxonomy-term-reference field-label-above"><div class="field-label">Keywords:&nbsp;</div><div class="field-items"><div class="field-item even"><a href="/taxonomy/term/10">searching</a></div></div></div><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even"><p>I've been playing with Nutch recently, partly as part of my attempts to get back into Java development. I've got it creating a crawl database and can do searches by Lucene via the web interface. It's really fast, which is great. I have this idea of providing specialised search for my own sites, which would involve a lot of customisation to the Nutch web interface. I have yet to get it to compile from source though!</p> <p>What I'm not able to do yet is update an already crawled database. I can only see how you'd delete the existing database and re-crawl the lot, which can't be right.</p> <p>For the interested here's my little How-To, which is mostly just following someone else's with modifications for Nutch 0.9:</p> <p>Based on the most useful tutorial found so far:<br /><a href="http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html">http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html</a></p> <p>Assuming a database of "crawl-tinysite".</p> <p>Crawl URLs into a new database:<br /> bin/nutch crawl urls -dir crawl-tinysite -depth 3</p> <p>Show statistics on crawl:<br /> bin/nutch readdb crawl-tinysite/crawldb -stats</p> <p>(some of the tutorial's commands are no longer valid for Nutch 0.9)</p> <p>Show database segments created by nutch:<br /> bin/nutch readseg -list -dir crawl-tinysite/segments/</p> </div></div></div> Sat, 25 Aug 2007 21:11:01 +0000 techuser 6 at http://www.technologicalwanderings.co.uk http://www.technologicalwanderings.co.uk/node/6#comments