Lucene Nutch

I've been playing with Nutch recently, partly as part of my attempts to get back into Java development. I've got it creating a crawl database and can do searches by Lucene via the web interface. It's really fast, which is great. I have this idea of providing specialised search for my own sites, which would involve a lot of customisation to the Nutch web interface. I have yet to get it to compile from source though!

What I'm not able to do yet is update an already crawled database. I can only see how you'd delete the existing database and re-crawl the lot, which can't be right.

For the interested here's my little How-To, which is mostly just following someone else's with modifications for Nutch 0.9:

Based on the most useful tutorial found so far:

Assuming a database of "crawl-tinysite".

Crawl URLs into a new database:
bin/nutch crawl urls -dir crawl-tinysite -depth 3

Show statistics on crawl:
bin/nutch readdb crawl-tinysite/crawldb -stats

(some of the tutorial's commands are no longer valid for Nutch 0.9)

Show database segments created by nutch:
bin/nutch readseg -list -dir crawl-tinysite/segments/



Recently I've deployed three Asterisk based VoIP servers. I've used Trixbox ISOs, as I had good experiences using Asterisk@Home a few years ago.

In general, things have gone well. Here's a few things I've come across:

Asterisk in VMWare is far from ideal. I've had a lot of problems with quality and the start of audio being clipped off. This perhaps is as much due to my massively overloaded server as anything else.

Trixbox 2.2 (the current release) is based on CentOS 4.5, which is RHEL4 and runs Linux kernel 2.6.9. The hardware support is far from up to date - if you have something like a newer cheap SATA controller you'll probably lack support - while it'll claim Linux compatibility due to drivers being in the latest kernels. As a side note, modern cheap IDE and SATA controllers tend to be rubbish anyway. Never use onboard software RAID, the bugs and issues I've seen are great and varied.

VoIP phones are immature. They really are. You have the high end Cisco which is solid and performs brilliantly with Cisco servers, yet is virtually useless with the SIP standard due to bugs and generally piss-poor implementation. Then you have the cheaper phones which will actually work (hooray!) but be of such low quality that you wonder why you'd bother. And there are things like what might be nice features included in the firmware which simply don't work (case in point: Snom/Elmeg IP290. Pops up a VMail link when there's voicemail, but tried to open the wrong SIP URL, and is NOT configurable in the web interface with the rest of the config options - I later find an Asterisk-side workaround which gets this working).



I've known about being able to run PHP under Java for a while now, but I've been a bit skeptical about how well it would actually run real PHP code.

Well, I've found out: really well. I've installed on my Tomcat-6 server and run some of my simpler PHP5 stuff on it. It runs well, but not as fast as I was expecting (i.e. slower than Zend PHP5, for filesystem access at least). I suspect this is due to not having Resin Pro and thus not able to cache compiled PHP scripts. Maybe.

But it's certainly got my interest. I like Java a lot, but my main line of work is in PHP5 so I'm totally out of practise and well behind on the state of the art in Java (or even whatever was current 5 years ago..). Using the ability of Quercus to run Java within PHP5[1] I might look at rewriting some of my more time consuming PHP functions - something like what I used to do with C and assembler long, long ago - except that Java is more capable and faster, while assembler is faster but leaves you totally on your own when it comes to features.

As an aside, to get it to run on Tomcat ignore the web.xml on the Quercus website. Take the web.xml out of the downloadable quercus-3.x.x.war.

[1] I know about the PHP/Java bridge, but I never seem to have the right version of PHP to get it to compile...



I have a 12" G4 PowerBook, and I love it! It's the most usable machine I've owned (after my Amigas).

Stuff on the Mac is so easy that I find myself doing stuff I've either not done for years or never done - like making videos.

iMovie is really easy to use. But, it's also slow - if you use the wrong video formats. I learned this the hard way - I put up with 20 minute imports of a few minutes of video for ages, but by accident found that if you import a video in the same format as what you asked iMovie to create the video in, it's instant. Obvious, when you know about it... This means that I convert on my Linux box first (literally seconds for 300MB of film) then import that into iMovie.

On my Mac, it is however buggy. I can't add transitions. The transition doesn't render. And if I'm really lucky, iMovie crashes. I suspect this is my install. Sadly iMovie 08 doesn't support the G4 so there's no upgrade path, which means it'll have to wait until I get my new Mac laptop (which I've been promising myself for a long time now).

Strangely iMovie is the only app I get along with in the iLife pack. iPhoto is useless to me for example.
iWork I'm quite impressed with, but then Microsoft Office 2004 for the Mac is just excellent, in terms of speed, user interface and usability - suprising since I consider Office 2003 and 2007 on Windows the complete opposite!



Eclipse is a Java oriented IDE, but one which is increasingly useful in my main line of work: PHP.

I recently moved wholesale from Dreamweaver to Eclipse (not by choice but by commerical necessity) and have found myself in a whole new world, one which I've pretty much ignored up until now. If I'd moved to Eclipse a year ago (when the PHP elements started to become useful) I really do think I would have been both more productive and also a better coder.

The main thing Eclipse has which Dreamweaver doesn't is code inspection and completion.

I have a class, such as:

class Basket
public $items;

In Dreamweaver, I'd need to remember that there is an attribute "$items" in Basket when typing $basket->items. But Eclipse knows about $items for me, and will hint to me that it's there when I type "$basket->". If I type "$basket->i", I can immediately hit tab to complete the line.

I've used these sorts of IDEs before (many years before, in fact) but had never even considered that it might apply to PHP. There are other major wins for Eclipse too, such as the SVN integration. This is massively more reliable than using Tortoise in the Windows shell as it will always ensure directory deletes and renames are notified to SVN. I usually remember to do these things myself, but there's more than just me in my team and I have spent a great many hours fixing SVN repositories after people have broken their directory layouts.


HP L2045w

For a very long time, I've put off changing my bulky 17" CRT for a TFT. I've worried for a long time about having quality colour reproduction, knowing full well that most screens out there are 6bits per pixel. I've had a very good CRT for a long time and the thought of subtly altering my working environment made me very nervous. As someone who works on websites, colour is important. One of my machines has a cheap 17" TFT, and it is crap. Even my PowerMac's otherwise high quality screen can be difficult to work with as it's very sensitive to viewing angles.

Well, I've been working a lot in Eclipse recently and have got quite using to having a few panels open for my sources. So much that I started to push my CRT over it's usual 1024x768. I found that even modest increases in screen size were making really big productivity improvements - even as the screen started to look blurry.

So I finally decided to get myself a big TFT - I decided on a widescreen 20" early on, as it would provide a lot more space (1680x1050) and fit within the physical constraints of my desk.

My initial choice was an Acer at £250 - a very highly regarded model, and one with a proper 8bit per pixel screen. Well, I procrastinated and by the time I was ready to place the order, I couldn't find stock anywhere.

On to choice 2: the HP L2045w. This has a 6bit panel, but at £180 and with a lot of very nice features, I decided to go for it.

First impressions: excellent screen. The colour is uniform across the whole area, there's virtually no backlight bleed, no dead pixels, excellent viewing angles and a brilliant stand that allows you to point the screen just about anywhere.

The monitor didn't come with a DVI cable so I tried it with VGA at first. It was blurry like this; in fact, if you don't have DVI then don't bother with this monitor. The quality was quite bad. Once I'd bought a DVI cable though, it was completely transformed. Everything is viewed in pin-sharp clarity.

Regarding colour, I was pleasantly suprised that I really do need to go out of my way to look for steps in colour. They are there, but I'll never see them in my normal work. It still seems stupid to feed a perfect 24bit colour signal into a monitor that will render the colours like a machine from 1992, but as of today the extra money required for a proper monitor isn't worth it to me.

No speakers are on this monitor, but given that all integrated speakers are very poor this is a good design choice.

In summary, I'm a happy HP customer.



Subscribe to Technological Wanderings RSS