For various purposes one needs systems that can visit a web page and then grab the main text of the page. This is one of those tasks where human intelligence shines and computers have a very hard time. A human being can easily look at a page from cnn.com or any blog and point out the ‘main text’ of the page, that is the part without graphics, ads, videos, side texts, etc.
I think the best server-side system among the ones mentioned in the comments so far is boilerpipe: http://code.google.com/p/boilerpipe/. According to its developer Christian Kohlschütter, “The boilerpipe library provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.” Some experiments I have done with the system supports the claims of its developer and I plan to use it for some of my projects.
On the other hand there are client-side systems that try to do similar things:
– Readability: http://lab.arc90.com/experiments/readability/
– Readable: http://readable-app.appspot.com/
Both systems are quite handy if your only concern is to have a good reading / text extraction experience within browser, however it may be a major pain in the important parts of your body to convert them to server-side programming languages (and do the necessary testing). It is not impossible though, see the discussion on Env.js group and the discussion on StackOverflow: http://stackoverflow.com/questions/2921237/is-there-anything-for-python-that-is-like-readability-js
There are a few other server-side ‘main text’ extraction systems that I encountered but since I did not experiment with them yet (well, boilerpipe was quite satisfying, I must admit) I did not want to mention them.
PS: If you know about any better server-side system for ‘main text’ extraction feel free to comment or e-mail me (if it is open source and free software).