Printable Format for Any Webpage
(and the “Meat” Algorithm)
Last week, we added functionality to one of our web apps to show just the main content of any web-page, without all the other stuff. You may think of this as creating a printable view of any web-page, with all images, videos, ads, etc. removed. Here is an example of an original webpage vs. the printable view we create:
Feel free to skip straight to our “Meat” Algorithm, as we’ve so endearingly named it, if you’re not interested in the specifics of implementing it.
The Tools: Ruby and Nokogiri
Thanks to Ruby and a Ruby gem, called Nokogiri, it’s far easier to create this printable view than you may think. If you haven’t heard of it before, Nokogiri is a gem that reads and parses HTML, XML, and SAX, and allows you to easily search and manipulate these documents based on CSS selectors and XPATH.
I should also note that Nokogiri also requires Open-URI, which is included in the standard Ruby library.
Reading and Parsing in 5 Lines
Nokogiri is really straight forward and easy to use. In fact, it’s so easy, I’m going to show you how to open any webpage and create a print-formatted version of it in 5 lines of code!
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com/some-page')) do |config|
config.noent.noblanks.noerror
end
doc.search("//script","//img","//iframe","//object","//embed","//param","//form","//meta","//link","//title").remove
doc.search("//div","//p","//span","//a","//h1","//h2","//h3","//h4","//h5","//h6","//ul","//ol").attr('class','').attr('id','').attr('style','')
doc = doc.search("//p").collect{ |p| p.parent }.uniq
Explanation: The Algorithm
Ok, so now you may be asking, what exactly is this code doing? Let’s take a look, piece-by-piece…
Setting Up the HTML with the Configuration Options
config.noent.noblanks.noerror
end
This config block for Nokogiri is simply stripping away blank nodes, entities, and suppressing any errors generated from malformed HTML. You can read more about these configuration options on Nokogiri’s site.
Stripping Out the Media
doc.search("//div","//p","//span","//a","//h1","//h2","//h3","//h4","//h5","//h6","//ul","//ol").attr('class','').attr('id','').attr('style','')
This line is simply removing all javascript, images, iframes, and embeded objects such as videos and flash. You can modify this to remove pretty much any elements you want from the page.
UPDATE: Since writing this article, we’ve refined our “algorithm” to strip out more element types. We also added the second line, which overrides the HTML attributes for class, id, and style so that the resulting HTML does not cause unintentional styling conflicts.
The “Meat” Algorithm: Determining What’s Important
This, believe it or not, is our algorithm for determining what is the important part of the webpage to keep. This is how we determine what the “meat” of the page is, the part that needs to be kept.
All we are doing is searching the document for <p> paragraph tags. Then we collect the parent <div> blocks (or whatever the parent elements of the <p> tags happen to be) into an array. However, if there are multiple <p> tags in one block (and let’s face it, there will be), this creates duplicate parent blocks in the array (one for each <p> tag). So, we simply call the .uniq method on the array to get rid of the duplicates.
Note the reason we grab the parent element of all paragraph elements (rather than just grabbing all of the paragraph elements themselves) is so that we make sure not to exclude any headings, unordered lists, ordered lists, or any other elements that may be in the body of the page, but not wrapped in paragraph tags.
95% of the Time, It Works Every Time
This is not full-proof, as there could be a webpage that throws semantics and proper markup to the wind, and decides that plain text and <br /> tags are better than <p> tags. This may also exclude other important information elsewhere in the webpage. However, in our limited experience so far, it works very well in at least 95% of scenarios.


oooooooooh , tanks very much
I have a wordpress blog with a lot of pictures hosted on third party websites. I want all the pictures to be hosted in my wordpress blogs. I don’t want to manually download all pictures and replace them in the posts, i need something to do that automatically..
You can use his cut and paste method but may not get all of the correct formating. go to this web site:. . http:\\www.zamzar.com. . It is a free.. . If you cannot figure it out or it does not work correctly email it to me and I will do it for you. drasnia33@yahoo.com. .