Monday, May 13, 2013

HTML Scraping - Stack Overflow - Time Machine Backup

I somehow want to crawl over a website at the click of a button using simple HTML and Javascript just to find simple information and make my own sitemap for the website.Like Accessing the website's Copyright year information. Thats it! But cross-site scripting access privileges and security measures in use dont allow me to do so. I come across javascript Error Messages like "XMLHttpRequest Exception 101" and "Origin null is not allowed by Access-Control-Allow-Origin".

So i google around and come across a concept called "HTML Scraping" described in the last paragraph.
And as i crawl for a Javascript-based crawler, i get to a StackOverflow page and extract its important contents to this post. So that i can read anytime later. 

I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.
The story so far:
If you want to see a scraper application, check out Grant's Stack Overflow user page monitor. Nifty!
The Ruby world's equivalent to Beautiful Soup is why_the_lucky_stiff's Hpricot
In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.
Python has several options for HTML scraping in addition to Beatiful Soup. Here are some others:
  • mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
  • lxml: Python binding to libwww. Supports various options to traverse and select elements (e.g.XPath and CSS selection)
  • scrapemark: high level library using templates to extract informations from HTML.
  • pyquery: allows you to make jQuery like queries on XML documents.
  • scrapy: an high level scraping and web crawling framework. It can be used to write spiders, for data mining and for monitoring and automated testing.
Scraping Stack Overflow is especially easy with Shoes and Hpricot

I've had some success with HtmlUnit, in Java. It's a simple framework for writing unit tests on web UI's, but equally useful for HTML scraping.

The templatemaker utility from Adrian Holovaty (of Django fame) uses a very interesting approach: You feed it variations of the same page and it "learns" where the "holes" for variable data are. It's not HTML specific, so it would be good for scraping any other plaintext content as well. I've used it also for PDFs and HTML converted to plaintext (with pdftotext and lynx, respectively).

I know and love Screen-Scraper.
Screen-Scraper is a tool for extracting data from websites. Screen-Scraper automates:
* Clicking links on websites
* Entering data into forms and submitting
* Iterating through search result pages
* Downloading files (PDF, MS Word, images, etc.)
Common uses:
* Download all products, records from a website
* Build a shopping comparison site
* Perform market research
* Integrate or migrate data
Technical:
* Graphical interface--easy automation
* Cross platform (Linux, Mac, Windows, etc.)
* Integrates with most programming languages (Java, PHP, .NET, ASP, Ruby, etc.)
* Runs on workstations or servers
Three editions of screen-scraper:
* Enterprise: The most feature-rich edition of screen-scraper. All capabilities are enabled.
* Professional: Designed to be capable of handling most common scraping projects.
* Basic: Works great for simple projects, but not nearly as many features as its two older brothers.
Try Yahoo! Query Language or YQL can be used alongwith jQuery, AJAX, JSONP to screen scrape web pages

Another option for Perl would be Web::Scraper which is based on Ruby's Scrapi. In a nutshell, with nice and concise syntax, you can get a robust scraper directly into data structures.

Although it was designed for .NET web-testing, I've been using the WatiN framework for this purpose. Since it is DOM-based, it is pretty easy to capture HTML, text, or images. Recentely, I used it to dump a list of links from a MediaWiki All Pages namespace query into an Excel spreadsheet. The followingVB.NET code fragement is pretty crude, but it works.

Another tool for .NET is MhtBuilder

There is this solution too: netty HttpClient

You would be a fool not to use Perl.. Here come the flames..
Bone up on the following modules and ginsu any scrape around.
use LWP
use HTML::TableExtract
use HTML::TreeBuilder
use HTML::Form
use Data::Dumper

I have used LWP and HTML::TreeBuilder with Perl and have found them very useful.
LWP (short for libwww-perl) lets you connect to websites and scrape the HTML, you can get the module here and the O'Reilly book seems to be online here.
There might be too much heavy-lifting still to do with something like this approach though. I have not looked at the Mechanize module suggested by another answer, so I may well do that.
Well if you want it done from client side using only a browser you have jcrawl.com. After having designed your scrapping service from the web app (http://www.jcrawl.com/app.html), you only need to add the generated script to an html page to start using/presenting your data. All the scrapping logic happens on the the browser via javascript. Hope you find it useful. 
I've had mixed results in .NET using SgmlReader which was originally started by Chris Lovett and appears to have been updated by MindTouch.
Implementations of the HTML5 parsing algorithmhtml5lib (Python, Ruby), Validator.nu HTML Parser(Java, JavaScript; C++ in development), Hubbub (C), Twintsam (C#; upcoming).
I've used Beautiful Soup a lot with Python. It is much better than regular expression checking, because it works like using the DOM, even if the HTML is poorly formatted. You can quickly find HTML tags and text with simpler syntax than regular expressions. Once you find an element, you can iterate over it and its children, which is more useful for understanding the contents in code than it is with regular expressions. I wish Beautiful Soup existed years ago when I had to do a lot of screenscraping -- it would have saved me a lot of time and headache since HTML structure was so poor before people started validating it.
I've also had great success using Aptana's Jaxer + jQuery to parse pages. It's not as fast or 'script-like' in nature, but jQuery selectors + real JavaScript/DOM is a lifesaver on more complicated (or malformed) pages.
Regular expressions work pretty well for HTML scraping as well ;-) Though after looking at Beautiful Soup, I can see why this would be a valuable tool.
You probably have as much already, but I think this is what you are trying to do:
from __future__ import with_statement
import re, os

profile = ""

os.system('wget --no-cookies --header "Cookie: soba=(SeCreTCODe)" http://stackoverflow.com/users/30/myProfile.html')
with open("myProfile.html") as f:
    for line in f:
        profile = profile + line
f.close()
p = re.compile('summarycount">(\d+)</div>') #Rep is found here
print p
m = p.search(profile)
print m
print m.group(1)
os.system("espeak \"Rep is at " + m.group(1) + " points\""
os.remove("myProfile.html")

In Java, you can use TagSoup.
Web scraping is the act of programmatically harvesting data from a webpage. It consists of finding a way to format the URLs to pages containing useful information, and then parsing the DOM tree to get at the data. It’s a bit finicky, but our experience is that this is easier than it sounds.
See http://blog.hartleybrody.com/web-scraping/

Still looking around for a workaround. If someone has some live code and examples. then please help me out. Take me to ShiftEdit/PasteBin/JSFiddle/SourceKit. Dont tell me i/we can't t do this task!

Its just accessing the HTML code of a webpage like we do using Java, via Javascript.

No comments:

Post a Comment