Lets visit each of these steps one by one.
For this tutorial, we will extract Yahoo's "Today's Top Searches" section towards the end of their home page (http://www.yahoo.com/ [1]).
Before you begin to write a web scraping program, its important to understand the pattern of the data that you wish to extract. View the page source to understand the pattern.
The string of text that we should parse is given below:
<div id="popsearchbd" class="bd">
<ol start=1><li><a href="r/dy/*-http://search.yahoo.com/search?p=Heidi+Klum&cs=bz&fr=fp-buzzmod">Heidi Klum</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Sarah+Larson&cs=bz&fr=fp-buzzmod">Sarah Larson</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Oscar+Videos&cs=bz&fr=fp-buzzmod">Oscar Videos</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Brad+Renfro&cs=bz&fr=fp-buzzmod">Brad Renfro</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Gary+Busey&cs=bz&fr=fp-buzzmod">Gary Busey</a></li></ol><ol start=6><li><a href="r/dy/*-http://search.yahoo.com/search?p=Barack+Obama&cs=bz&fr=fp-buzzmod">Barack Obama</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Razzie+Awards&cs=bz&fr=fp-buzzmod">Razzie Awards</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Raisin+in+the+Sun&cs=bz&fr=fp-buzzmod">Raisin in the Sun</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Stay+Home+Moms&cs=bz&fr=fp-buzzmod">Stay Home Moms</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Net+Neutrality&cs=bz&fr=fp-buzzmod">Net Neutrality</a></li></ol></div>
</div>
The pattern is that each search phrase is enclosed within a <li><a></a></li> tag. Therefore, we should parse everything between <a></a> of this text piece to get the desired text.
If you are writing a script to fetch data that has pagination, you should remember to validate the structure on 3 - 4 pages before you start developing code. The reason behind doing this is that the presentation of the first page could differ in subsequent pages.
You could use any programming language like Java, C#, PHP, PERL, etc. for this processing. I have used PHP for this example.
//change fso() to f sock open (my blog was causing an error)
//change fwt() to f write (my blog was causing an error)
//change fgs() to f gets (my blog was causing an error)
//change fc() to f close (my blog was causing an error)
$fp = fso("www.yahoo.com", 80, $errno, $errstr, 30);
if (!$fp) {
echo "$errstr ($errno)
\n";
} else {
$out = "GET / HTTP/1.1\r\n";
$out .= "Host: www.yahoo.com\r\n";
$out .= "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12\r\n";
$out .= "Connection: Close\r\n\r\n";
$str = "";
fwt($fp, $out);
while (!feof($fp)) {
$str .= fgs($fp, 1024);
}
fc($fp);
}
$pos = strpos($str, "", $pos);
if($pos === false) {
break;
}
$pos = $pos + strlen("fp-buzzmod\">");
$temppos = $pos;
$pos = strpos($str, "", $pos);
$datalength = $pos - $temppos;
$data = substr($str, $temppos , $datalength);
echo $data;
echo "\n";
}
}
You should test your program for all the parameters that the web page can take. I have experienced change in layout & data based on the parameters that are passed.
Some pages use form data and cookies to render data. In such cases you should remember to check the Request and Response headers and identify what is necessary to get the results that you want. If the page requires a cookie value, you should then use the cookie information in your Request headers. Look at the note below that I use to inspect Request and Response headers
I use Live HTTP Headers (a plug in for FireFox) to check for Request and Response Headers. Visit http://livehttpheaders.mozdev.org [2] for more details. To install this plug in, visit http://livehttpheaders.mozdev.org/installation.html [3] and click on the 'download it' link on the latest release. Please read the release notes before installing a particular version.
Please feel free to use the comments section below to share other tools that you use to monitor Request and Response Headers.
From a maintenance perspective, you should monitor the page frequently and re-validate the HTML structure. The reason for this is because nothing in this world is constant, much less a website you don't control. Design changes could result in a change in the HTML code. I recommend scheduling this activity for at least once a month.
This is all that is there in this tutorial. In the next tutorial, I will guide you step-by-step on how to create a program to check for yahoo.com search engine rankings.
Visit my blog to subscribe to get updated as the next article goes online.
Links:
[1] http://www.yahoo.com/
[2] http://livehttpheaders.mozdev.org
[3] http://livehttpheaders.mozdev.org/installation.html
[4] http://www.sunilb.com/php/writing-website-scrapers-in-php