Sunil has posted 3 posts at DZone. You can read more from them at their website. View Full User Profile

Writing Website Scrapers in PHP

02.26.2008
| 15155 views |
  • submit to reddit
This article instructs you on how to write a website scraper using PHP for web site data extraction. The concepts taught can be applied and programmed in Java, C#, etc. -- basically any language that has a powerful string processing capability. This article will teach you the basics of website scraping. The article will further cover a tutorial to find web ranking from Yahoo.com search engine.

Steps Involved in Writing a Scraping Program

  1. Visit the URL
  2. Understand the pattern
  3. Validate the structure of pattern on different URLs
  4. Write the program
  5. Test the program using various input parameters

Lets visit each of these steps one by one.

1. Visit the URL

For this tutorial, we will extract Yahoo's "Today's Top Searches" section towards the end of their home page (http://www.yahoo.com/).

2. Understand the pattern

Before you begin to write a web scraping program, its important to understand the pattern of the data that you wish to extract. View the page source to understand the pattern.

The string of text that we should parse is given below:

<div id="popsearchbd" class="bd">
<ol start=1><li><a href="r/dy/*-http://search.yahoo.com/search?p=Heidi+Klum&cs=bz&fr=fp-buzzmod">Heidi Klum</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Sarah+Larson&cs=bz&fr=fp-buzzmod">Sarah Larson</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Oscar+Videos&cs=bz&fr=fp-buzzmod">Oscar Videos</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Brad+Renfro&cs=bz&fr=fp-buzzmod">Brad Renfro</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Gary+Busey&cs=bz&fr=fp-buzzmod">Gary Busey</a></li></ol><ol start=6><li><a href="r/dy/*-http://search.yahoo.com/search?p=Barack+Obama&cs=bz&fr=fp-buzzmod">Barack Obama</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Razzie+Awards&cs=bz&fr=fp-buzzmod">Razzie Awards</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Raisin+in+the+Sun&cs=bz&fr=fp-buzzmod">Raisin in the Sun</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Stay+Home+Moms&cs=bz&fr=fp-buzzmod">Stay Home Moms</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Net+Neutrality&cs=bz&fr=fp-buzzmod">Net Neutrality</a></li></ol></div>
</div>

The pattern is that each search phrase is enclosed within a <li><a></a></li> tag. Therefore, we should parse everything between <a></a> of this text piece to get the desired text.

3. Validate the structure of pattern on different URLs

If you are writing a script to fetch data that has pagination, you should remember to validate the structure on 3 - 4 pages before you start developing code. The reason behind doing this is that the presentation of the first page could differ in subsequent pages.

4. Write the program

You could use any programming language like Java, C#, PHP, PERL, etc. for this processing. I have used PHP for this example.

//change fso() to f sock open (my blog was causing an error)
//change fwt() to f write (my blog was causing an error)
//change fgs() to f gets (my blog was causing an error)
//change fc() to f close (my blog was causing an error)

$fp = fso("www.yahoo.com", 80, $errno, $errstr, 30);
if (!$fp) {
echo "$errstr ($errno)
\n";
} else {
$out = "GET / HTTP/1.1\r\n";
$out .= "Host: www.yahoo.com\r\n";
$out .= "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12\r\n";
$out .= "Connection: Close\r\n\r\n";

$str = "";

fwt($fp, $out);
while (!feof($fp)) {
$str .= fgs($fp, 1024);
}
fc($fp);
}

$pos = strpos($str, "", $pos);

if($pos === false) {
break;
}

$pos = $pos + strlen("fp-buzzmod\">");
$temppos = $pos;
$pos = strpos($str, "", $pos);

$datalength = $pos - $temppos;

$data = substr($str, $temppos , $datalength);
echo $data;
echo "\n";
}

}

5. Test the program using various input parameters

You should test your program for all the parameters that the web page can take. I have experienced change in layout & data based on the parameters that are passed.

Notes on processing forms and cookies

Some pages use form data and cookies to render data. In such cases you should remember to check the Request and Response headers and identify what is necessary to get the results that you want. If the page requires a cookie value, you should then use the cookie information in your Request headers. Look at the note below that I use to inspect Request and Response headers

Tool to inspect Request and Response Headers

I use Live HTTP Headers (a plug in for FireFox) to check for Request and Response Headers. Visit http://livehttpheaders.mozdev.org for more details. To install this plug in, visit http://livehttpheaders.mozdev.org/installation.html and click on the 'download it' link on the latest release. Please read the release notes before installing a particular version.

Please feel free to use the comments section below to share other tools that you use to monitor Request and Response Headers.

Future Maintenance of the program

From a maintenance perspective, you should monitor the page frequently and re-validate the HTML structure. The reason for this is because nothing in this world is constant, much less a website you don't control. Design changes could result in a change in the HTML code. I recommend scheduling this activity for at least once a month.

This is all that is there in this tutorial. In the next tutorial, I will guide you step-by-step on how to create a program to check for yahoo.com search engine rankings.

Visit my blog to subscribe to get updated as the next article goes online.

References
Published at DZone with permission of its author, Sunil Bhatia. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)