Perl? I’ve seen those

So this morning I was asked to get content from a website out in plain text.

htmlcontent Visually, this means that the HTML code over on the left, needs to be converted to straight text that can be viewed in notepad without all of the tags.plaintext

As I have done some screen scraping in the past for other jobs, I am familiar with the concept of taking data from a terminal screen and working with it. I have not, however, done any sort of screen scraping for web content. 

My first step in the process is to ask what other people have used. I don’t want to re-invent the wheel if possible. As I am the only developer here, I find it useful to post questions on Twitter, as most of the people I follow have some attachment to the technical industry. I use it as an open messaging system- kind of like shouting down a hall and seeing who answers.

My first response was HTML::Strip. As someone who has been a Windows programmer for most of his career, focused on Microsoft based products (and not really web based platforms), this told me absolutely nothing. Google (or Bing… I’m trying…) tells me that this is essentially a Perl module. Huh?

startmenustrawberry So begins my morning’s quest for knowledge. I do a bit of digging, and I found that what I need to do first is get a some type of Perl interpreter. These sort of things come with Unix/Linux… but Microsoft pays for my life, so I use Windows. The top of my result list was fine for me, so I went with Strawberry Perl as my interpreter of choice. 

As a value add, I get a CPAN Client which is essentially a universal installer utility for installing modules which can be consumed by Pearl script. Meaning that if you need to include a reads HTML pages and return their content to you, you just tell CPAN the name of the library and it magically installs! 

I need two things to get started:

  • HTML::Strip – the Perl library that strips content out of web pages
  • LWP::Simple – the Perl Library to manipulate HTML

cpaninstallAfter a bit more research I found that all I need to do is launch my CPAN Client and in the command prompt run install HTML::Strip, and install LWP::Simple. It’s really just that simple! No messing around with installer files. It just works. Now I can write a script that consumes those libraries.

This post is getting long and rather than just drag on with coding, here’s how we can scrape the text from a web page using Perl:

#!/usr/bin/perl

use HTML::Strip;
use LWP::Simple;

my $hs = HTML::Strip->new();
my $url = "http://www.google.com";
my $content = get($url);

my $clean_text = $hs->parse($content);
print $clean_text;
$hs->eof;

Done. That will write the contents of our $urlvariable’s web site to the screen. I save my script into a text file C:\strawberry\perl\Scripts\Scraper.pl (I use .pl as the file extension only for convention’s sake). 

scraped To execute my script, I open the command prompt and type perl C:\strawberry\perl\Scripts\Scraper.pl and the result is printed to the command window.

 

 

Obviously, I still have some work to do to make my little script a viable solution:

  • Scrape a specific target area on the web page rather than the whole page
  • Loop thru a list of pages to parse rather than just a single page

…but that’s the brunt of what I needed it to do. For all you Perl experts out there- I probably butchered your favorite language... sorry.

 

Leave a comment