Skip to content Skip to sidebar Skip to footer

Should I Use Html::parser Or Xml::parser To Extract And Replace Text?

I am looking at being able to extract all plain text and analyse/amend from HTML/XHTML document and then replace if needed. Can I do this using HTML::Parser or should it be XML::Pa

Solution 1:

The approach of HTML::Parser is based on tokens and callbacks. I find it very convenient when you have particularly complex conditions on the context in which the data you whish to extract or to change occurs.

Otherwise I prefer a tree based approach. HTML::TreeBuilder::XPath (based ultimely on HTML::Parser) allows you to find nodes with XPath. It returns HTML::Elements. The documentation is a little scarce (well, spread over a couple of modules). But still the quick way to mine into HTML.

If you deal with pure XML, XML::Twig is an outstanding parser: very good memory management, allows to combine the tree and stream approaches. And the documentation is very good.

Solution 2:

Say in someone's StackOverflow user page you want to replace all instances of PERL with Perl. You could do so with

#! /usr/bin/perluse warnings;
use strict;

use HTML::Parser;
use LWP::Simple;

my $html = get "http://stackoverflow.com/users/201469/phil-jackson";
die"$0: get failed"unlessdefined $html;

subreplace_text{
  my($skipped,$markup) = @_;
  $skipped =~ s/\bPERL\b/Perl/g;
  print $skipped, $markup;
}

my $p = HTML::Parser->new(
  api_version =>3,
  marked_sections =>1,
  case_sensitive =>1,
  unbroken_text =>1,
  xml_mode =>1,
  start_h => [ \&replace_text =>"skipped_text, text" ],
  end_h => [ \&replace_text =>"skipped_text, text" ],
);

# your page may use a different encodingbinmode STDOUT, ":utf8"ordie"$0: binmode: $!";
$p->parse($html);

The output is what we expect:

$ wget -O phil-jackson.html http://stackoverflow.com/users/201469
$ ./replace-text >out.html
$ diff -ub phil-jackson.html out.html
--- phil-jackson.html
+++ out.html
@@ -327,7 +327,7 @@

 PERL:  

-#$linkTrue =  &hellip; ">comparing PERL md5() and PHP md5()</a></h3>
+#$linkTrue =  &hellip; ">comparing Perl md5() and PHP md5()</a></h3>

         <div class="tags t-php t-perl t-md5">
             <a href="/questions/tagged/php" class="post-tag" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/perl" class="post-tag" title="show questions tagged 'perl'" rel="tag">perl</a> <a href="/questions/tagged/md5" class="post-tag" title="show questions tagged 'md5'" rel="tag">md5</a> 

The "PERL:" sore thumb is part of an element attribute, not a text section.

Solution 3:

You should also look at Web::Scraper. I find this module easier than the HTML::Parser modules, but it helps if your are familiar with XPath. Parsing of HTML is very unpredictable depending on the actual pages - it is like pdf-display and not data-oriented.

Solution 4:

Which module you should use depends on what you are trying to do. For starters, HTML::Parser comes with great examples which also include a script that extracts plain text from an HTML document.

Do not try to parse HTML documents using an XML parser: You will find yourself in a world of pain as a lot of valid HTML constructs are not valid XML.

Do not try to parse XML documents using an HTML parser: You will lose all the advantages of the stricter requirement that an XML document be well formed before it can be parsed.

Post a Comment for "Should I Use Html::parser Or Xml::parser To Extract And Replace Text?"