# BEGIN SourceDeps(oneline):
BuildRequires: perl(Algorithm/NeedlemanWunsch.pm) perl(Class/Generate.pm) perl(Exporter.pm) perl(ExtUtils/MakeMaker.pm) perl(Fatal.pm) perl(HTML/Entities.pm) perl(HTML/Parser.pm) perl(IPC/Run3.pm) perl(LWP.pm) perl(LWP/UserAgent.pm) perl(Probe/Perl.pm) perl(Test/More.pm)
# END SourceDeps(oneline)
%define module_version 0.08
%define module_name HTML-ListScraper
%define _unpackaged_files_terminate_build 1
BuildRequires: rpm-build-perl perl-devel perl-podlators

Name: perl-%module_name
Version: 0.08
Release: alt1
Summary: generic web page scraping support
Group: Development/Perl
License: perl
Url: %CPAN %module_name

Source0: http://cpan.org.ua/authors/id/V/VB/VBAR/%{module_name}-%{module_version}.tar.gz
BuildArch: noarch

%description
While Perl has good support and is often used for extracting
machine-friendly data from HTML pages, most scripts used for that task
are ad-hoc, parsing just one site's HTML and depending on superficial,
transient details of its structure - and are therefore brittle and
labor-intensive to maintain. This module tries to support more generic
scraping for a class of pages: those whose most important part is a
list of links.

`HTML::ListScraper' is a subclass of the HTML::Parser manpage, building on its
ability to convert an octet stream - whether strictly valid HTML or
something just vaguely similar to it - to tags and text. HTML parsing
works the same as with `HTML::Parser', except you don't need to
register your own HTML event handlers.

When the document is parsed, call `find_sequences' to find out which
tags in the document repeat, one after the other, more than once
(attributes, text and comments are ignored for this comparison). Since
there'll probably be quite a lot of such sequences,
`HTML::ListScraper' tries to find the "longest one repeating most
often", specifically, it maximizes `log(number of non-overlapping
runs)*log(number of tags in the sequence)'. There can obviously be
more than one such sequence, which is why the method returns an array
(and the array can also be empty - see below). Your application can
then iterate over the returned structure to find items of interest.

This module includes a script, `scrape', displaying the sequences
found by `HTML::ListScraper', so that you can see which items your
application needs - and if they aren't there, you can try to tweak
`HTML::ListScraper''s settings with the various `scrape' switches to
make it find more.

`HTML::ListScraper' methods are as follows:
%prep
%setup -q -n %{module_name}-%{module_version}

%build
%perl_vendor_build

%install
%perl_vendor_install

%files
%doc Changes README
%perl_vendor_privlib/H*

%changelog
