# BEGIN SourceDeps(oneline):
BuildRequires: perl(ExtUtils/MakeMaker.pm) perl(HTML/TreeBuilder.pm) perl(Test/More.pm)
# END SourceDeps(oneline)
%define module_version 0.03
%define module_name HTML-ContentExtractor
%define _unpackaged_files_terminate_build 1
BuildRequires: rpm-build-perl perl-devel perl-podlators

Name: perl-%module_name
Version: 0.03
Release: alt1
Summary: extract the main content from a web page by analysising the DOM tree!
Group: Development/Perl
License: perl
Url: %CPAN %module_name

Source0: http://cpan.org.ua/authors/id/J/JZ/JZHANG/%module_name-%module_version.tar.gz
BuildArch: noarch

%description
Web pages often contain clutter (such as ads, unnecessary images and
extraneous links) around the body of an article that distracts a user
from actual content. This module is used to reduce the noise content
in web pages and thus identify the content rich regions.


A web page is first parsed by an HTML parser, which corrects the
markup and creates a DOM (Document Object Model) tree. By using a
depth-first traversal to navigate the DOM tree, noise nodes are
identified and removed, thus the main content is extracted. Some
useless nodes (script, style, etc.) are removed; the container nodes
(table, div, etc.) which have high link/text ratio (higher than
threshold) are removed; (link/text ratio is the ratio of the number of
links and non-linked words.) The nodes contain any string in the
predefined spam string list are removed.


Please notice the input HTML should be encoded in utf-8 format( so do
the spam words), thus the module can handle web pages in any language
(I've used it to process English, Chinese, and Japanese web pages).

=over 4

=item $e = HTML::ContentExtractor->new(%%options);

Constructs a new `HTML::ContentExtractor' object. The optional
%%options hash can be used to set the options list below.

=item $e->table_tags();

=item $e->table_tags(@tags);

=item $e->table_tags(\@tags);

This is used to get/set the table tags array. The tags are used as the
container tags.

=item $e->ignore_tags();

=item $e->ignore_tags(@tags);

=item $e->ignore_tags(\@tags);

This is used to get/set the ignore tags array. The elements of such
tags will be removed.

=item $e->spam_words();

=item $e->spam_words(@strings);

=item $e->spam_words(\@strings);

This is used to get/set the spam words list. The elements have such
string will be removed.

=item $e->link_text_ratio();

=item $e->link_text_ratio($ratio);

This is used to get/set the link/text ratio, default is 0.05.

=item $e->min_text_len();

=item $e->min_text_len($len);

This is used to get/set the min text length, default is 20. If length
of the text of an elment is less than this value, this element will be
removed.

=item $e->extract($url,$HTML);

This is used to perform the extraction process. Please notice the
input $HTML must be encoded in UTF-8. 

=item $e->as_html();

Return the extraction result in HTML format.

=item $e->as_text();

Return the extraction result in text format.

=back



%prep
%setup -n %module_name-%module_version

%build
%perl_vendor_build

%install
%perl_vendor_install

%files
%doc README Changes
%perl_vendor_privlib/H*

%changelog
