%define module_name Text-Shingle
# BEGIN SourceDeps(oneline):
BuildRequires: perl(ExtUtils/MakeMaker.pm) perl(Lingua/Sentence.pm) perl(Module/Build.pm) perl(Text/NGrammer.pm) perl(Unicode/Normalize.pm)
# END SourceDeps(oneline)
%define _unpackaged_files_terminate_build 1
BuildRequires: rpm-build-perl perl-devel perl-podlators

Name: perl-%module_name
Version: 0.07
Release: alt1
Summary: Pure Perl implementation of shingles for pieces of text
Group: Development/Perl
License: perl
Url: %CPAN %module_name

Source0: http://mirror.yandex.ru/mirrors/cpan/authors/id/N/NI/NIDS/%{module_name}-%{version}.tar.gz
BuildArch: noarch

%description
The module provides a way to extract shingles from a piece of text.  Shingles can then be used for other operations such as clustering, deduplication, etc.

Given a document, the w-shingles represent a set of sorted groups of *w* adjacent words in the text.  The parameter *w* is also called the *width* of the shingle.  For instance, the sentence "a rose is a rose", contains the following shingles of width 2, or 2-shingles: [ (a is), (is rose) and (a rose).  While the shingle "a rose" would be present twice in the text twice, in the set of the shingles that is found only once.

Since the w-shingles are very close relatives of the n-grams, this module is built on top of the Text::NGrammer manpage and then it can break the text into sentences before the shingling in such a way that they do not cross the boundaries of the sentences.  Moreover, the module provides a way to normalize the shingles in order to collapse on the same shingle token that look the same but that are represented by different code points, e.g., composite accents vs. accented letters.  The normalization, enabled by default, is done through the module the Unicode::Normalize manpage and it uses the NFKC normalization (details in http://www.unicode.org/reports/tr15/).

The shingles in output are represented by strings in which the tokens have been joined through the use of the space character U+0020, the common space character available also in the ASCII set.  This choice has been made for two reasons: the first one is the fact that usually the shingles are then used as tokens in computing distances and this makes life a lot easies, and second that breaking them again in the various components is just doable invoking `split'.

%prep
%setup -q -n %{module_name}-%{version}

%build
%perl_vendor_build

%install
%perl_vendor_install

%files
%doc Changes
%perl_vendor_privlib/T*

%changelog
