# BEGIN SourceDeps(oneline):
BuildRequires: perl(CPAN.pm) perl(Carp.pm) perl(Config.pm) perl(Cwd.pm) perl(Exporter.pm) perl(ExtUtils/MM_Unix.pm) perl(ExtUtils/MakeMaker.pm) perl(ExtUtils/Manifest.pm) perl(File/Basename.pm) perl(File/Find.pm) perl(File/Spec.pm) perl(FileHandle.pm) perl(LWP/Simple.pm) perl(Module/Build.pm) perl(Net/FTP.pm) perl(Socket.pm) perl(Test/More.pm) perl(Test/Pod.pm) perl(Test/Pod/Coverage.pm) perl(YAML.pm) perl(inc/Module/Install.pm)
# END SourceDeps(oneline)
%define module_version 1.6
%define module_name String-REPartition
%define _unpackaged_files_terminate_build 1
BuildRequires: rpm-build-perl perl-devel perl-podlators

Name: perl-%module_name
Version: 1.6
Release: alt1
Summary: Generates a regex to partition a data set
Group: Development/Perl
License: perl
Url: %CPAN %module_name

Source0: http://cpan.org.ua/authors/id/A/AV/AVIF/%module_name-%module_version.tar.gz
BuildArch: noarch

%description
This module exports a single function -- make_partition_re.  It takes as its
first argument a number between 0 and 1, representing a percentage, and as
its second argument a reference to a list of strings.  It returns a regular
expression which is guaranteed to match the percentage of the strings in
the list represented by the number in the first argument.  More importantly,
the regex returned will *not* to match the rest of the string in the
list.  That is, if the inputs were '0.6' and a reference to a list of 100
strings, the returned regex would match 60 of the strings in the list and not
match the other 40. 

Keep in mind that, since only integer operations may be performed on these
strings, (that is, there cannot be a regex which matches a fraction of a
string), the target number is rounded down.  If you have 4 strings in your
list and a ratio of 0.4, the resulting regex will match 1 string, not 1.6
strings.  More interestingly, with 4 strings and a ratio of 0.1, the
resulting regex will almost certainly be `/^()$/' -- matching exactly 0 of
the strings in the list.  Furthermore, because of this rounding, the
returned regex may not match precisely the number expected by multiplying
the size of the list by the ratio, but instead be off by a small number
in either direction.

c<make_partition_re()> will return c<undef> on a failure, and print a warning
to STDERR if `$^W' is true.  Currently, the only errors that can occur relate
directly to the validity of the inputs.  Furthermore, if the strings in the
list are not unique, the behaviour of this function is not defined.  For a
small amout of repetition the regex should still work, but it should be clear 
that a solution cannot be found if the input list consists only of many copies
of the same string.

The function finds its solution in roughly O(N) time -- however, in worst cases,
I think it can get as high as O(N^2).  It's also true that certain types
of pathologically constructed data sets can break things and cause it either to
return an invalid regex or to enter an infinite loop.  While I haven't run
into any of this in my testing, I'm not confident that I've tested every
possibility.

So why would you want to use this?  Well, that's a question you'll have to
answer for yourself, mostly.  :)  However, the situations I envisaged while
developing the module were sort of like this: Imagine you have a large set of
data, indexed by a correspondingly large set of keywords.  Let's say you want 
to split this data into two partitions, perhaps in order to store the data 
in two separate locations.  Maintaining a complete list of the remote keys
could be expensive -- instead, you can simply store a regular expression
which matches the keys you keep remotely and does not match the local ones.

Another interesting feature is that a regex generated from a sufficiently 
large subset of your data will, approximately, match the appropriate percentage
of strings from the complete data set.  This means that you do not need to have
all of the data before you generate a regex to partition it.  As an example,
generating a regex from roughly 10%% of the words in /usr/dict/words (selected
randomly) gave me a regex that matched within .3%% of the desired result for all
of the words.


%prep
%setup -n %module_name-%module_version

%build
%perl_vendor_build

%install
%perl_vendor_install

%files
%doc Changes README
%perl_vendor_privlib/S*

%changelog
