move FTS documentation to a separate file

2025-06-07 14:19:10 -04:00 · 2014-07-04 22:28:38 +02:00 · 2014-07-04 22:28:38 +02:00 · b6d9f86716
commit b6d9f86716
parent 4d03a92da4
4 changed files with 521 additions and 300 deletions
--- a/1
+++ b/1
@ -6,6 +6,7 @@ inc/Test/NoWarnings.pm
 inc/Test/NoWarnings/Warning.pm
 lib/DBD/SQLite.pm
 lib/DBD/SQLite/Cookbook.pod
 lib/DBD/SQLite/Fulltext_search.pod
 LICENSE
 Makefile.PL
 MANIFEST			This list of files
--- a/lib/DBD/SQLite.pm
+++ b/lib/DBD/SQLite.pm
@ -2357,275 +2357,14 @@ need to call the L</create_collation> method directly.
 =head1 FULLTEXT SEARCH
-The FTS extension module within SQLite allows users to create special
+SQLite is bundled with an extension module for full-text
-tables with a built-in full-text index (hereafter "FTS tables"). The
+indexing. Tables with this feature enabled can be efficiently queried
-full-text index allows the user to efficiently query the database for
+to find rows that contain one or more instances of some specified
-all rows that contain one or more instances of a specified word (hereafter
+words, in any column, even if the table contains many large documents.
 a "token"), even if the table contains many large documents.
-=head2 Short introduction to FTS
+Explanations for using this feature are provided in a separate document:
 see L<DBD::SQLite::Fulltext_search>.
 The first full-text search modules for SQLite were called C<FTS1> and C<FTS2>
 and are now obsolete. The latest recommended module is C<FTS4>; however
 the former module C<FTS3> is still supporter. 
 Detailed documentation for both C<FTS4> and C<FTS3> can be found
 at L<http://www.sqlite.org/fts3.html>, including explanations about the
 differences between these two versions.
 Here is a very short example of using FTS :
  $dbh->do(<<"") or die DBI::errstr;
  CREATE VIRTUAL TABLE fts_example USING fts4(content)
  my $sth = $dbh->prepare("INSERT INTO fts_example(content) VALUES (?)");
  $sth->execute($_) foreach @docs_to_insert;
  my $results = $dbh->selectall_arrayref(<<"");
  SELECT docid, snippet(fts_example) FROM fts_example WHERE content MATCH 'foo'
 The key points in this example are :
 =over
 =item *
 The syntax for creating FTS tables is 
  CREATE VIRTUAL TABLE <table_name> USING fts4(<columns>)
 where C<< <columns> >> is a list of column names. Columns may be
 typed, but the type information is ignored. If no columns
 are specified, the default is a single column named C<content>.
 In addition, FTS tables have an implicit column called C<docid>
 (or also C<rowid>) for numbering the stored documents.
 =item *
 Statements for inserting, updating or deleting records 
 use the same syntax as for regular SQLite tables.
 =item *
 Full-text searches are specified with the C<MATCH> operator, and an
 operand which may be a single word, a word prefix ending with '*', a
 list of words, a "phrase query" in double quotes, or a boolean combination
 of the above.
 =item *
 The builtin function C<snippet(...)> builds a formatted excerpt of the
 document text, where the words pertaining to the query are highlighted.
 =back
 There are many more details to building and searching
 FTS tables, so we strongly invite you to read
 the full documentation at L<http://www.sqlite.org/fts3.html>.
 B<Incompatible change> : 
 starting from version 1.31, C<DBD::SQLite> uses the new, recommended
 "Enhanced Query Syntax" for binary set operators (AND, OR, NOT, possibly 
 nested with parenthesis). Previous versions of C<DBD::SQLite> used the
 "Standard Query Syntax" (see L<http://www.sqlite.org/fts3.html#section_3_2>).
 Unfortunately this is a compilation switch, so it cannot be tuned
 at runtime; however, since FTS3 was never advertised in versions prior
 to 1.31, the change should be invisible to the vast majority of 
 C<DBD::SQLite> users. If, however, there are any applications
 that nevertheless were built using the "Standard Query" syntax,
 they have to be migrated, because the precedence of the C<OR> operator
 has changed. Conversion from old to new syntax can be 
 automated through L<DBD::SQLite::FTS3Transitional>, published
 in a separate distribution.
 =head2 Tokenizers
 The behaviour of full-text indexes strongly depends on how
 documents are split into I<tokens>; therefore FTS table
 declarations can explicitly specify how to perform
 tokenization: 
  CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)
 where C<< <tokenizer> >> is a sequence of space-separated
 words that triggers a specific tokenizer. Tokenizers can
 be SQLite builtins, written in C code, or Perl tokenizers.
 Both are as explained below.
 =head3 SQLite builtin tokenizers
 SQLite comes with some builtin tokenizers (see
 L<http://www.sqlite.org/fts3.html#tokenizer>) :
 =over
 =item simple
 Under the I<simple> tokenizer, a term is a contiguous sequence of
 eligible characters, where eligible characters are all alphanumeric
 characters, the "_" character, and all characters with UTF codepoints
 greater than or equal to 128. All other characters are discarded when
 splitting a document into terms. They serve only to separate adjacent
 terms.
 All uppercase characters within the ASCII range (UTF codepoints less
 than 128), are transformed to their lowercase equivalents as part of
 the tokenization process. Thus, full-text queries are case-insensitive
 when using the simple tokenizer.
 =item porter
 The I<porter> tokenizer uses the same rules to separate the input
 document into terms, but as well as folding all terms to lower case it
 uses the Porter Stemming algorithm to reduce related English language
 words to a common root.
 =item icu
 The I<icu> tokenizer uses the ICU library to decide how to
 identify word characters in different languages; however, this
 requires SQLite to be compiled with the C<SQLITE_ENABLE_ICU>
 pre-processor symbol defined. So, to use this tokenizer, you need
 edit F<Makefile.PL> to add this flag in C<@CC_DEFINE>, and then
 recompile C<DBD::SQLite>; of course, the prerequisite is to have
 an ICU library available on your system.
 =item unicode61
 The "unicode61" tokenizer works very much like "simple" except that it
 does full unicode case folding according to rules in Unicode Version
 6.1 and it recognizes unicode space and punctuation characters and
 uses those to separate tokens. By contrast, the simple tokenizer only
 does case folding of ASCII characters and only recognizes ASCII space
 and punctuation characters as token separators.
 By default, "unicode61" also removes all diacritics from Latin script
 characters. This behaviour can be overridden by adding the tokenizer
 argument "remove_diacritics=0". For example:
  -- Create tables that remove diacritics from Latin script characters
  -- as part of tokenization.
  CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
  CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");
  -- Create a table that does not remove diacritics from Latin script
  -- characters as part of tokenization.
  CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
 Additional options can customize the set of codepoints that unicode61
 treats as separator characters or as token characters -- see the
 documentation in L<http://www.sqlite.org/fts3.html#unicode61>.
 =back
 If a more complex tokenizing algorithm is required, for example to
 implement stemming, discard punctuation, or to recognize compound words,
 use the perl tokenizer to implement your own logic, as explained below.
 =head3 Perl tokenizers
 In addition to the builtin SQLite tokenizers, C<DBD::SQLite>
 implements a I<perl> tokenizer, that can hook to any tokenizing
 algorithm written in Perl. This is specified as follows :
  CREATE ... USING fts4(<columns>, tokenize=perl '<perl_function>')
 where C<< <perl_function> >> is a fully qualified Perl function name
 (i.e. prefixed by the name of the package in which that function is
 declared). So for example if the function is C<my_func> in the main 
 program, write
  CREATE ... USING fts4(<columns>, tokenize=perl 'main::my_func')
 That function should return a code reference that takes a string as
 single argument, and returns an iterator (another function), which
 returns a tuple C<< ($term, $len, $start, $end, $index) >> for each
 term. Here is a simple example that tokenizes on words according to
 the current perl locale
  sub locale_tokenizer {
    return sub {
      my $string = shift;
      use locale;
      my $regex      = qr/\w+/;
      my $term_index = 0;
      return sub { # closure
        $string =~ /$regex/g or return; # either match, or no more token
        my ($start, $end) = ($-[0], $+[0]);
        my $len           = $end-$start;
        my $term          = substr($string, $start, $len);
        return ($term, $len, $start, $end, $term_index++);
      }
    };
  }
 There must be three levels of subs, in a kind of "Russian dolls" structure,
 because :
 =over
 =item *
 the external, named sub is called whenever accessing a FTS table
 with that tokenizer
 =item *
 the inner, anonymous sub is called whenever a new string
 needs to be tokenized (either for inserting new text into the table,
 or for analyzing a query).
 =item *
 the innermost, anonymous sub is called repeatedly for retrieving
 all terms within that string.
 =back
 Instead of writing tokenizers by hand, you can grab one of those
 already implemented in the L<Search::Tokenizer> module. For example,
 if you want ignore differences between accented characters, you can
 write :
  use Search::Tokenizer;
  $dbh->do(<<"") or die DBI::errstr;
  CREATE ... USING fts4(<columns>, 
                        tokenize=perl 'Search::Tokenizer::unaccent')
 Alternatively, you can use L<Search::Tokenizer/new> to build
 your own tokenizer. Here is an example that treats compound
 words (words with an internal dash or dot) as single tokens :
  sub my_tokenizer {
    return Search::Tokenizer->new(
      regex => qr{\p{Word}+(?:[-./]\p{Word}+)*},
     );
  }
 =head2 Fts4aux - Direct Access to the Full-Text Index
 The content of a full-text index can be accessed through the
 virtual table module "fts4aux". For example, assuming that
 our database contains a full-text indexed table named "ft",
 we can declare :
  CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft)
 and then query the C<ft_terms> table to access the
 list of terms, their frequency, etc.
 Examples are documented in
 L<http://www.sqlite.org/fts3.html#fts4aux>.
 =head2 Database space for FTS
 By default, FTS stores a complete copy of the indexed documents, together with
 the fulltext index. On a large collection of documents, this can
 consume quite a lot of disk space. However, FTS has some options
 for compressing the documents, or even for not storing them at all
 -- see L<http://www.sqlite.org/fts3.html#fts4_options>.
 =head1 R* TREE SUPPORT
--- a/lib/DBD/SQLite/Cookbook.pod
+++ b/lib/DBD/SQLite/Cookbook.pod
@ -135,37 +135,6 @@ The function can then be used as:
  FROM results
  GROUP BY group_name;
 =head1 FTS fulltext indexing
 =head2 Sparing database disk space
 As explained in L<http://www.sqlite.org/fts3.html#fts4_options>,
 several options are available to specify how SQLite should store
 indexed documents.
 One strategy is to use SQLite only for the fulltext index and
 metadata, and keep the full documents outside of SQLite; to do so, use
 the C<content=""> option. For example, the following SQL creates
 an FTS4 table with three columns - "a", "b", and "c":
   CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b, c);
 Data can be inserted into such an FTS4 table using an INSERT
 statements. However, unlike ordinary FTS4 tables, the user must supply
 an explicit integer docid value. For example:
  -- This statement is Ok:
  INSERT INTO t1(docid, a, b, c) VALUES(1, 'a b c', 'd e f', 'g h i');
  -- This statement causes an error, as no docid value has been provided:
  INSERT INTO t1(a, b, c) VALUES('j k l', 'm n o', 'p q r');
 Of course your application will need an algorithm for finding
 the external resource corresponding to any I<docid> stored within
 SQLite. Furthermore, SQLite C<offsets()> and C<snippet()> functions
 cannot be used, so if such functionality is needed, it has to be
 directly programmed within the Perl application.
 =head1 SUPPORT
@ -192,8 +161,6 @@ Create a series of tests scripts that validate the cookbook recipes.
 Adam Kennedy E<lt>adamk@cpan.orgE<gt>
 Laurent Dami E<lt>dami@cpan.orgE<gt>
 =head1 COPYRIGHT
 Copyright 2009 - 2012 Adam Kennedy.
--- a/lib/DBD/SQLite/Fulltext_search.pod
+++ b/lib/DBD/SQLite/Fulltext_search.pod
@ -0,0 +1,514 @@
 =head1 NAME
 DBD::SQLite::Fulltext_search - Using fulltext searches with DBD::SQLite
 =head1 DESCRIPTION
 =head2 Introduction
 SQLite is bundled with an extension module called "FTS" for full-text
 indexing. Tables with this feature enabled can be efficiently queried
 to find rows that contain one or more instances of some specified
 words (also called "tokens"), in any column, even if the table contains many
 large documents.
 The first full-text search modules for SQLite were called C<FTS1> and C<FTS2>
 and are now obsolete. The latest version is C<FTS4>, but it shares many
 features with the former module C<FTS3>, which is why parts of the 
 API and parts of the documentation still refer to C<FTS3>; from a client
 point of view, both can be considered largely equivalent.
 Detailed documentation can be found
 at L<http://www.sqlite.org/fts3.html>.
 =head2 Short example
 Here is a very short example of using FTS :
  $dbh->do(<<"") or die DBI::errstr;
  CREATE VIRTUAL TABLE fts_example USING fts4(content)
  my $sth = $dbh->prepare("INSERT INTO fts_example(content) VALUES (?)");
  $sth->execute($_) foreach @docs_to_insert;
  my $results = $dbh->selectall_arrayref(<<"");
  SELECT docid, snippet(fts_example) FROM fts_example WHERE content MATCH 'foo'
 The key points in this example are :
 =over
 =item *
 The syntax for creating FTS tables is 
  CREATE VIRTUAL TABLE <table_name> USING fts4(<columns>)
 where C<< <columns> >> is a list of column names. Columns may be
 typed, but the type information is ignored. If no columns
 are specified, the default is a single column named C<content>.
 In addition, FTS tables have an implicit column called C<docid>
 (or also C<rowid>) for numbering the stored documents.
 =item *
 Statements for inserting, updating or deleting records 
 use the same syntax as for regular SQLite tables.
 =item *
 Full-text searches are specified with the C<MATCH> operator, and an
 operand which may be a single word, a word prefix ending with '*', a
 list of words, a "phrase query" in double quotes, or a boolean combination
 of the above.
 =item *
 The builtin function C<snippet(...)> builds a formatted excerpt of the
 document text, where the words pertaining to the query are highlighted.
 =back
 There are many more details to building and searching
 FTS tables, so we strongly invite you to read
 the full documentation at L<http://www.sqlite.org/fts3.html>.
 =head1 QUERY SYNTAX
 Here are some explanation about FTS queries, borrowed from 
 the sqlite documentation.
 =head2 Token or token prefix queries
 An FTS table may be queried for all documents that contain a specified
 term, or for all documents that contain a term with a specified
 prefix. The query expression for a specific term is simply the term
 itself. The query expression used to search for a term prefix is the
 prefix itself with a '*' character appended to it. For example:
  -- Virtual table declaration
  CREATE VIRTUAL TABLE docs USING fts3(title, body);
  -- Query for all documents containing the term "linux":
  SELECT * FROM docs WHERE docs MATCH 'linux';
  -- Query for all documents containing a term with the prefix "lin".
  SELECT * FROM docs WHERE docs MATCH 'lin*';
 If a search token (on the right-hand side of the MATCH operator) 
 begins with "^" then that token must be the first in its field of
 the document : so for example C<^lin*> matches
 'linux kernel changes ...' but does not match 'new linux implementation'.
 =head2 Column specifications
 Normally, a token or token prefix query is matched against the FTS
 table column specified as the right-hand side of the MATCH
 operator. Or, if the special column with the same name as the FTS
 table itself is specified, against all columns. This may be overridden
 by specifying a column-name followed by a ":" character before a basic
 term query. There may be space between the ":" and the term to query
 for, but not between the column-name and the ":" character. For
 example:
  -- Query the database for documents for which the term "linux" appears in
  -- the document title, and the term "problems" appears in either the title
  -- or body of the document.
  SELECT * FROM docs WHERE docs MATCH 'title:linux problems';
  -- Query the database for documents for which the term "linux" appears in
  -- the document title, and the term "driver" appears in the body of the document
  -- ("driver" may also appear in the title, but this alone will not satisfy the.
  -- query criteria).
  SELECT * FROM docs WHERE body MATCH 'title:linux driver';
 =head2 Phrase queries
 A phrase query is a query that retrieves all documents that contain a
 nominated set of terms or term prefixes in a specified order with no
 intervening tokens. Phrase queries are specified by enclosing a space
 separated sequence of terms or term prefixes in double quotes ("). For
 example:
  -- Query for all documents that contain the phrase "linux applications".
  SELECT * FROM docs WHERE docs MATCH '"linux applications"';
  -- Query for all documents that contain a phrase that matches "lin* app*". 
  -- As well as "linux applications", this will match common phrases such 
  -- as "linoleum appliances" or "link apprentice".
  SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
 =head2 NEAR queries.
 A NEAR query is a query that returns documents that contain a two or
 more nominated terms or phrases within a specified proximity of each
 other (by default with 10 or less intervening terms). A NEAR query is
 specified by putting the keyword "NEAR" between two phrase, term or
 prefix queries. To specify a proximity other than the default, an
 operator of the form "NEAR/<N>" may be used, where <N> is the maximum
 number of intervening terms allowed. For example:
  -- Virtual table declaration.
  CREATE VIRTUAL TABLE docs USING fts4();
  -- Virtual table data.
  INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');
  -- Search for a document that contains the terms "sqlite" and "database" with
  -- not more than 10 intervening terms. This matches the only document in
  -- table docs (since there are only six terms between "SQLite" and "database" 
  -- in the document).
  SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';
  -- Search for a document that contains the terms "sqlite" and "database" with
  -- not more than 6 intervening terms. This also matches the only document in
  -- table docs. Note that the order in which the terms appear in the document
  -- does not have to be the same as the order in which they appear in the query.
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';
  -- Search for a document that contains the terms "sqlite" and "database" with
  -- not more than 5 intervening terms. This query matches no documents.
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';
  -- Search for a document that contains the phrase "ACID compliant" and the term
  -- "database" with not more than 2 terms separating the two. This matches the
  -- document stored in table docs.
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';
  -- Search for a document that contains the phrase "ACID compliant" and the term
  -- "sqlite" with not more than 2 terms separating the two. This also matches
  -- the only document stored in table docs.
  SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';
 More than one NEAR operator may appear in a single query. In this case
 each pair of terms or phrases separated by a NEAR operator must appear
 within the specified proximity of each other in the document. Using
 the same table and data as in the block of examples above:
  -- The following query selects documents that contains an instance of the term 
  -- "sqlite" separated by two or fewer terms from an instance of the term "acid",
  -- which is in turn separated by two or fewer terms from an instance of the term
  -- "relational".
  SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';
  -- This query matches no documents. There is an instance of the term "sqlite" with
  -- sufficient proximity to an instance of "acid" but it is not sufficiently close
  -- to an instance of the term "relational".
  SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';
 Phrase and NEAR queries may not span multiple columns within a row.
 =head2 Set operations
 The three basic query types described above may be used to query the
 full-text index for the set of documents that match the specified
 criteria. Using the FTS query expression language it is possible to
 perform various set operations on the results of basic queries. There
 are currently three supported operations:
 =over
 =item *
 The AND operator determines the intersection of two sets of documents.
 =item * 
 The OR operator calculates the union of two sets of documents.
 =item *
 The NOT operator may be used to compute the relative complement of one
 set of documents with respect to another.
 =back
 The AND, OR and NOT binary set operators must be entered using capital
 letters; otherwise, they are interpreted as basic term queries instead
 of set operators.  Each of the two operands to an operator may be a
 basic FTS query, or the result of another AND, OR or NOT set
 operation. Parenthesis may be used to control precedence and grouping.
 The AND operator is implicit for adjacent basic queries without any
 explicit operator. For example, the query expression "implicit
 operator" is a more succinct version of "implicit AND operator".
 Boolean operations as just described correspond to the so-called
 "enhanced query syntax" of sqlite; this is the version compiled 
 with C<DBD::SQLite>, starting from version 1.31.
 A former version, called the "standard query syntax", used to
 support tokens prefixed with '+' or '-' signs (for token inclusion
 or exclusion); if your application needs to support this old
 syntax, use  L<DBD::SQLite::FTS3Transitional> (published
 in a separate distribution) for doing the conversion.
 =head1 TOKENIZERS
 =head2 Concept
 The behaviour of full-text indexes strongly depends on how
 documents are split into I<tokens>; therefore FTS table
 declarations can explicitly specify how to perform
 tokenization: 
  CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)
 where C<< <tokenizer> >> is a sequence of space-separated
 words that triggers a specific tokenizer. Tokenizers can
 be SQLite builtins, written in C code, or Perl tokenizers.
 Both are as explained below.
 =head2 SQLite builtin tokenizers
 SQLite comes with some builtin tokenizers (see
 L<http://www.sqlite.org/fts3.html#tokenizer>) :
 =over
 =item simple
 Under the I<simple> tokenizer, a term is a contiguous sequence of
 eligible characters, where eligible characters are all alphanumeric
 characters, the "_" character, and all characters with UTF codepoints
 greater than or equal to 128. All other characters are discarded when
 splitting a document into terms. They serve only to separate adjacent
 terms.
 All uppercase characters within the ASCII range (UTF codepoints less
 than 128), are transformed to their lowercase equivalents as part of
 the tokenization process. Thus, full-text queries are case-insensitive
 when using the simple tokenizer.
 =item porter
 The I<porter> tokenizer uses the same rules to separate the input
 document into terms, but as well as folding all terms to lower case it
 uses the Porter Stemming algorithm to reduce related English language
 words to a common root.
 =item icu
 The I<icu> tokenizer uses the ICU library to decide how to
 identify word characters in different languages; however, this
 requires SQLite to be compiled with the C<SQLITE_ENABLE_ICU>
 pre-processor symbol defined. So, to use this tokenizer, you need
 edit F<Makefile.PL> to add this flag in C<@CC_DEFINE>, and then
 recompile C<DBD::SQLite>; of course, the prerequisite is to have
 an ICU library available on your system.
 =item unicode61
 The I<unicode61> tokenizer works very much like "simple" except that it
 does full unicode case folding according to rules in Unicode Version
 6.1 and it recognizes unicode space and punctuation characters and
 uses those to separate tokens. By contrast, the simple tokenizer only
 does case folding of ASCII characters and only recognizes ASCII space
 and punctuation characters as token separators.
 By default, "unicode61" also removes all diacritics from Latin script
 characters. This behaviour can be overridden by adding the tokenizer
 argument C<"remove_diacritics=0">. For example:
  -- Create tables that remove diacritics from Latin script characters
  -- as part of tokenization.
  CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
  CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");
  -- Create a table that does not remove diacritics from Latin script
  -- characters as part of tokenization.
  CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
 Additional options can customize the set of codepoints that unicode61
 treats as separator characters or as token characters -- see the
 documentation in L<http://www.sqlite.org/fts3.html#unicode61>.
 =back
 If a more complex tokenizing algorithm is required, for example to
 implement stemming, discard punctuation, or to recognize compound words,
 use the perl tokenizer to implement your own logic, as explained below.
 =head2 Perl tokenizers
 =head3 Declaring a perl tokenizer
 In addition to the builtin SQLite tokenizers, C<DBD::SQLite>
 implements a I<perl> tokenizer, that can hook to any tokenizing
 algorithm written in Perl. This is specified as follows :
  CREATE ... USING fts4(<columns>, tokenize=perl '<perl_function>')
 where C<< <perl_function> >> is a fully qualified Perl function name
 (i.e. prefixed by the name of the package in which that function is
 declared). So for example if the function is C<my_func> in the main 
 program, write
  CREATE ... USING fts4(<columns>, tokenize=perl 'main::my_func')
 =head3 Writing a perl tokenizer by hand
 That function should return a code reference that takes a string as
 single argument, and returns an iterator (another function), which
 returns a tuple C<< ($term, $len, $start, $end, $index) >> for each
 term. Here is a simple example that tokenizes on words according to
 the current perl locale
  sub locale_tokenizer {
    return sub {
      my $string = shift;
      use locale;
      my $regex      = qr/\w+/;
      my $term_index = 0;
      return sub { # closure
        $string =~ /$regex/g or return; # either match, or no more token
        my ($start, $end) = ($-[0], $+[0]);
        my $len           = $end-$start;
        my $term          = substr($string, $start, $len);
        return ($term, $len, $start, $end, $term_index++);
      }
    };
  }
 There must be three levels of subs, in a kind of "Russian dolls" structure,
 because :
 =over
 =item *
 the external, named sub is called whenever accessing a FTS table
 with that tokenizer
 =item *
 the inner, anonymous sub is called whenever a new string
 needs to be tokenized (either for inserting new text into the table,
 or for analyzing a query).
 =item *
 the innermost, anonymous sub is called repeatedly for retrieving
 all terms within that string.
 =back
 =head3 Using Search::Tokenizer
 Instead of writing tokenizers by hand, you can grab one of those
 already implemented in the L<Search::Tokenizer> module. For example,
 if you want ignore differences between accented characters, you can
 write :
  use Search::Tokenizer;
  $dbh->do(<<"") or die DBI::errstr;
  CREATE ... USING fts4(<columns>, 
                        tokenize=perl 'Search::Tokenizer::unaccent')
 Alternatively, you can use L<Search::Tokenizer/new> to build
 your own tokenizer. Here is an example that treats compound
 words (words with an internal dash or dot) as single tokens :
  sub my_tokenizer {
    return Search::Tokenizer->new(
      regex => qr{\p{Word}+(?:[-./]\p{Word}+)*},
     );
  }
 =head1 Fts4aux - Direct Access to the Full-Text Index
 The content of a full-text index can be accessed through the
 virtual table module "fts4aux". For example, assuming that
 our database contains a full-text indexed table named "ft",
 we can declare :
  CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft)
 and then query the C<ft_terms> table to access the
 list of terms, their frequency, etc.
 Examples are documented in
 L<http://www.sqlite.org/fts3.html#fts4aux>.
 =head1 How to spare database space
 By default, FTS stores a complete copy of the indexed documents,
 together with the fulltext index. On a large collection of documents,
 this can consume quite a lot of disk space. However, FTS has some
 options for compressing the documents, or even for not storing them at
 all -- see L<http://www.sqlite.org/fts3.html#fts4_options>. 
 In particular, the option for I<contentless FTS tables> only stores
 the fulltext index, without the original document content. This is
 specified as C<content="">, like in the following example :
  CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b)
 Data can be inserted into such an FTS4 table using an INSERT
 statements. However, unlike ordinary FTS4 tables, the user must supply
 an explicit integer docid value. For example:
  -- This statement is Ok:
  INSERT INTO t1(docid, a, b) VALUES(1, 'a b c', 'd e f');
  -- This statement causes an error, as no docid value has been provided:
  INSERT INTO t1(a, b) VALUES('j k l', 'm n o');
 Of course your application will need an algorithm for finding
 the external resource corresponding to any I<docid> stored within
 SQLite.
 When using placeholders, the docid must be explicitly typed to 
 INTEGER, because this is a "hidden column" for which sqlite 
 is not able to automatically infer the proper type. So the following
 doesn't work :
  my $sth = $dbh->prepare("INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)");
  $sth->execute(2, 'aa', 'bb'); # constraint error
 but it works with an explicitly cast  :
  my $sql = "INSERT INTO t1(docid, a, b) VALUES(CAST(? AS INTEGER), ?, ?)",
  my $sth = $dbh->prepare(sql);
  $sth->execute(2, 'aa', 'bb');
 or with an explicitly typed L<DBI/bind_param> :
  use DBI qw/SQL_INTEGER/;
  my $sql = "INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)";
  my $sth = $dbh->prepare(sql);
  $sth->bind_param(1, 2, SQL_INTEGER);
  $sth->bind_param(2, "aa");
  $sth->bind_param(3, "bb");
  $sth->execute();
 It is not possible to UPDATE or DELETE a row stored in a contentless
 FTS4 table. Attempting to do so is an error.
 Contentless FTS4 tables also support SELECT statements. However, it is
 an error to attempt to retrieve the value of any table column other
 than the docid column. The auxiliary function C<matchinfo()> may be
 used, but C<snippet()> and C<offsets()> may not, so if such
 functionality is needed, it has to be directly programmed within the
 Perl application.
 =head1 AUTHOR
 Laurent Dami E<lt>dami@cpan.orgE<gt>
 =head1 COPYRIGHT
 Copyright 2014 Laurent Dami.
 Some parts borrowed from the L<http://sqlite.org> documentation, copyright 2014.
 This documentation is in the public domain; you can redistribute
 it and/or modify it under the same terms as Perl itself.