The TAB optimisation in CQPweb & BNCweb

A discussion on Twitter during the recent Corpus Linguistics 2021 conference led Andrew Hardie and me to consider an optimisation for simple queries in CQPweb and BNCweb (using CEQL notation). This TAB optimisation provides up to 10x faster execution of CEQL queries for fixed phrases or part-of-speech patterns such as these:

  in front of
  New York City
  is n't it \?
  _JJ* times
  _VB* _VBG

Queries that skip optional tokens between the elements can also be optimised, e.g. {cat} *** {dog} (but neither {cat} * * {dog} nor in front of +).

The best news is that this optimisation is immediately available for all CQPweb and BNCweb servers, as well as other Web interfaces that build on the CEQL implementation from the CWB/Perl package.  Instructions for enabling the optimisation on existing CQPweb and BNCweb installations can be found at the bottom of this post.

What is the TAB optimisation?

CWB’s corpus query processor CQP matches its finite-state queries in a strict left-to-right manner, using index lookup for the first token expression. Taking the CEQL query in front of as an example, CQP will find all occurrences of in in the corpus using its index, and then check for each instances whether the following words are front and of. This is inefficient, of course, because of the very high frequency of the first element in.

Most search engines (such as Lucene) as well as more recent corpus query processors building on search engine technology (such as BlackLab or LexiDB, whose benchmarks ultimately led to the TAB optimisation in CQPweb) execute such queries in a more flexible and efficient manner: (a) by choosing the most selective element (front) for index lookup and then checking whether the previous and following word are in and of, respectively; or (b) by looking up all three elements in the index and then performing an efficient intersection of the result lists. To our knowledge, SketchEngine performs similar optimisations for CQP syntax queries.

CQP does not attempt any such optimisation, partly because it allows for very flexible and complex query expressions that are difficult to rewrite automatically and partly because it is implemented in old-fashioned C89, so the necessary matching and transformation of syntax trees for query expressions would be extremely painful. However, recent versions of CQP (v3.4.30 and newer) provide a special query mode for fixed phrases – TAB queries – that implements strategy (b). See Sec. 8.5 of the CQP Manual for details.

The TAB optimisation for simple queries in CEQL notation detects fixed-phrase patterns that can be matched with TAB queries and translates them into TAB query expressions instead of regular (“finite-state”) CQP queries.

How to enable it

Prerequisites

  • Make sure you have installed CWB v3.4.30 or newer for a fully functional and tested implementation of TAB queries.
  • In most cases, you will also need the CEQL implementation from the CWB/Perl interface v3.0.7 or newer. This Perl module translated CEQL queries into CQP syntax (including the special TAB and MU query modes where appropriate).

CQPweb v3.3

Recent versions of CQPweb include a PHP reimplementation of CEQL queries and do not rely on the CWB/Perl interface any more. If you are already running CQPweb v3.3, simply update it to the latest version from the SVN repository, make sure it is using CQP v3.4.30+, and add this line to lib/config.php to enable the TAB optimisation:

$use_ceql_tab_optimisation = true;

CQPweb v3.2

If you’re still stuck with CQPweb v3.2 (like myself), you can also benefit from the TAB optimisation via the CEQL implementation in CWB/Perl.  Make sure that you have upgraded both CWB and CWB/Perl to the required versions and that CQPweb is actually using these versions.

Then you need to make two small changes. First, add this line to lib/config.inc.php in order to make sure that CQPweb uses the CWB/Perl implementation of CEQL rather than its own PHP code:

$use_the_new_ceql = false;

Second, patch the file lib/perl/cqpwebCEQL.pm by adding this line to the parameter settings in the new() method (should be approx. line 90):

$self->SetParam("tab_optimisation", 1);

BNCweb

You can enable the TAB optimisation even on an old BNCweb server (provided that you have update to a recent release of the XML edition of BNCweb).  Make sure you meet the requirements as described above, then patch the file cgi-bin/processQuery.pl by modifying the simple_query() function as follows:

sub simple_query {
  my $query = shift;
  my $parser = new bncCEQL;

  $parser->SetParam("default_ignore_case", !$case_sensitive);
  $parser->SetParam("tab_optimisation", 1); # ADD THIS LINE FOR TAB OPTIMISATION
  $cqp_query = $parser->Parse($query);
  unless (defined $cqp_query) {
    my $theQueryQ = $bncHandle->quote($query);
    $sqlQuery = "insert into history (user, query, hits, simple_query, queryMode) values ('$username', $theQueryQ, -1, $theQueryQ, 'simple')";
    $bncHandle->do($sqlQuery) or bark("Can't insert $sqlQuery: " . $bncHandle::errstr);
    cqp_error_handler_full("<b>Error in Simple Query Synatx</b>", $parser->HtmlErrorMessage);
  }
  return $cqp_query;
}