This document contains notes about the design of bzr search.

Naming documents
++++++++++++++++

I plan to use a single namespace for all documents: revisions, inventories and
file texts, branch tags etc.

Currently this is:
('r', '', revision_id)
('f', file_id, revision_id)

What to index
+++++++++++++

I think we can sensibly index revisions, tree-shapes, and file texts. One very
interesting question is how to index deltas - for instance should searching for
'foo bar' return only the document-versions where 'foo bar' was introduced? Or
those where it was merged (as bzr annotate does), or any document-version where
'foo bar' is present at all ? In the short term, I'm punting on a real answer
to this because it requires serious thought, and I want to deliver something
for people to play with, so I'm going to simply annotate every text, and take
only the terms from lines ascribed to version being indexed. This will require
some care to avoid either being pathologically slow, or blowing out memory on
large trees.

The current file text indexing solution uses make_mpdifs. This should be
efficient as it has been tuned for bundle use. It allows us to only index
new-in-a-version terms (though still at line granularity).

Indexing bzr.dev stabilises at 180MB for me. (Very) large trees have been
observed to need 1GB of ram.  If thrashing occurs, change the group size in
inde.py down from 2500 revisions - the memory drop is roughly proportional to
that figure.

Indexing non-text
=================

I think it would be good to have a filtering capability so that e.g. openoffice
files can generate term-lists. This relates to the 'index deltas' question in
that external filters are likely n

Indexing paths
==============

There are several use cases for path indices:

 * Mapping fileid, revisionid search hits back to paths for display during hits
   on documents. This appears to need a concordance of the entire inventory
   state for every file hit we can produce.
 * Searching for a path ("show me the file README.txt")
 * Accelerating per-file log (by identifying path-change events which the 
   file graph index elides, and-or

Mapping fileid, revisionids to paths
------------------------------------

If we insert path ('p', '', PATH) as a document, and as terms insert the
fileid, revisionid pair as a phrase, then we can search for the phrase
fileid, revisionid to return the paths that match - and by virtue of our data
model that will be unique. 

Indexing Phrases
================

For now, a naive approach is being used: There is a phrase-index for N-term
phrases. It contains tuples of terms, mapping to a posting list for the phrase.
For generic any-N phrase support we need a N->phrase_N_index mapping. However
to bootstrap the process the simplest approach of adding a start,length tuple
to the names value has been used.

Possible future improvements to this are to:
 
 * Use termid tuples rather than terms (to save space)


How to index
++++++++++++

To allow incremental index operations, indices should be committed to regularly
during the index process. To allow cheap 'what is not indexed' queries, we can
use revision ids as a 'indexed' flag in the index. So probably within <batch
size> we should index by file_id (for annotation efficiency) and across <batch
size> we effectively index topologically (though I think partial indices are a 
really good idea to allow automatic-updates).

Where to store indices
++++++++++++++++++++++

There are many possible places to store indices. Ideally they get deleted when
a branch is deleted, can be shared between branches, update automatically etc.
For now, I'm going to store them in .bzr/bzr-search when the bzrdir is a
MetaDir, and just refuse to index other branches.  Storing in bzr-search is a
bit of an abstraction violation, but we have a split-out structure for a
reason, and by prefixing it with bzr-search I leave 'search' available for a
'core' version of this feature.

When indexing a bzr-svn branch (actual svn branch, not a branch created *from*
svn), the index is stored in
$bzrconfig/bzr-search/svn-lookaside/SVN_UUID/BRANCHPATH.

Search engine to use
++++++++++++++++++++

I looked at: pylucence (java dependency, hard to integrate into our VCS),
xapwrap (abandonware), lupy(abandonware). So in the interested of expendiency,
I'm writing a trivial boolean term index for this project.


Disk Format
+++++++++++

I've modelled this after a pack repository.

So we have a lock dir to control commits.

A names file listing the indices.

Second cut - in progress
========================

Each name in the names file is a pack and the start, length coordinates in the
pack for the three top level indices that make up a component. These are the
revisions index, the documents index and the terms index. These can be
recreated by scanning a component's pack, if they were to get lost.

The pack contains these top level indices and a set of posting list indices.

Posting list
------------

A GraphIndex(0, 1)
(doc_id -> "")

This allows set based querying to identify document ids that are matches. If we
have already selected (say) ids 45 and 200, there is no need to examine all the
document ids in a particular posting list.

Document ids
------------

A GraphIndex(0, 1)
(doc_id -> doc_key)

Document ids are a proxy for a specific document in the corpus. Actual corpus
identifiers are long (10's of bytes), so using a simple serial number for each
document we refer keeps the index size compact. Once a search is complete the
selected document ids are resolved into document keys within the branch - and
they can then be used to show the results to users.


Terms index
-----------

A GraphIndex(0, 1)
(term,) -> document_count, start, end of the terms posting list in the pack.

This allows the construction of a posting list index object for the term, which
is then used to whittle down/expand the set of documents to return.

The document count is used in one fairly cheap optimisation - we select the
term with the smallest number of references to start our search, in the hope
that that will reduce the total IO commensurately.

Revisions index
---------------
(revision,) -> nothing

This index is used purely to detect whether a revision is already indexed - if
it is we don't need to index it again.
