Document not found (404)
+This URL is invalid, sorry. Please use the navigation bar or search to continue.
+ +diff --git a/docs/Makefile b/docs/Makefile index 14ebbfec1..79d410e90 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -1,19 +1,7 @@ -# Minimal makefile for Sphinx documentation -# +all: + mdbook build . -# You can set these variables from the command line. -SPHINXOPTS = -SPHINXBUILD = sphinx-build -SOURCEDIR = source -BUILDDIR = _build +clean: + mdbook clean . -# Put it first so that "make" without argument is like "make help". -help: - @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) - -.PHONY: help Makefile - -# Catch-all target: route all unknown targets to Sphinx using the new -# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). -%: Makefile - @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) +.PHONY: clean diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 000000000..38a565aa1 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,22 @@ +# Generating Documentation + +## Requirements + +* `make` +* `mdbook` +* `mdbook-cmdrun` + +Install mdBook and cmdrun plugin with: + +``` +cargo install mdbook +cargo install mdbook-cmdrun +``` + +## Generate + +Generate by calling the following _in this directory_: + +``` +make +``` diff --git a/docs/book.toml b/docs/book.toml new file mode 100644 index 000000000..9afff9069 --- /dev/null +++ b/docs/book.toml @@ -0,0 +1,8 @@ +[book] +authors = ["Michal Siedlaczek"] +language = "en" +multilingual = false +src = "src" +title = "PISA Guide" + +[preprocessor.cmdrun] diff --git a/docs/book/.nojekyll b/docs/book/.nojekyll new file mode 100644 index 000000000..f17311098 --- /dev/null +++ b/docs/book/.nojekyll @@ -0,0 +1 @@ +This file makes sure that Github Pages doesn't process mdBook's output. diff --git a/docs/book/404.html b/docs/book/404.html new file mode 100644 index 000000000..da5db4c6c --- /dev/null +++ b/docs/book/404.html @@ -0,0 +1,216 @@ + + +
+ + +This URL is invalid, sorry. Please use the navigation bar or search to continue.
+ +compress_inverted_index
Compresses an inverted index
+Usage: ../../../build/bin/compress_inverted_index [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -c,--collection TEXT REQUIRED
+ Uncompressed index basename
+ -o,--output TEXT REQUIRED Output inverted index
+ --check Check the correctness of the index
+ -e,--encoding TEXT REQUIRED Index encoding
+ -w,--wand TEXT Needs: --scorer
+ WAND data filename
+ -s,--scorer TEXT Needs: --wand --quantize
+ Scorer function
+ --bm25-k1 FLOAT Needs: --scorer
+ BM25 k1 parameter.
+ --bm25-b FLOAT Needs: --scorer
+ BM25 b parameter.
+ --pl2-c FLOAT Needs: --scorer
+ PL2 c parameter.
+ --qld-mu FLOAT Needs: --scorer
+ QLD mu parameter.
+ --quantize Needs: --scorer Quantizes the scores
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+
+
+Compresses an inverted index from the uncompressed format using one of +the integer encodings.
+The input to this command is an uncompressed version of the inverted
+index described here.
+The --collection
option takes the basename of the uncompressed
+index.
The postings are compressed using one of the available integer
+encodings, defined by --encoding
. The available encoding values are:
block_interpolative
: Binary Interpolative
+Codingef
: Elias-Fanoblock_maskedvbyte
: MaskedVByteblock_optpfor
: OptPForDeltapef
: Partitioned
+Elias-Fanoblock_qmx
: QMXblock_simdbp
: SIMD-BP128block_simple8b
: Simple8bblock_simple16
: Simple16block_streamvbyte
: StreamVByteblock_varintg8iu
: Varint-G8IUblock_varintgb
: Varint-GBAt the time of compressing the index, you can replace frequencies with
+quantized precomputed scores. To do so, you must define --quantize
+flag, plus some additional options:
--scorer
: scoring function that should be used in to calculate the
+scores (bm25
, dph
, pl2
, qld
)--wand
: metadata filename pathComputes intersections of posting lists.
+Usage: ../../../build/bin/compute_intersection [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -e,--encoding TEXT REQUIRED Index encoding
+ -i,--index TEXT REQUIRED Inverted index filename
+ -w,--wand TEXT REQUIRED WAND data filename
+ --compressed-wand Needs: --wand
+ Compressed WAND data file
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ --combinations Compute intersections for combinations of terms in query
+ --max-term-count,--mtc FLOAT Needs: --combinations
+ Max number of terms when computing combinations
+ --min-query-len UINT Minimum query length
+ --max-query-len UINT Maximum query length
+ --header Write TSV header
+
+
+Computes an intersection of posting lists given by the input queries.
+It takes a file with queries and outputs the documents in the
+intersection of the posting lists. See queries
for
+more details on the input parameters.
Extracts posting counts from an inverted index.
+Usage: ../../../build/bin/count-postings [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -e,--encoding TEXT REQUIRED Index encoding
+ -i,--index TEXT REQUIRED Inverted index filename
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ --sep TEXT Separator string
+ --query-id Print query ID at the beginning of each line, separated by a colon
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ --sum Sum postings accross the query terms; by default, individual list lengths will be printed, separated by the separator defined with --sep
+
+
+Extracts posting counts from an inverted index.
+It sums up posting counts for each query term after parsing. See
+parse_collection
for more details about parsing options.
Creates additional data for query processing.
+Usage: ../../../build/bin/create_wand_data [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -c,--collection TEXT REQUIRED
+ Collection basename
+ -o,--output TEXT REQUIRED Output filename
+ --compress Compress additional data
+ --quantize Quantize scores
+ -s,--scorer TEXT REQUIRED Scorer function
+ --bm25-k1 FLOAT Needs: --scorer
+ BM25 k1 parameter.
+ --bm25-b FLOAT Needs: --scorer
+ BM25 b parameter.
+ --pl2-c FLOAT Needs: --scorer
+ PL2 c parameter.
+ --qld-mu FLOAT Needs: --scorer
+ QLD mu parameter.
+ --range Excludes: --block-size --lambda
+ Create docid-range based data
+ --terms-to-drop TEXT A filename containing a list of term IDs that we want to drop
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+[Option Group: blocks]
+
+ [At least 1 of the following options are required]
+ Options:
+ -b,--block-size FLOAT Excludes: --lambda --range
+ Block size for fixed-length blocks
+ -l,--lambda FLOAT Excludes: --block-size --range
+ Lambda parameter for variable blocks
+
+
+Creates additional data needed for certain query algorithms.
+Algorithms such as WAND and MaxScore (among others) need more data than +available in posting lists alone. This includes max scores for each +term, as well as max scores for ranges of posting lists that can be used +as skip lists.
+Refer to queries
for details about scoring functions.
Each posting list is divided into blocks, and each block gets a
+precomputed max score. These blocks can be either of equal size
+throughout the index, defined by --block-size
, or variable based on
+the lambda parameter --lambda
. [TODO: Explanation needed]
Retrieves query results in TREC format.
+Usage: ../../../build/bin/evaluate_queries [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -e,--encoding TEXT REQUIRED Index encoding
+ -i,--index TEXT REQUIRED Inverted index filename
+ -w,--wand TEXT REQUIRED WAND data filename
+ --compressed-wand Needs: --wand
+ Compressed WAND data file
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ -k INT REQUIRED The number of top results to return
+ -a,--algorithm TEXT REQUIRED
+ Query processing algorithm
+ -s,--scorer TEXT REQUIRED Scorer function
+ --bm25-k1 FLOAT Needs: --scorer
+ BM25 k1 parameter.
+ --bm25-b FLOAT Needs: --scorer
+ BM25 b parameter.
+ --pl2-c FLOAT Needs: --scorer
+ PL2 c parameter.
+ --qld-mu FLOAT Needs: --scorer
+ QLD mu parameter.
+ -T,--thresholds TEXT File containing query thresholds
+ -j,--threads UINT Number of threads
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ -r,--run TEXT Run identifier
+ --documents TEXT REQUIRED Document lexicon
+ --quantized Quantized scores
+
+
+Returns results for the given queries. The results are printed in the
+TREC format. See queries
for detailed description of
+the input parameters.
To print out the string identifiers of the documents (titles), you must
+provide the document lexicon with --documents
.
+Extracts max-scores for query terms from an inverted index.
+
+The max-scores will be printed to the output separated by --sep,
+which is a tab by default.
+Usage: ../../../build/bin/extract-maxscores [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -w,--wand TEXT REQUIRED WAND data filename
+ --compressed-wand Needs: --wand
+ Compressed WAND data file
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ --sep TEXT Separator string
+ --query-id Print query ID at the beginning of each line, separated by a colon
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ --quantized Quantized scores
+
+
+
+ A tool for converting queries from several formats to PISA queries.
+Usage: ../../../build/bin/extract_topics [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ -i,--input TEXT REQUIRED TREC query input file
+ -o,--output TEXT REQUIRED Output basename
+ -f,--format TEXT REQUIRED Input format
+ -u,--unique Unique queries
+
+
+
+ Constructs an inverted index from a forward index.
+Usage: ../../../build/bin/invert [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -i,--input TEXT REQUIRED Forward index basename
+ -o,--output TEXT REQUIRED Output inverted index basename
+ --term-count FLOAT Number of distinct terms in the forward index
+ -j,--threads UINT Number of threads
+ --batch-size UINT=100000 Number of documents to process at a time
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+
+
+
+ A tool for performing threshold estimation using the k-highest impact score for each term, pair or triple of a query. Pairs and triples are only used if provided with --pairs and --triples respectively.
+Usage: ../../../build/bin/kth_threshold [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -e,--encoding TEXT REQUIRED Index encoding
+ -i,--index TEXT REQUIRED Inverted index filename
+ -w,--wand TEXT REQUIRED WAND data filename
+ --compressed-wand Needs: --wand
+ Compressed WAND data file
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ -k INT REQUIRED The number of top results to return
+ -s,--scorer TEXT REQUIRED Scorer function
+ --bm25-k1 FLOAT Needs: --scorer
+ BM25 k1 parameter.
+ --bm25-b FLOAT Needs: --scorer
+ BM25 b parameter.
+ --pl2-c FLOAT Needs: --scorer
+ PL2 c parameter.
+ --qld-mu FLOAT Needs: --scorer
+ QLD mu parameter.
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ -p,--pairs TEXT Excludes: --all-pairs
+ A tab separated file containing all the cached term pairs
+ -t,--triples TEXT Excludes: --all-triples
+ A tab separated file containing all the cached term triples
+ --all-pairs Excludes: --pairs
+ Consider all term pairs of a query
+ --all-triples Excludes: --triples
+ Consider all term triples of a query
+ --quantized Quantizes the scores
+
+
+
+ Build, print, or query lexicon
+Usage: ../../../build/bin/lexicon [OPTIONS] SUBCOMMAND
+
+Options:
+ -h,--help Print this help message and exit
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+
+Subcommands:
+ build Build a lexicon
+ lookup Retrieve the payload at index
+ rlookup Retrieve the index of payload
+ print Print elements line by line
+
+
+
+ A tool for transforming textual queries to IDs.
+Usage: ../../../build/bin/map_queries [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ --sep TEXT Separator string
+ --query-id Print query ID at the beginning of each line, separated by a colon
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+
+
+
+ parse_collection - parse collection and store as forward index.
+Usage: ../../../build/bin/parse_collection [OPTIONS] [SUBCOMMAND]
+
+Options:
+ -h,--help Print this help message and exit
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ -j,--threads UINT Number of threads
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ --config Configuration .ini file
+ -o,--output TEXT REQUIRED Forward index filename
+ -b,--batch-size INT=100000 Number of documents to process in one thread
+ -f,--format TEXT=plaintext Input format
+
+Subcommands:
+ merge Merge previously produced batch files. When parsing process was killed during merging, use this command to finish merging without having to restart building batches.
+
+
+
+ Partition a forward index
+Usage: ../../../build/bin/partition_fwd_index [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ -i,--input TEXT REQUIRED Forward index filename
+ -o,--output TEXT REQUIRED Basename of partitioned shards
+ -j,--threads INT Thread count
+ -r,--random-shards INT Excludes: --shard-files
+ Number of random shards
+ -s,--shard-files TEXT ... Excludes: --random-shards
+ List of files with shard titles
+
+
+
+ Benchmarks queries on a given index.
+Usage: ../../../build/bin/queries [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -e,--encoding TEXT REQUIRED Index encoding
+ -i,--index TEXT REQUIRED Inverted index filename
+ -w,--wand TEXT WAND data filename
+ --compressed-wand Needs: --wand
+ Compressed WAND data file
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ -k INT REQUIRED The number of top results to return
+ -a,--algorithm TEXT REQUIRED
+ Query processing algorithm
+ -s,--scorer TEXT REQUIRED Scorer function
+ --bm25-k1 FLOAT Needs: --scorer
+ BM25 k1 parameter.
+ --bm25-b FLOAT Needs: --scorer
+ BM25 b parameter.
+ --pl2-c FLOAT Needs: --scorer
+ PL2 c parameter.
+ --qld-mu FLOAT Needs: --scorer
+ QLD mu parameter.
+ -T,--thresholds TEXT File containing query thresholds
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ --quantized Quantized scores
+ --extract Extract individual query times
+ --safe Needs: --thresholds Rerun if not enough results with pruning.
+
+
+Runs query benchmarks.
+Executes each query on the given index multiple times, and takes the +minimum of those as the final value. Then, it aggregates statistics +across all queries.
+This program takes a compressed index as its input along with a file
+containing the queries (line by line). Note that you need to specify the
+correct index encoding with --encoding
option, as this is currently
+not stored in the index. If the index is quantized, you must pass
+--quantized
flag.
For certain types of retrieval algorithms, you will also need to pass +the so-called "WAND file", which contains some metadata like skip lists +and max scores.
+There are several parameters you can define to instruct the program on
+how to parse and process the input queries, including which tokenizer to
+use, whether to strip HTML from the query, and a list of token filters
+(such as stemmers). For a more comprehensive description, see
+parse_collection
.
You can also pass a file containing stop-words, which will be excluded +from the parsed queries.
+In order for the parsing to actually take place, you need to also
+provide the term lexicon with --terms
. If not defined, the queries
+will be interpreted as lists of document IDs.
You can specify what retrieval algorithm to use with --algorithm
.
+Furthermore, -k
option defined how many results to retrieve for each
+query.
Use --scorer
option to define which scoring function you want to use
+(bm25
, dph
, pl2
, qld
). Some scoring functions have additional
+parameters that you may override, see the help message above.
You can also pass a file with list of initial score thresholds. Any
+documents that evaluate to a score below this value will be excluded.
+This can speed up the algorithm, but if the threshold is too high, it
+may exclude some of the relevant top-k results. If you want to always
+ensure that the results are as if the initial threshold was zero, you
+can pass --safe
flag. It will force to recompute the entire query
+without an initial threshold if it is detected that relevant documents
+have been excluded. This may be useful if you have mostly accurate
+threshold estimates, but still need the safety: even though some queries
+will be slower, most will be much faster, thus improving overall
+throughput and average latency.
Reads binary collection to stdout.
+Usage: ../../../build/bin/read_collection [OPTIONS] [SUBCOMMAND]
+
+Options:
+ -h,--help Print this help message and exit
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ -c,--collection TEXT REQUIRED
+ Collection file path.
+ --maptext TEXT Excludes: --maplex
+ ID to string mapping in text file format. Line n is the string associated with ID n. E.g., if used to read a document from a forward index, this would be the `.terms` file, which maps term IDs to their string reperesentations.
+ --maplex TEXT Excludes: --maptext
+ ID to string mapping in lexicon binary file format. E.g., if used to read a document from a forward index, this would be the `.termlex` file, which maps term IDs to their string reperesentations.
+
+Subcommands:
+ entry Reads single entry.
+ range Reads a range of entries.
+
+
+
+ Reassigns the document IDs.
+Usage: ../../../build/bin/reorder-docids [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -c,--collection TEXT REQUIRED
+ Collection basename
+ -o,--output TEXT Output basename
+ --documents TEXT Document lexicon
+ --reordered-documents TEXT Needs: --documents
+ Reordered document lexicon
+ --seed UINT Needs: --random Random seed.
+ --store-fwdidx TEXT Needs: --recursive-graph-bisection
+ Output basename (forward index)
+ --fwdidx TEXT Needs: --recursive-graph-bisection
+ Use this forward index
+ -m,--min-len UINT Needs: --recursive-graph-bisection
+ Minimum list threshold
+ -d,--depth FLOAT:INT in [1 - 64] Needs: --recursive-graph-bisection Excludes: --node-config
+ Recursion depth
+ --node-config TEXT Needs: --recursive-graph-bisection Excludes: --depth
+ Node configuration file
+ --nogb Needs: --recursive-graph-bisection
+ No VarIntGB compression in forward index
+ -p,--print Needs: --recursive-graph-bisection
+ Print ordering to standard output
+ -j,--threads UINT Number of threads
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+[Option Group: methods]
+
+ [Exactly 1 of the following options is required]
+ Options:
+ --random Needs: --output Assign IDs randomly. You can use --seed for deterministic results.
+ --from-mapping TEXT Use the mapping defined in this new-line delimited text file
+ --by-feature TEXT Order by URLs from this file
+ --recursive-graph-bisection,--bp
+ Use recursive graph bisection algorithm
+
+
+
+ A tool for sampling an inverted index.
+Usage: ../../../build/bin/sample_inverted_index [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ -c,--collection TEXT REQUIRED
+ Input collection basename
+ -o,--output TEXT REQUIRED Output collection basename
+ -r,--rate FLOAT REQUIRED Sampling rate (proportional size of the output index)
+ -t,--type TEXT REQUIRED Sampling type
+ --terms-to-drop TEXT A filename containing a list of term IDs that we want to drop
+ --seed UINT Seed state
+
+
+
+ Filters selective queries for a given index.
+Usage: ../../../build/bin/selective_queries [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -e,--encoding TEXT REQUIRED Index encoding
+ -i,--index TEXT REQUIRED Inverted index filename
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+
+
+
+ Executes commands for shards.
+Usage: ../../../build/bin/shards [OPTIONS] SUBCOMMAND
+
+Options:
+ -h,--help Print this help message and exit
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+
+Subcommands:
+ invert Constructs an inverted index from a forward index.
+ reorder-docids Reorder document IDs.
+ compress Compresses an inverted index
+ wand-data Creates additional data for query processing.
+ taily-stats Extracts Taily statistics from the index and stores it in a file.
+ taily-score Computes Taily shard ranks for queries. NOTE: as term IDs need to be resolved individually for each shard, DO NOT provide already parsed and resolved queries (with IDs instead of terms).
+ taily-thresholds Computes Taily thresholds.
+
+
+
+ A tool for stemming PISA queries.
+Usage: ../../../build/bin/stem_queries [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ -i,--input TEXT REQUIRED Query input file
+ -o,--output TEXT REQUIRED Query output file
+ --stemmer TEXT REQUIRED Stemmer
+
+
+
+ Extracts Taily statistics from the index and stores it in a file.
+Usage: ../../../build/bin/taily-stats [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -w,--wand TEXT REQUIRED WAND data filename
+ --compressed-wand Needs: --wand
+ Compressed WAND data file
+ -s,--scorer TEXT REQUIRED Scorer function
+ --bm25-k1 FLOAT Needs: --scorer
+ BM25 k1 parameter.
+ --bm25-b FLOAT Needs: --scorer
+ BM25 b parameter.
+ --pl2-c FLOAT Needs: --scorer
+ PL2 c parameter.
+ --qld-mu FLOAT Needs: --scorer
+ QLD mu parameter.
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ -c,--collection TEXT REQUIRED
+ Binary collection basename
+ -o,--output TEXT REQUIRED Output file path
+ --config Configuration .ini file
+
+
+
+ Estimates query thresholds using Taily cut-offs.
+Usage: ../../../build/bin/taily-thresholds [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ -k INT REQUIRED The number of top results to return
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --stats TEXT REQUIRED Taily statistics file
+ --config Configuration .ini file
+
+
+
+ Extracts query thresholds.
+Usage: ../../../build/bin/thresholds [OPTIONS]
+
+Options:
+ -h,--help Print this help message and exit
+ -e,--encoding TEXT REQUIRED Index encoding
+ -i,--index TEXT REQUIRED Inverted index filename
+ -w,--wand TEXT REQUIRED WAND data filename
+ --compressed-wand Needs: --wand
+ Compressed WAND data file
+ --tokenizer TEXT:{english,whitespace}=english
+ Tokenizer
+ -H,--html BOOLEAN=0 Strip HTML
+ -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
+ Token filters
+ --stopwords TEXT Path to file containing a list of stop words to filter out
+ -q,--queries TEXT Path to file with queries
+ --terms TEXT Term lexicon
+ --weighted Weights scores by query frequency
+ -k INT REQUIRED The number of top results to return
+ -s,--scorer TEXT REQUIRED Scorer function
+ --bm25-k1 FLOAT Needs: --scorer
+ BM25 k1 parameter.
+ --bm25-b FLOAT Needs: --scorer
+ BM25 b parameter.
+ --pl2-c FLOAT Needs: --scorer
+ PL2 c parameter.
+ --qld-mu FLOAT Needs: --scorer
+ QLD mu parameter.
+ -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn}=info
+ Log level
+ --config Configuration .ini file
+ --quantized Quantizes the scores
+
+
+
+ To create an index use the command compress_inverted_index
. The
+available index types are listed in index_types.hpp
.
For example, to create an index using the optimal partitioning +algorithm, using the test collection, execute the command:
+$ ./bin/compress_inverted_index -t opt \
+ -c ../test/test_data/test_collection \
+ -o test_collection.index.opt \
+ --check
+
+where test/test_data/test_collection
is the basename of the
+collection, that is the name without the .{docs,freqs,sizes}
+extensions, and test_collection.index.opt
is the filename of the
+output index. --check
will trigger a verification step to check the
+correctness of the index.
Binary Interpolative Coding (BIC) directly encodes a monotonically +increasing sequence. At each step of this recursive algorithm, the +middle element m is encoded by a number m − l − p, where l is the +lowest value and p is the position of m in the currently encoded +sequence. Then we recursively encode the values to the left and right of +m. BIC encodings are very space-efficient, particularly on clustered +data; however, decoding is relatively slow.
+To compress an index using BIC use the index type block_interpolative
.
++Alistair Moffat, Lang Stuiver: Binary Interpolative Coding for Effective Index Compression. Inf. Retr. 3(1): 25-47 (2000)
+
Given a monotonically increasing integer sequence S of size n, such that \(S_{n-1} < u\), we can encode it in binary using \(\lceil\log u\rceil\) bits. +Elias-Fano coding splits each number into two parts, a low part consisting of \(l = \lceil\log \frac{u}{n}\rceil\) right-most bits, and a high part consisting of the remaining \(\lceil\log u\rceil - l\) left-most bits. The low parts are explicitly written in binary for all numbers, in a single stream of bits. The high parts are compressed by writing, in negative-unary form, the gaps between the high parts of consecutive numbers.
+To compress an index using Elias-Fano use the index type ef
.
++Sebastiano Vigna. 2013. Quasi-succinct indices. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM ‘13). ACM, New York, NY, USA, 83-92.
+
++Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015.
+
++Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 401-410. DOI: https://doi.org/10.1145/1526709.1526764
+
++Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (SIGIR '14). ACM, New York, NY, USA, 273-282. DOI: https://doi.org/10.1145/2600428.2609615
+
Quantities, Multipliers, and eXtractor (QMX) packs as many integers as possible into 128-bit words (Quantities) and stores the selectors (eXtractors) separately in a different stream. The selectors are compressed (Multipliers) with +RLE (Run-Length Encoding).
+To compress an index using QMX use the index type block_qmx
.
++Andrew Trotman. 2014. Compression, SIMD, and Postings Lists. In Proceedings of the 2014 Australasian Document Computing Symposium (ADCS '14), J. Shane Culpepper, Laurence Park, and Guido Zuccon (Eds.). ACM, New York, NY, USA, Pages 50, 8 pages. DOI: https://doi.org/10.1145/2682862.2682870
+
++Daniel Lemire, Leonid Boytsov: Decoding billions of integers per second through vectorization. Softw., Pract. Exper. 45(1): 1-29 (2015)
+
++Vo Ngoc Anh, Alistair Moffat: Index compression using 64-bit words. Softw., Pract. Exper. 40(2): 131-147 (2010)
+
++Jiangong Zhang, Xiaohui Long, and Torsten Suel. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international conference on World Wide Web (WWW '08). ACM, New York, NY, USA, 387-396. DOI: https://doi.org/10.1145/1367497.1367550
+
++Daniel Lemire, Nathan Kurz, Christoph Rupp: Stream VByte: Faster byte-oriented integer compression. Inf. Process. Lett. 130: 1-6 (2018). DOI: https://doi.org/10.1016/j.ipl.2017.09.011
+
++Alexander A. Stepanov, Anil R. Gangolli, Daniel E. Rose, Ryan J. Ernst, and Paramjit S. Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM international conference on Information and knowledge management (CIKM '11), Bettina Berendt, Arjen de Vries, Wenfei Fan, Craig Macdonald, Iadh Ounis, and Ian Ruthven (Eds.). ACM, New York, NY, USA, 317-326. DOI: https://doi.org/10.1145/2063576.2063627
+
++ +Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: invited talk. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM '09), Ricardo Baeza-Yates, Paolo Boldi, Berthier Ribeiro-Neto, and B. Barla Cambazoglu (Eds.). ACM, New York, NY, USA, 1-1. DOI: http://dx.doi.org/10.1145/1498759.1498761
+
This section is an overview of how to take a collection +to a state in which it can be queried. +This process is intentionally broken down into several steps, +with a bunch of independent tools performing different tasks. +This is because we want the flexibility of experimenting with each +individual step without recomputing the entire pipeline.
+ +The raw collection is a dataset containing the documents to index. A
+collection is encoded in one of the supported
+formats that stores a list of document
+contents along with some metadata, such as URL and title. The
+parse_collection
tool takes a collection as an input and parses it to
+a forward index (see Forward Index). See
+Parsing for more details.
This is an inverted index in the Common Index File Format.
+It can be converted to an uncompressed PISA index (more information below)
+with the ciff2pisa
tool.
A forward index is the output of the parse_collection
tool.
+It represents each document as a list of tokens (terms) in the order of their appearance.
+To learn more about parsing and the forward index format, see Parsing.
An inverted index is the most fundamental structure in PISA. +For each term in the collection, it contains a list of documents the term appears in. +PISA distinguishes two types of inverted index.
+The uncompressed index stores document IDs and frequencies as 4-byte integers.
+It is an intermediate format between forward index and compressed inverted index.
+It is obtained by running invert
on a forward index.
+To learn more about inverting a forward index, see Inverting.
+Optionally, documents can be reordered with reorder-docids
to obtain another
+instance of uncompressed inverted index with different assignment of IDs to documents.
+More on reordering can be found in Document Reordering.
An uncompressed index is large and therefore before running queries, it +must be compressed with one of many available encoding methods. It is +this compressed index format that is directly used when issuing queries. +See Compress Index to learn more.
+This is a special metadata file containing additional statistics used during query processing. +See Build additional data.
+PISA supports partitioning a forward index into subsets called shards.
+Structures of all shards can be transformed in bulk using shards
command line tool.
+To learn more, read Sharding.
The following steps explain how to build PISA. +First, you need the code checked out from Github. +(Alternatively, you can download the tarball and unpack it on your local machine.)
+$ git clone https://github.com/pisa-engine/pisa.git
+$ cd pisa
+
+Then create a build environment.
+$ mkdir build
+$ cd build
+
+Finally, configure with CMake and compile:
+$ cmake ..
+$ make
+
+There are two build types available:
+Release
(default)Debug
RelWithDebInfo
MinSizeRel
Use Debug
only for development, testing, and debugging. It is much slower at runtime.
Learn more from CMake documentation.
+CMake supports configuring for different build systems. +On Linux and Mac, the default is Makefiles, thus, the following two commands are equivalent:
+$ cmake -G ..
+$ cmake -G "Unix Makefiles" ..
+
+Alternatively to Makefiles, you can configure the project to use Ninja instead:
+$ cmake -G Ninja ..
+$ ninja # instead of make
+
+Other build systems should work in theory but are not tested.
+You can run the unit and integration tests with:
+$ ctest
+
+The directory test/test_data
contains a small document collection used in the
+unit tests. The binary format of the collection is described in a following
+section.
+An example set of queries can also be found in test/test_data/queries
.
Once the parsing phase is complete, use the invert
command to turn a
+forward index into an inverted index. For example, assuming the
+existence of a forward index in the path path/to/forward/cw09b
:
$ mkdir -p path/to/inverted
+$ ./invert -i path/to/forward/cw09b \
+ -o path/to/inverted/cw09b \
+ --term-count `wc -w < path/to/forward/cw09b.terms`
+
+Note that the script requires as parameter the number of terms to be
+indexed, which is obtained by embedding the
+wc -w < path/to/forward/cw09b.terms
instruction.
A binary sequence is a sequence of integers prefixed by its length,
+where both the sequence integers and the length are written as 32-bit
+little-endian unsigned integers. An inverted index consists of 3
+files, <basename>.docs
, <basename>.freqs
, <basename>.sizes
:
<basename>.docs
starts with a singleton binary sequence where its
+only integer is the number of documents in the collection. It is then
+followed by one binary sequence for each posting list, in order of
+term-ids. Each posting list contains the sequence of document-ids
+containing the term.
<basename>.freqs
is composed of a one binary sequence per posting
+list, where each sequence contains the occurrence counts of the
+postings, aligned with the previous file (note however that this file
+does not have an additional singleton list at its beginning).
<basename>.sizes
is composed of a single binary sequence whose
+length is the same as the number of documents in the collection, and
+the i-th element of the sequence is the size (number of terms) of the
+i-th document.
Here is an example of a Python script reading the uncompressed inverted +index format:
+import os
+import numpy as np
+
+class InvertedIndex:
+ def __init__(self, index_name):
+ index_dir = os.path.join(index_name)
+ self.docs = np.memmap(index_name + ".docs", dtype=np.uint32,
+ mode='r')
+ self.freqs = np.memmap(index_name + ".freqs", dtype=np.uint32,
+ mode='r')
+
+ def __iter__(self):
+ i = 2
+ while i < len(self.docs):
+ size = self.docs[i]
+ yield (self.docs[i+1:size+i+1], self.freqs[i-1:size+i-1])
+ i += size+1
+
+ def __next__(self):
+ return self
+
+for i, (docs, freqs) in enumerate(InvertedIndex("cw09b")):
+ print(i, docs, freqs)
+
+
+ A forward index is a data structure that stores the term identifiers +associated to every document. Conversely, an inverted index stores for +each unique term the document identifiers where it appears (usually, +associated to a numeric value used for ranking purposes such as the raw +frequency of the term within the document).
+The objective of the parsing process is to represent a given collection
+as a forward index. To parse a collection, use the parse_collection
+command, for example:
$ mkdir -p path/to/forward
+$ zcat ClueWeb09B/*/*.warc.gz | \ # pass unzipped stream in WARC format
+ parse_collection \
+ -j 8 \ # use up to 8 threads at a time
+ -b 10000 \ # one thread builds up to 10k documents in memory
+ -f warc \ # use WARC
+ -F lowercase porter2 \ # lowercase and stem every term (using the Porter2 algorithm)
+ --html \ # strip HTML markup before extracting tokens
+ -o path/to/forward/cw09b
+
+In case you get the error -bash: /bin/zcat: Argument list too long
,
+you can pass the unzipped stream using:
$ find ClueWeb09B -name '*.warc.gz' -exec zcat -q {} \;
+
+The parsing process will write the following files:
+cw09b
: forward index in binary format.cw09b.terms
: a new-line-delimited list of sorted terms, where term
+having ID N is on line N, with N starting from 0.cw09b.termlex
: a binary representation (lexicon) of the .terms
+file that is used to look up term identifiers at query time.cw09b.documents
: a new-line-delimited list of document titles (e.g.,
+TREC-IDs), where document having ID N is on line N, with N starting
+from 0.cw09b.doclex
: a binary representation of the .documents
file that
+is used to look up document identifiers at query time.cw09b.urls
: a new-line-delimited list of URLs, where URL having ID N
+is on line N, with N starting from 0. Also, keep in mind that each ID
+corresponds with an ID of the cw09b.documents
file.Once the forward index has been generated, a binary document map and
+lexicon file will be automatically built. However, they can also be
+built using the lexicon
utility by providing the new-line delimited
+file as input. The lexicon
utility also allows efficient look-ups and
+dumping of these binary mapping files.
For example, assume we have the following plaintext, new-line delimited
+file, example.terms
:
aaa
+ bbb
+ def
+ zzz
+
+We can generate a lexicon as follows:
+./bin/lexicon build example.terms example.lex
+
+You can dump the binary lexicon back to a plaintext representation:
+./bin/lexicon print example.lex
+
+It should output:
+ aaa
+ bbb
+ def
+ zzz
+
+You can retrieve the term with a given identifier:
+./bin/lexicon lookup example.lex 2
+
+Which outputs:
+def
+
+Finally, you can retrieve the id of a given term:
+./bin/lexicon rlookup example.lex def
+
+It outputs:
+2
+
+NOTE: This requires the initial file to be lexicographically sorted,
+as rlookup
uses binary search for reverse lookups.
Both are English stemmers. Unfortunately, PISA does not have support for +any other languages. Contributions are welcome.
+The following raw collection formats are supported:
+plaintext
: every line contains the document's title first, then any
+number of whitespaces, followed by the content delimited by a new line
+character.trectext
: TREC newswire collections.trecweb
: TREC web collections.warc
: Web ARChive format as defined in the format
+specification.wapo
: TREC Washington Post Corpus.In case you want to parse a set of files where each one is a document (for example, the collection
+wiki-large), use the files2trec.py
script
+to format it to TREC (take into account that each relative file path is used as the document ID).
+Once the file is generated, parse it with the parse_collection
command specifying the trectext
+value for the --format
option.
Now it is possible to query the index. The command queries
treats each
+line of the standard input (or a file if -q
is present) as a separate
+query. A query line contains a whitespace-delimited list of tokens.
+These tokens are either interpreted as terms (if --terms
is defined,
+which will be used to resolve term IDs) or as term IDs (if --terms
is
+not defined). Optionally, a query can contain query ID delimited by a
+colon:
Q1:one two three
+ ^^ ^^^^^^^^^^^^^
+query ID terms
+
+For example:
+$ ./bin/queries \
+ -e opt \ # index encoding
+ -a and \ # retrieval algorithm
+ -i test_collection.index.opt \ # index path
+ -w test_collection.wand \ # metadata file
+ -q ../test/test_data/queries # query input file
+
+This performs conjunctive queries (and
). In place of and
other
+operators can be used (see Query algorithms), and
+also multiple operators separated by colon (and:or:wand
), which will
+run multiple passes, one per algorithm.
If the WAND file is compressed, append --compressed-wand
flag.
To perform BM25 queries it is necessary to build an additional file +containing the parameters needed to compute the score, such as the +document lengths. The file can be built with the following command:
+$ ./bin/create_wand_data \
+ -c ../test/test_data/test_collection \
+ -o test_collection.wand
+
+If you want to compress the file append --compress
at the end of the
+command. When using variable-sized blocks (for VBMW) via the
+--variable-block
parameter, you can also specify lambda with the -l <float>
or --lambda <float>
flags. The value of lambda impacts the
+mean size of the variable blocks that are output. See the VBMW paper
+(listed below) for more details. If using fixed-sized blocks, which is
+the default, you can supply the desired block size using the -b <UINT>
or --block-size <UINT>
arguments.
Here is the list of the supported query processing algorithms.
+Unranked (and
) or ranked (ranked_and
) conjunction.
Unranked (or
) or ranked (ranked_or
) union.
++Howard Turtle and James Flood. 1995. Query evaluation: strategies and optimizations. Inf. Process. Manage. 31, 6 (November 1995), 831-850. DOI=http://dx.doi.org/10.1016/0306-4573(95)00020-H
+
++Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the twelfth international conference on Information and knowledge management (CIKM '03). ACM, New York, NY, USA, 426-434. DOI: https://doi.org/10.1145/956863.956944
+
++Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 993-1002. DOI=http://dx.doi.org/10.1145/2009916.2010048
+
++ +Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano Venturini. 2017. Faster BlockMax WAND with Variable-sized Blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 625-634. DOI: https://doi.org/10.1145/3077136.3080780
+
PISA supports reassigning document IDs that were initially assigned in order of parsing. +The point of doing it is usually to decrease the index size or speed up query processing. +This part is done on an uncompressed inverted index. +Depending on the method, you might also need access to some parts of the forward index. +We support the following ways of reordering:
+reorder-docids
.
+Below, we explain each method and show some examples of running the command.All methods can optionally take a path to a document lexicon and make a copy of it that reflects +the produced reordering.
+reorder-docids \
+ --documents /path/to/original/doclex \
+ --reordered-documents /path/to/reordered/doclex \
+ ...
+
+Typically, you will want to do that if you plan to evaluate queries, which will need access to +a correct document lexicon.
+++NOTE: Because these options are common to all reordering methods, we ignore them below for brevity.
+
Random document reordering, as the name suggests, randomly shuffles all document IDs. +Additionally, it can take a random seed. Two executions of the command with the same seed +will produce the same final ordering.
+reorder-docids --random \
+ --collection /path/to/inv \
+ --output /path/to/inv.random \
+ --seed 123456789 # optional
+
+An index can be reordered according to any single document feature, such as URL or TRECID,
+as long as it is stored in a text file line by line, where line n
is the feature of
+document n
in the original order.
In particular, our collection parsing command produces two such feature files:
+*.documents
, which is typically a list of TRECIDs,*.urls
, which is a list of document URLs.To use either, you simply need to run:
+reorder-docids \
+ --collection /path/to/inv \
+ --output /path/to/inv.random \
+ --by-feature /path/to/feature/file
+
+You can also produce a mapping yourself and feed it to the command. +Such mapping is a text file with two columns separated by a whitespace:
+<original ID> <new ID>
+
+Having that, reordering is as simple as running:
+reorder-docids \
+ --collection /path/to/inv \
+ --output /path/to/inv.random \
+ --from-mapping /path/to/custom/mapping
+
+We provide an implementation of the Recursive Graph Bisection (aka BP) algorithm, +which is currently the state-of-the-art for minimizing the compressed space used +by an inverted index (or graph) through document reordering. +The algorithm tries to minimize an objective function directly related to the number +of bits needed to store a graph or an index using a delta-encoding scheme.
+Learn more from the original paper:
+++L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. +Compressing graphs and indexes with recursive graph bisection. +In Proc. SIGKDD, pages 1535–1544, 2016.
+
In PISA, you simply need to pass --recursive-graph-bisection
option (or its alias --bp
)
+to the reorder-docids
command.
reorder-docids --bp \
+ --collection /path/to/inv \
+ --output /path/to/inv.random
+
+Note that --bp
allows for some additional options.
+For example, the algorithm constructs a forward index in memory, which is in a special format
+separate from the PISA forward index that you obtain from the parse_collection
tool.
+You can instruct reorder-docids
to store that intermediate structure (--store-fwdidx
),
+as well as provide a previously constructed one (--fwdidx
), which can be useful if you
+want to reuse it for several runs with different algorithm parameters.
+To see all available parameters, run reorder-docids --help
.
To compile PISA, you will need a compiler supporting at least the C++17 +standard. Our continuous integration pipeline compiles PISA and runs +tests in the following configurations:
+Supporting Windows is planned but is currently not being actively +worked on, mostly due to a combination of man-hour shortage, +prioritization, and no core contributors working on Windows at the +moment. If you want to help us set up a Github workflow for Windows and +work out some issues with compilation, let us know on our Slack +channel.
+Most build dependencies are managed automatically with CMake and git submodules. +However, several dependencies still need to be manually provided:
+CMake >= 3.0
autoconf
, automake
, libtool
, and m4
(for building gumbo-parser
)You can opt in to use some system dependencies instead of those in git +submodules:
+PISA_SYSTEM_GOOGLE_BENCHMARK
): this is a dependency used only for
+compiling and running microbenchmarks.PISA_SYSTEM_ONETBB
):
+both build-time and runtime dependency.PISA_SYSTEM_BOOST
): both build-time
+and runtime dependency.PISA_SYSTEM_CLI11
):
+build-time only dependency used in command line tools.For example, to use all the system installation of Boost in your build:
+cmake -DPISA_SYSTEM_BOOST=ON <source-dir>
+
+
+ We support partitioning a collection into a number of smaller subsets called shards.
+Right now, only a forward index can be partitioned by running partition_fwd_index
command.
+For convenience, we provide shards
command that supports certain bulk operations on all shards.
We support two methods of partitioning: random, and by a defined mapping. +For example, one can partition collection randomly:
+$ partition_fwd_index \
+ -j 8 \ # use up to 8 threads at a time
+ -i full_index_prefix \
+ -o shard_prefix \
+ -r 123 # partition randomly into 123 shards
+
+Alternatively, a set of files can be provided.
+Let's assume we have a folder shard-titles
with a set of text files.
+Each file contains new-line-delimited document titles (e.g., TREC-IDs) for one partition.
+Then, one would call:
$ partition_fwd_index \
+ -j 8 \ # use up to 8 threads at a time
+ -i full_index_prefix \
+ -o shard_prefix \
+ -s shard-titles/*
+
+Note that the names of the files passed with -s
will be ignored.
+Instead, each shard will be assigned a numerical ID from 0
to N - 1
in order
+in which they are passed in the command line.
+Then, each resulting forward index will have appended .ID
to its name prefix:
+shard_prefix.000
, shard_prefix.001
, and so on.
The shards
tool allows to perform some index operations in bulk on all shards at once.
+At the moment, the following subcommands are supported:
All input and output paths passed to the subcommands will be expanded for each individual shards
+by extending it with .<shard-id>
(e.g., .000
) or, if substring {}
is present, then
+the shard number will be substituted there. For example:
shards reorder-docids --by-url \
+ -c inv \
+ -o inv.url \
+ --documents fwd.{}.doclex \
+ --reordered-documents fwd.url.{}.doclex
+
+is equivalent to running the following command for every shard XYZ
:
reorder-docids --by-url \
+ -c inv.XYZ \
+ -o inv.url.XYZ \
+ --documents fwd.XYZ.doclex \
+ --reordered-documents fwd.url.XYZ.doclex
+
+
+ Currently it is possible to perform threshold estimation tasks using the
+kth_threshold
tool. The tool computes the k-highest impact score for
+each term of a query. Clearly, the top-k threshold of a query can be
+lower-bounded by the maximum of the k-th highest impact scores of the
+query terms.
In addition to the k-th highest score for each individual term, it is +possible to use the k-th highest score for certain pairs and triples of +terms.
+To perform threshold estimation use the kth_threshold
command.
PISA is a text search engine able to run on large-scale collections of +documents. It allows researchers to experiment with state-of-the-art +techniques, allowing an ideal environment for rapid development.
+Some features of PISA are listed below:
+PISA is still in its unstable release, and no stability or +backwards-compatibility is guaranteed with each new version. New +features are constantly added, and contributions are welcome!
+ +