MARISA: Matching Algorithm with Recursively Implemented StorAge

Abstract: Matching Algorithm with Recursively Implemented StorAge (MARISA) is a space-efficient trie data structure. libmarisa is a C++ library for an implementation of MARISA. Users can build dictionaries and search keys from the dictionaries. The package also provides command line tools to test basic operations of libmarisa, and the tools are useful to test the performance.

Introduction

Overview

Matching Algorithm with Recursively Implemented StorAge (MARISA) is a space-efficient, fairly fast, and static trie data structure. MARISA serves as a dictionary structure, and by definition, it supports exact match lookup, which is the basic operation of dictionary. In addition, MARISA supports reverse lookup, common prefix search, and predictive search.

In most cases, MARISA is much more compact than a plain text which consists of the registered keys. This means that the traditional dictionary implementations, a binary tree (std::map<std::string, T>) and a hash table (std::unordered_map<std::string, T>), require more and more and more spaces than MARISA. Bloom Filter, a probabilistic data structure, is more space-efficient than MARISA but causes false positives and does not support reverse lookup, common prefix search, and predictive search.

libmarisa is a C++ library for an implementation of MARISA. Users can build dictionaries and search keys from the dictionaries. The package also provides command line tools to test basic operations of libmarisa, and the tools are useful to test the performance.

Functionality

libmarisa associates string keys with unique IDs, from 0 to (n - 1), where n is the number of keys. Note that users cannot specify the IDs because the mapping is automatically generated by MARISA. Every search function takes a string or an ID and returns the search result which is represented by a pair of the key and its ID.

  • Lookup
    • checks whether or not a query string is registered.
  • Reverse lookup
    • restores a key from its ID.
  • Common prefix search
    • searches keys from the possible prefixes of a query string.
  • Predictive search
    • searches keys starting with a query string.

Source

License

libmarisa and its command line tools are dual-licensed under the BSD 2-clause license and the LGPL.

Installation

GCC & Clang

$ tar zxf marisa-0.2.4.tar.gz
$ cd marisa-0.2.4
$ ./configure
$ make
$ make check
$ make install

Users can install libmarisa by using configure and make. make install might require sudo to install libmarisa as the root user. Additionally, ldconfig might be required because libmarisa is installed as a shared library in default settings.

If a POPCNT instruction is available on your environment, you can specify --enable-popcnt, when you run configure, to improve the performance of libmarisa. Likewise, --enable-sse2, --enable-sse3, --enable-ssse3, --enable-sse4.1, --enable-sse4.2, --enable-sse4, --enable-sse4a are available. Also, if you need a static library, specify --enable-static to configure. For other options, see ./configure --help.

Visual C++ 2008

There are project files for Visual C++ 2008 in vs2008/. Users can build a static library libmarisa.lib and the command line tools by using vs2008/vs2008.sln. If your Visual C++ is older than 2008. New projects are required to build libmarisa.

Perl Bindings

$ cd bindings/perl
$ perl Makefile.PL
$ make
$ make install

Users can find a Perl bindings in bindings/perl/, in which the wrapper was generated by SWIG. To install the Perl bindings, run perl Makefile.PL and then make install. See also bindings/perl/sample.pl.

Python Bindings

$ cd bindings/python
$ python setup.py build
$ python setup.py install

Users can find a Python bindings in bindings/python/, in which the wrapper was generated by SWIG. To install the Python bindings, run python setup.py install. See also bindings/python/sample.py.

Ruby Bindings

$ cd bindings/ruby
$ ruby extconf.rb
$ make
$ make install

Users can find a Ruby bindings in bindings/ruby/, in which the wrapper was generated by SWIG. To install the Ruby bindings, run ruby extconf.rb and then make install. See also bindings/ruby/sample.rb.

Others

There is an alternative Cython-based pip-installable Python bindings which is faster than the above Python bindings.

Command Line Tools

marisa-build

$ marisa-build < keyset.txt > keyset.dic
#keys: 9864459
#nodes: 13473881
size: 51044424

marisa-build is a tool to build a dictionary from a set of keys. This tool takes as input newline-delimited keys and writes the dictionary to the standard output.

Users can specify parameters through command line options. See marisa-build -h for the list of options.

If an input line contains horizontal tabs, the last one serves as the delimiter between a key and its weight which is used to optimize the order of nodes. Estimated frequency of each key, given as the weight, may improve the search performance.

marisa-lookup

$ marisa-lookup keyset.dic
Marisa
915465	Marisa
What's_uuup
-1	What's_uuup

marisa-lookup is a tool to test exact match lookup. If a query string is registered, this tool prints the key and its ID. Otherwise, this tool prints the query with -1.

See marisa-lookup -h for the list of options.

marisa-reverse-lookup

$ marisa-reverse-lookup keyset.dic
1234567
1234567	Goma_International_Airport

marisa-reverse-lookup is a tool to test reverse lookup. If a given ID is not out-of-range, this tool restores the associated key and prints it. The available ID is 0 to (n - 1), where n is the number of keys. Note that an out-of-range ID causes an error.

See marisa-reverse-lookup -h for the list of options.

marisa-common-prefix-search

$ marisa-common-prefix-search keyset.dic
USA
3 found
20	U	USA
1526	US	USA
37471	USA	USA

marisa-common-prefix-search is a tool to test common prefix search. This tool searches keys from the possible prefixes of a query string and then prints the first m keys, where m is one of the parameters.

See marisa-common-prefix-search -h for the list of options.

marisa-predictive-search

$ marisa-predictive-search keyset.dic -n 2
Touhou
15 found
975378	Touhou	Touhou
5508004	Touhou_Hisotensoku	Touhou

marisa-predictive-search is a tool to test predictive search. This tool searches keys starting with a query string and then prints the first m keys, where m is one of the parameters.

See marisa-predictive-search -h for the list of options.

marisa-benchmark

$ marisa-benchmark keyset.txt
Number of tries: 1 - 5
TAIL mode: Text mode
Node order: Descending weight order
Cache level: Normal cache
Number of keys: 9864459
Total length: 191858227
------+----------+--------+--------+--------+--------+--------
#tries       size    build   lookup  reverse   prefix  predict
                                      lookup   search   search
          [bytes]    [K/s]    [K/s]    [K/s]    [K/s]    [K/s]
------+----------+--------+--------+--------+--------+--------
     1   69905816   334.84  1368.16  1304.82  1080.44   605.92
     2   53635744   284.03   762.91   773.68   662.04   244.35
     3   51044424   278.89   688.86   703.60   604.44   212.00
     4   50309000   277.01   669.23   680.78   588.57   204.23
     5   50042232   275.93   636.83   674.26   562.08   199.48
------+----------+--------+--------+--------+--------+--------

marisa-benchmark is a tool to benchmark libmarisa. This tool takes the same input as marisa-build and measures the performance of libmarisa for the given set of keys. This tool is useful to fix dictionary settings.

For the search performance, marisa-benchmark measures the time to lookup or search keys in input order. When the keys are given in lexicographic order, few cache misses will occur in the benchmark. In contrast, when the keys are given in random order, many cache misses will occur in the benchmark.

See marisa-benchmark -h for the list of options.

marisa-dump

$ marisa-dump keyset.dic | head -3
input: keyset.dic
S
St
Sta

marisa-build is a tool to dump a dictionary. This tool prints all the keys in a given dictionary.

Users can specify the delimiter through command line options. See marisa-dump -h for the list of options.

Library

How to Use

// sample.cc
#include <iostream>
#include <marisa.h>

int main() {
  marisa::Keyset keyset;
  keyset.push_back("a");
  keyset.push_back("app");
  keyset.push_back("apple");

  marisa::Trie trie;
  trie.build(keyset);

  marisa::Agent agent;
  agent.set_query("apple");
  while (trie.common_prefix_search(agent)) {
    std::cout.write(agent.key().ptr(), agent.key().length());
    std::cout << ": " << agent.key().id() << std::endl;
  }
  return 0;
}
$ g++ sample.cc -lmarisa
$ ./a.out
a: 0
app: 1
apple: 2

libmarisa provides marisa.h in which all the headers are #included. Also, libmarisa uses namespace marisa. All the classes and functions except enumeration types are given as members of this namespace. Note that using namespace marisa may cause a critical error. Finally, gcc and clang require an option, -lmarisa, to link libmarisa with an application.

The core components of libmarisa are Keyset, Agent, and Trie. In addition, libmarisa provides an exception class, Exception, and two more classes, Key and Query, as members of Keyset and Agent.

  • Keyset: A class to store a set of keys. This class is used to build a set of keys for building a dictionary. Also, this class is useful to store search results.
  • Agent: A class to store a query and a result of search operations. Every search function takes a reference to this class.
  • Trie: A dictionary class.

For more examples, you can find the source code of the command line tools in tools/. The source code is useful as an example of error handling, predicive search, etc.

Enumeration Constants

Error Codes

typedef enum marisa_error_code_ {
  MARISA_OK           = 0,
  MARISA_STATE_ERROR  = 1,
  MARISA_NULL_ERROR   = 2,
  MARISA_BOUND_ERROR  = 3,
  MARISA_RANGE_ERROR  = 4,
  MARISA_CODE_ERROR   = 5,
  MARISA_RESET_ERROR  = 6,
  MARISA_SIZE_ERROR   = 7,
  MARISA_MEMORY_ERROR = 8,
  MARISA_IO_ERROR     = 9,
  MARISA_FORMAT_ERROR = 10,
} marisa_error_code;

libmarisa throws an instance of Exception when an error occurs, such as a file I/O error (MARISA_IO_ERROR), a size limitation error (MARISA_SIZE_ERROR), etc. For details, see marisa/base.h.

Number of Tries

typedef enum marisa_num_tries_ {
  MARISA_MIN_NUM_TRIES     = 0x00001,
  MARISA_MAX_NUM_TRIES     = 0x0007F,
  MARISA_DEFAULT_NUM_TRIES = 0x00003,
} marisa_num_tries;

MARISA is a recursive data structure in which a patricia trie is used to represent another patricia trie. A deeper recursion makes a dictionary more compact but degrades the search performance. For this time-space tradeoff, libmarisa provides a parameter to limit the recursion depth, which is equivalent to the number of tries. marisa_num_tries gives the range and the default setting of this parameter.

The best setting depends on the set of keys and the applications. In most cases, libmarisa works well with the default setting, MARISA_DEFAULT_NUM_TRIES, but if the application requires better search performance, MARISA_MIN_NUM_TRIES may be a better choice. Also, if the application uses long and complicated keys, a deeper recursion may achieve much higher spece-efficiency. marisa-benchmark is useful to find the best setting.

Cache Size

typedef enum marisa_cache_level_ {
  MARISA_HUGE_CACHE    = 0x00080,
  MARISA_LARGE_CACHE   = 0x00100,
  MARISA_NORMAL_CACHE  = 0x00200,
  MARISA_SMALL_CACHE   = 0x00400,
  MARISA_TINY_CACHE    = 0x00800,
  MARISA_DEFAULT_CACHE = MARISA_NORMAL_CACHE
} marisa_cache_level;

libmarisa embeds a precomputed table to a dictionary. The table serves as transition cache which improves the search performance but increases the dictionary size. Cache size is the parameter of this time-space tradeoff.

marisa_cache_level gives a list of available cache size. Compared with MARISA_NORMAL_CACHE, MARISA_LARGE_CACHE is 2 times larger, MARISA_HUGE_CACHE is 4 times larger, MARISA_SMALL_CACHE is 2 times smaller, and MARISA_TINY_CACHE is 4 times smaller.

TAIL Mode

typedef enum marisa_tail_mode_ {
  MARISA_TEXT_TAIL    = 0x01000,
  MARISA_BINARY_TAIL  = 0x02000,
  MARISA_DEFAULT_TAIL = MARISA_TEXT_TAIL,
} marisa_tail_mode;

The last patricia trie of MARISA stores its multi-byte labels as strings and marisa_tail_mode gives a list of TAIL implementations.

MARISA_TEXT_TAIL stores labels as zero-terminated strings. If the labels contain '\0', the TAIL mode is automatically switched to MARISA_BINARY_TAIL.

On the other hand, MARISA_BINARY_TAIL uses a bit vector, instead of '\0', to detect the end of labels. This means that MARISA_TEXT_TAIL is more space-efficient than MARISA_BINARY_TAIL when the average length of multi-byte labels is longer than 8 bytes.

Node Order

typedef enum marisa_node_order_ {
  MARISA_LABEL_ORDER   = 0x10000,
  MARISA_WEIGHT_ORDER  = 0x20000,
  MARISA_DEFAULT_ORDER = MARISA_WEIGHT_ORDER,
} marisa_node_order;

A dictionary has one more parameter, which is the order of nodes. There are two choices, MARISA_LABEL_ORDER and MARISA_WEIGHT_ORDER. The former arranges nodes in ascending order of the label and the latter arranges nodes in descending order of the weight. Many trie implementations arrange nodes in the label order but libmarisa uses MARISA_WEIGHT_ORDER as the default setting.

MARISA_WEIGHT_ORDER optimizes the node order for linear search performed in exact match lookup, common prefix search, and predictive search. In practice, experiments for English words/phrases showed that MARISA_WEIGHT_ORDER halved the average search time. On the other hand, MARISA_LABEL_ORDER enables predictive search to restore keys in lexicographic order.

Aliases

namespace marisa {
  typedef ::marisa_error_code ErrorCode;
  typedef ::marisa_cache_level CacheLevel;
  typedef ::marisa_tail_mode TailMode;
  typedef ::marisa_node_order NodeOrder;
}  // namespace marisa

The above enumeration types are defined in the global namespace to avoid collisions of the enumeration constants with macros provided by other modules. libmarisa provides type aliases and users can choose the familiar one.

class Exception

class Exception {
 public:
  const char *filename() const;
  int line() const;
  ErrorCode error_code() const;
  const char *error_message() const;

  const char *what() const;
};

Exception is an exception class. libmarisa throws an instance of Exception with the file name (__FILE__), the line number (__LINE__), and an error code (ErrorCode) when an error is detected. The instance also has an error message formatted __FILE__:__LINE__: error_code: error_message.

class Key

class Key {
 public:
  char operator[](std::size_t i) const;
  const char *ptr() const;
  std::size_t length() const;
  std::size_t id() const;
};

Key is a member of Keyset and Agent. Each key of Keyset is represented by this class. Also, the search result of Agent is represented by this class.

class Query

class Query {
 public:
  char operator[](std::size_t i) const;
  const char *ptr() const;
  std::size_t length() const;
  std::size_t id() const;
};

Query is a member of Agent. This class stores a query string and an ID as input for search functions. Users cannot make changes directly to Query because Agent provides a special interface to update its query.

class Keyset

class Keyset {
 public:
  Keyset();

  void push_back(const Key &key);
  void push_back(const Key &key, char end_marker);

  void push_back(const char *str);
  void push_back(const char *ptr,
                 std::size_t length,
                 float weight = 1.0);

  const Key &operator[](std::size_t i) const;
  Key &operator[](std::size_t i);

  std::size_t num_keys();

  bool empty() const;
  std::size_t size() const;
  std::size_t total_length() const;

  void reset();

  void clear();
  void swap(Keyset &rhs);
};

Overview

Keyset is used to store a set of keys for dictionary construction or to save the results of search functions.

Dictionary Source

For dictionary construction, users append keys to Keyset by using push_back() and then pass the keyset to build() of Trie. weight is an argument to receive the frequency or possibility of each key. If there are same keys, the weights are accumulated in dictionary construction.

After dictionary construction, users can read the associated IDs through operator[](). Instead, the weights are overwritten by the IDs because Key uses a union to store a weight or an ID.

Search Result

Users can save a search result to Keyset by using push_back(). When key() of Agent is given, a copy of the search result is stored in Keyset. If you want to append an end marker, such as '\0', use end_marker of push_back().

If you want to reuse an instance of Keyset, reset() may be a better choice than clear() because reset() keeps allocated memory in order to reduce memory allocation overhead.

num_keys() and size() return the number of keys. empty() checks whether the number of keys is 0 or not. total_length() returns the total length in byte.

class Agent

class Agent {
 public:
  Agent();

  const Query &query() const;
  const Key &key() const;

  void set_query(const char *str);
  void set_query(const char *ptr,
                 std::size_t length);
  void set_query(std::size_t key_id);
};

Agent is actually a tuple of Query, Key, and State. This class is used as I/O of search functions. Also, State is an incomplete type to keep the internal state of search operation.

A lookup operation requires 3 steps as follows: 1. sets a query string by set_query() of Agent, 2. passes the agent to lookup() of Trie, and 3. gets the search result by key() of Agent. The other operations proceed in the same way.

class Trie

class Trie {
 public:
  Trie();

  void build(Keyset &keyset,
             int config_flags = 0);

  void mmap(const char *filename);
  void map(const void *ptr,
           std::size_t size);

  void load(const char *filename);
  void read(int fd);

  void save(const char *filename) const;
  void write(int fd) const;

  bool lookup(Agent &agent) const;
  void reverse_lookup(Agent &agent) const;
  bool common_prefix_search(Agent &agent) const;
  bool predictive_search(Agent &agent) const;

  std::size_t num_tries() const;
  std::size_t num_keys() const;
  std::size_t num_nodes() const;

  TailMode tail_mode() const;
  NodeOrder node_order() const;

  bool empty() const;
  std::size_t size() const;
  std::size_t io_size() const;

  void clear();
  void swap(Trie &rhs);
};

Overview

Trie is a dictionary class, which is the most important component of libmarisa. All the operations are performed through this class.

In fact, Trie is a dictionary handle, and if the handle is invalid, functions other than build(), mmap(), map(), load(), read(), clear(), swap() throw an exception.

Construction

You can build a dictionary by using build(). The arguments are the above mentioned Keyset and a dictionary setting, config_flags, which is represented by a combination of flags. For example, 2 | MARISA_BINARY_TAIL specifies the maximum number of tries (2) and a TAIL mode (MARISA_BINARY_TAIL). Also, in this case, the default settings, MARISA_DEFAULT_ORDER and MARISA_DEFAULT_CACHE, are used for the node order and the cache size.

The IDs associated with the keys are available through operator[]() of keyset, and the IDs are useful to associate the keys with any data types.

File I/O

mmap() is an interface for memory mapped I/O. If an application performs a few search operations, it is unnecessary to read the whole dictionary, and in such a case, mmap() is useful. Also, memory mapped I/O is an easy way to share dictionary data among processes. On the other hand, if an application performs a lot of search operations, a memory mapped dictionary might cause a lot of random disk accesses which considerably increase the search time.

map() restores an instance of Trie from dictionary data on memory. load() and read() read a dictionary from a file or a file descriptor. save() and write() write a dictionary to a file or a file descriptor.

Search

Trie provides 4 search functions lookup(), reverse_lookup(), common_prefix_search(), and predictive_search() as follows:

  • lookup() checks whether a query string is registered or not, and if it is registered, lookup() returns true. In this case, the search result is available through agent.key(). Note that lookup() does not restore a key and agent.key().ptr() points to the query string because the two strings are the same.
  • reverse_lookup() restores a key from its ID. This function has no return value and the key is available through agent.key(). The key is actually stored in agent and it is lost when agent is reset or used for another search operation. If a given ID is out-of-range, reverse_lookup() throws an exception.
  • common_prefix_search() searches keys from the possible prefixes of a query string. If there are matching keys, this function returns true. In this case, the first key is available through agent.key(), and if there are more than one matching keys, the next key will be available after the next common_prefix_search() which returns true until there are no more matching keys. Note that agent.key().ptr() == agent.query().ptr() is always true when common_prefix_search() has returned true.
  • predictive_search() searches keys starting with a query string, and similar to common_prefix_search(), this function returns true until there are no more matching keys.

Note that agent keeps the internal state of common_prefix_search() and predictive_search() until agent is passed to another search function or agent.set_query() is called.

num_keys() and size() return the number of keys. empty() checks whether the number of keys is 0 or not. io_size() returns the dictionary size in byte.

stdio

void fread(std::FILE *file, Trie *trie);
void fwrite(std::FILE *file, const Trie &trie);

The functions for I/O using std::FILE are declared in marisa/stdio.h. If you don't want to #include <cstdio>, use marisa/trie.h instead of marisa.h.

iostream

std::istream &read(std::istream &stream, Trie *trie);
std::ostream &write(std::ostream &stream, const Trie &trie);

std::istream &operator>>(std::istream &stream, Trie &trie);
std::ostream &operator<<(std::ostream &stream, const Trie &trie);

The functions for I/O using std::iostream are declared in marisa/iostream.h. If you don't want to #include <iosfwd>, use marisa/trie.h instead of marisa.h.

Cross-architecture compatibility

The dictionary format of libmarisa depends on the architecture. Dictionaries built on a little endian architecture don't work on a big endian architecture. Also, on a big endian architecture, dictionaries built on a 32-bit machine don't work on a 64-bit machine and vise versa. On a little endian architecture, dictionaries are compatible on 32/64-bit machines.

References

Last Spell

Feel free to contact me for any questions.