According to definition in wikipedia, full text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references).
In other words, full text search is to find all documents containing given query terms and return them in order of their similarity to the query.
PostgreSQL supports full text search with following features.
- Stemming
- Ranking
- Highlighting results
- Multiple languages
- Fuzzy search for misspelling
- Accent support
Features
1. Parsing Documents
A document is the unit of searching in a full-text search system; for example, a magazine article or email message.
PostgreSQL provides the function to_tsvector for converting a document to the tsvector data type which stores processed documents. The function calls a parser which parses the textual document into tokens, reduces the tokens to lexemes, and returns a vector which lists the lexemes together with their positions in the document.
During the process of breaking the document text into tokens, the parser assigns a type to each token. Depending on the token type, each token is processed by a different list of dictionaries.
With help of dictionaries, some words are recognized as stop words, for example a, on, the. These words will be ignored since they are unuseful in searching.
Some words are recognized as normalized lexemes to represent the token. For example, rats becomes rat.
If no dictionary recognizes the token, the token is ignored.
The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration. It is possible to have many different configurations in the same database, and predefined configurations are available for various languages.
PostgreSQL provides the functions to_tsquery, plainto_tsquery, phraseto_tsquery and websearch_to_tsquery for converting a query to the tsquery data type.
These functions normalizes each token in the query input into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration.
The query input may consist of single tokens separated by the operators AND, OR, NOT, and FOLLOWED BY, possibly grouped using parentheses.
With FOLLOWED BY operator, you can do phrase search to find words next to each other.
Also, * can be attached to a lexeme to specify prefix matching. Such a lexeme will match any word that begins with the given string.
The to_tsquery can accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. Without quotes, to_tsquery will generate a syntax error for tokens that are not separated by an AND, OR, or FOLLOWED BY operator.