When a document is entered in a database, the words appearing in the document are automatically normalized: they are reduced to the nominative case (nouns), singular (nouns, adjectives and verbs), masculine (nouns and adjectives), infinitive (verbs), etc.
The normalization process is based upon morphological dictionaries, which allow more than 3 million Russian wordforms to be recognized. Thus, for example, the words таможня, таможню, таможни are represented in the dictionary by a single entry таможня, the words представленный, представляется, представляли by the word представлять, etc. Words that cannot be identified by the morphological dictionary are added to the general index, in all forms that have been encountered, as separate entries.
The morphological analysis automatically identifies prefixes that have a concrete meaning of their own.
These prefixes are turned into keywords which are entered in the database. Owing to this device, it is possible to find documents containing авиаперевозки by means of the query авиационные перевозки.
The system also recognizes and normalizes dates written in the following ways:
DD <month> YYYY
The words appearing in a query are also normalized by the morphological analysis, and the words containing prefixoids are automatically transformed into phrases in such a way as to ensure correct and accurate search results.
If a word cannot be found in the morphological dictionary, then all forms of the word that appear in the documents are added to the General index as separate entries. Thus, if such a word is used in a query, the morphological analysis of its inflection still enables recognition of all the wordforms appearing in the documents and belonging to the same paradigm.