Linguistic technologies ABBYY. From the complex to the perfect

Idea to deal with one of the key problems in the theory of artificial intelligence and solve the problem of understanding human speech processing technique originated in the minds of professionals ABBYY fifteen years ago. It was then with the filing of the company’s founder David Yang first started research and then development and engineering work on the creation of a new generation of machine translation, later grew into a separate project Compreno (former name – Natural Language Compiler) to address the many challenges with the processing of natural language.

The seriousness of the ABBYY a revolution in the field of computational linguistics shows not only many years of work for more than three hundred employees, but also of interest to the platform from the Development Fund of the Center of Research and Commercializing of New Technologies (Foundation laquo; Skolkovoraquo ;) , select the most promising projects and carrying out their support. No less compelling is the financial aspect and the matter: total investment fund laquo; Skolkovoraquo c Compreno – 475 million rubles, which is half the funding for the project. The second part (475 million rubles). Introduces herself ABBYY. Impressive numbers, emphasizing the scope and scale of the project.

amount of technology

To understand the nuances underlying the Compreno mechanisms and the logic of their work is to understand the fundamental concept of the project is as follows. Whatever the language of civilized people may say, the concepts that they represent words, many more similarities than differences. We all live in houses, use furniture, telephones, we travel by car, go to work in the office, fly on airplanes, etc. These concepts are general and do not depend on the language in terms of what we imagined them. Catching this connecting thread, ABBYY constructed language independent universal semantic hierarchy of concepts.

Semantic hierarchy of concepts is a universal language for all wood, thick branches which are more general terms (eg, laquo; dvizhenieraquo;) and thin – a more specific meaning, structured from the general to the specific (laquo; polzatraquo;, laquo ; letatraquo;, laquo; peshkomraquo walk;, laquo; begatraquo, etc.). If we are talking about the head of the organization, then the head of the class of lexical concept appears laquo; liderraquo, and subclasses are more specific concepts, such as laquo; bossraquo;, laquo; nachalnikraquo;, laquo; rukovoditelraquo;, laquo; shefraquo; and other words and phrases, which are a kind of leaves on the tree of concepts.

ABBYY Compreno deals not with words, but values ??(concepts). One value can be in the same branch of hierarchy, and the other – in other

This tree structure allows for property inheritance from parent to child and avoiding ambiguities in the translation sentences from one language to another. Explanation developers give the example of the word laquo; upravlenieraquo;, in the Russian language of the few concepts on different branches of the universal semantic tree: you can laquo; upravlenieraquo, interpreted as the department, and can, for example – as an action. And because the semantic class laquo; upravlenieraquo, in the sense of a certain organization is represented in one branch of the tree, and the actions of the other, the system automatically selects the correct word when translating the text into English, by opting for department management or, as the context phrases. As a result, employees are the core of Compreno semantic descriptions allow you to easily translate text from English or Russian in the universal language and universal – in any other description of which are in the system.

Second major block platform is Compreno syntax. It is important to understand that the syntax describes how the concepts relate to each drugomnbsp, in one or more sentences. To encode these relations in the languages ??used by the sentence, agreement, word order, the cases, the various function words, conjunctions, prepositions, and many of the rest. Syntax – is, figuratively speaking, a great designer of the listed items.

In different languages ??may use different elements of design. For example, the English word order is an important part of the syntax. Interrogative sentences are formed in one way, the narrative – the other, and nothing else. There are some optional circumstances of time and place, which are placed in the beginning of the sentence, but usually the first place is the subject, the second – the predicate and on the other are part of speech. The Russian language is different. We are not tied in an order of the words, but it is important for us reconciliation, which, in fact, is probably the biggest stumbling block for people studying Russian.

Another important thing to consider when parsing text – substitution and connections between words that occur when we miss a word, but we understand that it is still there. A striking example- phrase laquo; boy likes red apples, and the girl zelenyeraquo.; It is clear that with regard to the girls we are talking about apples (and also about the fact that she loves them), and we fully understand this, although the text is missing a few words. There are other, more complex syntactic relations, successfully parsed Compreno. For example: laquo; Though a boy and wanted to play, but he knew that he had little vremeniraquo.; In this case, we replaced the word twice laquo; malchikraquo; pronouns laquo; onraquo, and laquo; negoraquo, and the car is important to understand that this is one and the same object, and retrieve the missing components.

ABBYY Compreno seeks to determine the meaning of text, written in ordinary language, allowing the machine laquo; ponyatraquo, the text and transform it into a universal representation, independent of language

Unit Compreno, responsible for syntax, parses the role of different concepts in the proposal and connects them to each other. The system analyzes the text and builds a tree of relationships in which the principal is usually some sort of action. From him followed by an object, the subject and the other attributes that bound either to object or to the subject and convey inherent in a particular sentence meaning. To parse was as accurate as possible, Compreno uses semantic analysis based on the above described universal hierarchy of concepts. All this gives a total new level of freedom in the processing machine texts, allows her laquo; ponimatraquo; meaning of the original proposal and then synthesize that meaning in another language.

Finally, the third important component of the linguistic platform ABBYY is the statistics that allow the system to properly combine phrases and better deal with homonyms, where the same word can mean different things (typical example: laquo; zamokraquo, and laquo; zamokraquo ;) . Equally important is the statistical information and to correctly parse sentences with ambiguous interpretation. For example, to competent analysis phrases laquo; These types are available in our tseheraquo, only resorting to the data on the frequency of relationships between concepts, thus penetrating into the context of the speech, in other words, the subject of discussion. If he is on the metallurgy, the narrative is about steel, if the behavior of people, it would be logical to opt for some not very good type.

The statistical model Compreno put an impressive collection of texts of different themes and genres, almost daily by the system. Moreover, the text data which is not anyhow, and created or translated from one language to another person’s name. Such an approach reduces the likelihood of errors in decision-making system and distortions in the synthesis of semantic structures.

What eventually happened? As a result, specialists ABBYY managed by combining knowledge, imagination, ideas and experience to build on laquo; three kitahraquo – semantic hierarchy of concepts, syntax, and statistics – a model of language-independent data about the world and the model for access to the data. As a result, it was possible to come as close to understanding the meaning of the text and the computer make it possible to solve a wide layer of linguistic tasks. Like what?

Beautiful Mind

Talking about the practical relevance of the platform ABBYY Compreno, developers, first of all, focus on two key objectives – the automatic translation of texts for a variety of language pairs and intelligent information search.

First problem related to the broadcasting of text data, it is extremely important in the digital age, erasing formal boundaries and barriers between countries. With the ever-increasing volumes of multilingual information, the need to involve an increasing number of participants from all over the world in the implementation of critical projects today are not only receiving the transfer rate, but also the quality of the output text. From providing the latest in existing machine translation systems, things did not go as smoothly as it might seem at first glance. Blame – many fundamental limitations in scientific principles that are the basis of many existing machine translators. These limitations are related to the inability to correctly handle exceptions, the objective complexity of the language constructs, ignoring the semantics, the inability to capture the actual links in the proposal and other problems. Technology is an engineering Compreno embodiment of fundamental linguistic research many uchnyh world, accumulating about 50 years of experience. And thanks to Compreno can overcome these challenges and allows the synthesis of the meaning of the text is the same as it was in the original language, or be very similar. In order to evaluate the system below is an example of a piece of transfer Article Google’s laquo; Babel fishraquo; heralds future of translation tools and statistical translation platform ABBYY. This is self-explanatory.


If we tried manually to give the system those languages, it would be a hopeless task. The only possible way we could do this is to harness the power of machine computation. We build statistical models that are automatically training themselves and learning all the time.

ABBYY Compreno:

If we tried to manually give the system the languages ??that would beznadzhnoy task. The only possible way that we could do this, is to use the power of computer calculations. We will create a statistical model, which automatically learn and study time Sun.

Statistical Translator:

If we tried to hand to give the system of these languages, it would be a hopeless task. The only way we could do this is to use the capacity of the machine calculations. We build a statistical model, which automatically training themselves and learning all the time.

Importance of the second problem – Smart Search – is a consequence of the huge volume of information generated by humanity, growing exponentially and require different approaches to the analysis and the search for the desired data. Search now works primarily with verbal information: the retrieval of documents, we first come up with words that should be contained in it, then enter the keyword phrases, we match the search criteria, and then the data manually select the information of interest to us. Such habitual search has some major drawbacks. First, it is not always possible to formulate a query accurately describes the information that must be found. Secondly, thinking qualifying words, we select and limit narrows the search. Finally, through all the combinations of keywords is sometimes very tedious, if not impossible. With all these drawbacks have successfully addressed technology ABBYY Compreno, allowing for the semantic search using the concepts and relationships that have been extracted from the search engine query formulated in ordinary language.

Slogan laquo; We help people understand each drugaraquo, perfectly captures the essence of technology ABBYY Compreno

Laquo; Erudirovannostraquo; platformynbsp, and focused in her vast store of knowledge you can use Compreno to perform many other applications. Based on it, the company can create breakthrough solutions for systems of multilingual search and classification, data extraction facts and linking objects, monitoring systems, protection against unauthorized use of information, automatic summarization and annotation of documents, speech recognition, and many other tasks.

Very promising and interesting scope Compreno is to solve problems related to the visualization of text. A striking example – the creation of animated movies and movies based on text scripts. This is the direction the company operates laquo; Bazelevs Innovatsiiraquo, also takes an active part in the project laquo; Skolkovoraquo, and has already achieved certain results in the creation of software for the interactive visualization of three-dimensional text. In ABBYY declare with pride that the world no longer exists as a universal platform that allows you to solve so many applications that require high-quality linguistic analysis.

Plans Gromada

Today, as mentioned above, the project involves more than 300 professionals actively involved young professionals and students in the Department of ABBYY MIPT and graduates of leading universities – MSU, State Humanitarian University, Moscow State Linguistic University, St. Petersburg State University and many others. If you look at the roots of the work, they are covered in serznyh studies of Russian and world linguistics. This scientific baggage used by professionals ABBYY. The company plans to appear on the involvement in the project, the world’s leading experts in the field of language and linguistics, and to give the project an international status.

ABBYY is currently a pilot project for the deployment of software-based solutions Compreno. While proponents of the project did not disclose details about the product under development, but assured that their implementation and the widespread implementation ultimately benefit all – and software manufacturers, and consumers, that is, we are.

It is too early to say how much change the lives of mankind ABBYY Compreno ambitious project in the future. But it is safe to say that in the near future computational linguistics have made significant progress in the modeling language and move to a completely new technology base, the foundation of which is laid now.

Discuss the material in the conference