I had an interesting briefing with the Basis Technology team the other week. They updated me on the latest release of their technology called Rosette 7. In case you’re not familiar with Basis Technology it is the multilingual engine that is embedded in some of the biggest Internet search engines out there – including Google, Bing, and Yahoo. Enterprises and the government also utilize it. But, the company is not just about keyword search. Its technology also enables the extraction of entities (about 18 different kinds) such as organizations, names, and places. What does this mean? It means that the software can discover these kinds of entities across massive amounts of data and perform context sensitive discovery in many different languages.
Here’s a simple example. Say you’re in the Canadian consulate and you want to understand what is being said about Canada across the world. You type “Canada” into your search engine and get back a listing of documents. How do you make sense of this? Using Basis Technology entity extraction (an enhancement to search and a basic component of text analytics), you could actually perform faceted (i.e. guided) navigation across multiple languages. This is illustrated in the figure below. Here, the user typed “Canada” into the search engine and got back 89 documents. In the main pane in the browser, you can see that an arrow in a number of different languages highlights the word Canada, so you know that it is included in these documents. On the left hand side of the screen is the guided navigation pane. For example, you can see that there are 15 documents that contain a reference to Obama and another 6 that contain a reference to Barack Obama. This is not necessarily a co-occurrence in a sentence, just in the document. So, any of these articles would contain a reference to Obama and Canada. This would help you determine what Obama might have said about Canada. Or, what the connection is between Canada and the BBC (under organization). This idea is not necessarily new, but the strong multilingual capabilities make it compelling for global organizations.
If you have eagle eyes, you will notice that the search on Canada returned 89 documents, but the entity “Canada” only returned 61 documents. This illustrates what entity extraction is all about. When the search for Canada was run on the Rosette Name Indexer tab (see upper right hand corner of the screen shot) the query searched for Canada against all automatically extracted “Canada” entities that existed in all of the documents. This includes all persons, locations, and organizations that have similar names. This included entities like “Canada Post” and “Canada Life” which are organizations, not the country itself. Therefore the 28 other documents with a Canada variant are organizations or other entities.
There are obviously a number of different use cases where the ability to extract entities across languages can be important. Here are three:
- Watch lists. With the ability to extract entities, such as people, in multiple languages, this kind of technology is good for government or financial watch lists. Basis can resolve matches and translate names in 9 different languages. This includes resolving multiple spelling variations of foreign names. It also enables organizations to match names of people, places, and organizations against entries in a multilingual database.
- Legal discovery. Basis technology can identify 55 different languages. Companies would use this technology, for example, to identify multiple languages within a document and then route them appropriately. Additionally, Basis can extract entities in 15 different languages (and search in 21) so the technology could be used to process many documents and extract the entities associated with them to find the right set of documents needed in legal discovery.
- Brand image, competitive intelligence. The technology can be used to extract company names across multiple languages. The software can also be used against disparate data sources, such as internal document management systems as well as external sources such as the Internet. This means that it could cull the Internet to extract company name (and variations on the name) in multiple languages. I would expect this technology to be used by “listening posts” and other “Voice of the Customer” services in the near future.
While this technology is not a text analytics analysis platform, it does provide an important piece of core functionality needed in a global economy. Look for more announcements from the company in 2010 around enhanced search in additional languages.