The Three C’s – conceptual search, clustering, and categorization – Fern Halper's data makes the world go 'round

May 6, 2009

The Three C’s – conceptual search, clustering, and categorization

I recently had the opportunity to speak with Richard Turner, VP of Marketing, at Content Analyst. Content Analyst was originally part of SAIC and spun out about 5 years ago. The company provides content management, eDiscovery, and content workflow solutions to its clients – primarily as an OEM solutions.

The tool set is called CAAT (Content Analyst Analytic Technology). It includes five core features:

Concept search: uses concepts rather than key words to search through documents.
Dynamic Clustering: classifies documents into clusters.
Categorization: classifies documents into user-defined categories.
Summarization: identifies conceptually definitive sentences in documents.
Foreign Language: identifies the language in a document and can work across any language.

Let me just briefly touch upon concept search and dynamic clustering and categorization. Concept search is interesting because when most people think search, they think key word search. However, key words may not give you what you’re looking for. For example, I might be interested in finding documents that deal with banks. However, the document might not state the word bank explicitly. Rather, words like finance, money, and so on might occur in the document. So, if you don’t insert the right word into the search engine, you will not get back all relevant documents. Concept search allows you to search on the concept (not keyword) “bank” so you get back documents related to the concept even if they don’t contain the exact word. CAAT learns the word bank in a given set of documents from words like “money”, “exchange rate”, etc. It also learns that the word bank (as in financial institution) is not the same as the bank on the side of a river becuase of other terms in the document (such as money, transfer, or ATM).

Dynamic clustering enables the user to organize documents into categories based on content called clusters. You can also categorize documents by using examples that fall into a certain cluster and then train the system to recognize similar documents that could fall into the same category. You literally tag the document as belonging to a certain category and then give the system examples of other documents that are similar to this to train on. In eDiscovery applications, this can help dramatically cut down the amount of time needed to find the right documents. In the past, this was done manually, which obviously could be very time intensive.

How do they do it?

The company uses a technique called Latent Semantic Indexing (LSI), along with other patented technology, to help it accomplish all of this. Here is a good link that explains LSI. The upshot is that LSI uses a vector representation of the information found in the documents to analyze the term space in a document. Essentially, it removes the grammar, then counts and weights (e.g. how often a word appears on a page or in a document, etc.) the occurrence of the terms in the document. It does this across all of the documents, and actually collapses the matrix using a technique patented at Bell Labs. The more negative a term, the greater its distance from a page. Since the approach is mathematical, there is no need to put together a dictionary or thesauri. And, it’s this mathematical approach that makes the system language independent.

Some people have argued this technique can’t scale because the matrix would be too large and it would be hard to keep this in-memory. However, when I asked the folks at Content Analyst about this they told me that they have been working on the problem and that CAAT contains a number of features to optimize memory and throughput. The company regularly works with ligitation clients who might get 1-2 TB of information from opposing counsel and they are using CAAT for clustering, categorization, and search. The company also works with organizations that have created indexes of 45+ million (>8 TB) documents. That’s a lot of data!

Conceptual Search and Classification and Content Management

Aside from eDiscovery, Content Analyst is also being used in application such as improving search in media and publishing and of course, government applications. The company is also looking into other application areas.

Concept search is definitely a big step up from keyword search and is important for any kind of document that might be stored in a content management system. Automatic classification and clustering would also be huge (as would summarization and foreigh language recognition). This could move Content Analyst into other areas including analyzing claims, medical records, and customer feedback. Content management vendors such as IBM and EMC are definitely moving in the direction of providing more intelligence in their content management products. This makes perfect sense, since a lot of unstructured information is stored in these systems. Hopefully, other content management vendors will catch up soon.

@fbhalper

Business Intelligence, classification, Content Management, search, Text Analytics, Uncategorized

classification, clustering, conceptual search, Content Management, eDiscovery, Text Analytics

The Three C’s – conceptual search, clustering, and categorization

Share this:

Leave a comment Cancel reply