Why? How? These are key questions that business people ask a lot. Questions such as, “Why did our customer retention rate plummet?” or “How come our product quality has declined?” or even “How did we end up in this mess?” often cannot be answered using structured data alone. However, there is a lot of unstructured data out there in content management systems that is ripe for this kind of analysis. Claims, case files, contracts, call center notes, and various forms of correspondence are all prime sources of insight.
This past year, I have become quite interested in what content management vendors are doing about text analytics. This is, in part, due to some research we had done at Hurwitz & Associates, which indicated that companies were planning to deploy their text analytics solutions in conjunction with content management systems. Many BI vendors have already incorporated text analytics into their BI platforms, yet earlier this year there didn’t seem to be much action on this front on the part of the ECM vendors.
Now, several content management providers are stepping up to the plate with offerings in this space. One of these vendors is IBM. IBM’s Content Analyzer , formerly IBM OmniFind Analytics Edition, uses linguistic understanding and trend analysis to allow users to search, mine and analyze the combined information from their unstructured content and structured data. Content Analyzer consists of two pieces: a backend linguistics component and a visualization and analysis text mining user interface. Content Analyzer also has a direct integration with FileNet P8, which means Content Analyzer understands FileNet formats and that the integrity of any information from the FileNet system is maintained as it moves into Content Analyzer.
Last week, Rashmi Vital, the offer manager for content analytics, gave me a demo of the product. It has come a long way since I wrote about what IBM was doing in the space back in February. The demo she showed me utilized data from the National Highway Transportation Safety Administration Database (NHTSA), which logs consumer complaints about vehicles. The information includes structured information – for example, the incident date and whether there was a crash — and unstructured information, which is the free-text description written by consumers. Rashmi showed me how the text mining capabilities of Content Analyzer can be used for the early detection of quality issues.
Let’s suppose that management wants to understand quality problems with cars, trucks, and SUVs, before they become a major safety issue (the auto industry doesn’t need any more trouble than it already has) and they want to understand what specific component of the car is causing the complaint. The user decides to explore a data source of customer complaints. In this example, we are using the NHTSA data, but he or she could obviously get more information from his or her own warranty claims, reports, and technician notes stored in the content management system.
The user wants to gather all of the incident reports associated with fire. Since fire isn’t a structured data field, the first thing the user can do is either simply search on fire, or set up a small dictionary of terms with words and phrases that would also represent fire. These might include words like flame, blaze, spark, burst, and so on. Content analyzer crawls the documents and put them in an index. Content Analyzer’s index contains the NLP analytic results of the corpus of documents and the document itself because often the analyst wants to see the source.
Here is a screen shot of the IBM Content Analyzer’s visualization tool called Text Miner. Text Miner provides facilities for real-time statistical analysis on the index for a source dataset. It allows the users to analyze the processed data by organizing the data into categories, applying search conditions, and further drilling down to analyze patterns over time or correlations.
You can see that the search on fire returned about 200,000 documents (there were over 500,000 to start). The user can then correlate fire with specific problems. In this example, the user decides to correlate it to a structured field called “vehicle component”. The vehicle component that most highly correlated to fire (and with a high number of hits) is the electrical system wiring. The user can continue to drill down on this information, in order to determine what make and model of car had this problem, and once he or she has distilled the analysis to a manageable number, examine the actual description of the problem to understand the root cause.
Correlation analysis has another benefit because sometimes it can help to see trends that are highly unusual that we would not be consider. Suppose we take the same criteria as above and sort by correlation value (see next figure). It is not surprising to see components like electrical system or fuel system listed since we assume this is a normal place for potential fires to start. However, just below those components you can see a high correlation between fire and Vehicle Speed Control: Cruise Control Component. Perhaps this may not an area an analyst would consider a potential fire hazard. The high correlation value would be a signal to an analyst to investigate further by drilling down into the component, Vehicle Speed Control: Cruise Control and into the descriptions that customers submitted. The following view looks at the results of analyzing the phrases related to the current analytic criteria. Being able to drill down to see the actual incident report description allows the analyst to see the issue in its entirety. This is good stuff.
Stay tuned as I plan to showcase what other content management vendors are doing in this space.
I’m interested in your company plans for text analytics and content management. Please answer my poll, below: