Seven considerations for unstructured data

As organizations compete with analytics, more often they realize that structured data alone isn’t enough. Unstructured data — i.e., data sets that have not been structured in a pre-defined manner — such as text data, audio data, image data, and video data can provide a lot of value. For example, unstructured data can help to provide:

  • Better understanding of behavior. Typical structured data about behavior may include demographic information, billing information, how long someone has been a customer, etc.  But, if you add unstructured data, such as call center data, to the mix, you can understand the why behind the what. For example, a company may see that sales are decreasing and they don’t know why. They may look at call center notes, using text mining. Those call center notes may contain the entities, concepts, themes, and sentiments that explain why customers are dropping the service. This extracted data can be used to enrich an already structured data set for better insights.
  • Improved model performance.  When unstructured data such as data extracted from text is married together with structured data it enriches the data set for building models. This data becomes an attribute used for a machine learning model.
  • Competitive advantage.  Unstructured data can act as a competitive differentiator, both from gaining insights as well as applications that you can build with unstructured data. For example, we’re seeing organizations build predictive maintenance models using sound analysis. Natural language chatbots are being used by organizations to help with customer service. Image classification systems are being used in a host of applications from agriculture to auto sales. The list goes on.

TDWI Research indicates that those organizations that are utilizing unstructured data are more likely to measure a top or bottom line impact from their analytics than those who do not.

Some types of unstructured data are already mainstream. For instance, in a 2022 TDWI survey, 50% of respondents were already managing internal text data. Close to 30% were already managing external text data. Likewise, over 20% of respondents were managing machine generated data, image still data, and audio data. We’ve seen growth of about 10% since 2020 for audio and image data. Whether they are using this data for analytics and other use cases is unclear. For example, 2023 TDWI survey data suggests that not all of this data is used for analytics.

So, if your organizations wants to move forward utilizing unstructured data, what are some of the considerations it should think through?

  • Understand the business need. First, with unstructured data, just as is the case as you begin to collect other data you should ask, “What is the business problem?”  If there is a reason to collect the unstructured data, you should be doing it. If you think there is a reason, you should consider it.
  • Storing it. Of course, you need to determine how and where to store it. Can you store it on-premises?  Should you store it in the cloud?  That is going to depend on how much you collect and how you plan to use it. Do your homework. While the cloud may provide valuable tools to analyze new data types, sometimes the cloud can cost more than an on-premises solution.
  • Transformation. What does the pipeline look like for the unstructured data?  Where does the transformation occur for it?  In the pipeline? On the platform?  Chances are that you’ll have to process some of the data. For instance, with unstructured text data you may want to extract entities (people, places, things) or themes, or sentiments (positive, negative, neutral ) and store those. You need to determine how and where to do that. For machine data, you may decide that you don’t need to store all of it, depending on the use case. For instance, maybe you only want to store the outliers.
  • Keeping the data fresh. Once you start to store unstructured data, how do you keep the data fresh like you do for structured data? You’ll have to consider what the pipeline will look like for new data.
  • Developing new skills. Of course, there are new skills and talent for understanding and analyzing unstructured data.  If you’re making use of image data for instance, you’ll need someone with expertise. How do you build those skills?  Can business analysts be used? Data scientists? Someone else? This needs to be part of the plan.
  • Governing unstructured data.  There may be different issues to consider in governing new kinds of data. What standards need to be in place for it?  How do you know if it is high quality or not (e.g., how are you going to determine whether text data meets quality standards?  What about sound data?) Who decides what the standards and rules are for unstructured data? How will these be put in place?
  • Responsible use of data. There is the ethical use of unstructured data too.  For instance, you might have video data on customers, but should you use it? 

Want to learn more? Check out http://www.tdwi.org.

Leave a comment