TEMP

Unstructured data:­ A HIDDEN WORLD OF OPPORTUNITIES

foto van Yoran van Oirschot
Yoran van Oirschot
Data Engineer

August 15th, 2016

extracting value from unstructured data

There is a world of data out there which you are probably storing, but never using. This world is called unexplored data and it is huge. IT researchers claim that at least 80 % of all enterprise data is unstructured. Unstructured data represent the way people think and store information. However, for a computer this data is extremely hard to interpret. At Building Blocks, we observe an increasing demand from clients to use this unstructured data. Unstructured data brings many new opportunities that can be exploited by businesses. For example, automatic analysis of spoken or written customer feedback or insurance claim handling by sending in pictures of the damages.

We believe there is a lot of added value in unstructured data. To extract this value, we should first transform unstructured into structured data. Structured data is tabular data (rows and columns) which are very well defined, meaning we know what kind of data each column contains. On structured data you can run machine learning algorithms and statistical models. So how can we perform this transformation?

the process from unstructured data to added value

Unstructured data 

interpreting unstructured data

Unstructured data is the rawest form of data. It can literally be any kind of file or even a printed piece of paper. The fun thing about unstructured data is that what you extract from it is entirely dependent on from which angle you look at it. Take for example the following sentence:


“Unstructured data seems like fun for my organization.”
 

Language: English 
Sentiment: “Unstructured data seems like fun for my organization.” moderate positive 
ConceptsUnstructured data

These are just three out of many ways to look at this sentence. You could also look at emotion, classify it in a taxonomy, find relations to publications, etc. This process is called information extraction. In information extraction you are transforming unstructured data into structured data.


Four types of unstructured data
 

▪ Unstructured text: The example shows just three aspects that can be extracted from text. But there is a lot more! In practice we can use it to build search engines; quickly focus on dissatisfied customers based on their textual feedback via social media; or automatically find an answer to a customer's question. 

 Audio: From audio files you could detect the speaker; extract emotion from the tone of voice; match it with similar audio fragments; detect the genre of music. These are again just a few of the possibilities. Spotify, for example, uses these kinds of techniques to recommend you new music. With speech-to-text we can transform audio into text and use the whole range of techniques from text analysis.

▪ Image: Images like photos, icons, cliparts, etc. can be analyzed in different ways. We can detect shapes, detect objects or part of objects, detect people, detect emotion and much more. Scientists have analyzed all of Rembrandt paintings using these algorithms. They extracted the distance between the eyes and nose, facing direction, the type of hat, the colors of the clothing and much more. With these structured features they were able to train a deep learning machine to paint a new Rembrandt.

▪ Video: From video we can extract frames (which are images) and extract audio fragments. Both can be analyzed with all techniques in the corresponding fields. There is, however, also the aspect of motion. After the Boston Marathon bombings such algorithms were used to pin-point areas of interest in the hours of videos from surveillance, cell phone footage, etc. to quickly find footage of the suspects.

Now that we have seen four categories of unstructured data, please take a moment to think about what kind of unstructured data you have available in your organization. Can you imagine how this data can be used to improve your data science solutions?


Storing and processing unstructured

Previously we have looked at what kinds of unstructured data are out there and we have seen some examples of what we can extract from it. However, to start extracting value from the unstructured data sources you should first have an infrastructure that can store and process this vast amount of data. 

We believe that in order to get the most value out of your data, you should have an integrated data solution. Get rid of the silos in the organization and create a single version of the truth. Unstructured and structured data should be accessible through the same system. such a system is often called a data lake. The data lake is fed from your organization's systems. On the data lake you can create data science solutions that consider the organization as a whole.

Data lake

Data lake

HADOOP
Hadoop has become the de facto industry standard for storing and processing the unstructured data of the data lake. It is developed by the Apache Software Foundation and is fully open source. It is a system that is able to store and process files over many, many servers (often called a cluster). This cluster is nearly infinitely scalable with respect to storage capacity and computational power.

Hadoop consists of two parts:
Hadoop Distributed File System (HDFS): handles redundant and reliable storage. HDFS looks like a regular directory on a PC. Only now it may contain 100s of GBs of data which is spread out over many, many servers.

Yet Another Resource Manager (YARN): handles the resource management of the Hadoop cluster. Since there may be 100s of servers and many different users of the cluster these resources should be carefully managed.

These two parts alone cannot do a lot with your data, except for storing it. Data processing software like Apache Spark or Apache MapReduce can be used to efficiently extract value from the unstructured data. This software allows you to run your data science algorithms on the data in your cluster.

In this blog we have briefly seen what value unstructured data can bring to your data science solutions. We have differentiated between four kinds: text, audio, image and video. Whereas using algorithms we can transform for example audio to text. Each kind can be used to extract structured data which you can use in your data science solution. However, to extract structured data you should have a system in place that can store and process unstructured data. I have briefly introduced you to Hadoop, which can do just that. I hope that this will inspire you to look at unstructured data and think of all the possibilities that may take your data science to the next level.