Chapter 1. What is Unstructured Data?
There has been a noticeably slow uptake in the use of databases to manage unstructured data, in particular multimedia data. The technology at both the hardware and software levels for the management of multimedia is both mature and stable. What is preventing sites from the move to storing multimedia in the database is attributed to a lack of expertize, understanding, and a conservative view fostered by a number of factors including historical issues with performance and integration software.
Initially it is important to define what multimedia is in relation to structured and unstructured data. Unstructured data is any data that is not stored in a structured format. Structured data is anything that has an enforced composition to the atomic data types(1).
A relational database stores data in a structured format. Other non-relational databases also store their data in a structured format, so relational data can be considered a subset of structured data. XML is also considered structured, as well as data stored inside object-oriented databases. Because the structure of XML is fluid, one can consider XML as semi-structured.
There is a large amount of unstructured data in the real world that needs managing. In the last ten years most organizations have begun to recognize that there is a great need to manage it and to understand it. As unstructured data refers to anything that is not structured; it can become very difficult to understand what is out there and how to deal with it. The traditional thinking has been to just treat it as a blob (binary large object), but with a greater understanding of the variety of unstructured data types that exist, the need to manage them has grown.
To help understand this point think of geometry and the rules (mathematics) associated with it. When mathematicians tried to come to grips with circles, triangles, and shapes it was seen to be so complex, they started on the basic concepts first. This was dealing with geometry in a two-dimensional world. In this world view, triangles had three sides with three angles that always added up to 180 degrees. Parallel lines never met. By just focusing on this world view a greater understanding of geometry was formed. Core principles were calculated along with a lot of formulas and mathematics. In this analogy, the two-dimensional world is equivalent to the structured data.
Once this two-dimensional world reached a stage of becoming well studied and understood, focus was moved to the real three-dimensional world to see how it would behave. The three-dimensional world proved to be very complex and so made us focus on key areas that could be understood. This included the study of knots, symmetry, surfaces with holes, and curves. Some of the two-dimensional rules flowed through to the three-dimensional world but fewer didn't. Parallel lines can meet and triangles can have more or less than 180 degrees.
In this analogy the unstructured data is the three-dimensional world and there is a need to understand what is in it. Just like there exists no thorough understanding of three-dimensional geometry, so there is no full understanding of the unstructured data. It is an evolving and growing discipline as more information and experiences are gathered, tested, and learnt. So, like the notion of studying knots, holes, and curves, one can also focus on key areas of the unstructured data and learn from them. One key component is multimedia, which contains video, audio, photographs, and documents.
Multimedia is also referred to as rich media. It's not just limited to the four types identified and some even might debate whether documents are a component of multimedia. As will be shown, when breaking down multimedia into its fundamental components, one can classify these multimedia types and then develop new types from it. This includes three-dimensional objects, simulation data, and neural network data.
The analogy of comparing three-dimensional geometry to unstructured data works well and one has to also consider that mathematicians have gone beyond three-dimensional geometry into multi-dimensional geometry in an effort to help explain some key components of string theory, quantum theory, and astronomy. There are still a lot of unknowns with unstructured data. The recent introduction into the world of quantum computing using qubits to store information will undoubtedly push the field of unstructured data management into complete new areas(2).
Just like there is overlap between the two-dimensional world with the three-dimensional world, so there is between multimedia and structured data. The two are dependent on each other at the moment, but eventually with improvements in technology this might change. The rules formulated today might change tomorrow. It's important to realize that as technology changes the rules change. Working in multimedia is trying to hit a moving target. What is right today might be invalidated tomorrow.