Managing Multimedia and Unstructured Data in the Oracle Database
上QQ阅读APP看书,第一时间看更新

Data types

All metadata is text-based; in that, it is composed of characters from a well-defined character set or sets. To ensure consistency when copying or moving data, the same international standard as used for XML, which is UTF-8, should be used minimally. There is UTF-16 and other supersets of UTF-8 that can also be used.

As was covered in Chapter 1, What is Unstructured Data?, data that is stored as raw text is in effect, unstructured. This is due to the fact that there are no rules or controls that govern it. Though text-only metadata is flexible in its entry, it is easy to introduce errors. A good example that most museums encounter is representing the date in a text field.

In the following list, determine the actual dates:

  • 12-Dec-01
  • 10/11/12
  • 19 June
  • 30th February 2010
  • Februry 10th 1870
  • 50-60a.d.

In addition, how easy would it be to do a date search range on these values?

The following are the issues identified with the previous dates:

  • It is not clear what the year is. Is it 2001, 1901, 1801, or some other year? Shorthand for years is what resulted in the year 2000 issue.
  • Is it the 10th November 2012 or 11th October 2012? The date format dd/mm/yy versus mm/dd/yy always results in confusion, and neither date value is internationally accepted.
  • Is the value 19th June or 19th July? Shorthand for months can result in confusing dates.
  • This is not a valid date. It's very easy to enter in a non-existent date as there is no date validation.
  • February is misspelt. This can make it impossible to determine what the correct date is.
  • This is a circa date, not a correct date. Circa dates are of a different data type.

It is possible to break down text values into sub types, which have their own domain rules. Some of these are well-defined and conform to standards, others such as circa are still in the process of being defined. By breaking down the metadata into types, it becomes possible to do intelligent searches on the data and more importantly be guaranteed that the data is accurate.

The classic definition of a relational database is one in which all the data is perceived to be a set of tables. How the data is stored or managed is separate to the core concept that all data is being accessed using the SQL statement and modified using Data Manipulation Language (DML).

For metadata and all the data types, the same holds true. The data can be stored in an XML database, relational database, NoSQL data store, embedded in a digital object in a proprietary format, or a raw text file. When using Oracle, storing the data in an object/relational structure makes it easier to manage, secure, and control. What has been shown with social networks is that it is not necessary to store it in a traditional relational database. How it's stored is a matter of implementation based on the available software, technology, and business requirements.

The following lists some of the more popular and well-known data types. Some storage systems can handle these data types transparently and others cannot easily be represented.

Text

Even though all other data types are based around the text one, not all metadata fits naturally into a data type. In which case, it remains as text. Text fields are commonly seen described in databases as varchar or variable character fields. Text data can still have some domain-boundary rules associated with it:

  • Fixed domain list: A validation rule can be applied to a text field that forces it to belong to a set of well-defined values. This is also referred to as an enumerated type, for example, text metadata values belonging to the set of compass directions, such as north, south, east-west, north-east, south-west, and so on.
  • Repeating value: This is a concept that is forbidden in traditional relational databases and removed when a database is normalized to first normal form. With the introduction of object types in Oracle 8, repeating fields were reintroduced as varrays, as they were a natural feature of objects. As metadata is attached to a digital object, repeating values became a natural fit. An object can have a text field of keywords describing it. One or more values can be assigned. Max limits to the number of entries and the width of each entry can additionally be assigned. Text values can also be grouped together with other data types into a repeating group of values.
  • Fixed size: The length of characters or bytes is limited in size, for example, a text field has a maximum length of 100 characters. It's important when working in the UTF-8 (or greater) character set to differentiate between a character length and byte length. Certain non-traditional characters can be represented using 1 to 4 bytes in UTF-8, resulting in the number of bytes used to represent the text field exceeding the number of characters.

Date

The DATE data type is one that is stored internally as a number. Date is one of the most varying data types and what drives it is the precision required. Oracle internally supports date and timestamp.

Domain rules include enforcement of a valid date. Typical search options include the ability to find all digital objects created within a date range. As the data is stored internally as a number, it becomes quite easy to perform the date arithmetic, including adding a date and date interval together or subtract two dates. There are a large number of date functions available. Traditionally, date and time are stored together with a fine level of precision for time storage.

Date management gets complicated when time zones and daylight savings are factored in for international databases.

For some scientific or astronomical systems, the traditional DATE type is not flexible enough especial when relativistic motion has to be taken into account when doing calculations.

Interval

Oracle supports the concept of intervals being stored, which is essentially the difference between two dates.

Time

Time is generally included in the date, but there are situations where only the time is needed and not the date. For a botanical database, it might be important to note the time of day when a flower blossomed, but there is no requirement to store the date. Oracle does not support time being stored separately and usually requires the time to be filled with a dummy date value. By doing this, standard domain rules for time can be enforced and time calculations can be done.

Season

This data type is generally ignored, but for some systems it can be important. There are four traditional seasons – summer, autumn, winter, and spring, It's not correct to state that one can derive the season based on the date. If I say what season is it if it's 1st January? If you answer winter, the answer is correct in the northern hemisphere, but in the southern hemisphere, it's summer. Additionally, locations near the equator do not have the traditional four seasons, and generally have two, the wet season and the dry season.

Botanic systems need to store the season for describing the plants, flowers, or seeds. For searching, it's also reasonable to query all plants that display some activity in a particular season.

Circa

A circa date is an approximate date. It can include an approximate date range. When museums date old items, they might not know the exact date and use the term circa to denote the approximate date it was created.

When looking at geological dates, the date period can extend back hundreds of millions of years. When looking at astronomical dates, it can go back billions of years. Geology and astronomy generally don't use circa but their own terminology. The concept of using an approximate date is still the same.

Interestingly, the approximation varies geometrically originating from the current date. A circa date of 2000 would be accurate to about plus or minus 1 or 2 years. Whereas, a circa date of the 1900s might plus or minus be 3 or 4 years. A circa date from the 1600s might be plus or minus 10 or 20 years. The further back in time one goes, the more the degree of accuracy diminishes.

Circa

The approximation and degree of accuracy is determined by the dating method used, the type of item, and the length of time.

Circa

When searching on a circa date, a fuzzy logic algorithm is best used, and it needs to be flexible enough to factor in exact date and circa dates. If a person is searching for pottery in 1620, the search engine should query circa dates for anything that overlaps 1620 using fuzzy techniques. If the 1600's period was determined to be plus or minus 20 years, then any object with a circa date of 1600 to 1640 would match. Ranking now becomes important, as the closer the circa date is to the actual date, the higher the ranking.

If a person searches on circa 1620, then they are requesting a broader search range. In this case, it is from 1600 to 1640, meaning any object from 1580 (1600 minus 20) to 1660 (1640 plus 20) would match. Again ranking is important and objects matching closer to the number would have a higher rank.

Circa

There is no official circa standard and the text is free flowing. To address this, a formal syntax for circa has been described and can be found in Appendix D, Chapter References.

Boolean

A boolean value is traditionally thought of one that has two values, TRUE or FALSE. Sometimes, phrased as yes or no. Computationally, they can be represented as 1 and 0.

It's important to consider that the traditional thinking of boolean being binary doesn't hold up well in the real world and the NULL case needs to be correctly addressed. As such, a trinary logic system for dealing with boolean needs to be handled. In this case, the values are TRUE, FALSE, or NULL.

For metadata, this might be a tag that indicates whether the digital object is publicly available. The null case being that it hasn't yet been decided what it should be.

When using boolean, the set of domain functions available enables complex searches to be performed, for example, find all images that are publicly available but not yet licensed.

Number

The number data type is a type that has changed over time to handle the increasing power of computer systems. By default, a number is a real number. Real numbers(21) can be thought of as points on an infinitely long line, called the number line or real line, where the points corresponding to integers are equally spaced. Unfortunately, not all real numbers can be stored in a computer system. Transcendental numbers are ones with an infinite number of decimals (for example, π and e), and some numbers such as i (square root of negative 1) cannot be stored except as a formulaic representation.

What limits the range of numbers that can be stored is their precision. The precision(22) of a value describes the number of digits that are used to express that value. The number of digits also includes decimal values.

In a computer system, storage determines the precision. As a computer system is based on binary, the precision is linked to the number of bytes made available to store the number. When programming, the precision can be limited by the size of the registry values (not to be confused with a Windows registry). Computer chips that are 64 bit have a greater precision than those that are 32 bit. The EXIF metadata standard makes extensive use of numbers to represent a variety of values. The standards indicate what the precision is for the value.

Most database systems today can handle numbers 264 or ± 232.

Although it is commonly done, storing a number as a text limits its usage. It becomes very hard to validate or to perform arithmetic expressions against it. Storing as text also consumes more storage, which can make it harder to tune queries performed against the database.

Metric and imperial

Numbers are used extensively for storing scientific metadata. The challenge is then to use a universally accepted standard. The most common standard used by all countries in the world, except three, is the metric system(23). As the U.S. still uses imperial, this can cause issues when it comes to understanding what a number actually means.

If a metadata value of 5.0 is stored for length, is this in meters or yards? Without knowing what system is used, major errors in calculations can be made(24).

As the accepted international standard is metric (except for altitude), and even as the U.S. has accepted the use of this standard but been unable to implement its usage in the country(25), it should be a fait accomplice that metric is always used for. Experience has shown that this is not realistic. For museums and organizations that are decades old, the data they have might be in the imperial system, even though they are using the metric system.

Rather than trying to force a change to metric, it's best when storing numbers to first determine whether it is metric or imperial. This is easily achieved when using XML, as it is just another attribute. When using a relational database, this requires the value to be stored in another column to indicate whether it's metric or imperial. A possible solution is to create a specialized scientific data type, which has as its attributes, whether it is metric or imperial, and another attribute to store the number. This is an effective strategy but makes it hard to make use of a lot of built-in functionality that is available, when only a number is used. In this case, methods would have to be written to handle number processing. This might not be a cost-effective strategy.

Ultimately, an organization should determine for each of its data types whether they are to be stored in metric or imperial, and then enforce this on ingestion.

Accession number

Most museums use the term accession number to describe the primary key of the object. This is a unique value or number that identifies the digital object. What makes the accession number more robust is that information can be encoded into the number using dot notation. The number might not just be limited to numerics (the digits 0 to 9) but might also include characters.

The term originates from the idea of creating a unique key using a number system that increases by one for each new object. Though this is still possible, over time the characteristics changed.

In a registry system (a system that manages physical files, again not to be confused with the Windows registry), numbers are assigned sequentially to each new file. The full accession number is prefixed by the year, and reset to one at the beginning of the year. Special characters are used to separate the year from the number, but the year and the sequential number combined make the accession number, as only the two combined together form a unique value.

The following are examples of accession numbers based on a year system:

  • 2010/00023
  • 2008.1234

As the year is coded into it, it's immediately apparent when the object was cataloged.

For objects that can be represented in multiple ways, then additional notation can be used to identify it. So, if the object is a car, then pictures can be taken of it from different angles. In this case, these additional shots can be expressed as an additional number with a letter is assigned to it.

The examples given in the following list are all the digital objects that reflect different views of the same object:

  • 2011.1234a
  • 2011.1234b
  • 2001.1234c

or if number dot notation is used

  • 2011.1234.1
  • 2011.1234.2
  • 2011.1234.3

There is no right or wrong way for determining whether numbers or characters are to be used. Those with a computer background will likely gravitate towards numbers, as this is traditional for computer systems, whereas curators will gravitate towards letters as visually it's clearer what the relationship is. The letters stand out more in comparison to the numbers.

It's also possible to encode additional information in the accession number. The digital object can be prefixed using a character such as A (audio), V (video), P (photo), or D (document). Again dot notation can be used, or numbers can be associated with each type.

  • A.2012.1234a
  • P.2012.1234a
  • P.2012.1234b

This indicates that the three digital objects are all based on the same object. Where, the first is an audio of it, and the other two are digital photos of it.

Using the dot notation relationships can be encoded between multiple objects. In the following example, the accession number does not use the year, but rather uses a three-tier concept to categorize the object.

  • 14.5329.498
  • 14.5329.499
  • 14.5329.600

Those digital objects that are related share the same middle number. In artwork, this could be used to indicate that the same artist was used to produce the artwork (for example, Da Vinci or Monet). This, of course, gets more complicated when multiple artists work on the same digital object and the accession number starts to encounter limitations.

The information stored in the accession number hierarchy of dot notation is not limited in the length of information that can be stored in it. There are no exact universal standards that have to be conformed to. The format of the accession number is typically managed by the guidelines of software vendor. There are some accession number standards in some fields, such as bioinformatics, library science, and medicine.

With an accession number, what is important is being able to accurately and clearly define the rules about how the accession number is formed. Unfortunately, a number of well-aged museums have over time employed multiple and conflicting standards for the accession number, with a lot of exceptions to the rules employed within it to get around the changing nature of technology.

It's unfortunate that the concepts behind normalization were never realized by the architects of those who manage accession numbers. When normalization(26) occurs, the goal is to identify fields with multiple values in them and to break them out into separate columns. If this was done, the accession number would become a simple key that uniquely identifies the object and is a unique number. All other information such as the year and relationships between objects would be extracted and stored in other columns. The reality of the real world though doesn't always meet with the neat solution that normalization can bring. Museums display their works to the general public and label them. Those labels have limited spaces and being able to encode additional information in them, they can be of great use to those researching, identifying, and finding objects. So even though one can explain the benefits that normalization can bring to managing digital objects, the accession number will not disappear or be replaced by a normalized structure. There is nothing stopping one utilizing the best of both world's concept and using an accession number and having its information being additionally stored in a normalized or object structure. This feature can be covered in Oracle using a number of other features including materialized views and embedded functions in SQL. This is covered in greater detail in Chapter 4, Searching the Multimedia Warehouse.

Name

Name is a specialized data type used to identify an individual. This can be the person who created the object, took the photo of the object, who last edited details on it, or who is responsible for its copyright.

There are a lot of attributes of name, and it can be stored in a number of ways. It's typically stored as a text field in the format – first name, last name. Other attributes can include title, middle name, birth date, death date, and gender.

Like accession number, it's possible to store multiple attributes for name into the one field, giving added meaning to it. When looking at the person who first classified a botanical specimen, the representation of the name can include values such as first name, middle initial, last name, date classified, date reclassified, and name of person who reclassified(27).

For a museum, where the name is associated with a digital object, it is important that sufficient information is stored to ensure that false positives (covered in Chapter 4, Searching the Multimedia Warehouse) do not occur when a search is done.

If a digital object has the name of John Smith, then it's likely to be retrieved when someone searches for John Smith. But, is it the right John Smith? As there are so many potential false hits that can come back from this, the search is not accurate. This is where a name matching algorithm is needed.

Such an algorithm would have to apply fuzzy techniques and ranking based on the initial information entered. In the example of a memorial that has photos over a 100 years old with basic information identifying the individuals in them, then the name-matching becomes more complex. It needs to factor in values such as the middle initial or name, birth date, death date, nationality, and even death location. If some of these values are not available, then guesses (ranked accordingly) can be made.

For photographs containing multiple people, the circa date can be used along with the relationships in the group. If, for example, the photo is of John Smith, Fred Hangle, Harold Mills, and it was taken at the Somme in 1916 in World War 1, then a search looking for Smith, Hangle, and World War 1, should bring this up (World War 1 should translate to a circa date).

Additional attributes for name can include occupation, address (which is a type within a type), membership number, and contact details (phone, e-mail, mobile). What is important to note is that the information that is captured and managed within this type is determined by the context in which it is used. It is not appropriate to request the sex of a person in an e-Commerce system, as this is confidential medical information. For further information regarding name, sex, gender, and title data types refer to https://sites.google.com/site/sgtndata/.

Address

The address data type, like the name data type is composed of a number of varying attributes. The attributes used are based on the context in which the address is needed. Breaking down an address into country, city, suburb, and street enables geographical-based queries to perform. By using the Google address web service, an address can be translated into the spatial latitude/longitude co-ordinates. Keeping the country attribute system enables e-commerce systems to distinguish between local and international, for purposes of determining the currency for orders and shipping costs.

When used as part of the name data type, address is typically treated as a repeating group, as a person can have one or more addresses. When used in an e-commerce system, address can be used for shipping, billing, and delivering to different addresses, in the case of a gift. Address can also be used with name for matching and searching.

If a history of addresses is stored, then it's possible for security agencies to use time-based queries to establish approximate locations of individuals.

Filename

Most digital objects have a representation of themselves stored in a digital file, which is stored in a location controlled by the operating system. This is commonly referred to as the filesystem. Each file within the filesystem has a name, known as the filename.

The filename can include its physical path or directory structure, especially if the filename is not unique. Information can be stored in this structure and can be used to further identify or be used as metadata for the file. If, for example, the parent directory was the year the photo was taken, then it's valid to capture and extract this information from the filename.

It's also a realistic option for the accession number to be used as the filename. In some cases, a variation of this number can be used, especially if special characters are used and the storage media cannot handle it.

Filenames and their paths can be controlled and limited by the storage they are on. The paths can vary depending on the operating system. When a network drive is used, then understanding the path can get quite tricky. It's possible for a file to be stored on a Network Attached Storage (NAS) and be accessed simultaneously from Unix, Windows, and Mac. The access path to this file can vary in its structure.

In the following table, the same file stored in the same directory could be accessed using different methods. Each one using a different file path syntax for accessing it. If the server stores the Unix path, then a Windows user trying to use that same path is not likely to be able to access it. It will need to be translated.

By storing the digital objects in the database, the need to use or reference the path once loaded is not an issue, as the object can be served straight to the user from within the database. For organizations using a mixed environment, ensuring a consistent file path can be a challenge.

Digital files are not just limited to the two most used operating systems (Unix and Windows). They can also appear on mobile devices such as an iPad and more mature operating systems such as MVS.

A digital file is not necessarily unique, though most operating systems enforce the notion that the name must be unique within its home directory. Even this can get complicated when links to files are used. In this case, the same file can exist in multiple spots with different paths.

As discussed later, the challenge that most multimedia warehouses will experience is matching the files to existing relational (or equivalent) metadata.

It's also possible for searches to be performed against filenames and their paths. In some cases, the path might need to be translated to match the operating system knowledge level of the user. Most users are more familiar with Windows than Unix and users might actually only access a file via a Windows share. If the Unix path is stored, this might be of no use to the user, especially if the Windows share hides some or most of the path name.

Spatial co-ordinate

The traditional view of spatial co-ordinate is one which is a geographical location, such as latitude and longitude. In this case, the co-ordinate is just a point location. A set of co-ordinates can be combined into shapes, such as circles, squares, or polygon to represent an area. By adding additional dimensions, such as altitude, depth, or height, then a three-dimensional co-ordinate can be established. Incorporating time, as well as the planet, the co-ordinates that are based in can add extra information to the position.

As a spatial co-ordinate can be of any shape, it can be of any size. Once a spatial co-ordinate is stored, it's possible to perform spatial queries against it. A common one is find all the images within 5 km of my current location. A spatial co-ordinate lends itself well to being stored in XML, enabling it to be integrated with other images. Integration with Google Maps is a cost-effective solution for taking latitude/longitude points and visualizing them on a map.

A polygon spatial co-ordinate can be used to identify the path of a car under surveillance. A spatial query can then be used to correlate that co-ordinate with others and produce related digital objects. This could be photos or video of physical objects near them.

Spatial queries typically employ fuzzy search concepts such as finding images near us, or finding any digital object in this approximate area.

Even though latitude/longitude with geographic location are the most common applications seen for a spatial co-ordinate, it's not just limited to this. This data type can be used when the point of reference is not geographical but a virtual concept. In an intelligence data warehouse, the ability to create a spatial map of relationships between people, is a useful tool for analysis.