Managing Multimedia and Unstructured Data in the Oracle Database
上QQ阅读APP看书,第一时间看更新

Metadata standards

Metadata when conceptualized into a shared standard can be said to be part of an ontology. An ontology(12) renders shared vocabulary and taxonomy, which models a domain with the definition of objects and/or concepts and their properties and relations.

By grouping together metadata and defining a standard from it is useful for searching and understanding what a digital object is. It is now common practice for all digital cameras to capture and store metadata about the photo in metadata fields conforming to the EXIF standard. This can include aperture, focal length, brightness, and GPS co-ordinates of the image.

Most metadata is stored in the XML format, which is an easy-to-use and flexible data storage format. The definitions of the metadata within an XML format can be described using the XMP standard, which is also in XML.

The popularity, simplicity, flexibility, and wide-spread use of XML has resulted in nearly all metadata being stored in the XML format. Even older formats are being coerced into this format where possible. A good example is IPTC, which is one of the first metadata standards used for TIFF images. It used a propriety format, but in the latest version of TIFF images, it is being stored in the XMP format.

Digital images

This section details some of the most common metadata formats based on the multimedia image type.

IPTC

International Press Telecommunications Council (IPTC) enables the exchange of news. It was originally formulated in 1990 but only became popular with the adoption by Adobe. It was one of the first standards put together and does not use XML but rather a propriety format embedded in the header of the image. In 2001, Adobe extended IPTC to be incorporated into XMP, enabling large values to be stored. It currently supports both XMP and IPTC to be stored in the digital image at the same time. Each metadata item is called a property and has a fixed type of value. Internally, a unique numeric code was assigned to each value to identify what type of value it is.

Examples of some of the properties that can be found with their identifying number are:

The standard can be found at http://www.iptc.org/std/photometadata/specification.

IPTC values are found in TIF images. Oracle can extract IPTC information from an image.

EXIF

This is a standard designed for still-image cameras and comes from an exchangeable image file format. First published in 1996, it has been enhanced to also include audio.

The EXIF standard is more designed to capture information about how the camera took the image. This includes values such as aperture and focal length. As digital cameras today are embedded in mobile phones, smart devices and devices such as an iPad, they can also capture more information on top of this, including who took the photo and where it was taken.

In that regard, some of the values in EXIF overlap with IPTC, but the standard does not replace IPTC. Information is stored in tags in a tightly controlled but flexible structure, enabling each tag data type to be identified. The standard can easily be expanded on as technology changes.

The standard is flexible and is growing. A large amount of information can be captured about the image. Some example values include:

  • APERTUREVALUE
  • ARTIST
  • BATTERYLEVEL
  • COLORSPACE
  • COMPRESSION
  • COPYRIGHT
  • DATETIME
  • EXIFIMAGELENGTH
  • EXIFIMAGEWIDTH
  • EXPOSURETIME
  • FOCALLENGTH
  • GPSALTITUDEREF
  • GPSLATITUDE
  • GPSLONGITUDE
  • IMAGEDESCRIPTION
  • IMAGELENGTH
  • IMAGEWIDTH
  • MODEL
  • ORIENTATION
  • PIXELXDIMENSION
  • SHARPNESS
  • SHUTTERSPEEDVALUE

The standard can be found at http://www.exif.org/specifications.html.

EXIF values are found in JPEG and TIF images. Oracle can extract EXIF information from an image.

XMP

Extensible Metadata Platform (XMP) is a standard designed to be used in a digital object and is pioneered by Adobe.

XMP uses XML and has been designed to incorporate other standards. This includes IPTC and EXIF. Its usage is more determined by the camera manufacturer, as the standard is controlled by Adobe. It is extensible and adaptable, and the values that can be extracted are constantly changing. As it uses XML, the details about the items are stored as attributes in the XML along with the data making it easier to define new values and embed them in the XMP.

An XMP example extracted from a camera can be found in Appendix A, The Circa Data type.

XMP values can be found in JPG and TIF images. Oracle can extract XMP information from an image. It can also write XMP back to the image. Even though Oracle easily extracts the XMP, it only extracts the XML component. The XML Schema available in Oracle is needed to extract the various attributes and values from the XML.

The XMP standard can be found at http://www.adobe.com/products/xmp/.

Audio

Unlike video, which has a large number of competing standards with new standards emerging every year and changing in popularity based on which devices can best play them, for audio, the most popular standard is MP3. This standard has won out, because in the early days, when smaller files were more important, it would compress audio files dramatically with loss of quality that was varying and acceptable. An audio CD could only hold 70 minutes of music, whereas an audio MP3 could hold 10 times or more music. This huge difference in storage resulted in it being used extensively and most manufacturers adopting it.

Audio compression standalone is slightly different to audio being stored in a video file. In that case, the popularity of the codec is driven by quality and how easy it is to sync to the frames in the video. Standalone audio is associated with music albums and podcasts, which require metadata to help understand the content. With audio embedded in a video file, the album concept and the need for metadata is not required. So, even though an MP3 codec can be used in a video file, its usage is more determined by the video.

ID3

The ID3 metadata standard (ID short for identifier) first appeared in 1996. The first version format was simple and stored 128 bytes of data in the header of an audio MP3 file (which initially had no support for metadata being embedded in it). The standard was designed to hold information regarding music, but with the increase in popularity of podcasts (radio show style audio), the standard was coerced into working with it.

It contained information including the artist name, album, speed, and genre. Its limitations led to a new compatible version referred to as ID3v2, which can store more information. With the use of iTunes and other music stored, the ability to tag music become important for managing and categorizing.

The ID3 data is not stored in an XML format. The market at the moment still has the ID3v2 as the popular standard to use and the Adobe XMP standard, which supports MP3 as still to be adopted. Only when major music stores start using the XMP standard will it increase its popularity. With ID3v2, album covers (jpg thumbnails) can be embedded in the track.

The following table lists the most common ID3v2 values used:

The standard can be found at http://www.id3.org.

ID3v2 values are found in MP3 audio files. Oracle can extract property information from an MP3 file but not the ID3 values.

Relational

Standards for storing text data have existed for a long time. Botanical systems have been using taxonomies for hundreds of years. In addition, Botanic systems have been trying to classify the names of the people who identified and classified the plants. This was done before computers existed and employed a flexible form of grammar and syntax to classify names(13). For example, Betula alba L. 1753, nom. Rej.

With the rise in popularity of XML and the understanding that data can be grouped together and attached to an object, numerous standards began to appear to control the attribute names and how the data would be grouped. The goal being twofold:

  • To allow data to be intelligently searched on (for example, find me all authors whose name ends with Smith)
  • To allow data for objects to be interchanged between different systems

In the 1990s with the increase in usage of relational databases, the need arose to be able to interact with these standards and convert data from the relational format into the XML format and back again.

In this scenario, relational refers not to a relational database but the potential to store textual data in a relational structure via XML.

CDWA Lite

Categories for the Description of Works of Art (CDWA), the Lite version, is an XML-based schema using CDWA as a base. Its goal is to provide an easy-to-use standard to enable data transfer between museums(14).

CDWA uses the concept of a piece of work, which is an object. When digitized, it becomes a digital object. Some of the values used include:

The Dublin Core® metadata Initiative

Dublin Core(15) is a metadata standard used to describe resources for the purposes of discovery. Initially adopted by museums, the standard can be used by most organizations. It is composed of two levels, simple and qualified.

The simple level consists of these values:

  • Title
  • Creator
  • Subject
  • Description
  • Publisher
  • Contributor
  • Date
  • Type
  • Format
  • Identifier
  • Source
  • Language
  • Relation
  • Coverage
  • Rights

The values and overlap with the CDWA standard are the ones described earlier. An actual description of each value can be found at http://dublincore.org/documents/dces/. Each value would relate to a metadata value.

As the standard is primarily designed for discovery, search engines can make use of the individual metadata fields to perform advanced searches. Searches can be done on just the artist name, location, or period of time in which the work was created.

An example of Dublin Core data in XML format is:

<metadata
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:title>Hockey Stick</dc:title>
<dc:creator>Margot Donte</dc:creator>
<dc:subject>abstract art</dc:subject>
<dc:description>Cubist work of art depicting the essence of sport and life</dc:description>
<dc:publisher>Acme Publishing</dc:publisher>
<dc:contributor>Harold Smith</dc:contributor>
<dc:date>circa 1920</dc:date>
<dc:type>Painting</dc:type>
<dc:format>Canvas</dc:format>
<dc:identifier>M.23.32</dc:identifier>
<dc:source>Australia</dc:source>
<dc:language>English</dc:language>
<dc:relation>see M.23.32.b</dc:relation>
<dc:coverage>Pre 1930's Depression</dc:coverage>
<dc:rights>Public Domain</dc:rights>
</metadata>

The goal of the qualified Dublin Core level is to provide a well-defined vocabulary or exemplary qualifiers. These are additional structured actions to be applied against the value. When a value is qualified, it can still be read as if it was as a standard sentence. This can include adding structure to a sentence such as "Is Version Of" or "Is Part Of".

Darwin Core

Darwin Core(16) is an extension to Dublin Core. It intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The focus is taxonomy for biological disciplines. The standard includes coverage of:

  • Occurrence: This a method of uniquely identifying the record.
  • Event: This is an action that occurs at a place and during a period of time.
  • Dublin Core location terms: This is a region or physical spatial location. By overlapping with Dublin Core, which already defines this, allows for integration of data between different systems.
  • Geology: This includes geological classification, covering concepts such as geochronologic era, chronostratigraphic erathem, and the stratigraphic horizon.
  • Identification: This includes information regarding how the identification of the taxonomy was made.
  • Taxon: This includes the taxonomy (phylum, class, order, species, or genus) of the organism.
  • Relationship: This includes the details about the relationship between resources.
  • Measurement (or fact): This includes the value, type, method, creator, and date of information collected.

Media Art Notation System

Media Art Notation System (MANS) is a System of Formal Notation for Scoring Works of Digital and Variable Media Art. It is an attempt at coming up with a standard that uses XML for the categorization of multimedia. It uses Digital Item Declaration Language (DIDL) that allows for greater, more granular descriptions of a multi-component digital object(17).