Comparing
A multimedia warehouse is based on the same concepts as a data warehouse but can vary dramatically, because the nature of the digital objects stored within it enable new capabilities and concepts. Most multimedia warehouses employ to some data transformation, cleansing, and cataloging to make them more efficient to query and report on. Additionally, they might transform the images, summarize, combine, and restructure them. The most popular usage is to web enable the frontend, enabling it to be queried and accessed via a web browser by the general public.
The data warehouse
The idea of a data warehouse has been around for a long time, and specialized software vendors have come about purely to address the needs of it. The data warehouse evolved from a number of different directions simultaneously. Academics then formulated more official rules governing what a data warehouse really is.
For some, the data warehouse came about to solve the performance issue of ad hoc queries causing havoc with the performance of a transactional database. A user running one badly-formed query could shut down the database. The need to enable users to run these queries meant moving them to a copy of the database. From this grew the need to perform Extract, Transform, and Load (ETL)(1)against that copy. Various database features grew from this to enable the efficient movement of data from the primary databases to the data warehouse. This ensured that the data warehouse had information that was up-to-date.
Other needs arose, which include the requirement for managers to be able to query a number of different databases. From the requirement to produce summary information grew the concept of business analytics referred to as online analytical processing (OLAP)(2). The introduction of OLAP also opened up the idea that the data itself did not have to always be up-to-date and an exact copy of where it originated from. OLAP produced summary data focusing on different data dimensions (for example, geographical location, departmental sections, and time), which were useful for performing complex aggregate queries. When time-based summary queries were performed, the need to have completely up-to-date information was not required, especially when historical queries were done. This concept is missed by most database administrators and relational academics, who have been trained and brought up to believe that a true database is always consistent. The data warehouse threw this concept out and changed some of the rules. For OLTP, it was true that the data had to be consistent, but for data warehouses, there were new rules, and data consistency wasn't high on the agenda. For a multimedia warehouse, this same concept is equally true. A multimedia warehouse works to a different set of rules. The way the data is loaded, queried, and secured involves a different focus.
Data consistency(3) summarizes the validity, accuracy, usability, and integrity of related data between applications and across an IT enterprise. Data consistency is an important topic and is central to a relational database. For the user, consistency means that when they view data, the data has to be accurate and correct. It hasn't been changed by the disk or corrupted. It's a core concept of computing and the Oracle Database has a lot of features built into the database to ensure consistency.
Data consistency is heavily emphasized in the relational model, and the notion of primary keys, foreign keys, and constraints were made available to enforce consistency. The consistency in the relational model is real time at the transactional level (called atomicity(4)). As the model is mathematically based, it cannot be faulted. It is well-proven and tested.
There is a tradeoff. To enforce this level of consistency requires more computing resources and high-speed networks. The real-time nature of the consistency starts to fall apart in distributed systems. If an application is distributed across multiple databases at different sites, it can be quite difficult to keep them in sync and consistent in real time. Oracle replication initially tried to address this issue by offering synchronous (real time) and asynchronous (delayed) replication. With the introduction of replication via the redo logs (a common replication feature of most databases), asynchronous replication became stock standard. The notion of a delay existing between when the data is changed and when that change is eventually reflected in other areas, negated the real time requirement of consistency and introduced the idea of eventual consistency(5).
With the power of computer systems today, real-time application consistency has scalability limitations. Attempting to enforce foreign keys and a multitude of other constraints can prove to be resource-intensive, as the size of the database grows, as well as the number of users. With the rise in popularity of NoSQL(6), also came the notion of eventual consistency. It doesn't dispute the concept of data consistency at the transactional level. It says that the need for the data to be always consistent in real time isn't a mandatory requirement in all cases. For a financial system, it's most likely to be a mandatory requirement to always be consistent, but a social network application doesn't always require the data to be immediately consistent. By introducing eventual consistency, a number of previously encountered scalability and performance issues were overcome-enabling applications, such as Facebook and Google, to scale to hundreds of millions of users.
A data warehouse can make use of eventual consistency to achieve some of its performance requirements. The materialized view structure that can be used within the database is one such example. A data warehouse has different requirements on top of this and introduces a new concept, which traditional data consistency doesn't fully address.
Consistency is currently broken up into three sections:
- Point-in-time: This type covers disk and software. It checks whether the database writes data to the disk correctly.
- Transactional: This type ensures that a set of data items (logical unit of work) are consistent. Within the database, this ensures whether it's consistent when failure occurs.
- Application: This type ensures whether data across multiple transactions are consistent.
Each section expands on the capabilities of the previous one to enhance it.
What is missed is the accuracy and consistency of the data itself. In transactional consistency, the model doesn't care if a field containing an integer has the value 10 or 20, provided all other columns that reference it (primary keys, foreign keys) match.
Logical data consistency focuses on the data value themselves and their accuracy. It overlaps with eventual consistency. A good way of highlighting this is with a name field. A name field typically contains a first name and last name, but when a value is entered in, is it logically correct?
What if instead of John Smyth, John Smith is typed. Does it appear to be incorrect? The immediate answer is no; except that the consistency model can't tell if this is right or wrong. Even if the name John Smyth is entered, it still might be incorrect, because the person's full name wasn't entered. Should the name John Paul Smyth have been entered instead? At what point when entering in a name is it correct? The same can be said for address or contact details. What if person changes their name or phone number? In this case, the entered value might have the illusion of being correct when, in fact, it's now incorrect.
Another way of looking at this is with dates. If a person enters in a date, which relates to the period of time they were born, is the year sufficient? If they enter in their birth date, is that date actually correct? A more valid date is the one that includes a time. But, is it time with hour and minute, or hour, minute, and second? Also, what about a hundredth of a second? The precision of the date stored varies based on the context in which the date is used.
True consistency implies accuracy in the data, that is, being able to trust the data and trust the results when it's queried. It has been shown that we can't trust the data, as there is a fuzziness to it, a range of trust values. With the birth date entered, we might trust the year, month, and day, but not the hour, minute, and second.
If a person enters in an e-mail address, is that address a valid one? Is that e-mail address the one that belongs to that person and will it only belong to that person? Some applications can achieve a high degree of comfort in determining that the e-mail matches the person, but to maintain this over time can be difficult. There is a degree of accuracy and trust to be obtained here.
Most of the time, these fuzzy issues with data items are glossed over, as they are too difficult to understand, control, or are beyond the boundaries of the application (fuzzy data is data, which has a range of values and its logic refers to the mathematical manipulation of the fuzzy data). We have learned to accept logical inconsistency in data as that's par for the course. It's now taken for granted so much that it's instinctively ignored in a lot of cases. Yet most data items have a degree of fuzziness to them. Any data item defined as an integer indicates that the precision required is not the same as a real number. Dates, timestamps, even spatial co-ordinates have degrees of precision, where we accept a certain level of accuracy, but except that it doesn't have to be fully accurate.
The relational system might have a mathematical model behind it, ensuring the consistency of the data in the transactions, but it can't control whether the data values themselves are fully correct. It can't mathematically enforce that the name entered is 100 percent valid or matches the person's true identity. For a name, its very hard to even ensure that it has been spelt correctly.
When we take real-world data, it's translated and messaged to fit the computer system. Obvious errors can be corrected against (if an invalid date is entered), but we are never going to get full precision and full accuracy on all data entered. All that can be done is to achieve a level of trust with what is entered.
In a multimedia warehouse, the concept of trying to achieve logical data consistency is not attempted, as it becomes apparent that the amount of data that is fuzzy forms the bulk of most of the digital objects. The goal is to achieve a level of precision based on each data item and then, understand the implications of that precision.
In a warehouse that uses OLAP, when statistical queries are run over larger items, minor issues in the precision of the data can be factored out (averaged). In other cases, data that doesn't fit within the standard deviation can be excluded as anomalous and ignored. Those who work heavily with statistics will know the adage, "Lies, damned lies, and statistics"(7). By manipulating the database, especially when you know the precision of the data isn't high, can enable some users to adjust the results of the queries to better fit their expectations or goal. The results can be fudged.
Multimedia warehouses take the logical data consistency issue further when it comes to classification of a digital object. Is that John Smith in the photo? Is that a lyrebird singing in the audio track? Is that a photo of a chair? Is this person in the video? Is this digital photo identical to this photo? Is this document a photo? As is covered in this chapter, multimedia databases utilize fuzziness extensively. Data is never accurate. It only has a degree of accuracy that is fluid. It can change based on the circumstances, or even how the query is phrased.
Those used to the traditional data warehouse, especially one based around relational concepts, can have a lot of trouble dealing with the fuzziness of multimedia and the fact that it is not accurate. This can lead to almost comical attempts made by people to classify it:
This PDF file is a document if it contains more than x number of words, but it's a photo if it contains one digital image and less than y number of words.
In most cases, it just doesn't make sense to try and match the relational world to the multimedia one. The two are very different. It has been shown that probability theory is a subset of fuzzy logic(8), meaning that dealing with the fuzziness of data is mathematically sound and a natural extension of data management.
I have experience with a number of people who just want to avoid all unstructured data and require it to be ignored and not stored in the database (just keep it in the file system and out of harms way). Based on my personal experience, the large amount of resistance in the computing field to working with multimedia and any form of unstructured data is quite worrying. In a number of cases, it's attributed to just being too difficult to understand. For others, this type of data pushes their knowledge based beyond the traditional comfort zone of relational, which is well-understood.
Computer science is a constantly changing environment. New technology and advances in it cause major rethinks in the interface use, performance, and data management at least every two years. A newly released database introduces new features and replaces old concepts. Database administrators have to relearn new concepts and ideas at least every two to three years. In computing, you can't be conservative and dream of staying in your comfort zone. Yet, talking about the fuzziness of multimedia, the ways it impacts the database, and the ways to work with it, is constantly ignored. Ironically, that conservatism is found in database vendors including Oracle. In their case, I have stated many a time to a number of product managers that it's easier to (insert my valid witticism) than it is to convince Oracle on the benefits of multimedia in the database. Interestingly, when looking at the psychology behind this conservatism, one can use a positive aspect of it for designing and tuning databases. This is covered in Chapter 9, Understanding the limitations of Oracle Products, on tuning and why the greatest cause of performance problems is caused by management. So many tuning issues are missed because fuzzy concepts are ignored.
As the data warehouse concept grew, the idea of just throwing any data into a central repository appeared, especially, if it originated from older systems where not much was understood about its original structure. It was certainly easier and cheaper to just grab the data, copy it to a central store and say to the users "here it is, do with it as you want". Unfortunately, this concept failed because the data warehouse was driven by the database administrators. It was soon learned that a data warehouse was only successful if it was driven by the users themselves. They had queries and questions that needed to be answered. The data warehouse had a key business requirement and function. If that focus was lost, the data warehouse becomes a Dilapidated Warehouse and an expensive dinosaur. A number of data warehouses have suffered this fate.
But even in this case all was not lost, as from it came the concept of data mining, where patterns within the data and between the different data items could be calculated automatically. Having a data warehouse, which didn't have a core business requirement was not a death sentence. It was still possible to get useful information from it.
Data warehouses have numerous challenges to deal with. The most important ones are security, performance, and preventing information overload.
As more users access a data warehouse, it's important to ensure that only authorized users can access the data they are allowed to. For a security warehouse, information could be marked with different security clearance levels. This can require security to be implemented at the individual row level.
Unfortunately, just restricting access to the data could result in the data warehouse becoming unusable. In a population census database, users doing queries can get summary information about regions (for example, a suburb) but are not allowed to access the data coming from individual households because of legal privacy requirements. Restricting access to these records would mean that the summary queries cannot be performed. The security needs to be configured to resolve this dilemma.
One solution to address security is to use the concept of a data mart. A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse, which is usually oriented to a specific business line or team. Go to http://en.wikipedia.org/wiki/Data_mart for more information on data marts.
The use of a data mart enables the warehouse data to be tightly restricted to a well-defined set of users.
As access to summary information can become important and strategic to the business, especially if business decisions are based on it, the requirement to be able to audit what is queried and what a user actually views also becomes a key component.
Data warehouse queries can become very resource hungry and expensive to run. Database systems have been constantly evolving over time to deal with the performance issues. Some of the performance solutions include parallelization, materialized views, smart caching, partitioning, and high-speed intelligent hardware (for example, Oracle Exadata). As the amount of data grows, so does the complexity of queries users run because, simply, they now can. This means the performance requirements of the data warehouse are always changing.
A data mart can also be useful for performance, as it allows the data warehouse to be partitioned and each data mart can be tuned to the requirements of the set of users using it.
As more and more data is moved into the data warehouse, it can become very hard to work out what sort of queries can be run and how best to run them. To resolve this data maps or data dictionaries are created; providing a road map for the users to enable them to query intelligently against the data. Additionally, data marts allow for the key data items of interest to be made available to the user, as well as hiding structures they have no need to see or access.
The multimedia warehouse is different and has many faces. It can be seen as an extension of the existing data warehouse with the proviso that the focus is mainly on digital objects and not as much on the data. A multimedia warehouse is a super set of a data warehouse. It can contain all the traditional data warehouse elements and then contain all the digital objects. In reality to design and create an efficient and effective data warehouse, it's best to start with the digital object as the core and then load in data relating to it.
Like a data warehouse, a multimedia warehouse should be driven by having a business need. For multimedia warehouses that have an intelligence gathering focus, the requirement for data mining becomes very important.
There is no one type of multimedia warehouse, just like there is no one type of data warehouse, as each exists to satisfy a business requirement. They can be grouped into a number of different types, each with its own characteristics. The location of storage used for the multimedia warehouse can be referred to as a repository.
Types of multimedia warehouses
The following information describes some types of multimedia warehouses. This list does not cover all possible variations and will change as the technology changes.
The traditional multimedia warehouse is based around the same concepts of a data warehouse. The goal is to be able to provide a repository of digital objects and data that has originated from different sources. The data and the objects themselves go through an ETL process. This process would include the need to establish valid relationships between the data and the digital objects.
In a data warehouse, the data itself can be summarized into a layer, with that data itself summarized, and so on, into numerous parent layers. The standard example is creating a layered data summary structure of sales data, based on regions within a city, the state, a state regional area, and the country. Region is just one dimension of many in which the data can be grouped and summarized. Another dimension is time. Data might lend itself to be moved into these dimensions and summarized, but digital objects do not. That doesn't mean that a similar summary process can't be achieved. Digital photos can be combined together into a montage, snippets can be extracted from video and combined, key pages in different documents can be extracted, and then combined. Oracle Text can use its gist capability to automatically summarize a document or extract the key themes about the document.
The product Cooliris (http://www.cooliris.com/) summarizes photos onto a three-dimensional wall. The website Midomi (http://www.midomi.com) will try and recognize and match a tune that is hummed or sung.
As the repository contains digital objects, the tools used to perform queries need to be enhanced to not only intelligently query these objects but to also display them. For video, this can be quite difficult, especially if the videos have originated from a variety of different sources.
Part of the ETL process for dealing with digital objects involves transforming them into a universally accepted format, enabling all tools accessing them to display them correctly. For digital images, this might involve converting them to JPG. For video, it might involve converting them all to MPEG. For audio, it might involve converting them to MP3 and for documents, converting them to PDF. These formats have the greater likelihood of being viewed or played by most applications and tools.
In a standard data warehouse, even though data can be summarized across multiple dimensions when displayed, the data is typically displayed in one dimension. Meaning that only one key piece of information is conveyed within the summarized view. A summarized bar chart might display sales total within one region. The one dimension of data conveyed is sales.
In a multimedia warehouse, the display requirements and methods for displaying inherently encourage multiple dimensions of information to be displayed. These concepts can then be taken back and used in a data warehouse. A chart can use colors to convey one dimension of data, while the shape of the graph can be another dimension. Converting the output into three-dimensional enables more dimensional information to be shown including the size, movement, icons, and a changing perspective based on the view angle. Even audio output can be integrated in. Google maps utilizes the integration capability by allowing data such as public utilities, traffic information, points of interest, and road conditions to be overlaid and integrated into one map. Applications can even overlay their own dimensions of data. Another example is a tag cloud (covered later in this chapter), which uses the font size of a word to indicate additional information about its usage.
As one key goal of the data warehouse is to extract and process summary information, it soon became obvious that when reading raw figures from the database, it was easier to understand, comprehend, and find useful patterns in the data if it was converted into a visual form. Graphical OLAP tools became popular in the market to address this need. The human mind can absorb a lot of information quickly if it's presented in a visual form compared to presenting it as raw data.
A multimedia warehouse, by the nature of the digital objects that are stored in it, encourages the use of visualization tools to view and process it. There is a temptation for warehouse architects to convert the digital objects into raw data and use that for displaying the information, rather than using the strength of the underlying medium to create a more powerful and visual environment for the warehouse. This temptation originates from a lack of understanding and skill in with working with multimedia and trying to treat it as raw data just like a data warehouse, because that is the comfort zone of the architects.
Multimedia is referred to as rich media for a reason. It can greatly enhance and add intelligence to a warehouse. It should not be seen as raw binary data that might be useful to occasionally create a visually appealing interface. The warehouse should have at its core focus, the digital objects with the metadata around, that is used to drive the summarization and perform analytical queries.
In an image bank warehouse, the goal is to provide a central repository, which all digital objects and applications can access. The metadata is stored in applications outside the warehouse and these applications then just reference the digital objects in the warehouse. The only metadata stored with the digital objects is physical attribute information about the digital object. For a photo, this would be the EXIF metadata.
An important goal of the image bank warehouse is to store the digital object once and have a repository that can be tuned to the special requirements of multimedia. In this environment, it is still reasonable to create a data warehouse, with values in the data warehouse referencing the image bank warehouse. The advantage is that traditional data warehouses do not have to worry about the management and nuances of dealing with multimedia. They do not have to worry about the storage requirement or trying to handle and detect duplicated digital objects that might result, when different applications migrate parts of their data into the data warehouse.
The disadvantage is that the relationship between the data in the application and the digital object is loosely defined. It's typically a many-to-many relationship, meaning that one digital object can map to zero or more data items in other applications. Also, a data item in an application can map to multiple digital objects. In this scenario, it's possible to get orphaned records if an image is deleted or changed. In addition, the object relationship has to be configured. All the relationships need to be defined. If there are hundreds of thousands of digital objects and hundreds of thousands of data items across many applications, then it can be a very expensive process to build the relationship structure. When digital objects from different application systems are merged, it can be quite complex to look for duplicates, determine which digital object is the correct one, and then adjust the existing application to reference the master digital object.
So, even though an image bank warehouse can offer a lot of benefits, its strength and its weakness centers around the object relationship table, how well it's managed, and how accurate the relationships are within it (see Appendix E, Loading and Reading, which can be downloaded from the link given in the Preface.).
In a multimedia data mart, the goal is to take a controlled subset of digital objects, which can originate in a multimedia warehouse, possibly transform them, and then make them available for consumption. A popular method is to make these digital objects publicly available, where they can be manipulated, utilized, and even enhanced. Crowdsourcing methods can be applied to these images with the results cleaned and fed back into the parent multimedia warehouse.
The concepts behind a multimedia data mart are very similar to the traditional data mart, where the existence of it is created to address security, performance, or information overload issues.
Another use is to take a well-defined subset of digital objects with a simplified subset of metadata and then locate them on a high-end server (a computer with a lot of resources). The digital objects are then made available within an organization for querying and display.
In a public warehouse, the goal is to take digital objects from one or more internal systems and place them in a database, which can be accessed by the general public. The use of crowdsourcing (covered later) enables the general public to attach metadata to the images. When the digital objects are migrated to the public warehouse, they may be transformed into postcard sized ones. This transformation loses information within the image but provides consistent width, height, and quality giving a more aesthetic and user-friendly interface.
The public database servers housing the digital objects can be treated like a Bastion host (Bastion host is a special-purpose computer on a network specifically designed and configured to withstand attacks(9)).
The queries performed in a public warehouse are a mixture of course and fine grain based on what the core focus of the warehouse is (the definition of course and fine grained queries is covered later in this chapter). Some warehouses are designed for researchers, others just to enable the general public to better understand what the organization offers (see Appendix E, Loading and Reading).
In an e-Sales warehouse, the primary goal is to enable a form of e-commerce selling of the digital objects or what the digital objects represent. The delivery and configuration is detailed in Chapter 5, Loading Techniques.
For this multimedia warehouse, the digital objects are collected from one or more internal systems. The use of metadata is key for driving how the images are found and subsequently purchased. This means that the metadata around the image has to be transformed, cleaned, and made suitable for public consumption. The metadata, which is not suited, needs to be removed (see Appendix E, Loading and Reading).
A very powerful form of multimedia warehouse is the one used for intelligence gathering. Government departments, defense organizations, police agencies, and security firms can use multimedia warehouses.
The politics within a state or country can encourage the development and use of a multimedia warehouse. Police agencies in different states in a country have a reputation for not trusting the other. This can stem from perceived corruption, personality clashes, or conflicting security procedures. The result is a hesitation to share information in solving a case. Governments then create new agencies with new directives to try to resolve this impasse. They collect the information, transform it, and create an intelligence database. In some cases, they can create a data mart focusing on a particular criminal area of interest such as drugs, sexual offenses, and organized crime.
Information which is collected, cleansed, and stored in the central warehouse can come to it in both, a structured or unstructured format. Structured would include data, where the meaning for each value is well-known. This can include case information.
Unstructured can include surveillance video, audio from phone conversations, crime scene photos, and documents such as financial ledgers. The information might not have been digitized or fully cataloged. A crime scene photo might be labeled with a unique ID, ensuring its relationship to a case is established, but it might not be cataloged, where all information in the image is identified. As previously covered, computer systems are still not at the point, where they can easily analyze an image or video and determine what or who is in it.
Audio conversations, if clear and of a high quality, can be translated automatically but auxiliary information in the audio such as background noises or other simultaneous conversations, are not cataloged. To complicate the handling of audio conversations, a translator might be required if a different language is used. As covered in more detail later in this chapter, an automatic translator could be used, but the resultant translation might result in misinterpretation of the original conversation. The more information extracted, the greater the overall intelligence of the whole warehouse. Improvements in technology will ultimately overcome these limitations.
Additional information captured and stored includes biometric. This covers fingerprints, voice patterns, DNA, and blood types.
Information can come from a variety of sources, including internal systems and the Internet. All types of information can be captured including public biographies, company histories, and specialized databases (such as entomological databases, furniture, carmakers, and pharmacy information). With storage now being a lot cheaper and increasing in capacity, more of these databases can be captured and stored, enabling more complex and intelligent queries to be performed. The use of robots to trawl for data is a feature that search engines use.
An intelligence warehouse is intrinsically object-focused. An object can be a person, car, or piece of evidence. Information is then captured about the relationship between those objects.
Information also has to be cataloged as to how trustworthy it is. Information gleaned from a blog would not be trustworthy, because it's likely to be just hearsay and personally biased. Whereas, information coming from an internal system may be highly trustworthy. Generic queries when run need to use a fuzzy matching system taking into account the inherent trustworthiness of the data, and ensuring that the causal false relationships are not formed because of untrustworthy data. A query when run might need to perform that query a number of times, each time looking at different dimensions and using different fuzzy algorithms to do the match. The different result sets can then be merged with the aim of producing a result set that is indicative of the original question being asked.
The intelligence warehouse is a prime candidate for data mining, especially using a data mining(10) tool that can identify relationships between the different objects that might not normally be obvious. This can include:
- Association rule learning: Looking for relationships in the data
- Clustering: Looking for groupings in the data
- Anomaly detection: Looking data of interest that does not seem to fit
The intelligence warehouse is not limited to just its repository. One which can cross reference its results with Google, Wiki, and other external sources can provide additional information that might return unexpected relationships that may not normally have been considered.
The intelligence warehouse has a security requirement that separates itself from the other multimedia repositories. Such a huge and important amount of information requires securing the warehouse in a number of key areas:
- External hacking: Depending on the sensitivity of the data, there might be a requirement for external but authorized-only access to the warehouse. Police officers in the field might need to be able to run queries from remote locations. As soon as the system is made available on the Internet, it is open to potential hacking. To protect from this requires numerous security systems and authentication methods. In addition, encryption at a high level should be done on all data. Always keep in mind that, for a hacker, they use the easiest way in. There is no need to take a sledgehammer to a front door when the back door is wide open. The back door in most cases is one vulnerable to social engineering.
- Social engineering: This is an often neglected and not well-understood form of illegal access. The process simply involves getting the access to the data using any means other than trying to break through the firewall through brute force. A common method is for a hacker to pose as the local IT person and they ask the manager for their password. The only way to combat social engineering is to train all staff, including numerous practice sessions, into how to avoid not giving away information. To combat this, social engineers target new employees, who have not been trained, or staff in other companies that might have access. Hackers and social engineers are highly adaptable and adjust their strategies on a continual basis to new technology.
- Internal theft: This involves a staff member inside the organization stealing the data or performing a query and passing on the results to an external party. This can be done for ideological reasons or for financial gain. Although a potentially hard to combat system can use its own data mining tools and focus them internally on to the queries the staff perform, looking for anomalous or out-of-ordinary queries, and then flag them. Restricting access to data is also important. Additionally, all queries performed and the results returned should be audited and periodically reviewed. A staff member, who is aware that all queries they perform are audited and checked, is knowingly in a harder position to commit theft.
- Modification: This involves modification of internal data causing search queries to miss correct results, or setting up bogus information and sites with false data, which are then incorporated into the core warehouse. It's not enough to just protect the warehouse, but the source system, where the data comes from also needs to be protected. Modification can be deliberate but can also happen accidentally due to human error. Computer systems normally uses check sums to ensure that their internal data is not corrupted and is valid. When a person is involved in translating an audio tape or identifying objects in a photo or video, mistakes can be made. The only way to utilize the equivalent of a check sum is to have one or more people validate the data entered in. Unfortunately, this can be quite an expensive operation to do, especially if there is a huge amount of information to be ingested and translated and limited resources available to process it. This is where it becomes important to establish the trustworthiness of the data. In addition, its trustworthiness, where it originates from but also to its processing accuracy.
- Trojans: This method has been used more often as security becomes tighter and better enforced. It basically involves fooling someone internally in to installing a Trojan on their computer. This is traditionally done via scam e-mail messages purportedly designed to look official, to trick someone into plugging a malware-infected USB drive into a computer. This technique has been well-documented as used by companies or government agencies in different countries to spy on the other.