Research Ethics: June 2012

Wednesday, June 13, 2012

Analytics in the Oracle Database

The addition of analytics to databases is a natural direction. As the volume of data increases, data movement dominates the cost of computation. It starts to make more sense to move the computation and algorithms to the database than to move, to an external server, the data to be analyzed. Furthermore, in most cases, the results of the analysis need to be persisted back to the database and combined with other data in order to make them actionable.

Picking up on this, over the past releases Oracle has continuously added analytic features to its database. Taken as a whole these features make the Oracle Database a powerful platform for developing applications leveraging analytics. However, most users are not aware of the complete set of analytic features available in the database. The following list covers the features present in the Oracle Database 10g Release 2:
Complex data transformations
Data mining
Image feature extraction
Linear algebra
OLAP
Predictive analytics
Spatial analytics
Statistical functions
Text mining

As these features are part of a common server it is possible to combine them efficiently and with ease. The overall benefit is greater than just the sum of the parts that could be achieved through the integration of different servers and tools. For example, it is possible to create efficient arbitrarily complex SQL statements that combine data mining and text processing:
SELECT A.cust_name, A.contact_info
FROM customers A
WHERE PREDICTION_PROBABILITY(tree_model,
‘attrite’ USING A.*) > 0.8
AND A.cust_value > 90
AND A.cust_id IN
(SELECT B.cust_id
FROM call_center B
WHERE B.call_date BETWEEN ’01-Jan-2005’
AND ’30-Jun-2005’
AND CONTAINS(B.notes, ‘Checking Plus’, 1) > 0);
The above query selects all customers who have a high propensity to attrite (> 80% chance), are valuable customers (customer value rating > 90), and have had a recent conversation with customer services regarding a Checking Plus account. The propensity to attrite information is computed using a data mining model (tree_model). The query uses Oracle Text's CONTAINS operator to search call center notes for references to Checking Plus accounts.

Finally, it is also easy to integrate the results from queries like the one above with Business Intelligence tools such as Oracle Discover, Oracle Portal, and Crystal Reports (more on that in future posts).

In future posts I will cover:
How to get the most out of many of these features
How to solve real problems using analytics
The role of analytics in Business Intelligence and databases
In the meantime, the following provides a brief description of each one of the above features with links for further information.

Complex Data Transformations
Data transformation is a key aspect of analytical applications and ETL (extract, transform, and load). Besides support for transformations through SQL expressions, the Oracle Database, since the Oracle Database 10g Release 1, ships with a flexible data transformation package that includes a variety of missing value and outlier treatments, as well as binning and normalization capabilities.

Data Mining
Oracle Data Mining (ODM), an option to the Enterprise Edition of the Oracle Database, provides a rich set of data mining functionality. ODM 10g Release 2 has eleven algorithms that can be used for classification, regression, clustering, anomaly detection, feature extraction, association analysis, and attribute ranking.

The database also includes in both Standard Edition and Enterprise Edition the frequent itemset package (DBMS_FREQUENT_ITEMSET). This package enables frequent itemset counting and it is used as a building block for ODM's Association algorithm. Frequent itemsets provide a mechanism for counting how often multiple events occur together. This blog post has a nice discussion of this feature.

Image Feature Extraction
Oracle Intermedia is a feature of the Oracle Database that is included in both Standard Edition and Enterprise Edition. interMedia supports the extraction of image features (e.g., color histogram, texture, and positional color) that can then be used to characterize and analyze images.

Linear Algebra
Oracle Database 10g Release 2 ships with a new package UTL_NLA. The UTL_NLA package exposes a subset of the popular BLAS and LAPACK (Version 3.0) libraries for operations on vectors and matrices represented as VARRAYs. This package includes procedures to solve systems of linear equations, invert matrices, and compute eigenvalues and eigenvectors.

Predictive Analytics
Data mining can uncover useful information buried in vast amounts of data. However, it is often the case that many users that could benefit from these results do not have any data mining expertise. The DBMS_PREDICTIVE_ANALYTICS package addresses this by automating the entire data mining process from data preprocessing through model building to scoring new data. This package provides an important tool that makes data mining possible for a wider audience of users, in particular, business analysts. The capabilities of this package are also exposed through the Oracle Spreadsheet Add-In for Predictive Analytics. The Oracle Spreadsheet Add-In for Predictive Analytics enables Microsoft Excel users to mine their Oracle Database or Excel data using simple, "one click" Predict and Explain predictive analytics features.

Statistical Functions
The Oracle Database provides a long list of SQL statistical functions with support for: hypothesis testing (e.g., t-test, F-test), correlation computation (e.g., pearson correlation), cross-tab statistics, and descriptive statistics (e.g., median and mode). The package DBMS_STAT_FUNCS adds distribution fitting procedures and a summary procedure that returns descriptive statistics for a column.

OLAP
Oracle OLAP, an option to Oracle Database 10g Enterprise Edition, has features previously found only in specialized OLAP databases. Moving beyond drill-downs and roll-ups, Oracle OLAP also supports time-series modeling and forecast.

Text Mining
Oracle Text uses standard SQL to index, search, and analyze text and documents stored in the Oracle database, in files, and on the web. It also supports automatic classification and clustering of document collections. Many of these analytical features are layered on top of ODM functionality.

Spatial Analytics
Oracle Spatial is an option for Oracle Enterprise Edition that provides advanced spatial features to support high-end GIS and LBS solutions. Oracle Spatial's analysis and mining capabilities include functions for binning, detection of regional patterns, spatial correlation, colocation mining, and spatial clustering. Oracle Spatial also includes support for topology and network data models and analytics. The topology data model of Oracle Spatial allows one to work with data about nodes, edges, and faces in a topology. It includes network analysis functions for computing shortest path, minimum cost spanning tree, nearest-neighbors analysis, traveling salesman problem, among others.

Wednesday, June 6, 2012

Opinion Mining and Sentiment Analysis

A number of faculty at SF State are interested in textual analysis, natural language processing and data mining. This article (Opinion Mining and Sentiment Analysis in the journal Foundations and Trends in Information Retrieval, 2, 2008 by B Pang and L Lee discuss the rapid emergence, since 2001, of the area of sentiment analysis. That is, doing information extraction expressly to summarize sentiment and opinion. From intelligence analysis to marketing and movie reviews, this article is a gold mine of information and methodological complexity in this fast breaking field.

http://www.cs.cornell.edu/home/llee/omsa/omsa-published.pdf)

FOUNDATIONS AND TRENDSIN INFORMATION RETRIEVAL BOOKS

AUTHORSHIP ATTRIBUTION

by Patrick Juola (Duquesne University, USA)

Authorship attribution, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and a wide range of application. It is an important problem not only in information retrieval but in many other disciplines as well, from technology to teaching and from finance to forensics. The idea that authors have a statistical "fingerprint'' that can be detected by computers is a compelling one that has received a lot of research attention.

Authorship Attribution surveys the history and present state of the discipline, presenting some comparative results where available. It also provides a theoretical and empirically-tested basis for further work. Many modern techniques are described and evaluated, along with some insights for application for novices and experts alike.

Authorship Attribution will be of particular interest to information retrieval researchers and students who want to keep up with the latest techniques and their applications. It is also a useful resource for people in other disciplines, be it the teacher interested in plagiarism detection or the historian interested in who wrote a particular document.

MUSIC RETRIEVAL

A Tutorial and Review

by Nicola Orio (University of Padova, Italy)

Music Accessing and Retrieval is the first comprehensive survey of the vast new field of Music Information Retrieval (MIR). It describes a number of issues which are peculiar to the language of music — including forms, formats, and dimensions of music — together with the typologies of users and their information needs. To fulfil these needs a number of approaches are discussed, from direct search to information filtering and clustering of music documents. The emphasis is on tools, techniques, and approaches for content-based MIR, rather than on the systems that implement them. The interested reader can, however, find descriptions of more than 35 systems for music retrieval with links to their Web sites.

Music Accessing and Retrieval can be used as both a guide for beginners who are embarking on research in this relatively new area, and a useful reference for established researchers in this field.

OPEN-DOMAIN QUESTION ANSWERING

by John Prager (IBM T.J. Watson Research Center)

Open-Domain Question Answering is an introduction to the field of Question Answering (QA). It covers the basic principles of QA along with a selection of systems that have exhibited interesting and significant techniques, so it serves more as a tutorial than as an exhaustive survey of the field.

Starting with a brief history of the field, it goes on to describe the architecture of a QA system before analysing in detail some of the specific approaches that have been successfully deployed by academia and industry designing and building such systems.

Open-Domain Question Answering is both a guide for beginners who are embarking on research in this area, and a useful reference for established researchers and practitioners in this field.

OPINION MINING AND SENTIMENT ANALYSIS

by Bo Pang (Yahoo! Research, USA) & Lillian Lee (Cornell University, USA)

An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object.

Opinion Mining and Sentiment Analysis covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. The focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. The survey includes an enumeration of the various applications, a look at general challenges and discusses categorization, extraction and summarization. Finally, it moves beyond just the technical issues, devoting significant attention to the broader implications that the development of opinion-oriented information-access services have: questions of privacy, vulnerability to manipulation, and whether or not reviews can have measurable economic impact. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.

Opinion Mining and Sentiment Analysis is the first such comprehensive survey of this vibrant and important research area and will be of interest to anyone with an interest in opinion-oriented information-seeking systems.

crime data mining

INTELLIGENT DATA MINING TECHNIQUES

Traditional data mining techniques such as association analysis, classification and prediction, cluster analysis, and outlier analysis identify patterns in structured data.3 Newer techniques identify patterns from both structured and unstructured data. As with other forms of data mining, crime data mining raises privacy concerns.

For Nevertheless,researchers have developed various automated data mining techniques for both local law enforcement and national security applications. Entity extraction identifies particular patterns from data such as text, images, or audio materials. It has been used to automatically identify persons, addresses, vehicles, and personal characteristics from police narrative reports.5 In computer forensics, the extraction of software metrics which includes the data structure, program flow, organization and quantity of comments, and use of variable names can facilitate further investigation by, for example, grouping similar programs written by hackers and tracing their behavior. Entity extraction provides basic information for crime analysis, but its performance depends greatly on the availability of extensive amounts of clean input data.

Clustering techniques group data items into classes with similar characteristics to maximize or minimize intraclass similarity for example, to identify suspects who conduct crimes in similar ways or distinguish among groups belonging to different gangs. These techniques do not have a set of predefined classes for assigning items. Some researchers use the statistics-based concept space algorithm to automatically associate different objects such as persons, organizations, and vehicles in crime records. Using link analysis techniques to identify similar transactions, the Financial Crimes Enforcement Network AI System8 exploits Bank Secrecy Act data to support the detection and analysis of money laundering and other financial crimes. Clustering crime incidents can automate a major part of crime analysis but is limited by the high computational intensity typically required.

Association rule mining discovers frequently occurring item sets in a database and presents the patterns as rules. This technique has been applied in network intrusion detection to derive association rules from users’ interaction history. Investigators also can apply this technique to network intruders’ profiles to help detect potential future network attacks. Similar to association rule mining, sequential pattern mining finds frequently occurring sequences of items over a set of transactions that occurred at different times. In network intrusion detection, this approach can identify intrusion patterns among time-stamped data. Showing hidden patterns benefits crime analysis, but to obtain meaningful results requires rich and highly structured data.