Research Ethics: May 2012

Friday, May 25, 2012

Java DataMining Using JDM API

With the standardization of the Java Data Mining (JDM) API, Enterprise Java applications have been given predictive technologies.

Data mining is a widely accepted technology used for extracting hidden patterns from data. It is used to solve many business problems like identifying cross-sell or up-sell opportunities for specific customers based on customer profiles and purchase patterns, predicting which customers are likely to churn, creating effective product campaigns, detecting fraud, and finding natural segments.

More and more data mining algorithms are being embedded in databases. Advanced analytics, like data mining, is now widely integrated with applications. The objective of this article is to introduce Java developers to data mining and explain how the JDM standard can be used to integrate this technology with enterprise applications.

Data Mining Functions
Data mining offers different techniques, aka mining functions, that can be used depending on the type of problem to be solved. For example, a marketing manager who wants to find out which customers are likely to buy a new product can use the classification function. Similarly a supermarket manager who wants to determine which products to put next to milk and eggs, or what coupons to issue to a given customer to promote the purchase of related items can use the association function.

Data mining functions are divided into two main types called supervised (directed) and unsupervised (undirected).

Supervised functions are used to predict a value. They require a user to specify a set of predictor attributes and a target attribute. Predictors are the attributes used to predict the target attribute value. For example, a customer's age, address, occupation, and products purchased can be used to predict the target attribute "Will the customer buy the new product? (YES/NO)."

Classification and regression are categorized as supervised functions. Classification is used to predict discrete values, e.g., "buy" or "notBuy," and regression is used to predict continuous values, e.g., salary or price.

Unsupervised functions are used to find the intrinsic structure, relations, or affinities in data. Unsupervised mining doesn't use a target. Clustering and association functions come under this category. Clustering is used to find the natural groupings of data, and association is used to infer co-occurance rules from the data.

The Data Mining Process
Typically data mining projects are initiated by a business problem. For example, a CEO could ask, "How can I target the right customers to maximize profits?" Once the business problem is defined, the next step is to understand the data available and select the appropriate data to solve the problem. Based on the data's characteristics, prepare the data for mining. Select the right mining function and build a mining model with the data. After building the model, evaluate the model results. After evaluation, deploy the model. The CRISP-DM standard details the typical data mining process. Figure 1 illustrates a typical data mining process.

Enterprise applications like CRM analytics try to automate the data-mining process for common problems like intelligent marketing campaigns and market-basket analysis.

JDM API Overview
The Java Community Process (JCP) released the JDM 1.0 standard in August of 2004. JDM provides an industry standard API to integrate data mining functionality with applications. It facilitates the development of vendor-neutral data mining tools/solutions. It supports many commonly used mining functions and algorithms.

JDM uses the factory-method pattern to define Java interfaces that can be implemented in a vendor-neutral fashion. In the analytics business there's a broad set of data mining vendors who sell everything from a complete data mining solution to a single mining function. JDM conformance states that even a vendor with one algorithm/function can be JDM-conformant.

In JDM, javax.datamining is the base package that defines infrastructure interfaces and exception classes. Sub-packages are divided by mining function type, algorithm type, and core sub-packages. Core subpackages are javax.datamining.resource, javax.datamining.base, javax.datamining.data. The resource package defines connection-related interfaces that enable the applications to access Data Mining Engine (DME). The base package defines prime objects like mining model. The data package defines all physical and logical data-related interfaces. The javax.datamining.supervised package defines the supervised function-related interfaces and the javax.datamining.algorithm package contains all mining algorithm subclass packages.

Solving the Customer Churn Problem Using JDM
Problem Definition
Customer attrition is one of the big problems companies face. Knowing which customers are likely to leave can be used to develop a customer-retention strategy. Using data-mining classifications one can predict which customers are likely to leave. In the telecommunications industry, this problem is known as customer churn. Churn is a measure of the number of customers who leave or switch to competitors.

Understand and Prepare the Data
Based on the problem and its scope, domain experts, data analysts, and database administrators (DBA) will be involved in understanding and preparing data for mining. Domain experts and data analysts specify the data required for solving the problem. A DBA collects it and provides it in the format the analyst asked for.

In the following example, several customer attributes are identified to solve the churn problem. For simplicity's sake, we'll look at 10 predictors. However, a real-world dataset could have hundreds or even thousands of attributes (see Table 1).

Here CUSTOMER_ID is used as the case id, which is the unique identifier of a customer. The CHURN column is the target, which is the attribute to be predicted. All other attributes are used as predictors. For each predictor, the attribute type needs to be defined based on data characteristics.

There are three types of attributes, i.e., categorical, numerical, and ordinal.

A categorical attribute is an attribute where the values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either non-ordered (nominal) like state and gender, or ordered (ordinal) such as high, medium, or low temperatures. A numerical attribute is an attribute whose values are numbers that are either integer or real. Numerical attribute values are continuous as opposed to discrete or categorical values.

For supervised problems like this, historical data must be split into two datasets, i.e., for building and testing the model. Model building and testing operations need historical data about both types of customers, i.e., those who already left and those who are loyal. The model-apply operation requires data about new customers, whose churn details are to be predicted.

The Data Mining Engine (DME)
In JDM, the Data Mining Engine (DME) is the server that provides infrastructure and offers a set of data mining services. Every vendor must have a DME. For example, a database vendor providing embedded data-mining functionality inside the database can refer to the database server as its data mining engine.

JDM provides a Connection object for connecting to the DME. Applications can use the JNDI service to register the DME connection factory to access a DME in a vendor-neutral approach. The javax.datamining.resource.Connection object is used to represent a DME connection, do data-mining operations in the DME, and get metadata from the DME.

Listing 1 illustrates how to connect to a DME using a connection factory that's registered in a JNDI server.

Describe the Data
In data mining, the ability to describe the physical and logical characteristics of the mining data for building a model is important.

JDM defines a detailed API for describing physical and logical characteristics of the mining data.

The javax.datamining.data.PhysicalDataSet object is used to encapsulate location details and physical characteristics of the data.

The javax.datamining.data.LogicalData object is used to encapsulate logical characteristics of the data.

A logical attribute can be defined for each physical attribute in the physical data set. A logical attribute defines the attribute type and the data preparation status. The data preparation status defines whether the data in a column is prepared or not prepared. Some vendors support internal preparation of the data. In JDM, the physical data set and logical data are named objects, which can be stored in and retrieved from the DME.

Listing 2 illustrates how to create a PhysicalDataSet object and save it in the DME. Here PhysicalDataSet encapsulates "CHURN_BUILD_TABLE" details. In this table "CUSTOMER_ID" is used as the caseId.

Listing 3 illustrates how to create a LogicalData object and save it in the DME. Here LogicalData is used to specify the attribute types. Some vendors derive some of the logical characteristics of attributes from the physical data. So JDM specifies logical data as an optional feature that vendors can support. Logical data is an input for the model-build operation. Other operations like apply and test get this information from the model.

Build the Mining Model
One important function of data mining is the production of a model. A model can be supervised or unsupervised.

In JDM javax.datamining.base.Model is the base class for all model types. To produce a mining model, one of the key inputs is the build settings object.

javax.datamining.base.BuildSettings is the base class for all build-settings objects, it encapsulates the algorithm settings, function settings, and logical data. The JDM API defines the specialized build-settings classes for each mining function.

In this example, the ClassificationSettings object is used to build a classification model to classify churners.

Applications can select an algorithm that works best for solving a business problem. Selecting the best algorithm and its settings values requires some knowledge of how each algorithm works and experimentation with different algorithms and settings. The JDM API defines the interfaces to represent the various mining algorithms.

In this example, we will use the decision-tree algorithm.

In JDM, javax.datamining.algorithm.tree.TreeSettings object is used for representing decision-tree algorithm setting. Some vendors support implicit algorithm selection based on function and data characteristics. In those cases, applications can build models without specifying the algorithm settings.

Listing 4 illustrates how to create classification settings and save them in the DME. Here a classification-settings object encapsulates the logical data, algorithm settings, and target attri-bute details to build the churn model. Here the decision-tree algorithm is used. For more details about the algorithm settings refer to the JDM API documentation.

Listing 5 illustrates how to build a mining model by executing the build task. Typically model building is a long-running operation, JDM defines a task object that encapsulates the input and output details of a mining operation. A task object can be executed asynchronously or synchronously by an application. Applications can monitor the task-execution status using an execution handle.

An execution-handle object is created when the task is submitted for execution. For more details about the task execution and the execution handle, refer to the JDM API documentation.

Here the build task is created by specifying the input physical dataset name, build settings name, and output model name. The build task is saved and executed asynchronously in the DME. Applications can either wait for the task to be completed, or execute the task and check the status later.

Test the Mining Model
After building a mining model, one can evaluate the model using different test methodologies. The JDM API defines industry standard testing methodologies for supervised models.

For a classification model like the churn model, the ClassificationTestTask is used to compute classification test metrics. This task encapsulates input model name, test data name, and metrics object name. It produces a ClassificationTestMetrics object that encapsulates the accuracy, confusion matrix, and lift metrics details that are computed using the model.

Accuracy provides an estimate of how accurately the model can predict the target. For example, 0.9 accuracy means the model can accurately predict the results 90% of the time.

Confusion matrix is a two-dimensional, N x N table that indicates the number of correct and incorrect predictions a classification model made on specific test data. It provides a measure of how well a classification model predicts the outcome and where it makes mistakes.

Lift is a measure of how much better prediction results are using a model as opposed to chance. To explain the lift we will use a product campaign example. Say product campaigning to all 100,000 existing customers results in sales of 10,000 products. However, by using the mining model say we sell 9,000 products by campaigning to only 30,000 selected customers. So by using the mining model, campaign efficiency is increased three times, so the lift value is computed as 3, i.e., (9000/30000)/(10000/100000).

Listing 6 illustrates how to test the churn model by executing the classification test task using "CHURN_TEST_TABLE." After successfully completing the task, a classification test metrics object is created in the DME. It can be retrieved from the DME to explore the test metrics. (Listings 6-8 can be downloaded from www.sys-con.com/java/sourcec.cfm.)

Apply the Mining Model
After evaluating the model, the model is ready to be deployed to make predictions. JDM provides an ApplySettings interface that encapsulates the settings related to the apply operation. The apply operation will result in an output table with the predictions for each case. Apply settings can be configured to produce different contents in the output table. For more details on apply settings, refer to JDM API documentation.

In this example, we use the top prediction apply setting to produce the top prediction for each case. The DataSetApplyTask is used to apply the churn model on the "CHURN_APPLY_TABLE." JDM supports RecordApplyTask to compute the prediction for a single record; this task is useful for real-time predictions. In this example, we use the dataset apply task to do the batch apply to make predictions for all the records in the "CHURN_APPLY_TABLE".

Listing 7 illustrates how to apply the "CHURN_MODEL" on "CHURN_APPLY_TABLE" to produce an output table "CHURN_APPLY_RESULTS" that will have the predicted churn value "YES or NO" for each customer.

After doing the apply task, a "CHURN_APPLY_RESULTS" table will be created with two columns, "CUSOMER_ID" and "PREDICTED_CHURN." The probability associated with each prediction can be obtained by specifying it in the ApplySettings.

Here the mapTopPrediction method is used to map the top prediction value to the column name. The source destination map is used to carry over some of the columns from the input table to the apply-output table along with the prediction columns. In this case, "CUSTOMER_ID" column is carried over from the apply-input table to the output table. JDM specifies many other output formats so applications can generate the apply-output table in the required format. A discussion of all the available options is beyond the scope of this article.

Figure 2 summarizes the JDM data mining process flow that we did in this example.

Market Basket Analysis Example
To explain the use of unsupervised data mining in a practical scenario, we'll use one of the most popular data mining problems called market basket analysis.

The purpose of market basket analysis is to determine what products customers buy together. Knowing what products people buy together can be helpful to traditional retailers and web stores like Amazon.

The information can be used to design store layouts, web page designs, and catalog designs by keeping all cross-sell and up-sell products together. It can also be used in product promotions like discounts for cross-sell or up-sell products. Direct marketers can use basket analysis results to decide what new products to offer their prior customers.

To do market basket analysis, it's necessary to list the transactions customers made. Sometimes customer demographics and promotion/discount details are used to infer rules related to demographics and promotions. Here we use five transactions at a pizza store. For simplicity's sake, we'll ignore the demographics and promotion/discount details.

Transaction 1: Pepperoni Pizza, Diet Coke, Buffalo wings
Transaction 2: Buffalo wings, Diet Coke
Transaction 3: Pepperoni Pizza, Diet Coke
Transaction 4: Diet Coke, French Fries
Transaction 5: Diet Coke, Buffalo wings

The first step is to transform the transaction data above into a transactional format, i.e., a table with transaction id and product name columns. The table will look like Table 2. Only the items purchased are listed.

An association function is used for market basket analysis. An association model extracts the rules stating the support and confidence in each rule. The user can specify the minimum support, minimum confidence, and maximum rule length as build settings before building the model.

Since we have only five transactions, we'll build a model to extract all the possible rules by specifying minimum support as 0.1, minimum confidence as 0.51, and no maximum limit for the rule length. This model produces five rules (see Table 3).

In a typical scenario, you may have millions of transactions with thousands of products, so understanding the support and confidence measures and how these are calculated provides good insight into which rules need to be selected for a business problem.

Support is the percentage of records containing the item combination compared to the total number of records. For example take Rule 1, which says, "If Buffalo wings are purchased then diet coke will also be purchased." To calculate the support for this rule, we need to know how many of the five transactions conform to the rule. Actually, three transactions, i.e., 1, 2 and 5, conform to it. So the support for this rule is 3/5=0.6.

Confidence of an association rule is the support for the combination divided by the support for the condition. Support gives an incomplete measure of the quality of an association rule. If you compare Rule 1 with Rule 5, both of them have the same support, i.e., 0.6, because support is not directional. Confidence is directional, so that makes Rule 1 a better rule than Rule 5.

Rule length can be used to limit the length of the rules. When there are thousands of items/products with millions of transactions, rules get complex and lengthy, so it's used to limit the length of the rules in a model.

Using JDM to Solve the Market Basket Problem
So how does one use JDM API to build an association rules model and extract the appropriate rules from the model?

Typically data for association rules will be in a transactional format. A transactional format table will have three columns: "case id", "attribute name," and "attribute value" columns.

In JDM by using the PhysicalAttributeRole enumeration, the transactional format data can be described. The AssociationSettings interface is used to specify the build settings for association rules. It has minimum support, minimum confidence, and maximum rule-length settings that can be used to control the size of association rules model.

Listing 8 illustrates building a market-basket analysis model using the JDM association function and exploring the rules from the model using rule filters.

Conclusion
The use of data mining to solve business problems is on the upswing. JDM provides a standard Java interface for developing vendor-neutral data-mining applications. JDM supports common data-mining operations, as well as the creation, persistence, access, and maintenance of the metadata supporting mining activities. Oracle initiated a new JSR-247 to work on new features for a future version of the JDM standard.

References
Java Data Mining Specification. http://jcp.org/aboutJava/communityprocess/final/jsr073/index.html
Java Data Mining API Javadoc.www.oracle.com/technology/products/bi/odm/JSR-73/index.html
Java Data Mining Project Home. https://datamining.dev.java.net
Cross-Industry Standard Process for Data Mining (CRISP-DM). www.crisp-dm.org
JSR-247. http://jcp.org/en/jsr/detail?id=247

Wednesday, May 16, 2012

Knowledge Discovery and Opinion Mining

Sentiment Classification and Opinion Lexicons

Lexicons are a big part of my current research in opinion mining. Aside from the potential of helping supervised learning methods, they can be applied to unsupervised techniques - an appealing idea for research whose goal is domain independence. An opinion lexicon is a database that associates terms with opinion information - normally in the form of a numeric score indicating a term's positive or negative bias.

My dissertation was an investigation on how lexicons perform on sentiment classification of film reviews - this work was later expanded and incorporated into a chapter on the book "Knowledge Discovery Practices and Applications in Data Mining - Trends and New Domains".

Opinion Mining with SentiWordNet

A shorter version of this research was presented in Dublin's IT&T 2009 and available here.

The lexicon used here was SentiWordNet. Built from WordNet, SentiWordNet leverages WordNet's semantic relationships like synonyms and antonyms, and term glosses to expand a set of seeded words into a much larger lexicon. It can be tried online here. (also see Esuli and Sebastiani's SentiWordNet paper).

Using SeniWordNet for sentiment classification involves scanning a document for relevant terms and matching available information from the lexicon according to part of speech. There are some interesting NLP challenges involved here: we run the text via a part of speech tagger first to obtain details on whether terms are adjective, verb, etc. Then negation detection is performed to identify parts of text affected by a negating statement (ex: "not good" as opposed to "good"). Then, the document is scored based on terms found and whether it is negated.

Parameter Testing - Letting RapidMiner Do The Hard Work

In a previous post we have discussed an example of how to perform text classification in RapidMiner, and we used a data set of film reviews against several word vector schemes to classify documents according to their overall positive or negative sentiment. In this tutorial we show how to look for better results by using RapidMiner's parameter testing feature and evaluate the effects of feature selection to the original classification scheme.

Parameter Testing
There are many factors that come into play in determining the performance of a classification task: for example, tuning parameters on the classification algorithm, the use of outlier detection, feature selection and feature generation can all affect the end result. In general, it is hard to know a priori which combination of parameters will be the optimal one for a given data set or class of problem, and testing several possibilities of parameter values is the only way to better understand their influence and find a better fit.

The number of combined possibilities on how to tune a classification task however grows fast and testing them manually can become tedious very quickly. This is where parameterization can help. On RapidMiner, under Meta -> Parameter operators, we'll find several parameter optimization schemes:
Parameter iterator
Grid Parameter Optimization
And also algorithms that implement more sophisticated parameter searching schemes:
QuadraticParameterOptimization
EvolutionaryParameterOptimization
Feature Selection
We would like to test the effect of feature selection to our previous sentiment classification experiment. Recall that our word vector for the sentiment classifier generated some 2012 features based on unigram terms found in the source documents, after removal of stop words and stemming. Now, we can apply a scheme for filtering out uncorrelated features before we train the classifier algorithm.

Step 1 - Attribute Weighting and Selection
RapidMiner comes with a wealth of methods for performing feature selection. We extend the sentiment classification example by using a weighting scheme to attributes on the feature vector. Then, the top K highest weighted attributes (which we hope are the top most correlated to the labels) are chosen for training a classifier algorithm.

We introduce a pre-processing step to the training algorithm by introducing 2 operators:
InfoGainRatioWeighting - Calculates numeric weights for each attribute based on information gain in relation to the positive/negative labels.
AttributeWeightSelection - Filters attributes based on their associated numeric weights. This operator will take as input the example set containing data from our feature vector, and the result of applying the previous InfoGain weights to the example set. There are several criteria to choose from, and we will use the "top k" most relevant attributes.

Fixing Random Seed
By default, RapidMiner will use a dynamic seed whenever randomization is needed, for instance, when sampling the data set for cross-validation. To make sure our experimental results are repeatable on every run, we can fix our random seed by assigning it a specific value. This should be done on the "root" and "cross validation" operators.

Feature Selection by Feature Weights
Right now the project is ready to run. Lets see how it fares by leaving say, only the top 100 features according to the weighting scheme and applying those features to train the same classifier algorithm as before. Right click on the InfoGainRatioWeighting operator and add a "BreakPoint After" stop. When running the experiment, we can see the state of the execution process right after this step has run. At that stage, the attribute weights have been created. We can have a look at which ones were given the highest weights, giving an indication of the most correlated features to the positive or negative sentiment label:

In this weighting scheme we notice some familiar terms we would expect to correlate with a good or bad film review. Terms such as "lame", "poorly" and "terrific" score highly. Also, we notice some more unexpected predictors, such as the term "portray", which appears to be relevant to classification on the domain of films.

Once the experiment is complete, we see that in this data set, reducing the total number of features from 2012 to just 100 yielded an average accuracy of 81%. This is worst than having the experiment run with all the features (84.05% in our previous experiment), suggesting pruning the data set to only 100 features might be too severe and could be leaving out many terms that are good predictors. The question then is: is removing potentially uncorrelated features of any benefit to sentiment classification in this experiment?

Step 2 - Parameterization
We'll apply the GridParameterOptimization operator to test from a list of potential parameter combinations, based on accuracy criteria. The operator is added to the project just after the data set read step (ExampleSource), and the remainder of the operators are included as part of its subtree.

From this point on, the parameters affecting the behavior of the operators can be added to the parameter search scheme. Determining which combination works best is based upon results obtained from the "Main Criterion" in the Performance Evaluator operator. In our case, the criteria is accuracy. The operator is configured by selecting attributes we wish to test, and what values each attribute will take. In our example the experiment compares the results for selecting the k topmost relevant features according to seven different values of k:

Results
In our experiment, the average classification accuracy improved to 85.80% when using k = 800 topmost weighted features. This is better than our original baseline of 84.05%, and has the added benefit of using less features, therefore reducing the footprint necessary to train and run the algorithm. Not bad for a day's work, considering the tool did most of the work :-).

Further improvements could naturally be obtained by testing k at more granular increments, or including other factors such as support vector machine parameters. It is important however to bear in mind that adding more testing instances will result in a much larger search space, thus increasing the time needed to tune the experiment. For instance, searching for 50 values of k on the feature selection approach, combined with 10 possible values for tuning parameters on the classifier would result in 50 x 10 = 500 iterations. The numbers can add up quickly.

Other Approaches
The GridParameterOptimization operator is quite straightforward: just iterate over a list of parameter combinations and retrieve the one tha optimizes a particular error function, in our case accuracy. Finding the best combination of parameters relates to a more general problem of search and optimization, and lots of more sophisticated strategies have been proposed in the literature, some of which are also present in RapidMiner such as the QuadraticParameterOptimization operator, and the EvolutionaryParameterOptimization which implements a genetic algorithm for searching parameter combinations.

Further Reading
For those interested in reading on some of the subjects briefly touched upon in this tutorial, there is a good discussion on the topic of parameter search in the context of data mining on the book Principles of Data Mining by Hand, Manilla and Smyth.

Feature selection applied to text mining has been investigated by a number of authors, and a good overview of the topic can be found in the work of Sebastiani, 2002 (retrievable here).

Finally, an approach that uses feature selection techniques to the problem of sentiment classification can be seen in the work of Abbasi et al, 2008.

Opinion Mining with RapidMiner - A Quick Experiment

RapidMiner (formerly Yale) is a open source data mining and knowledge discovery tool written in Java, incorporating most well known mining algorithms for classification, clustering and regression; it also contains plugins for specialized tasks such as text mining and analysis of streamed data. RapidMiner is a GUI based tool, but mining tasks can also be scripted for batch mode processing. In addition to its numerous choice of operators, RapidMiner also includes the data mining library from the WEKA Toolkit.

The polarity data set is a set of film reviews from IMDB, which were labelled based on author feedback: positive or negative. There are 1000 labelled documents for each class, and the data is presented in plain text format. This data set has been employed to analyse the performance of opinion mining techniques. This data set can be downloaded from here.

RapidMiner Setup
Get RapidMiner here, and don't forget the plugin for text mining . The Text mining plugin contains tasks specially designed to assist on the preparation of text documents for mining tasks, such as tokenization, stop word removal and stemming. RapidMiner plugins are Java libraries that need to be added to the lib\plugins subdirectory under the installation location.

A word on the JRE

RapidMiner will ship a pre-configured script for loading its command line and GUI versions in the JVM. It is worth spending a few moments checking the JRE startup parameters, as larger data sets are likely to hit a memory allocation ceiling. Also, configuring the JRE for server-side execution (Java Hotspot) is likely to help as well. On the script used for starting up RapidMiner (e.g. RapidminerGUI or RapidMinerGUI.bat under scripts subdirectory):
- Configure the MAX_JAVA_MEMORY variable to the ammount of memory allocated to the JVM. The example below sets it to 1Gb:
MAX_JAVA_MEMORY=1024

- Add the "-server" flag to the JVM startup line on the startup script being used.

Step 1: From Text to Word Vector
Here we'll create a word vector data set based on a set of documents. The word vector set can then be reused and applied to different classifiers.

The TextInput operator receives a set of tokenized documents, generates a word vector and passes it on to the ExampleSetWriter operator for outputting to a file. This example was based on one of the samples from the RapidMiner Text Plugin.

TextInput will also create a special field in the word vector output file that identifies each vector with its original document, this is the id_attribute_type parameter: long or short text description based on document file name, or a unique sequential ID.

Operator Choices
We would like to experiment with different types of word vectors, and assess their impact on the classification task. The nested operators under TextInput and their setup are briefly described here. We follow the execution sequence of the operators:

PorterStemmer - Executes the english Porter stemming algorithm on document set. Stemming is a technique that reduces words to their common root, or stem. No parameters are allowed on this operator.

TokenLenghtFilter - Removes tokens based on string lenght. We use a minimum string lenght of 2 characters. This is our preference as a higher length filter could remove important sentiment information such as "ok", or "no".

StopWordFilterFile - Removes stop words based on a list given in a file. RapidMiner also implements an EnglishStopWord operator, however we would like to preserve some potentially useful sentiment information such as "ok" and "not", and thus used a scaled down version based on this stop word list.

StringTokenizer - Final step before building the word vector, receives modified text documents from previous steps and builds a series of term tokens.

There is clearly an argument for getting rid of stemming and word filtering altogether and performing the experiment using each potential word as a feature. The final word vector however would be far larger and the process more time consuming (one test run without stemming and word length greater than 3 generated over 25K features). On the basis thet we'd like to perform a quick experiment to demonstrate the features of the plugin, for now we'll keep the filtering in.

It is also worth mention the n-gram tokenizer operator, not used in this test, which generates a list of n-grams based on words occuring in the text. A 2-gram - or bigram - tokenizer generates all possible two-word sequence pairs found on the text. n-grams have the potential of retaining more information regarding opinion polarity - ex. the words "not nice" become the "not_nice " bigram, which can then be treated as a feature by the classifier. This however comes at the expense of classifier overfitting, since it would require a far larger volume of examples to train on all possible relevant n-grams, for larger values of n, not to mention the hit in execution time due to a much larger feature space. We thus leave it out for this experiment.

Word Vectors
The TextInput operator is capable of generating several types of word vectors. We create 3 different examples for our test:
Binary Occurrence: Term Receives 1 if present in document, 0 otherwise.
Term Frequency: Value is based on the normalized number of occurrences of term in document.
TFIDF: Calculated based on word frequency in document and in the entire corpus.

In the TextInput operator we also perform some term prunning, by removing the least and most frequent terms in the document set. We set our thresholds at terms appearing in at least 50 documents, and at most 1970 documents, out of a corpus of 2000 documents.

Running the Task
Executing the task will generate 2 output files determined by the ExampleSetWriter operator:
Word vector set (.dat) Attribute description file (.aml)
The final word vector contains 2012 features, plus 2 special attributes recodring the label and document name.

Step 2: Training and Cross-Validation
We will employ the Support Vector Machines learner to train a model based on samples from the word vector set we just created. We will use 3-fold cross validation method to compare the results obtained.
In our experiment, we apply a Linear SVM with the exact same configuration using the 3 types of word vectors obtained from the previous step: Binary, Term Frequency and TFIDF. All the hard work will be done by the XValidation process, which encapsulates the process os selecting folds from the data set and iterating through the classification execution steps.

The first step on our RapidMiner experiment is reading the word vector from disk. This is the task of the ExampleSource process.

Then, we start learning/running the classifier with cross-validation. We have configured ours with 3 folds, meaning the process will run 3 times, using 1/3 of the data set as training, and applying the model to the remaining vectors. It will perform the same operation 3 times, each time using a different fold as training set.

The XValidation process takes in a series of sub-processes used in its iterations. First, the learner algorithm to be used. As mentioned earlier, we are using a Linear C-SVC SVM, and at this stage not a lot of tweaking has been done on its parameters.

Then, an OperatorChain is used to actually perform the execution of the classification experiment. It links together the ModelApplier process - which applied the trained model to the input vectors, and a PerformanceEvaluator task which calculates standard performance metrics on the classification run.

That's it. We then run the experiment, only chaning the input vector each time and compare the results.

Results
The classification process took around 30 minutes on my home PC (Windows XP / Intel Celeron) for each run with a different data set.

Monday, May 14, 2012

Cloud Computing: Database as a Service

In cloud computing we all know about Infrastructure as a Service, Software as a Service; similarly Database as a Service is emerging and is slowly becoming popular among cloud providers. This article is brief overview of database as a service.

Databases are widely used as the foundation of many enterprise applications. In most organizations, however, each project that uses a database acquires its own hardware and software and establishes its own hosting and support arrangements for the application. Consequently, there are often a large proliferation of servers and software within a given enterprise that perform the same task - provide database services for applications.

Modern databases have advanced in features and functionality to the point where they can be used to provide a larger, shared service to the enterprise as a whole. This shared service allows multiple applications to connect to a single database running on a cluster of machines. The applications are isolated from each other and explicit portions of the database processing power are allocated to each one - a high-use application would receive a larger portion of the processing power than a low-use application. This database-as-a-service (DBaaS) offering provides a number of benefits to large enterprise organizations:

Higher availability
Cost savings
Better service through centralized management
Reduced risk

Purpose and Benefits
In many enterprises today, each application development team architects, builds, and deploys each of the individual technologies that make up the application. In the case of databases, this means that each application development team deploys its own database for each application. A large enterprise may have thousands of individual database instances running on a variety of hardware; each configured separately and managed independently.

This article describes an alternate solution to the problem of providing database functionality to application developers: offering the database as a common, shared service to the enterprise as a whole. This architecture uses a pool of database computing resources that are shared across multiple applications. Rather than architecting, building, and deploying individual databases, application development teams simply connect the application to the centralized database service. This architecture delivers a number of benefits, including but not limited to:

Higher availability - All participating applications immediately benefit from automatic failover. This is intrinsic to the service offering and does not have to be configured and managed individually per application. This puts high-availability in cost range of even low-budget projects-all applications get it "for free."
Cost savings - By pooling hardware resources together and driving up efficiency, the overall cost to the business should be reduced. The more individual database servers currently being used, and the lower their overall utilization, the more savings can be realized. Modern databases also allow the use of many smaller, commodity servers in a cluster configuration, rather than a few large servers. Finally, there are software savings by the consolidation of database software licenses.
Better service through centralized management - By centralizing the overall management, each development group does not have to hire its own database administrators. Administration and management is handled by full-time experts rather than part-time resources that are responsible for other tasks.
Reduced risk - The whole environment is managed and support is available on request. In the decentralized model, support is often performed ad-hoc by a part-time team and may not be available at the time it is needed, nor with the expertise to solve the problem as soon as required.

The solution is not without risks, however. These include:

Common point of failure - While not dissimilar to other existing shared points of failure (SAN, network, etc), the failure of the shared instance would affect all dependent applications.
Failure to adopt - While a shared environment offers a number of desirable features, some simple conditions will need to be met that application owners might see as a barrier. For example, an application owner might decide they want control over the specific hardware used to host their application.
Failure to recover costs - The thrust of this solution does involve trading many simple

implementations with a single sophisticated implementation. This sophistication will necessarily include some higher licensing costs and demand a high caliber of support. Without careful management and widespread adoption, this cost structure could outweigh the benefits.

Platform Types
In the current state characterized by the individual deployment of databases throughout the enterprise, each application development team is selecting the platform on which to run the database itself. While an enterprise might standardize on a particular hardware and software platform (e.g., an HP, Dell, or IBM server, running Oracle or SQL Server), these components are typically matched together in an ad-hoc way and may not reflect an optimal configuration.
By running database as a central service, the enterprise can optimize the compute density, manageability, and complexity of the hardware and software platform combination. The right components can be selected to deliver optimum performance at minimum cost. This can even be taken to the extreme of purchasing optimized database "appliances" such as Oracle Exadata systems.

Reference Architecture
As shown in Figure 1, the reference architecture has a number of different components and layers that implement a very robust database service offering. These are described in a bottom-up fashion.

Data network and SAN - This layer provides basic IP network connectivity to the system for users (application consumers) and also provides a network path to storage. In the case of the reference architecture, "SAN" may also refer to any storage networking technology compatible with the database solution (e.g., NAS, etc.).

OS and Servers - This layer provides the basic hardware and software system on which the database runs. Again, the choice of wording here is a bit arbitrary; the reference architecture would also allow optimized implementations that may not include a separate operating system, for instance.

Storage Management - This layer provides basic capabilities to manage database files in the storage pool.

Disaster Recovery - One goal of DBaaS is to increase the reliability and availability of the database functionality associated with all enterprise applications. By building the service centrally, we should be able to provide advanced functionality such as disaster recovery to individual applications "for free." The disaster recovery layer of the architecture implements data replication and mirroring to remote location for the purpose of recovery from total site failure.

Backup and Recovery - Similarly, the backup and recovery layer provides common backup and recovery services to all the data stored in the shared DBaaS implementation. All application data is automatically backed up automatically, simply by virtue of using the service rather than an ad-hoc database.

Clustering Software - The clustering software layer provides functionality that coordinates the activities of multiple physical hardware machines to create a larger shared cluster. This allows the DBaaS service to scale up and down as demand increases and decreases, using the horsepower of many individual machines to increase performance when required.

Database - The database layer represents the database software itself (SQL parser, query optimizer, execution engine, etc.). This is what we commonly think of as "the database."

Grid Management - The grid management layer allows DBaaS operators to manage the set of machines on which the DBaaS service runs. Machines can be added to the pool when load requires, or removed. The grid management software can also check on the health of individual machines.

opinion mining -sentiment mining

Opinion mining is a type of natural language processing for tracking the mood of the public about a particular product. Opinion mining, which is also called sentiment analysis, involves building a system to collect and examine opinions about the product made in blog posts, comments, reviews or tweets. Automated opinion mining often uses machine learning, a component of artificial intelligence (AI).

Opinion mining can be useful in several ways. If you are in marketing, for example, it can help you judge the success of an ad campaign or new product launch, determine which versions of a product or service are popular and even identify which demographics like or dislike particular features. For example, a review might be broadly positive about a digital camera, but be specifically negative about how heavy it is. Being able to identify this kind of information in a systematic way gives the vendor a much clearer picture of public opinion than surveys or focus groups, because the data is created by the customer.

An opinion mining system is often built using software that is capable of extracting knowledge from examples in a database and incorporating new data to improve performance over time. The process can be as simple as learning a list of positive and negative words, or as complicated as conducting deep parsing of the data in order to understand the grammar and sentence structure used.

There are several challenges in opinion mining. The first is that a word that is considered to be positive in one situation may be considered negative in another situation. Take the word "long" for instance. If a customer said a laptop's battery life was long, that would be a positive opinion. If the customer said that the laptop's start-up time was long, however, that would be is a negative opinion. These differences mean that an opinion system trained to gather opinions on one type of product or product feature may not perform very well on another.

A second challenge is that people don't always express opinions the same way. Most traditional text processing relies on the fact that small differences between two pieces of text don't change the meaning very much. In opinion mining, however, "the movie was great" is very different from "the movie was not great".

Finally, people can be contradictory in their statements. Most reviews will have both positive and negative comments, which is somewhat manageable by analyzing sentences one at a time. However, the more informal the medium (twitter or blogs for example), the more likely people are to combine different opinions in the same sentence. For example: "the movie bombed even though the lead actor rocked it" is easy for a human to understand, but more difficult for a computer to parse. Sometimes even other people have difficulty understanding what someone thought based on a short piece of text because it lacks context. For example, "That movie was as good as his last one" is entirely dependent on what the person expressing the opinion thought of the previous film.

Job Opportunities in Data Mining

Software Developer in Data Mining Technologies

The Oracle Data Mining group is responsible for embedding data mining technology within the Oracle database. We currently have a large breadth of mining functions and algorithms embedded in the Oracle database:

- classification (decision tree, support vector machine, naive bayes, and logistic regression)

- regression (support vector machine and multivariate linear regression)

- clustering (enhanced k-means and o-cluster, an Oracle developed algorithm)

- feature extraction (non-negative matrix factorization)

- associations (apriori)

- attribute importance (mdl)

In addition, we have enhanced Oracle's native, built-in SQL language to support functions which score data mining models (e.g., produce the probability of churn).

Expected background:

strong background in machine learning (Ph.D. preferable)

strong background in mathematics

strong knowledge of C

knowledge of SQL is a plus

Is Matlab best language for Data mining?

While starting a new project a few days ago, I had to answer the recurrent question: What language do I choose? In research, we have the opportunity of choosing any language, free or not. This is usually not the case in industry where the language can be fixed for many reasons (price, customer choice, boss choice, same as existing system, etc.).

I basically had to choose between Java and Matlab (C++ was soon deleted from my list since I don’t like to spend time on pointers and manually free up the memory, but this is very personal). Of course a lot of others are available, but I feel more confident with these two. As most of my work was done with Matlab, I decided to start with Java. Contradictory? Not at all, I just wanted to know how easy it was to use Java for raw data mining tasks (i.e. without using JDM framework or such).

When doing data mining, a large part of the work is to manipulate data. Indeed, the part of coding the algorithm can be quite short since Matlab has a lot of toolboxes for data mining. And when manipulating data, Matlab is definitely better. It is normal since it is done to work with matrices (MATrix LABoratory). Thus, deleting a row, a column, transposing a matrix, calculating the determinant… all these can be done in one line of code. To my knowledge, this is not the case with Java, but if you know some way, feel free to comment.

For more information about using Matlab for data mining, the best place is

http://matlabdatamining.blogspot.in/.