Sentiment Classification and Opinion Lexicons
Lexicons are a big part of my current research in opinion mining. Aside from the potential of helping supervised learning methods, they can be applied to unsupervised techniques - an appealing idea for research whose goal is domain independence. An opinion lexicon is a database that associates terms with opinion information - normally in the form of a numeric score indicating a term's positive or negative bias.
My dissertation was an investigation on how lexicons perform on sentiment classification of film reviews - this work was later expanded and incorporated into a chapter on the book "Knowledge Discovery Practices and Applications in Data Mining - Trends and New Domains".
My dissertation was an investigation on how lexicons perform on sentiment classification of film reviews - this work was later expanded and incorporated into a chapter on the book "Knowledge Discovery Practices and Applications in Data Mining - Trends and New Domains".
- Opinion Mining with SentiWordNet
A shorter version of this research was presented in Dublin's IT&T 2009 and available here.
The lexicon used here was SentiWordNet. Built from WordNet, SentiWordNet leverages WordNet's semantic relationships like synonyms and antonyms, and term glosses to expand a set of seeded words into a much larger lexicon. It can be tried online here. (also see Esuli and Sebastiani's SentiWordNet paper).
Using SeniWordNet for sentiment classification involves scanning a document for relevant terms and matching available information from the lexicon according to part of speech. There are some interesting NLP challenges involved here: we run the text via a part of speech tagger first to obtain details on whether terms are adjective, verb, etc. Then negation detection is performed to identify parts of text affected by a negating statement (ex: "not good" as opposed to "good"). Then, the document is scored based on terms found and whether it is negated.
The lexicon used here was SentiWordNet. Built from WordNet, SentiWordNet leverages WordNet's semantic relationships like synonyms and antonyms, and term glosses to expand a set of seeded words into a much larger lexicon. It can be tried online here. (also see Esuli and Sebastiani's SentiWordNet paper).
Using SeniWordNet for sentiment classification involves scanning a document for relevant terms and matching available information from the lexicon according to part of speech. There are some interesting NLP challenges involved here: we run the text via a part of speech tagger first to obtain details on whether terms are adjective, verb, etc. Then negation detection is performed to identify parts of text affected by a negating statement (ex: "not good" as opposed to "good"). Then, the document is scored based on terms found and whether it is negated.
Parameter Testing - Letting RapidMiner Do The Hard Work
In a previous post we have discussed an example of how to perform text classification in RapidMiner, and we used a data set of film reviews against several word vector schemes to classify documents according to their overall positive or negative sentiment. In this tutorial we show how to look for better results by using RapidMiner's parameter testing feature and evaluate the effects of feature selection to the original classification scheme.
Parameter Testing
There are many factors that come into play in determining the performance of a classification task: for example, tuning parameters on the classification algorithm, the use of outlier detection, feature selection and feature generation can all affect the end result. In general, it is hard to know a priori which combination of parameters will be the optimal one for a given data set or class of problem, and testing several possibilities of parameter values is the only way to better understand their influence and find a better fit.
The number of combined possibilities on how to tune a classification task however grows fast and testing them manually can become tedious very quickly. This is where parameterization can help. On RapidMiner, under Meta -> Parameter operators, we'll find several parameter optimization schemes:
Parameter iterator
Grid Parameter Optimization
And also algorithms that implement more sophisticated parameter searching schemes:
QuadraticParameterOptimization
EvolutionaryParameterOptimization
Feature Selection
We would like to test the effect of feature selection to our previous sentiment classification experiment. Recall that our word vector for the sentiment classifier generated some 2012 features based on unigram terms found in the source documents, after removal of stop words and stemming. Now, we can apply a scheme for filtering out uncorrelated features before we train the classifier algorithm.
Step 1 - Attribute Weighting and Selection
RapidMiner comes with a wealth of methods for performing feature selection. We extend the sentiment classification example by using a weighting scheme to attributes on the feature vector. Then, the top K highest weighted attributes (which we hope are the top most correlated to the labels) are chosen for training a classifier algorithm.
We introduce a pre-processing step to the training algorithm by introducing 2 operators:
InfoGainRatioWeighting - Calculates numeric weights for each attribute based on information gain in relation to the positive/negative labels.
AttributeWeightSelection - Filters attributes based on their associated numeric weights. This operator will take as input the example set containing data from our feature vector, and the result of applying the previous InfoGain weights to the example set. There are several criteria to choose from, and we will use the "top k" most relevant attributes.
Fixing Random Seed
By default, RapidMiner will use a dynamic seed whenever randomization is needed, for instance, when sampling the data set for cross-validation. To make sure our experimental results are repeatable on every run, we can fix our random seed by assigning it a specific value. This should be done on the "root" and "cross validation" operators.
Feature Selection by Feature Weights
Right now the project is ready to run. Lets see how it fares by leaving say, only the top 100 features according to the weighting scheme and applying those features to train the same classifier algorithm as before. Right click on the InfoGainRatioWeighting operator and add a "BreakPoint After" stop. When running the experiment, we can see the state of the execution process right after this step has run. At that stage, the attribute weights have been created. We can have a look at which ones were given the highest weights, giving an indication of the most correlated features to the positive or negative sentiment label:
In this weighting scheme we notice some familiar terms we would expect to correlate with a good or bad film review. Terms such as "lame", "poorly" and "terrific" score highly. Also, we notice some more unexpected predictors, such as the term "portray", which appears to be relevant to classification on the domain of films.
Once the experiment is complete, we see that in this data set, reducing the total number of features from 2012 to just 100 yielded an average accuracy of 81%. This is worst than having the experiment run with all the features (84.05% in our previous experiment), suggesting pruning the data set to only 100 features might be too severe and could be leaving out many terms that are good predictors. The question then is: is removing potentially uncorrelated features of any benefit to sentiment classification in this experiment?
Step 2 - Parameterization
We'll apply the GridParameterOptimization operator to test from a list of potential parameter combinations, based on accuracy criteria. The operator is added to the project just after the data set read step (ExampleSource), and the remainder of the operators are included as part of its subtree.
From this point on, the parameters affecting the behavior of the operators can be added to the parameter search scheme. Determining which combination works best is based upon results obtained from the "Main Criterion" in the Performance Evaluator operator. In our case, the criteria is accuracy. The operator is configured by selecting attributes we wish to test, and what values each attribute will take. In our example the experiment compares the results for selecting the k topmost relevant features according to seven different values of k:
Results
In our experiment, the average classification accuracy improved to 85.80% when using k = 800 topmost weighted features. This is better than our original baseline of 84.05%, and has the added benefit of using less features, therefore reducing the footprint necessary to train and run the algorithm. Not bad for a day's work, considering the tool did most of the work :-).
Further improvements could naturally be obtained by testing k at more granular increments, or including other factors such as support vector machine parameters. It is important however to bear in mind that adding more testing instances will result in a much larger search space, thus increasing the time needed to tune the experiment. For instance, searching for 50 values of k on the feature selection approach, combined with 10 possible values for tuning parameters on the classifier would result in 50 x 10 = 500 iterations. The numbers can add up quickly.
Other Approaches
The GridParameterOptimization operator is quite straightforward: just iterate over a list of parameter combinations and retrieve the one tha optimizes a particular error function, in our case accuracy. Finding the best combination of parameters relates to a more general problem of search and optimization, and lots of more sophisticated strategies have been proposed in the literature, some of which are also present in RapidMiner such as the QuadraticParameterOptimization operator, and the EvolutionaryParameterOptimization which implements a genetic algorithm for searching parameter combinations.
Further Reading
For those interested in reading on some of the subjects briefly touched upon in this tutorial, there is a good discussion on the topic of parameter search in the context of data mining on the book Principles of Data Mining by Hand, Manilla and Smyth.
Feature selection applied to text mining has been investigated by a number of authors, and a good overview of the topic can be found in the work of Sebastiani, 2002 (retrievable here).
Finally, an approach that uses feature selection techniques to the problem of sentiment classification can be seen in the work of Abbasi et al, 2008.
Parameter Testing
There are many factors that come into play in determining the performance of a classification task: for example, tuning parameters on the classification algorithm, the use of outlier detection, feature selection and feature generation can all affect the end result. In general, it is hard to know a priori which combination of parameters will be the optimal one for a given data set or class of problem, and testing several possibilities of parameter values is the only way to better understand their influence and find a better fit.
The number of combined possibilities on how to tune a classification task however grows fast and testing them manually can become tedious very quickly. This is where parameterization can help. On RapidMiner, under Meta -> Parameter operators, we'll find several parameter optimization schemes:
Parameter iterator
Grid Parameter Optimization
And also algorithms that implement more sophisticated parameter searching schemes:
QuadraticParameterOptimization
EvolutionaryParameterOptimization
Feature Selection
We would like to test the effect of feature selection to our previous sentiment classification experiment. Recall that our word vector for the sentiment classifier generated some 2012 features based on unigram terms found in the source documents, after removal of stop words and stemming. Now, we can apply a scheme for filtering out uncorrelated features before we train the classifier algorithm.
Step 1 - Attribute Weighting and Selection
RapidMiner comes with a wealth of methods for performing feature selection. We extend the sentiment classification example by using a weighting scheme to attributes on the feature vector. Then, the top K highest weighted attributes (which we hope are the top most correlated to the labels) are chosen for training a classifier algorithm.
We introduce a pre-processing step to the training algorithm by introducing 2 operators:
InfoGainRatioWeighting - Calculates numeric weights for each attribute based on information gain in relation to the positive/negative labels.
AttributeWeightSelection - Filters attributes based on their associated numeric weights. This operator will take as input the example set containing data from our feature vector, and the result of applying the previous InfoGain weights to the example set. There are several criteria to choose from, and we will use the "top k" most relevant attributes.
Fixing Random Seed
By default, RapidMiner will use a dynamic seed whenever randomization is needed, for instance, when sampling the data set for cross-validation. To make sure our experimental results are repeatable on every run, we can fix our random seed by assigning it a specific value. This should be done on the "root" and "cross validation" operators.
Feature Selection by Feature Weights
Right now the project is ready to run. Lets see how it fares by leaving say, only the top 100 features according to the weighting scheme and applying those features to train the same classifier algorithm as before. Right click on the InfoGainRatioWeighting operator and add a "BreakPoint After" stop. When running the experiment, we can see the state of the execution process right after this step has run. At that stage, the attribute weights have been created. We can have a look at which ones were given the highest weights, giving an indication of the most correlated features to the positive or negative sentiment label:
In this weighting scheme we notice some familiar terms we would expect to correlate with a good or bad film review. Terms such as "lame", "poorly" and "terrific" score highly. Also, we notice some more unexpected predictors, such as the term "portray", which appears to be relevant to classification on the domain of films.
Once the experiment is complete, we see that in this data set, reducing the total number of features from 2012 to just 100 yielded an average accuracy of 81%. This is worst than having the experiment run with all the features (84.05% in our previous experiment), suggesting pruning the data set to only 100 features might be too severe and could be leaving out many terms that are good predictors. The question then is: is removing potentially uncorrelated features of any benefit to sentiment classification in this experiment?
Step 2 - Parameterization
We'll apply the GridParameterOptimization operator to test from a list of potential parameter combinations, based on accuracy criteria. The operator is added to the project just after the data set read step (ExampleSource), and the remainder of the operators are included as part of its subtree.
From this point on, the parameters affecting the behavior of the operators can be added to the parameter search scheme. Determining which combination works best is based upon results obtained from the "Main Criterion" in the Performance Evaluator operator. In our case, the criteria is accuracy. The operator is configured by selecting attributes we wish to test, and what values each attribute will take. In our example the experiment compares the results for selecting the k topmost relevant features according to seven different values of k:
Results
In our experiment, the average classification accuracy improved to 85.80% when using k = 800 topmost weighted features. This is better than our original baseline of 84.05%, and has the added benefit of using less features, therefore reducing the footprint necessary to train and run the algorithm. Not bad for a day's work, considering the tool did most of the work :-).
Further improvements could naturally be obtained by testing k at more granular increments, or including other factors such as support vector machine parameters. It is important however to bear in mind that adding more testing instances will result in a much larger search space, thus increasing the time needed to tune the experiment. For instance, searching for 50 values of k on the feature selection approach, combined with 10 possible values for tuning parameters on the classifier would result in 50 x 10 = 500 iterations. The numbers can add up quickly.
Other Approaches
The GridParameterOptimization operator is quite straightforward: just iterate over a list of parameter combinations and retrieve the one tha optimizes a particular error function, in our case accuracy. Finding the best combination of parameters relates to a more general problem of search and optimization, and lots of more sophisticated strategies have been proposed in the literature, some of which are also present in RapidMiner such as the QuadraticParameterOptimization operator, and the EvolutionaryParameterOptimization which implements a genetic algorithm for searching parameter combinations.
Further Reading
For those interested in reading on some of the subjects briefly touched upon in this tutorial, there is a good discussion on the topic of parameter search in the context of data mining on the book Principles of Data Mining by Hand, Manilla and Smyth.
Feature selection applied to text mining has been investigated by a number of authors, and a good overview of the topic can be found in the work of Sebastiani, 2002 (retrievable here).
Finally, an approach that uses feature selection techniques to the problem of sentiment classification can be seen in the work of Abbasi et al, 2008.
Opinion Mining with RapidMiner - A Quick Experiment
RapidMiner (formerly Yale) is a open source data mining and knowledge discovery tool written in Java, incorporating most well known mining algorithms for classification, clustering and regression; it also contains plugins for specialized tasks such as text mining and analysis of streamed data. RapidMiner is a GUI based tool, but mining tasks can also be scripted for batch mode processing. In addition to its numerous choice of operators, RapidMiner also includes the data mining library from the WEKA Toolkit.
The polarity data set is a set of film reviews from IMDB, which were labelled based on author feedback: positive or negative. There are 1000 labelled documents for each class, and the data is presented in plain text format. This data set has been employed to analyse the performance of opinion mining techniques. This data set can be downloaded from here.
RapidMiner Setup
Get RapidMiner here, and don't forget the plugin for text mining . The Text mining plugin contains tasks specially designed to assist on the preparation of text documents for mining tasks, such as tokenization, stop word removal and stemming. RapidMiner plugins are Java libraries that need to be added to the lib\plugins subdirectory under the installation location.
A word on the JRE
RapidMiner will ship a pre-configured script for loading its command line and GUI versions in the JVM. It is worth spending a few moments checking the JRE startup parameters, as larger data sets are likely to hit a memory allocation ceiling. Also, configuring the JRE for server-side execution (Java Hotspot) is likely to help as well. On the script used for starting up RapidMiner (e.g. RapidminerGUI or RapidMinerGUI.bat under scripts subdirectory):
- Configure the MAX_JAVA_MEMORY variable to the ammount of memory allocated to the JVM. The example below sets it to 1Gb:
MAX_JAVA_MEMORY=1024
- Add the "-server" flag to the JVM startup line on the startup script being used.
Step 1: From Text to Word Vector
Here we'll create a word vector data set based on a set of documents. The word vector set can then be reused and applied to different classifiers.
The TextInput operator receives a set of tokenized documents, generates a word vector and passes it on to the ExampleSetWriter operator for outputting to a file. This example was based on one of the samples from the RapidMiner Text Plugin.
The polarity data set is a set of film reviews from IMDB, which were labelled based on author feedback: positive or negative. There are 1000 labelled documents for each class, and the data is presented in plain text format. This data set has been employed to analyse the performance of opinion mining techniques. This data set can be downloaded from here.
RapidMiner Setup
Get RapidMiner here, and don't forget the plugin for text mining . The Text mining plugin contains tasks specially designed to assist on the preparation of text documents for mining tasks, such as tokenization, stop word removal and stemming. RapidMiner plugins are Java libraries that need to be added to the lib\plugins subdirectory under the installation location.
A word on the JRE
RapidMiner will ship a pre-configured script for loading its command line and GUI versions in the JVM. It is worth spending a few moments checking the JRE startup parameters, as larger data sets are likely to hit a memory allocation ceiling. Also, configuring the JRE for server-side execution (Java Hotspot) is likely to help as well. On the script used for starting up RapidMiner (e.g. RapidminerGUI or RapidMinerGUI.bat under scripts subdirectory):
- Configure the MAX_JAVA_MEMORY variable to the ammount of memory allocated to the JVM. The example below sets it to 1Gb:
MAX_JAVA_MEMORY=1024
- Add the "-server" flag to the JVM startup line on the startup script being used.
Step 1: From Text to Word Vector
Here we'll create a word vector data set based on a set of documents. The word vector set can then be reused and applied to different classifiers.
The TextInput operator receives a set of tokenized documents, generates a word vector and passes it on to the ExampleSetWriter operator for outputting to a file. This example was based on one of the samples from the RapidMiner Text Plugin.
TextInput will also create a special field in the word vector output file that identifies each vector with its original document, this is the id_attribute_type parameter: long or short text description based on document file name, or a unique sequential ID.
Operator Choices
We would like to experiment with different types of word vectors, and assess their impact on the classification task. The nested operators under TextInput and their setup are briefly described here. We follow the execution sequence of the operators:
PorterStemmer - Executes the english Porter stemming algorithm on document set. Stemming is a technique that reduces words to their common root, or stem. No parameters are allowed on this operator.
TokenLenghtFilter - Removes tokens based on string lenght. We use a minimum string lenght of 2 characters. This is our preference as a higher length filter could remove important sentiment information such as "ok", or "no".
StopWordFilterFile - Removes stop words based on a list given in a file. RapidMiner also implements an EnglishStopWord operator, however we would like to preserve some potentially useful sentiment information such as "ok" and "not", and thus used a scaled down version based on this stop word list.
StringTokenizer - Final step before building the word vector, receives modified text documents from previous steps and builds a series of term tokens.
There is clearly an argument for getting rid of stemming and word filtering altogether and performing the experiment using each potential word as a feature. The final word vector however would be far larger and the process more time consuming (one test run without stemming and word length greater than 3 generated over 25K features). On the basis thet we'd like to perform a quick experiment to demonstrate the features of the plugin, for now we'll keep the filtering in.
It is also worth mention the n-gram tokenizer operator, not used in this test, which generates a list of n-grams based on words occuring in the text. A 2-gram - or bigram - tokenizer generates all possible two-word sequence pairs found on the text. n-grams have the potential of retaining more information regarding opinion polarity - ex. the words "not nice" become the "not_nice " bigram, which can then be treated as a feature by the classifier. This however comes at the expense of classifier overfitting, since it would require a far larger volume of examples to train on all possible relevant n-grams, for larger values of n, not to mention the hit in execution time due to a much larger feature space. We thus leave it out for this experiment.
Word Vectors
The TextInput operator is capable of generating several types of word vectors. We create 3 different examples for our test:
Binary Occurrence: Term Receives 1 if present in document, 0 otherwise.
Term Frequency: Value is based on the normalized number of occurrences of term in document.
TFIDF: Calculated based on word frequency in document and in the entire corpus.
In the TextInput operator we also perform some term prunning, by removing the least and most frequent terms in the document set. We set our thresholds at terms appearing in at least 50 documents, and at most 1970 documents, out of a corpus of 2000 documents.
Running the Task
Executing the task will generate 2 output files determined by the ExampleSetWriter operator:
Word vector set (.dat) Attribute description file (.aml)
The final word vector contains 2012 features, plus 2 special attributes recodring the label and document name.
Step 2: Training and Cross-Validation
We will employ the Support Vector Machines learner to train a model based on samples from the word vector set we just created. We will use 3-fold cross validation method to compare the results obtained.
In our experiment, we apply a Linear SVM with the exact same configuration using the 3 types of word vectors obtained from the previous step: Binary, Term Frequency and TFIDF. All the hard work will be done by the XValidation process, which encapsulates the process os selecting folds from the data set and iterating through the classification execution steps.
Operator Choices
We would like to experiment with different types of word vectors, and assess their impact on the classification task. The nested operators under TextInput and their setup are briefly described here. We follow the execution sequence of the operators:
PorterStemmer - Executes the english Porter stemming algorithm on document set. Stemming is a technique that reduces words to their common root, or stem. No parameters are allowed on this operator.
TokenLenghtFilter - Removes tokens based on string lenght. We use a minimum string lenght of 2 characters. This is our preference as a higher length filter could remove important sentiment information such as "ok", or "no".
StopWordFilterFile - Removes stop words based on a list given in a file. RapidMiner also implements an EnglishStopWord operator, however we would like to preserve some potentially useful sentiment information such as "ok" and "not", and thus used a scaled down version based on this stop word list.
StringTokenizer - Final step before building the word vector, receives modified text documents from previous steps and builds a series of term tokens.
There is clearly an argument for getting rid of stemming and word filtering altogether and performing the experiment using each potential word as a feature. The final word vector however would be far larger and the process more time consuming (one test run without stemming and word length greater than 3 generated over 25K features). On the basis thet we'd like to perform a quick experiment to demonstrate the features of the plugin, for now we'll keep the filtering in.
It is also worth mention the n-gram tokenizer operator, not used in this test, which generates a list of n-grams based on words occuring in the text. A 2-gram - or bigram - tokenizer generates all possible two-word sequence pairs found on the text. n-grams have the potential of retaining more information regarding opinion polarity - ex. the words "not nice" become the "not_nice " bigram, which can then be treated as a feature by the classifier. This however comes at the expense of classifier overfitting, since it would require a far larger volume of examples to train on all possible relevant n-grams, for larger values of n, not to mention the hit in execution time due to a much larger feature space. We thus leave it out for this experiment.
Word Vectors
The TextInput operator is capable of generating several types of word vectors. We create 3 different examples for our test:
Binary Occurrence: Term Receives 1 if present in document, 0 otherwise.
Term Frequency: Value is based on the normalized number of occurrences of term in document.
TFIDF: Calculated based on word frequency in document and in the entire corpus.
In the TextInput operator we also perform some term prunning, by removing the least and most frequent terms in the document set. We set our thresholds at terms appearing in at least 50 documents, and at most 1970 documents, out of a corpus of 2000 documents.
Running the Task
Executing the task will generate 2 output files determined by the ExampleSetWriter operator:
Word vector set (.dat) Attribute description file (.aml)
The final word vector contains 2012 features, plus 2 special attributes recodring the label and document name.
Step 2: Training and Cross-Validation
We will employ the Support Vector Machines learner to train a model based on samples from the word vector set we just created. We will use 3-fold cross validation method to compare the results obtained.
In our experiment, we apply a Linear SVM with the exact same configuration using the 3 types of word vectors obtained from the previous step: Binary, Term Frequency and TFIDF. All the hard work will be done by the XValidation process, which encapsulates the process os selecting folds from the data set and iterating through the classification execution steps.
The first step on our RapidMiner experiment is reading the word vector from disk. This is the task of the ExampleSource process.
Then, we start learning/running the classifier with cross-validation. We have configured ours with 3 folds, meaning the process will run 3 times, using 1/3 of the data set as training, and applying the model to the remaining vectors. It will perform the same operation 3 times, each time using a different fold as training set.
The XValidation process takes in a series of sub-processes used in its iterations. First, the learner algorithm to be used. As mentioned earlier, we are using a Linear C-SVC SVM, and at this stage not a lot of tweaking has been done on its parameters.
Then, an OperatorChain is used to actually perform the execution of the classification experiment. It links together the ModelApplier process - which applied the trained model to the input vectors, and a PerformanceEvaluator task which calculates standard performance metrics on the classification run.
That's it. We then run the experiment, only chaning the input vector each time and compare the results.
Results
The classification process took around 30 minutes on my home PC (Windows XP / Intel Celeron) for each run with a different data set.
Then, we start learning/running the classifier with cross-validation. We have configured ours with 3 folds, meaning the process will run 3 times, using 1/3 of the data set as training, and applying the model to the remaining vectors. It will perform the same operation 3 times, each time using a different fold as training set.
The XValidation process takes in a series of sub-processes used in its iterations. First, the learner algorithm to be used. As mentioned earlier, we are using a Linear C-SVC SVM, and at this stage not a lot of tweaking has been done on its parameters.
Then, an OperatorChain is used to actually perform the execution of the classification experiment. It links together the ModelApplier process - which applied the trained model to the input vectors, and a PerformanceEvaluator task which calculates standard performance metrics on the classification run.
That's it. We then run the experiment, only chaning the input vector each time and compare the results.
Results
The classification process took around 30 minutes on my home PC (Windows XP / Intel Celeron) for each run with a different data set.
No comments:
Post a Comment