Tuesday, February 28, 2012

Data Mining Software

Data Mining Software | DataDetective

DataDetective is a powerful data mining software developed by Santient Information Systems. It uses data mining techniques for exploring patterns and carrying out analysis, prediction, clustering, network analysis any so on. It uses fuzzy matching to establish relationships between data. It is easy to use and even novice users can use it. It is powerful, fast and flexible. It provide user with easy to use user interfaces to interact. Its flexible data interface allows complex data structures and has integrated ETL for data processing. It has been used by the Dutch police to analyze crime databases using data mining.

Name Data Detective
Brief Summary A visual data mining tool with focus on ease of use.
License Type Commercial Licence
Data Mining Approaches Classification Discovery, Cluster Discovery, Association Discovery, Text Mining, Outlier Discovery, Data Visualisation, Discovery Visualisation
Currently Available Currently Available
Website http://www.sentient.nl/?ddenLink


Data Mining Software | Darwin

Darwin is a data mining software which is now part of Oracle. It employs various data mining techniques like classification, neural networks, genetic algorithms and regression. It follows a client-server architecture with the client can be be working on both the Windows and Unix platform. It is capable of getting data from a wide range of sources.

Most Popular Data Mining Software

Surveys conducted by KDD Nuggets and Rexer Analytics have asked people involved in data mining what software they use. While it's not necessarily true that the most popular software is the best for a particular purpose, they can help guide you in choosing what software to evaluate.

KDD Nuggets Poll May 2010

The results of the KDD Nugget Survey were based on about 900 respondents to the question "Which data mining/analytic tools you used in the past 12 months for a real project?"
  1. Rapid Miner
  2. R
  3. KNIME
  4. "Own Code"
  5. Weka or Pentaho

Rexler Analyics Survey 2010

This survey asked data miners about the software the use as a primary tool, frequently or occasionally.The top 5 most used tools were:
  1. R
  2. SAS
  3. IBM SPSS Statistics
  4. IBM SPSS Modeller
  5. Weka
The top 5 primary tools were:
  1. Statistica
  2. IBM SPSS Modeller
  3. SAS
  4. R
  5. IBM SPSS Statistics


Friday, February 24, 2012

Data Mining Datasets

Web sites I've used to grab data for some testing of algorithms of software packages:

There are several sites for data, including:

UC Irvine Machine Learning Repository: http://archive.ics.uci.edu/ml/

Carnegie Mellon Statlib Archive: http://lib.stat.cmu.edu/datasets/

DELVE Datasets: http://www.cs.utoronto.ca/~delve/data/datasets.html

MIT Broad Institute Cancer Datasets: http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi

Datasets for Data Mining: http://www.kdnuggets.com/datasets/

What do Data miners Need to Learn?

I've been asked by several folks recently what they need to learn to succeed in data mining and predictive analytics. This is a different twist on the question I also get, namely what degree should one get to be a good (albeit "green") data miner. Usually, the latter question gets the answer "it doesn't matter" because I know so many great data miners without a statistics or mathematics degree. Understandably, there are many non-stats/math degrees that have a very strong statistics or mathematics component, such as psychology, political science, and engineering to name a few. But then again, you don't necessarily have to load up on the stats/math courses in these disciplines either.

So the question of "what to learn" applies across majors whether undergraduate or graduate. Of course statistics and machine learning courses are directly applicable. However, the answer I've been giving recently to the question what do new data miners need to learn (assuming they will learn algorithms) have centered around two other topics: databases and business.

I had no specific coursework or experience in either when I began my career. In the 80s, databases were not as commonplace in the DoD world where I began my career; we usually worked with flat files provided to us by a customer, even if these files were quite large. Now, most customers I work with have their data stored in databases or data marts, and as a result, we data miners often must lean on DBAs or an IT layer of people to get at the data. This would be fine except that (1) the data that is provided to data miners is often not the complete data we need or at least would like to have before building models, (2) we sometimes won't know how valuable data is until we look at it, and (3) communication with IT is often slow and laden with political issues inherent in many organizations.

On the other hand, IT is often reticent to give analysts significant freedom to query databases because of the harm they can do (wise!) because data miners have in general a poor understanding of how databases work and which queries are dangerous or computationally expensive.

Therefore, I am becoming more of the opinion that a masters program in data mining, or a data mining certificate program should contain at least one course on databases, which should contain at least some database design component, but for the most part should emphasize a users perspective). It is probably more realistic to require this for a degree than a certificate, but could be included in both. I know that for me, in considering new hires, this would be provide a candidate an advantage for me if he or she had SQL or SAS experience.

For the second issue, business experience, there are some that might be concerned that "experience" is too narrow for a degree program. After all, if someone has experience in building response models, what good would that do for Paypal if they are looking for building fraud models? My reply is "a lot"! Building models on real data (meaning messy) to solve a real problem (meaning identifying a target variable that conveys the business decision to be improved) requires a thought process that isn't related to knowing algorithms or data.

Building "real-world" models requires a translation of business objectives to data mining objectives (as described in the Business Understanding section of CRISP-DM). When I have interviewed young data miners in the past, it is those who have had to go through this process that are better prepared to begin the job right away, and it is those who recognize the value here who do better at solving problems in a way that impacts decisions rather than finding cool, innovative solutions that never see the light of day. (UPDATE: the crisp-dm.org site is no longer up--see comments section. The CRISP-DM 1.0 document however can still be downloaded here, with higher resolution graphics, by the way!)

My challenge to the universities who are adding degree programs in data mining and predictive analytics, or are offering Certificate programs is then to include courses on how to access data (databases), and how to solve problems (business objectives, perhaps by offering a practicum with a local company).

TOP 10 Algorithms in Data mining

In an effort to identify some of the most influential algorithms that have been widely used in the data mining community, the IEEE International Conference on Data Mining (ICDM) identified the top 10 algorithms in data mining for presentation at ICDM '06 in Hong Kong.

As the first step in the identification process, in September 2006 we invited the ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners to each nominate up to 10 best-known algorithms in data mining. All except one in this distinguished set of award winners responded to our invitation. We asked each nomination to provide the following information: (a) the algorithm name, (b) a brief justification, and (c) a representative publication reference. We also advised that each nominated algorithm should have been widely cited and used by other researchers in the field, and the nominations from each nominator as a group should have a reasonable representation of the different areas in data mining.

After the nominations in Step 1, we verified each nomination for its citations on Google Scholar in late October 2006, and removed those nominations that did not have at least 50 citations. All remaining (18) nominations are given on the candidate list below, organized in 10 topics. Please note that for some of these algorithms such as K-means, the citation is not given on the original paper that introduced the algorithm, but a recent paper that highlights the importance of the technique.

  • 18 Candidates for the Top 10 Algorithms in Data Mining
In the third step of the identification process, we had a wider involvement of the research community. We invited the Program Committee members of KDD-06, ICDM '06, and SDM '06 as well as the ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners to each vote for up to 10 well-known algorithms from the above candidate list. The voting results of this step were presented at ICDM '06 and are given in the slides below.
  • Panel Slides with Voting Results
We hope the identification of the top 10 algorithms can promote data mining to wider real-world applications and inspire more researchers in data mining to further explore these 10 algorithms, including their impact and new research issues.

PDF Link


Data mining Problems

Top 10 challenging problems in data mining

In a previous post, I wrote about the top 10 data mining algorithms, a paper that was published in Knowledge and Information Systems. The “selective” process is the same as the one that has been used to identify the most important (according to answers of the survey) data mining problems. The paper by Yang and Wu has been published (in 2006) in the International Journal of Information Technology & Decision Making. The paper contains the following problems (in no specific order):
  • Developing a unifying theory of data mining
  • Scaling up for high dimensional data and high speed data streams
  • Mining sequence data and time series data
  • Mining complex knowledge from complex data
  • Data mining in a network setting
  • Distributed data mining and mining multi-agent data
  • Data mining for biological and environmental problems
  • Data Mining process-related problems
  • Security, privacy and data integrity
  • Dealing with non-static, unbalanced and cost-sensitive data

I sometimes receive emails from master student or practitioners interested in data mining. The usual question is “What can I do as research in data mining?”. Of course, the answer depends on what you like and the opportunities of the moment. However, this paper can maybe give some hints on possible directions for research.

As usual, the “data mining automation process” issue is mentioned. It is worth noting that researchers argue that they need to find a way to automate data mining, while practitioners say that they can do it (for example KXEN). Finally, I think that one of the most important issue is pointed out by the following sentence in the paper:

“[...] they’re [data mining systems] unable to relate the results of mining to the real-world decisions they affect [...]“

In my opinion, it is more subjective to rank top problems than top algorithms. Most people will certainly agree on the selected data mining algorithms. The question is more subjective regarding data mining problems since some of them may only be relevant to certain fields of research.