Friday, February 24, 2012

Data mining Problems

Top 10 challenging problems in data mining

In a previous post, I wrote about the top 10 data mining algorithms, a paper that was published in Knowledge and Information Systems. The “selective” process is the same as the one that has been used to identify the most important (according to answers of the survey) data mining problems. The paper by Yang and Wu has been published (in 2006) in the International Journal of Information Technology & Decision Making. The paper contains the following problems (in no specific order):
  • Developing a unifying theory of data mining
  • Scaling up for high dimensional data and high speed data streams
  • Mining sequence data and time series data
  • Mining complex knowledge from complex data
  • Data mining in a network setting
  • Distributed data mining and mining multi-agent data
  • Data mining for biological and environmental problems
  • Data Mining process-related problems
  • Security, privacy and data integrity
  • Dealing with non-static, unbalanced and cost-sensitive data

I sometimes receive emails from master student or practitioners interested in data mining. The usual question is “What can I do as research in data mining?”. Of course, the answer depends on what you like and the opportunities of the moment. However, this paper can maybe give some hints on possible directions for research.

As usual, the “data mining automation process” issue is mentioned. It is worth noting that researchers argue that they need to find a way to automate data mining, while practitioners say that they can do it (for example KXEN). Finally, I think that one of the most important issue is pointed out by the following sentence in the paper:

“[...] they’re [data mining systems] unable to relate the results of mining to the real-world decisions they affect [...]“

In my opinion, it is more subjective to rank top problems than top algorithms. Most people will certainly agree on the selected data mining algorithms. The question is more subjective regarding data mining problems since some of them may only be relevant to certain fields of research.



No comments:

Post a Comment