Monday, October 15, 2012

Data Mining - PMML in Action


Data Mining - PMML in Action

Do you know HTML, the common standard for web pages? PMML is the same idea applied to data mining. PMML stands for Predictive Model Markup Language. It is a standard using XML notation to deploy data mining model in the industry. PMML standard comprises data pre-processing and modelling.
The book contains a good (but short) introduction to the standard. Only a few schemas about the various steps of the process are given. Most of the book focuses on the different parts of the standard, illustrated with code snippets.  The main objective of this standard is to reduce time to market when deploying data mining solutions.
The book is really useful if you want to deploy your solution using PMML. If you are interested to know more about this standard, this is not the right book (but maybe the only one?). It contains a lot of code, but no case study. This would have been a plus, since most of the material can certainly be found on the web.

Data Management, Exploration and Mining (DMX)

The Data Management, Exploration and Mining group focuses on solving key problems in information management. Our current focuses are reducing the total cost of ownership of information management, enabling efficient identification and correction of data quality problems, and enabling flexible and rich modes of interaction with stored information while recognizing the key role the web plays in information delivery and publishing.
Database management systems provide functionality that is central to developing business applications. Furthermore, as new cloud database services emerge, even more applications are beginning to use database systems. Yet, the problem of tuning database management systems for achieving the required performance is significant, and results in high cost of ownership. The goal of our research in theAutoAdmin project is to make database systems more self-tuning and self-administering. We approach this by enabling databases to track the usage of their systems and to gracefully adapt to application requirements. Thus, instead of the application having to track and tune the database, the database actively monitors, diagnoses and tunes itself to be responsive to application needs. Our research has led to novel self-tuning components being included in Microsoft SQL Server.
High quality data is a critical requirement for effective business intelligence. Businesses obtain their data from multiple sources, often with widely varying data quality and representations. Therefore, data cleaning technology plays a crucial role in helping identify and correct such data quality problems. The increasing importance of web data and services has made this problem even more important. For example, data cleaning technology is necessary to capture misspellings and differences in representation when a user looks up addresses in an online address search engine. Traditionally, data cleaning has been driven by consultants and software that is custom made for specific vertical domains. Our goal in the Data Cleaning project is to design and build a domain-independent platform that can be used to develop solutions for specific vertical domains. We have designed data cleaning operators (e.g. Fuzzy Lookup, Fuzzy Grouping) that can be composed with other operators and customized by application developers to develop suitable data cleaning solutions. A key technical challenge that we focus on is efficient implementation of these operators over large data sets. We have also developed tools based on learning from examples, that can assist application developers and domain experts to more easily customize and fine tune their solutions to obtain good data quality. Our research has led to operators such as Fuzzy Lookup and Fuzzy Grouping shipping in Microsoft SQL Server Integration Services, as well as being used in the Bing.com/Maps search engine.
Keyword search over web and enterprise documents is a very popular mechanism for finding relevant information. In both enterprise and web scenarios, document collections coexist with large structured databases. Therefore, keyword search over structured databases, particularly in collections involving both structured and unstructured documents, is an important problem. In the Data Exploration project, we explore the algorithmic and systems issues arising out of the goal of searching and analyzing document collections and structured databases together. One of our goals is to identify structured database objects or entities relevant to a query, even if query keywords are not present in the entity name or description columns. A second goal is to enable efficient keyword search on logical entities (obtained by joining multiple relations) in databases without materializing them.
While tremendous progress has been made in data capture and storage, the technology for querying, navigating, exploring, visualizing, and summarizing large data stores is still maturing. Traditional approaches to data reduction and analysis break down with massive data sets. Therefore, we aim at exploiting Data Mining techniques, i.e., applying statistical and machine learning techniques to detect patterns in databases. Our research effort in data mining focuses on ensuring that traditional techniques are made effective over enterprise databases. In particular, traditional algorithms need to be made scalable; moreover, it is necessary to enable a seamless integration of data mining technology with Relational/OLAP database infrastructure so that database developers can exploit data mining functionality. Our work has resulted in novel technologies shipping in Microsoft SQL Server and Commerce Server.


No comments:

Post a Comment