Research Ethics: October 2012

Companies are increasingly using or exploring the potential of location-based data for providing new services or generating new applications for connecting with customers.

Many people are familiar with Starbucks’ use of location-based data to send customers offers when they’re close to Starbucks’ outlets.

location data Future Uses for Data Analysis and Location based Data

Now, new applications are beginning to emerge. Take, for example, the app the City of Boston has developed that uses smartphone technology to collect geotagged vibrations that help the city determine where additional roadwork is needed.

Street Bump, as it’s called, uses a smartphone’s built-in motion sensor that detects changes in the device’s movement relative to its current position – as well as its global positioning system – to locate potholes and get them repaired more quickly than in the past.

When a vehicle hits a pothole, Street Bump sends a signal to a database that includes the location of the car and the size of the pothole. The data can then be analyzed to identify and address the road conditions around Boston.

Indeed, there are myriad applications for using data analysis with location-based data in the public sector.

San Francisco’s public transit system is transitioning away from paper bus transfers to RFID cards. These RFID cards let city officials collect passenger data to help assess average commute times, passenger density, and popular travel periods by neighborhoods. Officials can then adjust the bus schedules accordingly, thus leading to greater operational efficiency.

Looking ahead, smartphones may be able to help identify not just where you are, but where you might be expected to be.

Researchers at the University of Birmingham in the UK have created an algorithm that can predict your future location using data gathered from your friends’ smartphones. The algorithm predicts a person’s movements by comparing the data from that person’s smartphone with the smartphone data of people in his social group.

For example, if Mary typically goes to yoga classes on Wednesdays but instead stops at the library to drop off some books, the algorithm will examine the activity of her friends, Jean and Michelle. If Jean and Michelle follow their usual routines, the algorithm will predict with a high degree of accuracy that Mary will continue on to her yoga class after stopping at the library.

In a study of 200 people, the algorithm has predicted the location of some people 24 hours later within about 328 feet, with some as close as 65 feet. These types of analyses, which are supported by the use of data visualization techniques, offer tremendous opportunities for retailers if they can predict with a high degree of certainty where people are likely to be.

For instance, a retailer can provide offers or incentives if there’s a strong likelihood that people will be at or near one of its outlets.

Companies will also be able to blend behavioral data with other customer information to make more relevant “hyper-local” offers to customers, as blogger Chris Horton notes.

For example, a restaurant chain that has airport venues can identify customers using passive location-based techniques and provide them offers and coupons based on their favorite meals (e.g., “We noticed that you are traveling through Chicago’s O’Hare International Airport. We’d like to offer you a 20% discount on our pulled-pork platter.”).

Future location-based services applications will only be strengthened by integrating in other data streams (existing customer data or information triggered by an event such as a product scan or entering a store) and then using data analysis to quickly determine the next best action that can be taken.

As evidenced by some of the emerging applications touched on here, we’ve only begun to scratch the surface for the possibilities that exist between data analysis and location-based data.

Data Mining - PMML in Action

Do you know HTML, the common standard for web pages? PMML is the same idea applied to data mining. PMML stands for Predictive Model Markup Language. It is a standard using XML notation to deploy data mining model in the industry. PMML standard comprises data pre-processing and modelling.

The book contains a good (but short) introduction to the standard. Only a few schemas about the various steps of the process are given. Most of the book focuses on the different parts of the standard, illustrated with code snippets. The main objective of this standard is to reduce time to market when deploying data mining solutions.

The book is really useful if you want to deploy your solution using PMML. If you are interested to know more about this standard, this is not the right book (but maybe the only one?). It contains a lot of code, but no case study. This would have been a plus, since most of the material can certainly be found on the web.

Data Management, Exploration and Mining (DMX)

The Data Management, Exploration and Mining group focuses on solving key problems in information management. Our current focuses are reducing the total cost of ownership of information management, enabling efficient identification and correction of data quality problems, and enabling flexible and rich modes of interaction with stored information while recognizing the key role the web plays in information delivery and publishing.

Database management systems provide functionality that is central to developing business applications. Furthermore, as new cloud database services emerge, even more applications are beginning to use database systems. Yet, the problem of tuning database management systems for achieving the required performance is significant, and results in high cost of ownership. The goal of our research in theAutoAdmin project is to make database systems more self-tuning and self-administering. We approach this by enabling databases to track the usage of their systems and to gracefully adapt to application requirements. Thus, instead of the application having to track and tune the database, the database actively monitors, diagnoses and tunes itself to be responsive to application needs. Our research has led to novel self-tuning components being included in Microsoft SQL Server.

High quality data is a critical requirement for effective business intelligence. Businesses obtain their data from multiple sources, often with widely varying data quality and representations. Therefore, data cleaning technology plays a crucial role in helping identify and correct such data quality problems. The increasing importance of web data and services has made this problem even more important. For example, data cleaning technology is necessary to capture misspellings and differences in representation when a user looks up addresses in an online address search engine. Traditionally, data cleaning has been driven by consultants and software that is custom made for specific vertical domains. Our goal in the Data Cleaning project is to design and build a domain-independent platform that can be used to develop solutions for specific vertical domains. We have designed data cleaning operators (e.g. Fuzzy Lookup, Fuzzy Grouping) that can be composed with other operators and customized by application developers to develop suitable data cleaning solutions. A key technical challenge that we focus on is efficient implementation of these operators over large data sets. We have also developed tools based on learning from examples, that can assist application developers and domain experts to more easily customize and fine tune their solutions to obtain good data quality. Our research has led to operators such as Fuzzy Lookup and Fuzzy Grouping shipping in Microsoft SQL Server Integration Services, as well as being used in the Bing.com/Maps search engine.

Keyword search over web and enterprise documents is a very popular mechanism for finding relevant information. In both enterprise and web scenarios, document collections coexist with large structured databases. Therefore, keyword search over structured databases, particularly in collections involving both structured and unstructured documents, is an important problem. In the Data Exploration project, we explore the algorithmic and systems issues arising out of the goal of searching and analyzing document collections and structured databases together. One of our goals is to identify structured database objects or entities relevant to a query, even if query keywords are not present in the entity name or description columns. A second goal is to enable efficient keyword search on logical entities (obtained by joining multiple relations) in databases without materializing them.

While tremendous progress has been made in data capture and storage, the technology for querying, navigating, exploring, visualizing, and summarizing large data stores is still maturing. Traditional approaches to data reduction and analysis break down with massive data sets. Therefore, we aim at exploiting Data Mining techniques, i.e., applying statistical and machine learning techniques to detect patterns in databases. Our research effort in data mining focuses on ensuring that traditional techniques are made effective over enterprise databases. In particular, traditional algorithms need to be made scalable; moreover, it is necessary to enable a seamless integration of data mining technology with Relational/OLAP database infrastructure so that database developers can exploit data mining functionality. Our work has resulted in novel technologies shipping in Microsoft SQL Server and Commerce Server.

Research Ethics

Friday, October 19, 2012

Future Uses for Data Analysis and Location-based Data

Monday, October 15, 2012

Data Mining - PMML in Action

Data Mining - PMML in Action