Research Ethics: July 2015

Sunday, July 26, 2015

R vs Python

whether one should use R or Python when performing their day-to-day data analysis tasks. Both Python and R are amongst the most popular languages for data analysis, and have their supporters and opponents. While Python is often praised for being a general-purpose language with an easy-to-understand syntax, R’s functionality is developed with statisticians in mind, thereby giving it field-specific advantages such as great features for data visualization.

Our new infographic”Data Science Wars: R vs Python” is therefore for everyone interested in how these two (statistical) programming languages relate to each other. The infographic explores what the strengths of R are over Python and vice versa, and aims to provide a basic comparison between these two programming languages from a data science and statistics perspective.

Sunday, July 19, 2015

The Top 10 research papers in computer science by Mendeley readership.

Binary battle to promote applications built on the Mendeley API (now including PLoS as well), Now take a look at the data to see what people have to work with. The analysis is focused on our second largest discipline, Computer Science. Biological Sciences is the largest, but I started with this one so that I could look at the data with fresh eyes, and also because it’s got some really cool papers to talk about.

It was a fascinating list of topics, with many of the expected fundamental papers like Shannon’s Theory of Information and the Google paper, a strong showing from Mapreduce and machine learning, but also some interesting hints that augmented reality may be becoming more of an actual reality soon.

1. Latent Dirichlet Allocation (available full-text)

LDA is a means of classifying objects, such as documents, based on their underlying topics. I was surprised to see this paper as number one instead of Shannon’s information theory paper (#7) or the paper describing the concept that became Google (#3). It turns out that interest in this paper is very strong among those who list artificial intelligence as their subdiscipline. In fact, AI researchers contributed the majority of readership to 6 out of the top 10 papers. Presumably, those interested in popular topics such as machine learning list themselves under AI, which explains the strength of this subdiscipline, whereas papers like the Mapreduce one or the Google paper appeal to a broad range of subdisciplines, giving those papers a smaller numbers spread across more subdisciplines. Professor Blei is also a bit of a superstar, so that didn’t hurt. (the irony of a manually-categorized list with an LDA paper at the top has not escaped us)

2. MapReduce : Simplified Data Processing on Large Clusters (available full-text)

It’s no surprise to see this in the Top 10 either, given the huge appeal of this parallelization technique for breaking down huge computations into easily executable and recombinable chunks. The importance of the monolithic “Big Iron” supercomputer has been on the wane for decades. The interesting thing about this paper is that had some of the lowest readership scores of the top papers within a subdiscipline, but folks from across the entire spectrum of computer science are reading it. This is perhaps expected for such a general purpose technique, but given the above it’s strange that there are no AI readers of this paper at all.

3. The Anatomy of a large-scale hypertextual search engine (available full-text)

In this paper, Google founders Sergey Brin and Larry Page discuss how Google was created and how it initially worked. This is another paper that has high readership across a broad swath of disciplines, including AI, but wasn’t dominated by any one discipline. I would expect that the largest share of readers have it in their library mostly out of curiosity rather than direct relevance to their research. It’s a fascinating piece of history related to something that has now become part of our every day lives.

4. Distinctive Image Features from Scale-Invariant Keypoints

This paper was new to me, although I’m sure it’s not new to many of you. This paper describes how to identify objects in a video stream without regard to how near or far away they are or how they’re oriented with respect to the camera. AI again drove the popularity of this paper in large part and to understand why, think “Augmented Reality“. AR is the futuristic idea most familiar to the average sci-fi enthusiast as Terminator-vision. Given the strong interest in the topic, AR could be closer than we think, but we’ll probably use it to layer Groupon deals over shops we pass by instead of building unstoppable fighting machines.

5. Reinforcement Learning: An Introduction (available full-text)

This is another machine learning paper and its presence in the top 10 is primarily due to AI, with a small contribution from folks listing neural networks as their discipline, most likely due to the paper being published in IEEE Transactions on Neural Networks. Reinforcement learning is essentially a technique that borrows from biology, where the behavior of an intelligent agent is is controlled by the amount of positive stimuli, or reinforcement, it receives in an environment where there are many different interacting positive and negative stimuli. This is how we’ll teach the robots behaviors in a human fashion, before they rise up and destroy us.

6. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions (available full-text)

Popular among AI and information retrieval researchers, this paper discusses recommendation algorithms and classifies them into collaborative, content-based, or hybrid. While I wouldn’t call this paper a groundbreaking event of the caliber of the Shannon paper above, I can certainly understand why it makes such a strong showing here. If you’re using Mendeley, you’re using both collaborative and content-based discovery methods!

7. A Mathematical Theory of Communication (available full-text)

Now we’re back to more fundamental papers. I would really have expected this to be at least number 3 or 4, but the strong showing by the AI discipline for the machine learning papers in spots 1, 4, and 5 pushed it down. This paper discusses the theory of sending communications down a noisy channel and demonstrates a few key engineering parameters, such as entropy, which is the range of states of a given communication. It’s one of the more fundamental papers of computer science, founding the field of information theory and enabling the development of the very tubes through which you received this web page you’re reading now. It’s also the first place the word “bit”, short for binary digit, is found in the published literature.

8. The Semantic Web (available full-text)

In The Semantic Web, Tim Berners-Lee, Sir Tim, the inventor of the World Wide Web, describes his vision for the web of the future. Now, 10 years later, it’s fascinating to look back though it and see on which points the web has delivered on its promise and how far away we still remain in so many others. This is different from the other papers above in that it’s a descriptive piece, not primary research as above, but still deserves it’s place in the list and readership will only grow as we get ever closer to his vision.

9. Convex Optimization (available full-text)

This is a very popular book on a widely used optimization technique in signal processing. Convex optimization tries to find the provably optimal solution to an optimization problem, as opposed to a nearby maximum or minimum. While this seems like a highly specialized niche area, it’s of importance to machine learning and AI researchers, so it was able to pull in a nice readership on Mendeley. Professor Boyd has a very popular set of video classes at Stanford on the subject, which probably gave this a little boost, as well. The point here is that print publications aren’t the only way of communicating your ideas. Videos of techniques at SciVee or JoVE or recorded lectures (previously) can really help spread awareness of your research.

10. Object recognition from local scale-invariant features (available in full-text)

This is another paper on the same topic as paper #4, and it’s by the same author. Looking across subdisciplines as we did here, it’s not surprising to see two related papers, of interest to the main driving discipline, appear twice. Adding the readers from this paper to the #4 paper would be enough to put it in the #2 spot, just below the LDA paper.

2015's Hottest Topics In Computer Science Research

There are several major closed-form challenges in Computer Science, such as

the P versus NP problem,
finding better algorithms for Matrix multiplication and Fourier transform,
building Quantum computers that can quickly factorize numbers into primes (Shor’s algorithm), or at least explaining why we are still so far from this goal.

However, the hottest topics are broad and intentionally defined with some vagueness, to encourage out-of-the-box thinking. For such topics, zooming in on the right questions often marks significant progress in itself.

Here’s my list for 2015.

Abundant-data applications, algorithms, and architectures are a meta-topic that includes research avenues such as data mining (quickly finding relatively simple patterns in massive amounts of loosely structured data, evaluating and labeling data, etc), machine learning (building mathematical models that represent structure and statistical trends in data, with good predictive properties), hardware architectures to process more data than is possible today.

Artificial intelligence and robotics - broadly, figuring out how to formalize human capabilities, which currently appear beyond the reach of computers and robots, then make computers and robots more efficient at it. Self-driving cars and swarms of search-and-rescue robots are a good illustration. In the past, once good model were found for something (such as computer-aided design of electronic circuits), this research moves into a different field – the design of efficient algorithms, statistical models, computing hardware, etc.

Bio-informatics and other uses of CS in biology, biomedical engineering, and medicine, including systems biology (modeling interactions of multiple systems in a living organism, including immune systems and cancer development), computational biophysics (modeling and understanding mechanical, electrical, and molecular-level interactions inside an organism), computational neurobiology (understanding how organisms process incoming information and react to it, control their bodies, store information, and think). There is a very large gap between what is known about brain structure and the functional capabilities of a living brain – closing this gap is one of the grand challenged in modern science and engineering. DNA analysis and genetics have also become computer-based in the last 20 years. Biomedical engineering is another major area of growth, where microprocessor-based systems can monitor vital signs, and even administer life-saving medications without waiting for a doctor. Computer-aided design of prosthetics is very promising.

Computer-assisted education, especially at the high-school level. Even for CS, few high schools offer competent curriculum, even in developed countries. Cheat-proof automated support for exams and testing, essay grading, generation of multiple-choice questions. Support for learning specific skills, such as programming (immediate feedback on simple mistakes and suggestions on how to fix them, peer grading, style analysis).

Databases, data centers, information retrieval, and natural-language processing: collecting and storing massive collections of data and making them easily available (indexing, search), helping computers understand (structure in) human-generated documents and artifacts of all kinds (speech, video, text, motion, biometrics) and helping people search for the information they need when they need it. There are many interactions with abundant-data applications here, as well as with human-computer interaction, as well as with networking.

Emerging technologies for computing hardware, communication, and sensing: new models of computation (such as optical and quantum computing) and figuring out what they are [not] good for. Best uses for three-dimensional integrated circuits and a variety of new memory chips. Modeling and using new types of electronic switches (memristors, devices using carbon nano-tubes, etc), quantum communication and cryptography, and a lot more.

Human-computer interaction covers human-computer interface design and focused techniques that allow computers to understand people (detect emotions, intent, level of skill), as well as the design of human-facing software (social networks) and hardware (talking smart-phones and self-driving cars).

Large-scale networking: high-performance hardware for data centers, mobile networking, support for more efficient multicast, multimedia, and high-level user-facing services (social networks), networking services for developing countries (without permanent high-bandwidth connections), various policy issues (who should run the Internet and whether the governments should control it). Outer-space communication networks. Network security (which I also listed under Security) is also a big deal.

Limits of computation and communication at the level of problem types (some problems cannot be solved in principle!), algorithms (sometimes an efficient algorithm is unlikely to exist) and physical resources, especially space, time, energy and materials. This topic covers Complexity Theory from Theoretical CS, but also the practical obstacles faced by the designers of modern electronic systems, hinting at limits that have not yet been formalized.

Multimedia: graphics, audio (speech, music, ambient sound), video – analysis, compression, generation, playback, multi-channel communication etc. Both hardware and software are involved. Specific questions include scene analysis (describing what’s on the picture), comprehending movement, synthesizing realistic multimedia, etc.

Programming languages and environments: automated analysis of programs in terms of correctness and resource requirements, comparisons between languages, software support for languages (i.e., compilation), program optimization, support for parallel programming, domain-specific languages, interactions between languages, systems that assist programmers by inferring their intent.

Security of computer systems and support for digital democracy, including network-level security (intrusion detection and defense), OS-level security (anti-virus SW) and physical security (biometrics, tamper-proof packaging, trusted computing on untrusted platforms), support for personal privacy (efficient and user-friendly encryption), free speech (file sharing, circumventing sensors and network restrictions by oppressive regimes), as well as issues related to electronic polls and voting. Security is also a major issue in the use of embedded systems and the Internet of Things (IoT).

Verification, proofs, and automated debugging of hardware designs, software, networking protocols, mathematical theorems, etc. This includes formal reasoning (proof systems and new types of logical arguments), finding bugs efficiently and diagnosing them, finding bug fixes, and confirming the absence of bugs (usually by means of automated theorem-proving).

If something is not listed, it may still be a very worthwhile topic, but not necessarily “hot” right now, or perhaps lurking in my blind spot.

Now that you have a long answer, let’s revisit the question! Hotness usually refers to how easy it is to make impact in the field and how impactful the field is likely to be in the broader sense. For example, solving P vs. NP would be impactful and outright awesome, but also extremely unlikely to happen any time soon. So, new researchers are advised to stay away from such an established challenge. Quantum computing is roughly in the same category, although apparently the media and the masses have not realized this. On the positive side, applied physicists are building interesting new devices, producing results that are worthwhile by themselves. So, quantum information processing is a hot area in applied physics, but not in computer design.