Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Monday, December 12, 2016

Top Analytics, Data Science software

 

 

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

R remains the leading tool, with 49% share, but Python grows faster and almost catches up to R. RapidMiner remains the most popular general Data Science platform. Big Data tools used by almost 40%, and Deep Learning usage doubles.  

The poll got tremendous participation from analytics and data science community and vendors, attracting 2,895 voters, who chose from a record number of 102 different tools.

R remains the leading tool, with 49% share (up from 46.9% in 2015), but Python usage grew faster and it almost caught up to R with 45.8% share (up from 30.3%). RapidMiner remains the most popular general platform for data mining/data science, with about 33% share. Notable tools with the most growth in popularity include Dato, Dataiku, MLlib, H2O, Amazon Machine Learning, scikit-learn, and IBM Watson.

The increased choice of tools is reflected in wider usage. The average number of tools used was 6.0, vs 4.8 in 2015.

The usage of Hadoop/Big Data tools grew to 39%, up from 29% in 2015 (and 17% in 2014), driven by Apache Spark, MLlib (Spark Machine Learning Library) and H2O.
The participation by region was: US/Canada (40%), Europe (39%), Asia (9.4%), Latin America (5.8%), Africa/MidEast (2.9%), Australia/NZ (2.2%).

Top Analytics/Data Science Tools

Next table has the top 10 most popular tools in 2016 poll
Tool2016
% share
% change% alone
R49%+4.5% 1.4%
Python45.8%+51% 0.1%
SQL35.5%+15% 0%
Excel33.6%+47% 0.2%
RapidMiner32.6%+3.5% 11.7%
Hadoop22.1%+20% 0%
Spark21.6%+91% 0.2%
Tableau18.5%+49% 0.2%
KNIME18.0%-10%4.4%
scikit-learn17.2%+107% 0%

In this table 2016 % share is % of voters who used this tool, % change is the change in share vs 2015 poll, and % alone is the percent of voters who used only the reported tool among all voters who used that tool. E.g. 4.4% of KNIME voters reported using only KNIME and nothing else. We note a decrease in such lone voting, with only 9 tools having 5% or more lone votes.

Top10 Analytics Data Science Software 2016
Fig 1: KDnuggets Analytics/Data Science 2016 Software Poll: top 10 most popular tools in 2016

Compared to 2015 KDnuggets Analytics/Data Science Poll results, the only newcomer in top 10 was scikit-learn, displacing SAS.

Tools with the highest growth (among tools with at least 15 users in 2015) were
Tool% change2016 %share2015 %share
Dato377%2.4%0.5%
Dataiku292%7.8%2.0%
MLlib253%11.6%3.3%
H2O233%6.7%2.0%
Amazon Machine Learning171%1.9%0.7%
scikit-learn107%17.2%8.3%
IBM Watson99%4.2%2.1%
Splunk/ Hunk98%2.2%1.1%
Spark91%21.6%11.3%
Scala79%6.2%3.5%


This year, 86% of voters used commercial software and 75% used free software. About 25% used only commercial software, and 13% used only open source/free software. A majority of 61% used both free and commercial software, similar to 64% in 2015.

New (in this poll) tools that received at least 1% share votes in 2016 were
  • Anaconda, 16%
  • Microsoft other ML/Data Science tools, 1.6%
  • SAP HANA, 1.2%
  • XLMiner, 1.2%
Among tools with at least 15 votes in 2015, the largest decline in 2016 was for the tools below, which includes probably a combination of decline of popularity for free tools like F# and lack of a voter drive for some of commercial tools this year.
  • Ayasdi, down 85%, to 0.3% share from 2.0%
  • Actian, down 83%, to 0.3% share from 2.0%
  • Datameer, down 52%, to 0.4% share from 0.9%
  • SAP Analytics, down 51%, to 1.5% share from 3.0%
  • SAS Enterprise Miner, down 49%, to 5.6% from 10.9%
  • Alteryx, down 46%, to 3.0% share from 5.6%
  • F#, down 42%, to 0.4% share from 0.7%
  • TIBCO Spotfire, down 36%, to 2.8% share from 4.3%
  • JMP, down 36%, to 2.0% share from 3.1%

Hadoop/Big Data Tools

The usage of Hadoop/Big Data tools grew to 39%, up from 29% in 2015 and 17% in 2014), driven mainly by big growth in Apache Spark, MLlib (Spark Machine Learning Library) and H2O, which we included among Big Data tools.

Here are the Big Data tools and their share in 2016, 2015, and %change.
Tool2016
%Share
2015
%share
% change
Hadoop22.1%18.4%+20.5%
Spark21.6%11.3%+91%
Hive12.4%10.2%+21.3%
MLlib11.6%3.3%+253%
SQL on Hadoop tools7.3%7.2%+1.6%
H2O6.7%2.0%+234%
HBase5.5%4.6%+18.6%
Apache Pig4.6%5.4%-16.1%
Apache Mahout2.6%2.8%-7.2%
Dato2.4%0.5%+338%
Datameer0.4%0.9%-52.3%
Other Hadoop/HDFS-based tools4.9%4.5%+7.5%

Deep Learning Tools

For the second year KDnuggets poll include Deep Learning Tools. This year, 18% of voters used Deep Learning tools, doubling the 9% in 2015.

Google Tensorflow jumped to first place, displacing last year leader Theano/Pylearn2 ecosystem.

Top tools:
  • Tensorflow, 6.8%
  • Theano ecosystem (including Pylearn2), 5.1%
  • Caffe, 2.3%
  • MATLAB Deep Learning Toolbox, 2.0%
  • Deeplearning4j, 1.7%
  • Torch, 1.0%
  • Microsoft CNTK, 0.9%
  • Cuda-convnet, 0.8%
  • mxnet, 0.6%
  • Convnet.js, 0.3%
  • darch, 0.1%
  • Nervana, 0.1%
  • Veles, 0.1%
  • Other Deep Learning Tools, 3.7%
The Deep Learning field is still in the beginning of its journey, as we see by the large number of options.

Programming Languages

Python, Java, Unix tools, Scala grew in popularity, while C/C++, Perl, Julia, F#, Clojure, and Lisp declined.

Here are the programming languages sorted by popularity.
  • Python, 45.8% share (was 30.3%), 51% increase
  • Java, 16.8% share (was 14.1%), 19% increase
  • Unix shell/awk/gawk 10.4% share (was 8.0%), 30% increase
  • C/C++, 7.3% share (was 9.4%), 23% decrease
  • Other programming/data languages, 6.8% share (was 5.1%), 34.1% increase
  • Scala, 6.2% share (was 3.5%), 79% increase
  • Perl, 2.3% share (was 2.9%), 19% decrease
  • Julia, 1.1% share (was 1.1%), 1.6% decrease
  • F#, 0.4% share (was 0.7%), 41.8% decrease
  • Clojure, 0.4% share (was 0.5%), 19.4% decrease
  • Lisp, 0.2% share (was 0.4%), 33.3% decrease

Thursday, March 10, 2016

Most Popular Coding Languages of 2016





Most Popular Coding Languages of 2016
Data on the "Most Popular Coding Languages" based on hundreds of thousands of data points collected by processing over 1,200,000+ challenge submissions in (now) 26 different programming languages. This gives us a pretty valuable insight on what the trends are in hiring demand amongst tech companies for the upcoming year. It's data we hope will be especially helpful for new computer science graduates or coders looking to stay ahead of the curve. (CodeEval is now being used as a classroom tool in a number of schools, from university programs to boot camps.)
 
Results

For the fifth year in a row, Python retains it's #1 dominance followed by Java, C++, and Javascript.
This year's most noticeable changes were a 27% increase in C# submissions, a 15% surge in Java, as well as a 21% increase in C submissions. While still reigning champ we saw a 14% drop in Python submissions as well as a 17% decline in Ruby usage.

Programming language ranking change by year.


We've seen a triple digit surge with R and Visual basic but they still only account for less than 1%. This year we added 5 new languages D, Fortran, Guile, OCaml and Scheme.


Programming language change percentage by year.

It's interesting to note the rise of Java after several years of steady decline. Could this be the year that Java overtakes Python? On the TIOBE index, another major index and a good indicator of market share, Java has surpassed both Python and Visual basic for the top spot. This may indicate a big popularity growth in the coming year. Note: Some of the newer languages we've added; D, Guile, Fortran, OCaml, and Scheme, may have suffered somewhat since they haven't had a full year inside the platform.  

Friday, August 28, 2015

PYTHON PACKAGES FOR DATA MINING





The intelligent key thing is when you use  the same hammer to solve what ever problem you came across. Like the same way when we indented to solve a data mining problem  we will face so many issues but we can solve them by using python in a intelligent way.


Before stepping directly to Python packages, let me clear up any doubts you may have about why you should be using Python.

WHY PYTHON ?

We all know that python is powerful programming language, but what does that mean, exactly? What makes python  a powerful programming language?

PYTHON IS EASY

Universally, Python has gained a reputation because of it’s easy to learn. The syntax of Python programming language is designed to be easily readable. Python has significant popularity in  scientific computing. The people working in this field are scientists first, and programmers second.

PYTHON IS EFFICIENT

Nowadays we working on bulk amount of data, popularly known as big data.  The more data you have to process, the more important it becomes to manage the memory you use. Here Python will work very efficiently.

PYTHON IS FAST

We all know Python is an interpreted language, we may think that it is slow, but some amazing work has been done over the past years to improve Python’s performance. My point is that if you want to do high-performance computing, Python is a viable best option today.
Hope I cleared your doubt about “Why Python?”, so let me jump to Python Packages for data mining.

NUMPY

Numpylogo
 About:
NumPy is the fundamental package for scientific computing with Python. NumPy is an extension to the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications.
Original author(s)Travis Oliphant
Developer(s)Community project
Initial releaseAs Numeric, 1995; as NumPy, 2006
Stable release1.9.0 / 7 September 2014; 36 days ago
Written inPython, C
Operating systemCross-platform
TypeTechnical computing
LicenseBSD-new license
Websitewww.numpy.org
Installing numpy:
If Python is not installed in your computer please install it first.
Installing numpy in linux
Open your terminal and copy these commands:
sudo apt-get update
sudo apt-get install python-numpy
Sample numpy code for using reshape function
[code language=”css”]from numpy import *
a = arange(12)
a = a.reshape(3,2,2)
print a [/code]
Script output
[[[ 0 1]
[ 2 3]]
[[ 4 5]
[ 6 7]]
[[ 8 9]
[10 11]]]

SCIPY

scipy_logo
About:
SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world’s leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, Scipy is the tool for the job.
Original author(s)Travis Oliphant, Pearu Peterson, Eric Jones
Developer(s)Community library project
Stable release0.14.0 / 3 May 2014; 5 months ago
Written inPythonFortranCC++[1]
Operating systemCross-platform (list)
TypeTechnical computing
LicenseBSD-new license
Websitewww.scipy.org
 Installing SciPy in linux
Open your terminal and copy these commands:
sudo apt-get update
sudo apt-get install python-scipy
Sample SciPy code
[code language=”css”] from scipy import special, optimize
f = lambda x: -special.jv(3, x)
sol = optimize.minimize(f, 1.0)
x = linspace(0, 10, 5000)
plot(x, special.jv(3, x), ‘-‘, sol.x, -sol.fun, ‘o’)
savefig(‘plot.png’, dpi=96)[/code]
 Script output
Screenshot from 2014-10-29 19:36:33

PANDAS

pandas
About:
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
Pandas is well suited for many different kinds of data:
  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet.
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure.
Installing Pandas in Linux
Open your terminal and copy these commands:
sudo apt-get update
sudo apt-get install python-pandas
Sample Pandas code about Pandas Series
[code language=”css”]import pandas as pd
values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
ser = pd.Series(values)
print ser[/code]
Script output
0 2.0000
1 1.0000
2 5.0000
3 0.9700
4 3.0000
5 10.0000
6 0.0599
7 8.0000

MATPLOTLIB

540px-Matplotlib_logo.svg

About:
matplotlib is a plotting library for the Python programming language and its NumPy numerical mathematics extension. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB. SciPy makes use of matplotlib.
Original author(s)John Hunter
Developer(s)Michael Droettboom, et al.
Stable release1.4.2 (26 October 2014; 3 days ago) [±]
Written inPython
Operating systemCross-platform
TypePlotting
Licensematplotlib license
Websitematplotlib.org
Installing Matplotlib in linux
Open your terminal and copy these commands:
sudo apt-get update
sudo apt-get install python-matplotlib
Sample Matplotlib code to Create Histograms
[code language=”css”]import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# example data
mu = 100 # mean of distribution
sigma = 15 # standard deviation of distribution
x = mu + sigma * np.random.randn(10000)
num_bins = 50
# the histogram of the data
n, bins, patches = plt.hist(x, num_bins, normed=1, facecolor=’green’, alpha=0.5)
# add a ‘best fit’ line
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins, y, ‘r–‘)
plt.xlabel(‘Smarts’)
plt.ylabel(‘Probability’)
plt.title(r’Histogram of IQ: $\mu=100$, $\sigma=15$’)
# Tweak spacing to prevent clipping of ylabel
plt.subplots_adjust(left=0.15)
plt.show()[/code]
Script output
Screenshot from 2014-10-29 19:55:21

IPYTHON

ipython
IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers enhanced introspection, rich media, additional shell syntax, tab completion, and rich history. IPython currently provides the following features:
  • Powerful interactive shells (terminal and Qt-based).
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into one’s own projects.
  • Easy to use, high performance tools for parallel computing.
Original author(s)Fernando Perez and others
Stable release2.3 / 1 October 2014; 27 days ago
Written inPythonJavaScriptCSS,HTML
Operating systemCross-platform
TypeShell
LicenseBSD
Websitewww.ipython.org
Installing IPython in linux
Open your terminal and copy these commands:
sudo apt-get update
sudo pip install ipython
Sample IPython code
This piece of code is to plot demonstrating the integral as the area under a curve
[code language=”css”]import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
def func(x):
return (x – 3) * (x – 5) * (x – 7) + 85
a, b = 2, 9 # integral limits
x = np.linspace(0, 10)
y = func(x)
fig, ax = plt.subplots()
plt.plot(x, y, ‘r’, linewidth=2)
plt.ylim(ymin=0)
# Make the shaded region
ix = np.linspace(a, b)
iy = func(ix)
verts = [(a, 0)] + list(zip(ix, iy)) + [(b, 0)]
poly = Polygon(verts, facecolor=’0.9′, edgecolor=’0.5′)
ax.add_patch(poly)
plt.text(0.5 * (a + b), 30, r"$\int_a^b f(x)\mathrm{d}x$",
horizontalalignment=’center’, fontsize=20)
plt.figtext(0.9, 0.05, ‘$x$’)
plt.figtext(0.1, 0.9, ‘$y$’)
ax.spines[‘right’].set_visible(False)
ax.spines[‘top’].set_visible(False)
ax.xaxis.set_ticks_position(‘bottom’)
ax.set_xticks((a, b))
ax.set_xticklabels((‘$a$’, ‘$b$’))
ax.set_yticks([])
plt.show()[/code]
Script output
area_fig

SCIKIT-LEARN

scikit-learn-logo
The scikit-learn project started as scikits.learn, a Google Summer of Code project by David Cournapeau. Its name stems from the notion that it is a “SciKit” (SciPy Toolkit), a separately-developed and distributed third-party extension to SciPy. The original codebase was later extensively rewritten by other developers. Of the various scikits, scikit-learn as well as scikit-image were described as “well-maintained and popular” in November 2012.
Original author(s)David Cournapeau
Initial releaseJune 2007; 7 years ago[1]
Stable release0.15.1 / August 1, 2014; 2 months ago[2]
Written inPythonCythonC andC++
Operating systemLinuxMac OS X,Microsoft Windows
TypeLibrary for machine learning
LicenseBSD License
Websitescikit-learn.org
Installing Scikit-learn in linux
Open your terminal and copy these commands
sudo apt-get update
sudo apt-get install python-sklearn
Sample Scikit-learn code
[code language=”css”]import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis]
diabetes_X_temp = diabetes_X[:, :, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X_temp[:-20]
diabetes_X_test = diabetes_X_temp[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print(‘Coefficients: \n’, regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) – diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print(‘Variance score: %.2f’ % regr.score(diabetes_X_test, diabetes_y_test))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color=’black’)
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color=’blue’,
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show() [/code]
Script output
Coefficients:
[ 938.23786125]
Residual sum of squares: 2548.07
Variance score: 0.47
linera
I have explained the packages which we are going to use in coming posts to solve some interesting problems.
Please leave your comment if you have any other Python data mining packages to add to this list.
Originally published here.

Data Mining - Fruitful and Fun

Orange Tutorial

http://www.orange.biolab.si/tutorial/rst/index.html

Sunday, July 26, 2015

R vs Python






whether one should use R or Python when performing their day-to-day data analysis tasks. Both Python and R are amongst the most popular languages for data analysis, and have their supporters and opponents. While Python is often praised for being a general-purpose language with an easy-to-understand syntax, R’s functionality is developed with statisticians in mind, thereby giving it field-specific advantages such as great features for data visualization.

Our new infographic”Data Science Wars: R vs Python” is therefore for everyone interested in how these two (statistical) programming languages relate to each other. The infographic explores what the strengths of R are over Python and vice versa, and aims to provide a basic comparison between these two programming languages from a data science and statistics perspective.