Great Packages for Data Science in Python and R

This article is contributed by Hang Li at Hulu:

Domino’s Chief Data Scientist, Eduardo Ariño de la Rubia talk about Python and R as the “best” language for data scientists.
A list of useful packages from this talk.


  • Feather – Fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow
  • Ibis  – Productivity-centric Python data analysis framework for SQL systems and the Hadoop platform. Co-founded by the creator of pandas
  • Paratext  – A library for reading text files over multiple cores.
  • Bcolz  – A columnar data container that can be compressed.
  • Altair  – Declarative statistical visualization library for Python
  • Bokeh  – Interactive Web Plotting for Python
  • Blaze  – NumPy and Pandas interface to Big Data
  • Xarry  – N-D labeled arrays and datasets in Python
  • Dask  – Versatile parallel programming with task scheduling
  • Keras – High-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.
  • PyMC3  – Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano


  • Feather – Fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow
  • Haven – Import foreign statistical formats into R via the embedded ReadStat C library.
  • readr  – Read flat/tabular text files from disk (or a connection).
  • Jsonlite  A fast JSON parser and generator optimized for statistical data and the web.
  • ggplot2 – A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
  • htmlwidgets – A framework for creating HTML widgets that render in various contexts including the R console, ‘R Markdown’ documents, and ‘Shiny’ web applications.
  • leaflet – Create and customize interactive maps using the ‘Leaflet’ JavaScript library and the ‘htmlwidgets’ package.
  • tilegramsR  – Provide R spatial objects representing Tilegrams.
  • dplyr – A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
  • broom – Convert statistical analysis objects from R into tidy data frames
  • tidytext – Text mining for word processing and sentiment analysis using ‘dplyr’, ‘ggplot2’, and other tidy tools.
  • mxnet – The MXNet R packages brings flexible and efficient GPU computing and state-of-art deep learning to R.
  • tensorflow – TensorFlow™ is an open source software library for numerical computation using data flow graphs.

[Video] A Huge Debate: R vs. Python for Data Science

Kaggler. Data Scientist.

Solution Sharing for the Allstate Competition at Kaggle

I participated in the Allstate competition at Kaggle and finished 54th out of 3,055 teams.  I shared my solution in the forum after the competition here:

Congrats for winners and top performers, and thanks for great sharing to all contributors in the forum. It’s always a humbling experience to compete at Kaggle. I learn so much at every competition from a lot of fellow kagglers.

Here I’d like to share my code base and notes for the competition:

My friends and I have been using the framework based on Makefiles for competitions for years now and it has worked great so far.

Introduction to the framework is available on the TalkingData forum:

Our previous code repo for past competitions are also available at:

Hope it’s helpful.

Kaggler. Data Scientist.