New Editor – Tam T. Nguyen, Kaggle Grandmaster

I am happy to announce that we have a new editor, Tam T. Nguyen, joining Kaggler.com.

Tam is a Competition Grandmaster at Kaggle. He won the 1st prizes at KDD Cup 2015, IJCAI-15 repeat buyer competition, and Springleaf marketing response competition.

Currently, he is Postdoctoral Search Fellow at Ryerson University in Toronto, Canada. Prior to that, he was Data Analytics Project Lead at I2R A*STAR. He earned his Ph.D. in Computer Science from NTU Singapore. He’s originally from Vietnam.

Please subscribe to us at Kaggler.com, Facebook, and Twitter.

Keras Backend Benchmark: Theano vs TensorFlow vs CNTK

Inspired by Max Woolf’s benchmark, the performance of 3 different backends (Theano, TensorFlow, and CNTK) of Keras with 4 different GPUs (K80, M60, Titan X, and 1080 Ti) across various neural network tasks are compared.

For the performance of TensorFlow and CNTK with K80, the numbers reported at Max Woolf’s benchmark are used.

Conclusion

  • The accuracies of Theano, TensorFlow and CNTK backends are similar across all benchmark tests, while speeds vary a lot.
    • Theano is significantly (up to 50 times) slower than TensorFlow and CNTK.
    • Between TensorFlow and CNTK, CNTK is a lot (about 2 to 4 times) faster than TensorFlow for LSTM (Bidirectional LSTM on IMDb Data and Text Generation via LSTM), while speeds for other type of neural networks are close to each other.
  • Among K80, M60, Titan X and 1080 Ti GPUs:
    • 1080 Ti is the fastest.
    • K80 is the slowest.
    • M60 is faster than K80 and comparable to Titan X and 1080 Ti.
    • Theano is significantly (up to 14 times) faster on 1080 Ti than on Titan X, while the improvements for TensorFlow and CNTK are moderate.

Detailed results are available at https://github.com/szilard/benchm-dl/blob/master/keras_backend.md

Building Your Own Kaggle Machine

In 2014, I shared the specifications of a 6-core 64GB RAM desktop system that I purchased at around $2,000. Since then, I added NVidia Titan X to it for deep learning at additional $1,000, and it served me well.

However, as other team members started joining me on data science competitions and deep learning competitions got more popular, my team decided to build a more powerful desktop system.

The specifications of the new system that we built are as follows:

  • CPU: Xeon 2.4GHz 14-Core
  • RAM: 128GB DDR4-2400
  • GPU: 4 NVidia 1080 Ti 11GB
  • SSD: 960GB
  • HDD: 4TB 7200RPM
  • PSU: 1600W 80+ Titanium certified

Total cost including tax and shipping was around $7,000. Depending on the budget, you can go down to 2 (-$1,520) 1080 Ti GPU cards instead of 4, or 64GB (-$399) instead of 128GB RAM, and still have a decent system.

You can find the full part lists here.

Additional Resources

Winning Data Science Competitions – Latest Slides

 

This year I had several occasions to give my “Winning Data Science Competitions” talk – at Microsoft, KSEA-SWC 2017, USC Applied Statistics Club, Spark SC, and Whisper.

I am grateful for all these opportunities to share what I enjoy with the data scientist community.

I truly believe that working on competitions on a regular basis can make us better data scientists. Hope my talk and slides help other data scientists.

My talk is outlined as follows:

  1. Why compete
    1. For fun
    2. For experience
    3. For learning
    4. For networking
  2. Data science competition intro
    1. Competitions
    2. Structure
    3. Kaggle
  3. Misconceptions of data science competitions
    1. No ETL?
    2. No EDA?
    3. Not worth it?
    4. Not for production?
  4. Best practices
    1. Feature engineering
    2. Diverse algorithms
    3. Cross validation
    4. Ensemble
    5. Collaboration
  5. Personal tips
  6. Additional resources

You can find latest slides here:

Kaggler 0.5.0 Released

I am glad to announce the release of Kaggler 0.5.0. Kaggler 0.5.0 has a significant improvement in the performance of the FTRL algorithm thanks to Po-Hsien Chu (github, kaggle, linkedin).

Results

We increase the train speed by up to 100 times compare to 0.4.x. Our benchmark shows that one epoch with 1MM records with 8 features takes 1.2 seconds with 0.5.0 compared to 98 seconds with 0.4.x on an i7 CPU.

Motivation

The FTRL algorithm has been a popular algorithm since its first appearance on a paper published by Google. It is suitable for highly sparse data, so it has been widely used for click-through-rate (CTR) prediction in online advertisement. Many Kagglers use FTRL as one of their base algorithms in CTR prediction competitions. Therefore, we want to improve our FTRL implementation and benefit Kagglers who use our package.

Methods

We profile the code with cProfile and resolve the overheads one by one:

  1. Remove over-heads of Scipy Sparse Matrix row operation: Scipy sparse matrix checks many conditions in __getitems__, resulting in a lot of function calls. In fit(), we know that we’re fetching exactly each row, and it is very unlikely to exceed the bound, so we can fetch the indexes of each row in a faster way. This enhancement makes our FTRL 10x faster.
  2. More c-style enhancement: Specify types more clearly, return a whole list instead of yielding feature indexes, etc. These enhancements make our FTRL 5X faster when interaction==False.
  3. Faster hash function for interaction features: The last enhancement is to remove the overhead of hashing of interaction features. We use MurMurHash3, which scikit-learn uses, to directly hash the multiplication of feature indexes. This enhancement makes our FTRL 5x faster when interaction==True.

Contributor

Po-Hsien Chu (github, kaggle, linkedin)

Great Packages for Data Science in Python and R

This article is contributed by Hang Li at Hulu:


Domino’s Chief Data Scientist, Eduardo Ariño de la Rubia talk about Python and R as the “best” language for data scientists.
A list of useful packages from this talk.

Python

  • Feather – Fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow
  • Ibis  – Productivity-centric Python data analysis framework for SQL systems and the Hadoop platform. Co-founded by the creator of pandas
  • Paratext  – A library for reading text files over multiple cores.
  • Bcolz  – A columnar data container that can be compressed.
  • Altair  – Declarative statistical visualization library for Python
  • Bokeh  – Interactive Web Plotting for Python
  • Blaze  – NumPy and Pandas interface to Big Data
  • Xarry  – N-D labeled arrays and datasets in Python
  • Dask  – Versatile parallel programming with task scheduling
  • Keras – High-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.
  • PyMC3  – Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

R

  • Feather – Fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow
  • Haven – Import foreign statistical formats into R via the embedded ReadStat C library.
  • readr  – Read flat/tabular text files from disk (or a connection).
  • Jsonlite  A fast JSON parser and generator optimized for statistical data and the web.
  • ggplot2 – A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
  • htmlwidgets – A framework for creating HTML widgets that render in various contexts including the R console, ‘R Markdown’ documents, and ‘Shiny’ web applications.
  • leaflet – Create and customize interactive maps using the ‘Leaflet’ JavaScript library and the ‘htmlwidgets’ package.
  • tilegramsR  – Provide R spatial objects representing Tilegrams.
  • dplyr – A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
  • broom – Convert statistical analysis objects from R into tidy data frames
  • tidytext – Text mining for word processing and sentiment analysis using ‘dplyr’, ‘ggplot2’, and other tidy tools.
  • mxnet – The MXNet R packages brings flexible and efficient GPU computing and state-of-art deep learning to R.
  • tensorflow – TensorFlow™ is an open source software library for numerical computation using data flow graphs.

[Video] A Huge Debate: R vs. Python for Data Science

Solution Sharing for the Allstate Competition at Kaggle

I participated in the Allstate competition at Kaggle and finished 54th out of 3,055 teams.  I shared my solution in the forum after the competition here:


Congrats for winners and top performers, and thanks for great sharing to all contributors in the forum. It’s always a humbling experience to compete at Kaggle. I learn so much at every competition from a lot of fellow kagglers.

Here I’d like to share my code base and notes for the competition:

My friends and I have been using the framework based on Makefiles for competitions for years now and it has worked great so far.

Introduction to the framework is available on the TalkingData forum:

Our previous code repo for past competitions are also available at:

Hope it’s helpful.