NIPS 2017 Notes

Jeong and I attended NIPS 2017 in December, 2017. Our notes are as follows.

Take-Aways for Professionals

As shown in the statistics shared by organizers during opening remarks, the majority of NIPS papers are from academia. Even papers from industry, which are only a small fraction, are mostly from research organizations. What can professionals take away from this academic conference? In my experience, people from industry can get following benefits from NIPS.

  • Cutting-edge research: This might not be applicable in practice immediately, but can still provide important perspectives and directions on each problem.
  • Recruiting: I would say that 90% of sponsors are focus on hiring. All big companies had their after-parties (a.k.a. recruiting events).
  • Networking: For some people, this is the most important benefit at NIPS. With over 7,000 attendees, NIPS 2017 was the largest academic conference in Machine Learning. Everyday I enjoyed conversations with many people in the same field at the poster sessions, after-parties, and even on the way back home with uberpool.
Registration line at NIPS 2017. There were over 7,000 attendees.

Technical Trends

We noticed technical trends as follows.

  • Meta-learning
  • Interpretability
  • ML systems (or systems for ML)
  • Bayesian modeling
  • Unsupervised learning
  • Probabilistic programming

Below are areas that I would like to investigate further in 2018.

  • Model Interpretation
  • Attention models
  • Online learning
  • Reinforcement learning

Detailed Session-by-Session Notes

Below are more detailed notes:

On 12/4 (Mon)

Tutorials

TitleComments
Deep Learning: Practice and TrendsVery good summary of deep learning’s current status and trends.

CNN, RNN, adversarial networks and unsupervised domain adaptation are closer to actual application. These models should be in professionals' tool boxes.
Meta learning and graph networks are interesting but further away from application.
Deep Probabilistic Modeling with Gaussian ProcessesThis talk brings an important point. In real world applications, we need to know not only pointwise predictions, but also the level of uncertainty in predictions to support decision making.
Geometric Deep Learning on Graphs and Manifolds by Michael BronsteinThis talk focuses on a interesting trend in deep learning, which uses deep learning on graphs data. In my opinion, there is still a long way to have real applications out of this field.

Opening Remarks/ Invited Talk

TitleComments
Opening Remarks & Powering the next 100 yearsOpening remarks has several interesting statistics of NIPS. It shows that NIPS is a very academia-centric conference.
The invited talk, explains the huge amount of energy human need and the limitation of fossil fuel and low-carbon tech.
Some ideas of how machine learning can help new energy (fusion) next 100 years and have big impact. Including: exploration and inference experiments data. Adding human (domain experts) preferences into ML approach. Mentioned several Bayesian approaches.
It is about applied machine learning in physics which can impact world a lot.
Thanks to many open source frameworks, it gets much easier to apply ML to different problem. ML becomes a major tool and will have huge impact across different domains.

Poster Sessions

TitleComments
SvCCa: Singular vector Canonical Correlation analysis for Deep understanding and improvementGoogle’s blog and paper to understand deep learning models. It can be used to improve prediction performance. The key idea is using Singular vector Canonical Correlation (SvCC) to analysis hidden layer parameters.
Dropoutnet: addressing Cold Start in recommender SystemsThis focuses only on the item cold start. It need a metadata based vector representative of new items.  
LightGBM: A Highly Efficient Gradient Boosting Decision TreeThis paper explains the implementation of LightGBM. It uses different approximate approach from XGBoost's.
Discovering Potential Correlations via HypercontractivityAn interesting idea to find potential relationship in the subset of data.
Other interesting papers* Learning Hierarchical Information Flow with Recurrent Neural Modules.
* Learning ReLUs via Gradient Descent.
* Clone MCMC: Parallel High-Dimensional Gaussian Gibbs Sampling
* Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems

On 12/5 (Tue)

Invited Talk

TitleComments
Why AI Will Make it Possible to Reprogram the Human GenomeThis is one of the most impactful areas of AI/DL. Lately, AI/DL has been used to tackle many challenges in healthcare and shown some promising results.
Test Of Time Award: Random Features for Large-Scale Kernel MachinesThis is the spotlight talk of NIPS 2017. It stirred a lot of discussions online. I highly recommend that you watch the video. Points from both sides of discussion are valid. Some related discussions:

Yann LeCun's rebuttal to Ali's talk
Alchemy, Rigour and Engineering
The Trouble with BiasThis is a good topic. Data collection and creation process can introduce strong undesirable bias to the data set. ML algorithms can reproduce and even reinforce such bias. This is more than a technical problem.

Poster Sessions

TitleComments
A Unified Approach to Interpreting Model PredictionsUse expectations and Shapley values to interpret model prediction. Unified several previous approaches including LIME.
https://github.com/slundberg/shap
Positive-Unlabeled Learning with Non-Negative Risk Estimator1 class classification is very useful in real world, e.g. click ads, watch content, etc. This paper use a different loss function in PU learning.
An Applied Algorithmic Foundation for Hierarchical ClusteringThere are several papers on hierarchical clustering. This is just one of them. Hierarchical clustering is also very useful in real world. In this paper it more focus on the foundation(objective function) of this problem.
Affinity Clustering: Hierarchical Clustering at ScaleAnother hierarchical clustering paper. A bottom-up hierarchical clustering. Each time make many merge decisions.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning resultsThis is an interesting semi-supervised deep learning approach. I feel it used students to prevent overfitting. Teacher and student improve each other in a virtuous cycle.
Unbiased estimates for linear regression via volume samplingChoose samples wisely can get similar (not bad) performance w entire data set. This will be useful in the scenarios which is costly to get labels.
A framework for Multi-A(rmed)/B(andit) Testing with Online FDR ControlThere are several papers of MAB(Multi-armed bandit), this is one of them. MAB can be very useful in website optimization.
Other interesting papers* Streaming Weak Submodularity: Interpreting Neural Networks on the Fly
* Generalization Properties of Learning with Random Features

On 12/6 (Wed)

Invited Talk

TitleComments
The Unreasonable Effectiveness of StructureThis talk discussed the structure in input and output. Then describe a way to describe “structure” in data. (Probabilistic Soft Logic http://psl.linqs.org/ )
Deep Learning for RoboticsIf working in robotics domain, this is a must attend talk. This talk discussed many unsolved pieces to the AI robotics puzzle and how DL (deep reinforcement learning, meta learning, etc ) can help. Some ideas might be useful in other domain.

Poster Sessions

TitleComments
Clustering with Noisy QueriesThis paper describe and analysis a way of how to gather answers of a clustering problem. Instead of asking “do element u belong to cluster A” this paper suggest asking “do elements u and v belong to the same cluster?”
End-to-End Differentiable ProvingVery interesting paper which try to combine NN and 1st order logic expert system. Learn vector representation of symbols.
ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy GamesLooks like a fun place to try AI(:)).
Attention Is All You NeedA new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Simple and Scalable Predictive Uncertainty Estimation using Deep EnsemblesMeasure the uncertainty is very important. This paper describe a way (simple non-Bayesian baseline) to measure uncertainty.
Other interesting papers* Train longer, generalize better: closing the generalization gap in large batch training of neural networks
* Unsupervised Image-to-Image Translation Networks
* A simple neural network module for relational reasoning
* Style Transfer from Non-parallel Text by Cross-Alignment

On 12/7 (Thu)

Invited Talk

TitleComments
Learning State RepresentationsThis is a very interesting talk. It tried to peel the onions of how human make decision and learn stuff. The researcher also design experiments to prove the hypothesis of “we cluster experiences together into task states based on similarity and learning happens within a cluster, not across cluster borders”. Then try to design model structure to represent this cluster(state).
On Bayesian Deep Learning and Deep Bayesian LearningThis talk is about combine Bayesian Learning and Deep Learning. This topic can be very useful in the future. It also include several projects in this area.

Symposium – Interpretable ML

TitleComments
About this symposiumI think interpretability is a very important part of models. As be mentioned in one talk of this symposium interpretability is not a purely computational problem and beyond tech. The final goal still be untangle(understand) causal impact, model interpretability can be valuable in at least 2 aspects: debug model predict, help generate hypotheses to do controlled experiment.
Invited talk - The role of causality for interpretability.This talk discussed how to use causality in model interpretability.
Invited talk - Interpretable Discovery in Large Image Data SetsThis talk present a DEMUD(SVOD-based plus explanations) method to interprete image data sets.
Poster* Detecting Bias in Black-Box Models Using Transparent Model Distillation
* The Intriguing Properties of Model Explanations
* Feature importance scores and lossless feature pruning using Banzhaf power indices
Debate about whether or not interpretability is necessary for machine learningInteresting debates about interpretability. Worth to watch.

Other Resources

NIPS videos, slides and notes are available as follows.

Second Place Solution at CIKM AnalytiCup 2017 – Lazada Product Title Quality Challenge

Lazada Product Title Quality Challenge

In this challenge, the participants were provided with a set of product titles, description, and attributes, together with the associated title quality scores (clarity and conciseness) as labeled by Lazada internal QC team. The task is to build a product title quality model that can automatically grade the clarity and the conciseness of a product title.

Team members

Tam T. Nguyen, Kaggle Grandmaster, Postdoctoral Research Fellow at Ryerson University

Hossein Fani, PhD Student at University of New Brunswick

Ebrahim Bagheri, Associate Professor at Ryerson University

Gilberto Titericz, Kaggle Grandmaster ( #1 Kaggler), Data Scientist at AirBnb

Solution Overview

We present our winning approach for the Lazada Product Title Quality Challenge for the CIKM Cup 2017 where the data set was annotated as conciseness and clarity by Lazada QC team. The participants were asked to build machine learning model to predict conciseness and clarity of an SKU based on product title, short description, product categories, price, country, and product type. As sellers could freely enter anything for title and description, they might contain typos or misspelling words. Moreover, there were many annotators labelling the data so there must be disagreement on the true label of an SKU. This makes the problem difficult to solve if one is solely using traditional natural language processing and machine learning techniques. In our proposed approach, we adapted text mining and machine learning methods which take into account both feature and label noises. Specifically, we are using bagging methods to deal with label noise where the whole training data cannot be used to build our models. Moreover, we think that for each SKU, conciseness and clarity would be annotated by the same QC. It means that conciseness and clarity should be correlated in a certain manner. Therefore, we extended our bagging approach by considering out of fold leakage to take advantage of co-relation information. Our proposed approach achieved the root mean squared error (RMSE) of 0.3294 and 0.2417 on the test data for conciseness and clarity, respectively. You may refer to the paper or the source code for more details.

Winner’s Solution at Porto Seguro’s Safe Driver Prediction Competition

The Porto Seguro Safe Driver Prediction competition at Kaggle finished 2 days ago. 5,170 teams with 5,798 people competed for 2 months to predict if a driver will file an insurance claim next year with anonymized data.

Michael Jahrer, Netflix Grand Prize winner and Kaggle Grandmaster, took the lead from the beginning and finished #1. He graciously shared his solution right after the competition. Let’s check out his secret sauce. (This was initially posted on the Kaggle forum and reposted here with minor format changes with permission from him.)


Thanks to Porto Seguro to provide us with such a nice, leakage-free, time-free and statistical correct dataset.

A nice playground to test the performance of everything, this competition was stat similar to Otto, like larger testset than train, anonymous data, but differ in details.

I wanna dive straight into solution.

It’s a blend of 6 models. 1x lightgbm, 5x nn. All on same features, I just removed *calc and added 1-hot on *cat. All neural nets are trained on denoising autoencoder hidden activation, they did a great job in learning a better representation of the numeric data. lightgbm on raw data. Nonlinear stacking failed, simple averaging works best (all weights=1).

That’s the final 0.2965 solution. 2 single models would have been enough to win (#1 + #2 give me 0.29502 on private).

The complete list of models in the final blend:

Font is a bit small, you need to increase the zoom with ctrl (+).

The difference to my private .2969 score is I added bagging versions nBag=32 of the above mentioned 6 models, all weight=1, and Igor’s 287 script with weight=0.05. Was not really worth the effort for .2965 -> .2969 gain huh!? I selected these 2 blends at the end.

Feature Engineering

I dislike this part most, my creativity is too low for an average competition lifetime, also luck plays huge role here. Therefore I like representation learning, its also an step towards AI.

Basically I removed *calc, added 1-hot to *cat features. That’s all I’ve done. No missing value replacement or something. This is feature set “f0” in the table. This ends up in exactly 221 dense features. With single precision floats its 1.3GB RAM (1e-9*4*221*(595212+892816)).

Thanks to the public kernels (wheel of fortune eg.) that suggest to remove *calc features, I’m too blind and probably would not have figured this out by myself. I never remove features.

Local Validation

5-fold CV as usual. Fixed seed. No stratification. Each model has own rand seed in CV (weight init in nn, data_random_seed in lightgbm). Test predictions are arithmetic averages of all fold models. Just standard as I would use for any other task. Somebody wrote about bagging and its improvements, I spend a week in re-training all my models in a 32-bag setup (sampling with replacement). Score only improved a little.

Normalization

Input normalization for gradient-based models such as neural nets is critical. For lightgbm/xgb it does not matter. The best what I found during the past and works straight of the box is “RankGauss”. Its based on rank transformation. First step is to assign a linspace to the sorted features from 0..1, then apply the inverse of error function ErfInv to shape them like Gaussian, then I subtract the mean. Binary features are not touched with this transformation (eg. 1-hot ones). This works usually much better than standard mean/std scaler or min/max.

Unsupervised Learning

Denoising autoencoders (DAE) are nice to find a better representation of the numeric data for later neural net supervised learning. One can use train+test features to build the DAE. The larger the testset, the better 🙂 An autoencoder tries to reconstruct the inputs features. So features = targets. Linear output layer. Minimize MSE. A denoising autoencoder tries to reconstruct the noisy version of the features. It tries to find some representation of the data to better reconstruct the clean one.

With modern GPUs we can put much computing power to solve this task by touching peak floating point performance with huge layers. Sometimes I saw over 300W power consumption by checking nvidia-smi.

So why manually constructing 2,3,4-way interactions, use target encoding, search for count features, impute features, when a model can find something similar by itself?

The critical part here is to invent the noise. In tabular datasets we cannot just flip, rotate, sheer like people are doing this in images. Adding Gaussian or uniform additive / multiplicative noise is not optimal since features have different scale or a discrete set of values that some noise just didn’t make sense. I found a noise schema called “swap noise”. Here I sample from the feature itself with a certain probability “inputSwapNoise” in the table above. 0.15 means 15% of features replaced by values from another row.

Two different topologies are used by myself. Deep stack, where the new features are the values of the activations on all hidden layers. Second, bottleneck, where one middle layer is used to grab the activations as new dataset. This DAE step usually blows the input dimensionality to 1k..10k range.

Learning with Train+Test Features Unsupervised

You might think I am cheating when using test features too for learning. So I’ve done an experiment to check the effectiveness of unsupervised learning without test features. For reference I took model #2, public:0.28970, private:0.29298. With exactly same params it ends up in a slightly weaker CV gini:0.2890. public:0.28508, private:0.29235. Private score is similar, public score is worse. So not a complete breakdown as expected. Btw total scoring time of the testset with this “clean” model is 80[s].

Other Unsupervised Models

Yes I tried GANs (generative adversarial networks) here. No success. Since NIPS2016 I was able to code GANs by myself. A brilliant idea. Generated MNIST digits looked fine, CIFAR images not that.

For generator and discriminator I used MLPs. I think they have a fundamental problem in generating both numeric and categorical data. The discriminator won nearly all the time on my setups. I tried various tricks like truncation the generator output. Clip to known values, many architectures, learn params, noise vec length, dropout, leakyRelu etc. Basically I used activations from hidden layers of the discrimiator as new dataset. At the end they were low 0.28x on CV, too low to contribute to the blend. Haven’t tried hard enough.

Another idea that come late in my mind was a min/max. game like in GAN to generate good noise samples. Its critical to generate good noise for a DAE. I’m thinking of a generator with feature+noiseVec as input, it maximizes the distance to original sample while the autoencoder (input from generator) tried to reconstruct the sample… more maybe in another competition.

Neural Nets

Feedforward nets trained with backprop, accelerated by minibatch gradient updates. This is what all do here. I use vanilla SGD (no momentum or adam), large number of epochs, learning rate decay after every epoch. Hidden layers have ‘r’ = relu activation, output is sigmoid. Trained to minimize logloss. In bottleneck autoencoder the middle layer activation is ‘l’ = linear. When dropout!=0 it means all hidden layers have dropout. Input dropout often improve generalization when training on DAE features. Here a slight L2 regularization also helps in CV. Hidden layer size of 1000 works out of the box for most supervised tasks. All trained on GPU with 4-byte floats.

LightGBM

Nice library, very fast, sometimes better than xgboost in terms of accuracy. One model in the ensemble. I tuned params on CV.

XGBoost

I didn’t found a setup where xgboost adds something to the blend. So no used here in Porto.

Blending

Nonlinear things failed. That’s the biggest difference to the Otto competition where xgb, nn were great stackers. Every competition has its own pitfalls. Whatever. For me even tuning of linear blending weights failed. So I stick with all w=1.

Software Used

Everything I’ve done here end-to-end was written in C++/CUDA by myself. Of course I used lightgbm and xgboost C interface and a couple of acceleration libs like cuBLAS. I’m a n00b in python or R like you guys are experts. My approach is still old school and low level. I want to understand what is going from top to bottom. At some time, I’ll learn it, but currently there are just too much python/R packages that bust my head, I’m stick with loop-based code.

Hardware Used

All models above can be run on a 32GB RAM machine with clever data swapping. Next to that I use a GTX 1080 Ti card for all neural net stuff.

Total Time Spent

Some exaflops and kilowatts of GPU power was wasted for this competition for sure. Models run longer than I spend on writing code. Reading all the forum posts also costs a remarkable amount of time, but here my intention was don’t miss anything. At the end it was all worth. Big hands to all the great writers here like Tilli, CPMP, .. really great job guys.

What Did Not Work

Upsampling, deeper autoencoders, wider autoencoders, KNNs, KNN on DAE features, nonlinear stacking, some feature engineering (yes, I tried this too), PCA, bagging, factor models (but others had success with it), xgboost (other did well with that) and much much more..

Thats it.

New Editor – Tam T. Nguyen, Kaggle Grandmaster

I am happy to announce that we have a new editor, Tam T. Nguyen, joining Kaggler.com.

Tam is a Competition Grandmaster at Kaggle. He won the 1st prizes at KDD Cup 2015, IJCAI-15 repeat buyer competition, and Springleaf marketing response competition.

Currently, he is Postdoctoral Search Fellow at Ryerson University in Toronto, Canada. Prior to that, he was Data Analytics Project Lead at I2R A*STAR. He earned his Ph.D. in Computer Science from NTU Singapore. He’s originally from Vietnam.

Please subscribe to us at Kaggler.com, Facebook, and Twitter.

Keras Backend Benchmark: Theano vs TensorFlow vs CNTK

Inspired by Max Woolf’s benchmark, the performance of 3 different backends (Theano, TensorFlow, and CNTK) of Keras with 4 different GPUs (K80, M60, Titan X, and 1080 Ti) across various neural network tasks are compared.

For the performance of TensorFlow and CNTK with K80, the numbers reported at Max Woolf’s benchmark are used.

Conclusion

  • The accuracies of Theano, TensorFlow and CNTK backends are similar across all benchmark tests, while speeds vary a lot.
    • Theano is significantly (up to 50 times) slower than TensorFlow and CNTK.
    • Between TensorFlow and CNTK, CNTK is a lot (about 2 to 4 times) faster than TensorFlow for LSTM (Bidirectional LSTM on IMDb Data and Text Generation via LSTM), while speeds for other type of neural networks are close to each other.
  • Among K80, M60, Titan X and 1080 Ti GPUs:
    • 1080 Ti is the fastest.
    • K80 is the slowest.
    • M60 is faster than K80 and comparable to Titan X and 1080 Ti.
    • Theano is significantly (up to 14 times) faster on 1080 Ti than on Titan X, while the improvements for TensorFlow and CNTK are moderate.

Detailed results are available at https://github.com/szilard/benchm-dl/blob/master/keras_backend.md

Building Your Own Kaggle Machine

In 2014, I shared the specifications of a 6-core 64GB RAM desktop system that I purchased at around $2,000. Since then, I added NVidia Titan X to it for deep learning at additional $1,000, and it served me well.

However, as other team members started joining me on data science competitions and deep learning competitions got more popular, my team decided to build a more powerful desktop system.

The specifications of the new system that we built are as follows:

  • CPU: Xeon 2.4GHz 14-Core
  • RAM: 128GB DDR4-2400
  • GPU: 4 NVidia 1080 Ti 11GB
  • SSD: 960GB
  • HDD: 4TB 7200RPM
  • PSU: 1600W 80+ Titanium certified

Total cost including tax and shipping was around $7,000. Depending on the budget, you can go down to 2 (-$1,520) 1080 Ti GPU cards instead of 4, or 64GB (-$399) instead of 128GB RAM, and still have a decent system.

You can find the full part lists here.

Additional Resources

Winning Data Science Competitions – Latest Slides

 

This year I had several occasions to give my “Winning Data Science Competitions” talk – at Microsoft, KSEA-SWC 2017, USC Applied Statistics Club, Spark SC, and Whisper.

I am grateful for all these opportunities to share what I enjoy with the data scientist community.

I truly believe that working on competitions on a regular basis can make us better data scientists. Hope my talk and slides help other data scientists.

My talk is outlined as follows:

  1. Why compete
    1. For fun
    2. For experience
    3. For learning
    4. For networking
  2. Data science competition intro
    1. Competitions
    2. Structure
    3. Kaggle
  3. Misconceptions of data science competitions
    1. No ETL?
    2. No EDA?
    3. Not worth it?
    4. Not for production?
  4. Best practices
    1. Feature engineering
    2. Diverse algorithms
    3. Cross validation
    4. Ensemble
    5. Collaboration
  5. Personal tips
  6. Additional resources

You can find latest slides here: