Kaggler 0.8.0 Release

Kaggler 0.8.0 is released. It added model.BaseAutoML and model.AutoLGB for automatic feature selection and hyper-parameter tuning using hyperopt.

The implementation is based on the solution of the team AvengersEnsmbl at the KDD Cup 2019 Auto ML track. Details and winners’ solutions at the competition are available at the competition website.

model.BaseAutoML is the base class, from which you can inherit to implement your own auto ML class. model.AutoLGB is the auto ML class for LightGBM. It’s simple to use as follows:

from kaggler.model import AutoLGB

model = AutoLGB(objective='binary', metric='auc')
model.tune(X_trn, y_trn)
model.fit(X_trn, y_trn)
p = model.predict(X_tst)

Other updates include:

For more details, please check out the documentation and repository.

Any comments or contributions will be appreciated.

Layoff – My story

Update: I was not affected by the layoff in the news. I’m sharing about the one I had a while ago. I don’t need a new opportunity for now. Thanks for asking. 🙂

Screenshot captured from https://www.nytimes.com/2019/07/29/technology/uber-job-cuts.html

It was one of less ideal days at work.

While getting ready for work, I received an email announcing the layoff of over 400 people in marketing.

At work, a series of follow-up meetings got scheduled. I talked to my team, who didn’t get affected, to address any concerns.

Later in the afternoon, farewell emails began to arrive. I replied to some, wished the best, and connected to them on LinkedIn.

It surely hurts when it happens. Either to me or to my colleagues.


I was laid off in 2001. It was just 9 months after I started my first full time job when the company shut down my department. I was confused, angry and felt like a failure, then I cried.

Looking back, I am thankful for the experience.

It opened up better opportunities. With the previous work experience, the job search was much easier. I had better understanding of what I could offer, and which companies needed it. In the end, I landed at a company with a greater fit.

It helped me have the right mindset at work. At the new company, I grew so much and so fast because I was humbled and grateful for the new opportunity every day.

It also gave me a better perspective on my career. Since then, no matter how I like my job, I keep reminding myself that “this won’t be my last company“. Employment can change, but relationship, skills and experiences will last. It’s wise to focus on what will last.


If it happens again, it will still hurt, but I won’t feel the same confusion, anger or failure. I won’t cry. 🙂

I hope all my colleagues keep their heads up, get stronger and wiser, and find better opportunities.

Best wishes for their career and families.

Winner’s Solution at Porto Seguro’s Safe Driver Prediction Competition

The Porto Seguro Safe Driver Prediction competition at Kaggle finished 2 days ago. 5,170 teams with 5,798 people competed for 2 months to predict if a driver will file an insurance claim next year with anonymized data.

Michael Jahrer, Netflix Grand Prize winner and Kaggle Grandmaster, took the lead from the beginning and finished #1. He graciously shared his solution right after the competition. Let’s check out his secret sauce. (This was initially posted on the Kaggle forum and reposted here with minor format changes with permission from him.)


Thanks to Porto Seguro to provide us with such a nice, leakage-free, time-free and statistical correct dataset.

A nice playground to test the performance of everything, this competition was stat similar to Otto, like larger testset than train, anonymous data, but differ in details.

I wanna dive straight into solution.

It’s a blend of 6 models. 1x lightgbm, 5x nn. All on same features, I just removed *calc and added 1-hot on *cat. All neural nets are trained on denoising autoencoder hidden activation, they did a great job in learning a better representation of the numeric data. lightgbm on raw data. Nonlinear stacking failed, simple averaging works best (all weights=1).

That’s the final 0.2965 solution. 2 single models would have been enough to win (#1 + #2 give me 0.29502 on private).

The complete list of models in the final blend:

Font is a bit small, you need to increase the zoom with ctrl (+).

The difference to my private .2969 score is I added bagging versions nBag=32 of the above mentioned 6 models, all weight=1, and Igor’s 287 script with weight=0.05. Was not really worth the effort for .2965 -> .2969 gain huh!? I selected these 2 blends at the end.

Feature Engineering

I dislike this part most, my creativity is too low for an average competition lifetime, also luck plays huge role here. Therefore I like representation learning, its also an step towards AI.

Basically I removed *calc, added 1-hot to *cat features. That’s all I’ve done. No missing value replacement or something. This is feature set “f0” in the table. This ends up in exactly 221 dense features. With single precision floats its 1.3GB RAM (1e-9*4*221*(595212+892816)).

Thanks to the public kernels (wheel of fortune eg.) that suggest to remove *calc features, I’m too blind and probably would not have figured this out by myself. I never remove features.

Local Validation

5-fold CV as usual. Fixed seed. No stratification. Each model has own rand seed in CV (weight init in nn, data_random_seed in lightgbm). Test predictions are arithmetic averages of all fold models. Just standard as I would use for any other task. Somebody wrote about bagging and its improvements, I spend a week in re-training all my models in a 32-bag setup (sampling with replacement). Score only improved a little.

Normalization

Input normalization for gradient-based models such as neural nets is critical. For lightgbm/xgb it does not matter. The best what I found during the past and works straight of the box is “RankGauss”. Its based on rank transformation. First step is to assign a linspace to the sorted features from 0..1, then apply the inverse of error function ErfInv to shape them like Gaussian, then I subtract the mean. Binary features are not touched with this transformation (eg. 1-hot ones). This works usually much better than standard mean/std scaler or min/max.

Unsupervised Learning

Denoising autoencoders (DAE) are nice to find a better representation of the numeric data for later neural net supervised learning. One can use train+test features to build the DAE. The larger the testset, the better 🙂 An autoencoder tries to reconstruct the inputs features. So features = targets. Linear output layer. Minimize MSE. A denoising autoencoder tries to reconstruct the noisy version of the features. It tries to find some representation of the data to better reconstruct the clean one.

With modern GPUs we can put much computing power to solve this task by touching peak floating point performance with huge layers. Sometimes I saw over 300W power consumption by checking nvidia-smi.

So why manually constructing 2,3,4-way interactions, use target encoding, search for count features, impute features, when a model can find something similar by itself?

The critical part here is to invent the noise. In tabular datasets we cannot just flip, rotate, sheer like people are doing this in images. Adding Gaussian or uniform additive / multiplicative noise is not optimal since features have different scale or a discrete set of values that some noise just didn’t make sense. I found a noise schema called “swap noise”. Here I sample from the feature itself with a certain probability “inputSwapNoise” in the table above. 0.15 means 15% of features replaced by values from another row.

Two different topologies are used by myself. Deep stack, where the new features are the values of the activations on all hidden layers. Second, bottleneck, where one middle layer is used to grab the activations as new dataset. This DAE step usually blows the input dimensionality to 1k..10k range.

Learning with Train+Test Features Unsupervised

You might think I am cheating when using test features too for learning. So I’ve done an experiment to check the effectiveness of unsupervised learning without test features. For reference I took model #2, public:0.28970, private:0.29298. With exactly same params it ends up in a slightly weaker CV gini:0.2890. public:0.28508, private:0.29235. Private score is similar, public score is worse. So not a complete breakdown as expected. Btw total scoring time of the testset with this “clean” model is 80[s].

Other Unsupervised Models

Yes I tried GANs (generative adversarial networks) here. No success. Since NIPS2016 I was able to code GANs by myself. A brilliant idea. Generated MNIST digits looked fine, CIFAR images not that.

For generator and discriminator I used MLPs. I think they have a fundamental problem in generating both numeric and categorical data. The discriminator won nearly all the time on my setups. I tried various tricks like truncation the generator output. Clip to known values, many architectures, learn params, noise vec length, dropout, leakyRelu etc. Basically I used activations from hidden layers of the discrimiator as new dataset. At the end they were low 0.28x on CV, too low to contribute to the blend. Haven’t tried hard enough.

Another idea that come late in my mind was a min/max. game like in GAN to generate good noise samples. Its critical to generate good noise for a DAE. I’m thinking of a generator with feature+noiseVec as input, it maximizes the distance to original sample while the autoencoder (input from generator) tried to reconstruct the sample… more maybe in another competition.

Neural Nets

Feedforward nets trained with backprop, accelerated by minibatch gradient updates. This is what all do here. I use vanilla SGD (no momentum or adam), large number of epochs, learning rate decay after every epoch. Hidden layers have ‘r’ = relu activation, output is sigmoid. Trained to minimize logloss. In bottleneck autoencoder the middle layer activation is ‘l’ = linear. When dropout!=0 it means all hidden layers have dropout. Input dropout often improve generalization when training on DAE features. Here a slight L2 regularization also helps in CV. Hidden layer size of 1000 works out of the box for most supervised tasks. All trained on GPU with 4-byte floats.

LightGBM

Nice library, very fast, sometimes better than xgboost in terms of accuracy. One model in the ensemble. I tuned params on CV.

XGBoost

I didn’t found a setup where xgboost adds something to the blend. So no used here in Porto.

Blending

Nonlinear things failed. That’s the biggest difference to the Otto competition where xgb, nn were great stackers. Every competition has its own pitfalls. Whatever. For me even tuning of linear blending weights failed. So I stick with all w=1.

Software Used

Everything I’ve done here end-to-end was written in C++/CUDA by myself. Of course I used lightgbm and xgboost C interface and a couple of acceleration libs like cuBLAS. I’m a n00b in python or R like you guys are experts. My approach is still old school and low level. I want to understand what is going from top to bottom. At some time, I’ll learn it, but currently there are just too much python/R packages that bust my head, I’m stick with loop-based code.

Hardware Used

All models above can be run on a 32GB RAM machine with clever data swapping. Next to that I use a GTX 1080 Ti card for all neural net stuff.

Total Time Spent

Some exaflops and kilowatts of GPU power was wasted for this competition for sure. Models run longer than I spend on writing code. Reading all the forum posts also costs a remarkable amount of time, but here my intention was don’t miss anything. At the end it was all worth. Big hands to all the great writers here like Tilli, CPMP, .. really great job guys.

What Did Not Work

Upsampling, deeper autoencoders, wider autoencoders, KNNs, KNN on DAE features, nonlinear stacking, some feature engineering (yes, I tried this too), PCA, bagging, factor models (but others had success with it), xgboost (other did well with that) and much much more..

Thats it.

Quora: How many employed data scientists are able to solve problems from online competitions such as Kaggle’s?

Read Jeong-Yoon Lee's answer to How many employed data scientists are able to solve problems from online competitions such as Kaggle's? on Quora

New Editor – Tam T. Nguyen, Kaggle Grandmaster

I am happy to announce that we have a new editor, Tam T. Nguyen, joining Kaggler.com.

Tam is a Competition Grandmaster at Kaggle. He won the 1st prizes at KDD Cup 2015, IJCAI-15 repeat buyer competition, and Springleaf marketing response competition.

Currently, he is Postdoctoral Search Fellow at Ryerson University in Toronto, Canada. Prior to that, he was Data Analytics Project Lead at I2R A*STAR. He earned his Ph.D. in Computer Science from NTU Singapore. He’s originally from Vietnam.

Please subscribe to us at Kaggler.com, Facebook, and Twitter.

Keras Backend Benchmark: Theano vs TensorFlow vs CNTK

Inspired by Max Woolf’s benchmark, the performance of 3 different backends (Theano, TensorFlow, and CNTK) of Keras with 4 different GPUs (K80, M60, Titan X, and 1080 Ti) across various neural network tasks are compared.

For the performance of TensorFlow and CNTK with K80, the numbers reported at Max Woolf’s benchmark are used.

Conclusion

  • The accuracies of Theano, TensorFlow and CNTK backends are similar across all benchmark tests, while speeds vary a lot.
    • Theano is significantly (up to 50 times) slower than TensorFlow and CNTK.
    • Between TensorFlow and CNTK, CNTK is a lot (about 2 to 4 times) faster than TensorFlow for LSTM (Bidirectional LSTM on IMDb Data and Text Generation via LSTM), while speeds for other type of neural networks are close to each other.
  • Among K80, M60, Titan X and 1080 Ti GPUs:
    • 1080 Ti is the fastest.
    • K80 is the slowest.
    • M60 is faster than K80 and comparable to Titan X and 1080 Ti.
    • Theano is significantly (up to 14 times) faster on 1080 Ti than on Titan X, while the improvements for TensorFlow and CNTK are moderate.

Detailed results are available at https://github.com/szilard/benchm-dl/blob/master/keras_backend.md

Building Your Own Kaggle Machine

In 2014, I shared the specifications of a 6-core 64GB RAM desktop system that I purchased at around $2,000. Since then, I added NVidia Titan X to it for deep learning at additional $1,000, and it served me well.

However, as other team members started joining me on data science competitions and deep learning competitions got more popular, my team decided to build a more powerful desktop system.

The specifications of the new system that we built are as follows:

  • CPU: Xeon 2.4GHz 14-Core
  • RAM: 128GB DDR4-2400
  • GPU: 4 NVidia 1080 Ti 11GB
  • SSD: 960GB
  • HDD: 4TB 7200RPM
  • PSU: 1600W 80+ Titanium certified

Total cost including tax and shipping was around $7,000. Depending on the budget, you can go down to 2 (-$1,520) 1080 Ti GPU cards instead of 4, or 64GB (-$399) instead of 128GB RAM, and still have a decent system.

You can find the full part lists here.

Additional Resources

Winning Data Science Competitions – Latest Slides

 

This year I had several occasions to give my “Winning Data Science Competitions” talk – at Microsoft, KSEA-SWC 2017, USC Applied Statistics Club, Spark SC, and Whisper.

I am grateful for all these opportunities to share what I enjoy with the data scientist community.

I truly believe that working on competitions on a regular basis can make us better data scientists. Hope my talk and slides help other data scientists.

My talk is outlined as follows:

  1. Why compete
    1. For fun
    2. For experience
    3. For learning
    4. For networking
  2. Data science competition intro
    1. Competitions
    2. Structure
    3. Kaggle
  3. Misconceptions of data science competitions
    1. No ETL?
    2. No EDA?
    3. Not worth it?
    4. Not for production?
  4. Best practices
    1. Feature engineering
    2. Diverse algorithms
    3. Cross validation
    4. Ensemble
    5. Collaboration
  5. Personal tips
  6. Additional resources

You can find latest slides here: