Second Place Solution at CIKM AnalytiCup 2017 – Lazada Product Title Quality Challenge

Lazada Product Title Quality Challenge

In this challenge, the participants were provided with a set of product titles, description, and attributes, together with the associated title quality scores (clarity and conciseness) as labeled by Lazada internal QC team. The task is to build a product title quality model that can automatically grade the clarity and the conciseness of a product title.

Team members

Tam T. Nguyen, Kaggle Grandmaster, Postdoctoral Research Fellow at Ryerson University

Hossein Fani, PhD Student at University of New Brunswick

Ebrahim Bagheri, Associate Professor at Ryerson University

Gilberto Titericz, Kaggle Grandmaster ( #1 Kaggler), Data Scientist at AirBnb

Solution Overview

We present our winning approach for the Lazada Product Title Quality Challenge for the CIKM Cup 2017 where the data set was annotated as conciseness and clarity by Lazada QC team. The participants were asked to build machine learning model to predict conciseness and clarity of an SKU based on product title, short description, product categories, price, country, and product type. As sellers could freely enter anything for title and description, they might contain typos or misspelling words. Moreover, there were many annotators labelling the data so there must be disagreement on the true label of an SKU. This makes the problem difficult to solve if one is solely using traditional natural language processing and machine learning techniques. In our proposed approach, we adapted text mining and machine learning methods which take into account both feature and label noises. Specifically, we are using bagging methods to deal with label noise where the whole training data cannot be used to build our models. Moreover, we think that for each SKU, conciseness and clarity would be annotated by the same QC. It means that conciseness and clarity should be correlated in a certain manner. Therefore, we extended our bagging approach by considering out of fold leakage to take advantage of co-relation information. Our proposed approach achieved the root mean squared error (RMSE) of 0.3294 and 0.2417 on the test data for conciseness and clarity, respectively. You may refer to the paper or the source code for more details.

Winner’s Solution at Porto Seguro’s Safe Driver Prediction Competition

The Porto Seguro Safe Driver Prediction competition at Kaggle finished 2 days ago. 5,170 teams with 5,798 people competed for 2 months to predict if a driver will file an insurance claim next year with anonymized data.

Michael Jahrer, Netflix Grand Prize winner and Kaggle Grandmaster, took the lead from the beginning and finished #1. He graciously shared his solution right after the competition. Let’s check out his secret sauce. (This was initially posted on the Kaggle forum and reposted here with minor format changes with permission from him.)


Thanks to Porto Seguro to provide us with such a nice, leakage-free, time-free and statistical correct dataset.

A nice playground to test the performance of everything, this competition was stat similar to Otto, like larger testset than train, anonymous data, but differ in details.

I wanna dive straight into solution.

It’s a blend of 6 models. 1x lightgbm, 5x nn. All on same features, I just removed *calc and added 1-hot on *cat. All neural nets are trained on denoising autoencoder hidden activation, they did a great job in learning a better representation of the numeric data. lightgbm on raw data. Nonlinear stacking failed, simple averaging works best (all weights=1).

That’s the final 0.2965 solution. 2 single models would have been enough to win (#1 + #2 give me 0.29502 on private).

The complete list of models in the final blend:

Font is a bit small, you need to increase the zoom with ctrl (+).

The difference to my private .2969 score is I added bagging versions nBag=32 of the above mentioned 6 models, all weight=1, and Igor’s 287 script with weight=0.05. Was not really worth the effort for .2965 -> .2969 gain huh!? I selected these 2 blends at the end.

Feature Engineering

I dislike this part most, my creativity is too low for an average competition lifetime, also luck plays huge role here. Therefore I like representation learning, its also an step towards AI.

Basically I removed *calc, added 1-hot to *cat features. That’s all I’ve done. No missing value replacement or something. This is feature set “f0” in the table. This ends up in exactly 221 dense features. With single precision floats its 1.3GB RAM (1e-9*4*221*(595212+892816)).

Thanks to the public kernels (wheel of fortune eg.) that suggest to remove *calc features, I’m too blind and probably would not have figured this out by myself. I never remove features.

Local Validation

5-fold CV as usual. Fixed seed. No stratification. Each model has own rand seed in CV (weight init in nn, data_random_seed in lightgbm). Test predictions are arithmetic averages of all fold models. Just standard as I would use for any other task. Somebody wrote about bagging and its improvements, I spend a week in re-training all my models in a 32-bag setup (sampling with replacement). Score only improved a little.

Normalization

Input normalization for gradient-based models such as neural nets is critical. For lightgbm/xgb it does not matter. The best what I found during the past and works straight of the box is “RankGauss”. Its based on rank transformation. First step is to assign a linspace to the sorted features from 0..1, then apply the inverse of error function ErfInv to shape them like Gaussian, then I subtract the mean. Binary features are not touched with this transformation (eg. 1-hot ones). This works usually much better than standard mean/std scaler or min/max.

Unsupervised Learning

Denoising autoencoders (DAE) are nice to find a better representation of the numeric data for later neural net supervised learning. One can use train+test features to build the DAE. The larger the testset, the better 🙂 An autoencoder tries to reconstruct the inputs features. So features = targets. Linear output layer. Minimize MSE. A denoising autoencoder tries to reconstruct the noisy version of the features. It tries to find some representation of the data to better reconstruct the clean one.

With modern GPUs we can put much computing power to solve this task by touching peak floating point performance with huge layers. Sometimes I saw over 300W power consumption by checking nvidia-smi.

So why manually constructing 2,3,4-way interactions, use target encoding, search for count features, impute features, when a model can find something similar by itself?

The critical part here is to invent the noise. In tabular datasets we cannot just flip, rotate, sheer like people are doing this in images. Adding Gaussian or uniform additive / multiplicative noise is not optimal since features have different scale or a discrete set of values that some noise just didn’t make sense. I found a noise schema called “swap noise”. Here I sample from the feature itself with a certain probability “inputSwapNoise” in the table above. 0.15 means 15% of features replaced by values from another row.

Two different topologies are used by myself. Deep stack, where the new features are the values of the activations on all hidden layers. Second, bottleneck, where one middle layer is used to grab the activations as new dataset. This DAE step usually blows the input dimensionality to 1k..10k range.

Learning with Train+Test Features Unsupervised

You might think I am cheating when using test features too for learning. So I’ve done an experiment to check the effectiveness of unsupervised learning without test features. For reference I took model #2, public:0.28970, private:0.29298. With exactly same params it ends up in a slightly weaker CV gini:0.2890. public:0.28508, private:0.29235. Private score is similar, public score is worse. So not a complete breakdown as expected. Btw total scoring time of the testset with this “clean” model is 80[s].

Other Unsupervised Models

Yes I tried GANs (generative adversarial networks) here. No success. Since NIPS2016 I was able to code GANs by myself. A brilliant idea. Generated MNIST digits looked fine, CIFAR images not that.

For generator and discriminator I used MLPs. I think they have a fundamental problem in generating both numeric and categorical data. The discriminator won nearly all the time on my setups. I tried various tricks like truncation the generator output. Clip to known values, many architectures, learn params, noise vec length, dropout, leakyRelu etc. Basically I used activations from hidden layers of the discrimiator as new dataset. At the end they were low 0.28x on CV, too low to contribute to the blend. Haven’t tried hard enough.

Another idea that come late in my mind was a min/max. game like in GAN to generate good noise samples. Its critical to generate good noise for a DAE. I’m thinking of a generator with feature+noiseVec as input, it maximizes the distance to original sample while the autoencoder (input from generator) tried to reconstruct the sample… more maybe in another competition.

Neural Nets

Feedforward nets trained with backprop, accelerated by minibatch gradient updates. This is what all do here. I use vanilla SGD (no momentum or adam), large number of epochs, learning rate decay after every epoch. Hidden layers have ‘r’ = relu activation, output is sigmoid. Trained to minimize logloss. In bottleneck autoencoder the middle layer activation is ‘l’ = linear. When dropout!=0 it means all hidden layers have dropout. Input dropout often improve generalization when training on DAE features. Here a slight L2 regularization also helps in CV. Hidden layer size of 1000 works out of the box for most supervised tasks. All trained on GPU with 4-byte floats.

LightGBM

Nice library, very fast, sometimes better than xgboost in terms of accuracy. One model in the ensemble. I tuned params on CV.

XGBoost

I didn’t found a setup where xgboost adds something to the blend. So no used here in Porto.

Blending

Nonlinear things failed. That’s the biggest difference to the Otto competition where xgb, nn were great stackers. Every competition has its own pitfalls. Whatever. For me even tuning of linear blending weights failed. So I stick with all w=1.

Software Used

Everything I’ve done here end-to-end was written in C++/CUDA by myself. Of course I used lightgbm and xgboost C interface and a couple of acceleration libs like cuBLAS. I’m a n00b in python or R like you guys are experts. My approach is still old school and low level. I want to understand what is going from top to bottom. At some time, I’ll learn it, but currently there are just too much python/R packages that bust my head, I’m stick with loop-based code.

Hardware Used

All models above can be run on a 32GB RAM machine with clever data swapping. Next to that I use a GTX 1080 Ti card for all neural net stuff.

Total Time Spent

Some exaflops and kilowatts of GPU power was wasted for this competition for sure. Models run longer than I spend on writing code. Reading all the forum posts also costs a remarkable amount of time, but here my intention was don’t miss anything. At the end it was all worth. Big hands to all the great writers here like Tilli, CPMP, .. really great job guys.

What Did Not Work

Upsampling, deeper autoencoders, wider autoencoders, KNNs, KNN on DAE features, nonlinear stacking, some feature engineering (yes, I tried this too), PCA, bagging, factor models (but others had success with it), xgboost (other did well with that) and much much more..

Thats it.

New Editor – Tam T. Nguyen, Kaggle Grandmaster

I am happy to announce that we have a new editor, Tam T. Nguyen, joining Kaggler.com.

Tam is a Competition Grandmaster at Kaggle. He won the 1st prizes at KDD Cup 2015, IJCAI-15 repeat buyer competition, and Springleaf marketing response competition.

Currently, he is Postdoctoral Search Fellow at Ryerson University in Toronto, Canada. Prior to that, he was Data Analytics Project Lead at I2R A*STAR. He earned his Ph.D. in Computer Science from NTU Singapore. He’s originally from Vietnam.

Please subscribe to us at Kaggler.com, Facebook, and Twitter.

Keras Backend Benchmark: Theano vs TensorFlow vs CNTK

Inspired by Max Woolf’s benchmark, the performance of 3 different backends (Theano, TensorFlow, and CNTK) of Keras with 4 different GPUs (K80, M60, Titan X, and 1080 Ti) across various neural network tasks are compared.

For the performance of TensorFlow and CNTK with K80, the numbers reported at Max Woolf’s benchmark are used.

Conclusion

  • The accuracies of Theano, TensorFlow and CNTK backends are similar across all benchmark tests, while speeds vary a lot.
    • Theano is significantly (up to 50 times) slower than TensorFlow and CNTK.
    • Between TensorFlow and CNTK, CNTK is a lot (about 2 to 4 times) faster than TensorFlow for LSTM (Bidirectional LSTM on IMDb Data and Text Generation via LSTM), while speeds for other type of neural networks are close to each other.
  • Among K80, M60, Titan X and 1080 Ti GPUs:
    • 1080 Ti is the fastest.
    • K80 is the slowest.
    • M60 is faster than K80 and comparable to Titan X and 1080 Ti.
    • Theano is significantly (up to 14 times) faster on 1080 Ti than on Titan X, while the improvements for TensorFlow and CNTK are moderate.

Detailed results are available at https://github.com/szilard/benchm-dl/blob/master/keras_backend.md

Building Your Own Kaggle Machine

In 2014, I shared the specifications of a 6-core 64GB RAM desktop system that I purchased at around $2,000. Since then, I added NVidia Titan X to it for deep learning at additional $1,000, and it served me well.

However, as other team members started joining me on data science competitions and deep learning competitions got more popular, my team decided to build a more powerful desktop system.

The specifications of the new system that we built are as follows:

  • CPU: Xeon 2.4GHz 14-Core
  • RAM: 128GB DDR4-2400
  • GPU: 4 NVidia 1080 Ti 11GB
  • SSD: 960GB
  • HDD: 4TB 7200RPM
  • PSU: 1600W 80+ Titanium certified

Total cost including tax and shipping was around $7,000. Depending on the budget, you can go down to 2 (-$1,520) 1080 Ti GPU cards instead of 4, or 64GB (-$399) instead of 128GB RAM, and still have a decent system.

You can find the full part lists here.

Additional Resources

Winning Data Science Competitions – Latest Slides

 

This year I had several occasions to give my “Winning Data Science Competitions” talk – at Microsoft, KSEA-SWC 2017, USC Applied Statistics Club, Spark SC, and Whisper.

I am grateful for all these opportunities to share what I enjoy with the data scientist community.

I truly believe that working on competitions on a regular basis can make us better data scientists. Hope my talk and slides help other data scientists.

My talk is outlined as follows:

  1. Why compete
    1. For fun
    2. For experience
    3. For learning
    4. For networking
  2. Data science competition intro
    1. Competitions
    2. Structure
    3. Kaggle
  3. Misconceptions of data science competitions
    1. No ETL?
    2. No EDA?
    3. Not worth it?
    4. Not for production?
  4. Best practices
    1. Feature engineering
    2. Diverse algorithms
    3. Cross validation
    4. Ensemble
    5. Collaboration
  5. Personal tips
  6. Additional resources

You can find latest slides here:

Kaggler 0.5.0 Released

I am glad to announce the release of Kaggler 0.5.0. Kaggler 0.5.0 has a significant improvement in the performance of the FTRL algorithm thanks to Po-Hsien Chu (github, kaggle, linkedin).

Results

We increase the train speed by up to 100 times compare to 0.4.x. Our benchmark shows that one epoch with 1MM records with 8 features takes 1.2 seconds with 0.5.0 compared to 98 seconds with 0.4.x on an i7 CPU.

Motivation

The FTRL algorithm has been a popular algorithm since its first appearance on a paper published by Google. It is suitable for highly sparse data, so it has been widely used for click-through-rate (CTR) prediction in online advertisement. Many Kagglers use FTRL as one of their base algorithms in CTR prediction competitions. Therefore, we want to improve our FTRL implementation and benefit Kagglers who use our package.

Methods

We profile the code with cProfile and resolve the overheads one by one:

  1. Remove over-heads of Scipy Sparse Matrix row operation: Scipy sparse matrix checks many conditions in __getitems__, resulting in a lot of function calls. In fit(), we know that we’re fetching exactly each row, and it is very unlikely to exceed the bound, so we can fetch the indexes of each row in a faster way. This enhancement makes our FTRL 10x faster.
  2. More c-style enhancement: Specify types more clearly, return a whole list instead of yielding feature indexes, etc. These enhancements make our FTRL 5X faster when interaction==False.
  3. Faster hash function for interaction features: The last enhancement is to remove the overhead of hashing of interaction features. We use MurMurHash3, which scikit-learn uses, to directly hash the multiplication of feature indexes. This enhancement makes our FTRL 5x faster when interaction==True.

Contributor

Po-Hsien Chu (github, kaggle, linkedin)