UPDATE on 5/1/2021

Today, Kaggler v0.9.4 is released with additional features for DAE as follows:

  • In addition to the swap noise (swap_prob), the Gaussian noise (noise_std) and zero masking (mask_prob) have been added to DAE to overcome overfitting.
  • Stacked DAE is available through the n_layer input argument (see Figure 3. in Vincent et al. (2010), "Stacked Denoising Autoencoders" for reference).

For example, to build a stacked DAE with 3 pairs of encoder/decoder and all three types of noises, you can do:

from kaggler.preprocessing import DAE

dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_layer=3, noise_std=.05, swap_prob=.2, masking_prob=.1)
X = dae.fit_transform(pd.concat([trn, tst], axis=0))

If you're using previous versions, please upgrade Kaggler using pip install -U kaggler.


Today I released a new version (v0.9.0) of the Kaggler package with Denoising AutoEncoder (DAE) with the swap noise.

Now you can train a DAE with only 2 lines of code as follows:

dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)
X = dae.fit_transform(df[feature_cols])

In addition to the new DAE feature encoder, Kaggler supports many of feature transformations used in Kaggle including:

  • TargetEncoder: with smoothing and cross-validation to avoid overfitting
  • FrequencyEncoder
  • LabelEncoder: that imputes missing values and groups rare categories
  • OneHotEncoder: that imputes missing values and groups rare categories
  • EmbeddingEncoder: that transforms categorical features into embeddings
  • QuantileEncoder: that transforms numerical features into quantiles

In the notebook below, I will show how to use Kaggler's LabelEncoder, TargetEncoder, and DAE for feature engineering, then use Kaggler's AutoLGB to do feature selection and hyperparameter optimization.

This notebook was originally published here at Kaggle.


Today I released a new version (v0.9.0) of the Kaggler package with Denoising AutoEncoder (DAE) with the swap noise.

Now you can train a DAE with only 2 lines of code as follows:

dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)
X = dae.fit_transform(df[feature_cols])

In addition to the new DAE feature encoder, Kaggler supports many of feature transformations used in Kaggle including:

  • TargetEncoder: with smoothing and cross-validation to avoid overfitting
  • FrequencyEncoder
  • LabelEncoder: that imputes missing values and groups rare categories
  • OneHotEncoder: that imputes missing values and groups rare categories
  • EmbeddingEncoder: that transforms categorical features into embeddings
  • QuantileEncoder: that transforms numerical features into quantiles

In the notebook below, I will show how to use Kaggler's LabelEncoder, TargetEncoder, and DAE for feature engineering, then use Kaggler's AutoLGB to do feature selection and hyperparameter optimization.

Part 1: Data Loading & Feature Engineering

import lightgbm as lgb
import numpy as np
import pandas as pd
from pathlib import Path
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, confusion_matrix
import warnings
!pip install kaggler
import kaggler
from kaggler.model import AutoLGB
from kaggler.preprocessing import DAE, TargetEncoder, LabelEncoder

print(f'Kaggler: {kaggler.__version__}')
warnings.simplefilter('ignore')
pd.set_option('max_columns', 100)
feature_name = 'dae'
algo_name = 'lgb'
model_name = f'{algo_name}_{feature_name}'

data_dir = Path('/kaggle/input/tabular-playground-series-apr-2021/')
trn_file = data_dir / 'train.csv'
tst_file = data_dir / 'test.csv'
sample_file = data_dir / 'sample_submission.csv'
pseudo_label_file = '../input/tps-apr-2021-pseudo-label-dae/tps04-sub-006.csv'

feature_file = f'{feature_name}.csv'
predict_val_file = f'{model_name}.val.txt'
predict_tst_file = f'{model_name}.tst.txt'
submission_file = f'{model_name}.sub.csv'

target_col = 'Survived'
id_col = 'PassengerId'
n_fold = 5
seed = 42
encoding_dim = 64
trn = pd.read_csv(trn_file, index_col=id_col)
tst = pd.read_csv(tst_file, index_col=id_col)
sub = pd.read_csv(sample_file, index_col=id_col)
pseudo_label = pd.read_csv(pseudo_label_file, index_col=id_col)
print(trn.shape, tst.shape, sub.shape, pseudo_label.shape)
tst[target_col] = pseudo_label[target_col]
n_trn = trn.shape[0]
df = pd.concat([trn, tst], axis=0)
df.head()
df['Embarked'] = df['Embarked'].fillna('No')
df['Cabin'] = df['Cabin'].fillna('_')
df['CabinType'] = df['Cabin'].apply(lambda x:x[0])
df.Ticket = df.Ticket.map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')

df['Age'].fillna(round(df['Age'].median()), inplace=True,)
df['Age'] = df['Age'].apply(round).astype(int)

# Fare, fillna with mean value
fare_map = df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()
df['Fare'] = df['Fare'].fillna(df['Pclass'].map(fare_map['Fare']))

df['FirstName'] = df['Name'].str.split(', ').str[0]
df['SecondName'] = df['Name'].str.split(', ').str[1]

df['n'] = 1

gb = df.groupby('FirstName')
df_names = gb['n'].sum()
df['SameFirstName'] = df['FirstName'].apply(lambda x:df_names[x]).fillna(1)

gb = df.groupby('SecondName')
df_names = gb['n'].sum()
df['SameSecondName'] = df['SecondName'].apply(lambda x:df_names[x]).fillna(1)

df['Sex'] = (df['Sex'] == 'male').astype(int)

df['FamilySize'] = df.SibSp + df.Parch + 1

feature_cols = ['Pclass', 'Age','Embarked','Parch','SibSp','Fare','CabinType','Ticket','SameFirstName', 'SameSecondName', 'Sex',
                'FamilySize', 'FirstName', 'SecondName']
cat_cols = ['Pclass','Embarked','CabinType','Ticket', 'FirstName', 'SecondName']
num_cols = [x for x in feature_cols if x not in cat_cols]
print(len(feature_cols), len(cat_cols), len(num_cols))
for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:
    df[col] = np.log2(1 + df[col])
    
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

Label encoding with rare category grouping and missing value imputation

lbe = LabelEncoder(min_obs=50)
df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)

Target encoding with smoothing and 5-fold cross-validation

cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)
te = TargetEncoder(cv=cv)
df_te = te.fit_transform(df[cat_cols], df[target_col])
df_te.columns = [f'te_{col}' for col in cat_cols]
df_te.head()

DAE

dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)
X = dae.fit_transform(df[feature_cols])
df_dae = pd.DataFrame(X, columns=[f'dae_{i}' for i in range(encoding_dim)])
print(df_dae.shape)

Part 2: Model Training

AutoLGB for Feature Selection and Hyperparameter Optimization

X = pd.concat([df[feature_cols], df_te, df_dae], axis=1)
y = df[target_col]
X_tst = X.iloc[n_trn:]

p = np.zeros_like(y, dtype=float)
p_tst = np.zeros((tst.shape[0],))
print(f'Training a stacking ensemble LightGBM model:')
for i, (i_trn, i_val) in enumerate(cv.split(X, y)):
    if i == 0:
        clf = AutoLGB(objective='binary', metric='auc', random_state=seed)
        clf.tune(X.iloc[i_trn], y[i_trn])
        features = clf.features
        params = clf.params
        n_best = clf.n_best
        print(f'{n_best}')
        print(f'{params}')
        print(f'{features}')
    
    trn_data = lgb.Dataset(X.iloc[i_trn], y[i_trn])
    val_data = lgb.Dataset(X.iloc[i_val], y[i_val])
    clf = lgb.train(params, trn_data, n_best, val_data, verbose_eval=100)
    p[i_val] = clf.predict(X.iloc[i_val])
    p_tst += clf.predict(X_tst) / n_fold
    print(f'CV #{i + 1} AUC: {roc_auc_score(y[i_val], p[i_val]):.6f}')
print(f'  CV AUC: {roc_auc_score(y, p):.6f}')
print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')

Submission

n_pos = int(0.34911 * tst.shape[0])
th = sorted(p_tst, reverse=True)[n_pos]
print(th)
confusion_matrix(pseudo_label[target_col], (p_tst > th).astype(int))
sub[target_col] = (p_tst > th).astype(int)
sub.to_csv(submission_file)

If you find it useful, please upvote the notebook and leave your feedback. It will be greatly appreciated!

Also please check my previous notebooks as well: