DAE with 2 Lines of Code with Kaggler
A tutorial on Kaggler's new DAE feature transformation
UPDATE on 5/1/2021
Today, Kaggler
v0.9.4 is released with additional features for DAE as follows:
- In addition to the swap noise (
swap_prob
), the Gaussian noise (noise_std
) and zero masking (mask_prob
) have been added to DAE to overcome overfitting. - Stacked DAE is available through the
n_layer
input argument (see Figure 3. in Vincent et al. (2010), "Stacked Denoising Autoencoders" for reference).
For example, to build a stacked DAE with 3 pairs of encoder/decoder and all three types of noises, you can do:
from kaggler.preprocessing import DAE
dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_layer=3, noise_std=.05, swap_prob=.2, masking_prob=.1)
X = dae.fit_transform(pd.concat([trn, tst], axis=0))
If you're using previous versions, please upgrade Kaggler
using pip install -U kaggler
.
Today I released a new version (v0.9.0) of the Kaggler
package with Denoising AutoEncoder (DAE) with the swap noise.
Now you can train a DAE with only 2 lines of code as follows:
dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)
X = dae.fit_transform(df[feature_cols])
In addition to the new DAE feature encoder, Kaggler
supports many of feature transformations used in Kaggle including:
-
TargetEncoder
: with smoothing and cross-validation to avoid overfitting FrequencyEncoder
-
LabelEncoder
: that imputes missing values and groups rare categories -
OneHotEncoder
: that imputes missing values and groups rare categories -
EmbeddingEncoder
: that transforms categorical features into embeddings -
QuantileEncoder
: that transforms numerical features into quantiles
In the notebook below, I will show how to use Kaggler
's LabelEncoder
, TargetEncoder
, and DAE
for feature engineering, then use Kaggler
's AutoLGB
to do feature selection and hyperparameter optimization.
This notebook was originally published here at Kaggle.
Today I released a new version (v0.9.0) of the Kaggler
package with Denoising AutoEncoder (DAE) with the swap noise.
Now you can train a DAE with only 2 lines of code as follows:
dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)
X = dae.fit_transform(df[feature_cols])
In addition to the new DAE feature encoder, Kaggler
supports many of feature transformations used in Kaggle including:
-
TargetEncoder
: with smoothing and cross-validation to avoid overfitting FrequencyEncoder
-
LabelEncoder
: that imputes missing values and groups rare categories -
OneHotEncoder
: that imputes missing values and groups rare categories -
EmbeddingEncoder
: that transforms categorical features into embeddings -
QuantileEncoder
: that transforms numerical features into quantiles
In the notebook below, I will show how to use Kaggler
's LabelEncoder
, TargetEncoder
, and DAE
for feature engineering, then use Kaggler
's AutoLGB
to do feature selection and hyperparameter optimization.
import lightgbm as lgb
import numpy as np
import pandas as pd
from pathlib import Path
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, confusion_matrix
import warnings
!pip install kaggler
import kaggler
from kaggler.model import AutoLGB
from kaggler.preprocessing import DAE, TargetEncoder, LabelEncoder
print(f'Kaggler: {kaggler.__version__}')
warnings.simplefilter('ignore')
pd.set_option('max_columns', 100)
feature_name = 'dae'
algo_name = 'lgb'
model_name = f'{algo_name}_{feature_name}'
data_dir = Path('/kaggle/input/tabular-playground-series-apr-2021/')
trn_file = data_dir / 'train.csv'
tst_file = data_dir / 'test.csv'
sample_file = data_dir / 'sample_submission.csv'
pseudo_label_file = '../input/tps-apr-2021-pseudo-label-dae/tps04-sub-006.csv'
feature_file = f'{feature_name}.csv'
predict_val_file = f'{model_name}.val.txt'
predict_tst_file = f'{model_name}.tst.txt'
submission_file = f'{model_name}.sub.csv'
target_col = 'Survived'
id_col = 'PassengerId'
n_fold = 5
seed = 42
encoding_dim = 64
trn = pd.read_csv(trn_file, index_col=id_col)
tst = pd.read_csv(tst_file, index_col=id_col)
sub = pd.read_csv(sample_file, index_col=id_col)
pseudo_label = pd.read_csv(pseudo_label_file, index_col=id_col)
print(trn.shape, tst.shape, sub.shape, pseudo_label.shape)
tst[target_col] = pseudo_label[target_col]
n_trn = trn.shape[0]
df = pd.concat([trn, tst], axis=0)
df.head()
df['Embarked'] = df['Embarked'].fillna('No')
df['Cabin'] = df['Cabin'].fillna('_')
df['CabinType'] = df['Cabin'].apply(lambda x:x[0])
df.Ticket = df.Ticket.map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')
df['Age'].fillna(round(df['Age'].median()), inplace=True,)
df['Age'] = df['Age'].apply(round).astype(int)
# Fare, fillna with mean value
fare_map = df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()
df['Fare'] = df['Fare'].fillna(df['Pclass'].map(fare_map['Fare']))
df['FirstName'] = df['Name'].str.split(', ').str[0]
df['SecondName'] = df['Name'].str.split(', ').str[1]
df['n'] = 1
gb = df.groupby('FirstName')
df_names = gb['n'].sum()
df['SameFirstName'] = df['FirstName'].apply(lambda x:df_names[x]).fillna(1)
gb = df.groupby('SecondName')
df_names = gb['n'].sum()
df['SameSecondName'] = df['SecondName'].apply(lambda x:df_names[x]).fillna(1)
df['Sex'] = (df['Sex'] == 'male').astype(int)
df['FamilySize'] = df.SibSp + df.Parch + 1
feature_cols = ['Pclass', 'Age','Embarked','Parch','SibSp','Fare','CabinType','Ticket','SameFirstName', 'SameSecondName', 'Sex',
'FamilySize', 'FirstName', 'SecondName']
cat_cols = ['Pclass','Embarked','CabinType','Ticket', 'FirstName', 'SecondName']
num_cols = [x for x in feature_cols if x not in cat_cols]
print(len(feature_cols), len(cat_cols), len(num_cols))
for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:
df[col] = np.log2(1 + df[col])
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])
lbe = LabelEncoder(min_obs=50)
df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)
cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)
te = TargetEncoder(cv=cv)
df_te = te.fit_transform(df[cat_cols], df[target_col])
df_te.columns = [f'te_{col}' for col in cat_cols]
df_te.head()
dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)
X = dae.fit_transform(df[feature_cols])
df_dae = pd.DataFrame(X, columns=[f'dae_{i}' for i in range(encoding_dim)])
print(df_dae.shape)
X = pd.concat([df[feature_cols], df_te, df_dae], axis=1)
y = df[target_col]
X_tst = X.iloc[n_trn:]
p = np.zeros_like(y, dtype=float)
p_tst = np.zeros((tst.shape[0],))
print(f'Training a stacking ensemble LightGBM model:')
for i, (i_trn, i_val) in enumerate(cv.split(X, y)):
if i == 0:
clf = AutoLGB(objective='binary', metric='auc', random_state=seed)
clf.tune(X.iloc[i_trn], y[i_trn])
features = clf.features
params = clf.params
n_best = clf.n_best
print(f'{n_best}')
print(f'{params}')
print(f'{features}')
trn_data = lgb.Dataset(X.iloc[i_trn], y[i_trn])
val_data = lgb.Dataset(X.iloc[i_val], y[i_val])
clf = lgb.train(params, trn_data, n_best, val_data, verbose_eval=100)
p[i_val] = clf.predict(X.iloc[i_val])
p_tst += clf.predict(X_tst) / n_fold
print(f'CV #{i + 1} AUC: {roc_auc_score(y[i_val], p[i_val]):.6f}')
print(f' CV AUC: {roc_auc_score(y, p):.6f}')
print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')
n_pos = int(0.34911 * tst.shape[0])
th = sorted(p_tst, reverse=True)[n_pos]
print(th)
confusion_matrix(pseudo_label[target_col], (p_tst > th).astype(int))
sub[target_col] = (p_tst > th).astype(int)
sub.to_csv(submission_file)
If you find it useful, please upvote the notebook and leave your feedback. It will be greatly appreciated!
Also please check my previous notebooks as well:
- AutoEncoder + Pseudo Label + AutoLGB: shows how to build a basic AutoEncoder using Keras, and perform automated feature selection and hyperparameter optimization using Kaggler's AutoLGB.
- Supervised Emphasized Denoising AutoEncoder: shows how to build a more sophiscated version of AutoEncoder, called supervised emphasized Denoising AutoEncoder (DAE), which trains DAE and a classifier simultaneously.
- Stacking Ensemble: shows how to perform stacking ensemble.