Kaggler 0.4.0 Released

UPDATE on 9/15/2015

I found a bug in OneHotEncoder, and fixed it.  The fix is not available on pip yet, but you can update Kaggler to latest version from the source as follows:

$ git clone https://github.com/jeongyoonlee/Kaggler.git
$ cd Kaggler
$ python setup.py build_ext --inplace
$ sudo python setup.py install

If you find a bug, please submit a pull request to github or comment here.


I’m glad to announce the release of Kaggler 0.4.0.

Kaggler is a Python package that provides utility functions and online learning algorithms for classification.  I use it for Kaggle competitions along with scikit-learn, LasagneXGBoost, and Vowpal Wabbit.

Kaggler 0.4.0 added the scikit-learn like interface for preprocessing, metrics, and online learning algorithms.

kaggler.preprocessing

Classes in kaggler.preprocessing now support fit, fit_transform, and transform methods. Currently 2 preprocessing classes are available as follows:

  • Normalizer – aligns distributions of numerical features into a normal distribution. Note that it’s different from sklearn.preprocessing.Normalizer, which only scales features without changing distributions.
  • OneHotEncoder – transforms categorical features into dummy variables.  It is similar to sklearn.preprocessing.OneHotEncoder except that it groups infrequent values into a dummy variable.
from kaggler.preprocessing import OneHotEncoder

# values appearing less than min_obs are grouped into one dummy variable.
enc = OneHotEncoder(min_obs=10, nan_as_var=False)
X_train = enc.fit_transform(train)
X_test = enc.transform(test)

kaggler.metrics

3 metrics are available as follows:

  • logloss – calculates the bounded log loss error for classification predictions.
  • rmse – calculates the root mean squared error for regression predictions.
  • gini – calculates the gini coefficient for regression predictions.
from kaggler.metrics import gini

score = gini(y, p)

kaggler.online_model

Classes in kaggler.online_model (except ClassificationTree) now support fit, and predict methods. Currently 5 online learning algorithms are available as follows:

  • SGD – stochastic gradient descent algorithm with hashing trick and interaction
  • FTRL – follow-the-regularized-leader algorithm with hashing trick and interaction
  • FM – factorization machine algorithm
  • NN (or NN_H2) – neural network algorithm with a single (or double) hidden layer(s)
  • ClassificationTree – decision tree algorithm
from kaggler.online_model import FTRL
from kaggler.data_io import load_data

# load a libsvm format sparse feature file
X, y = load_data('train.sparse', dense=False)

# FTRL
clf = FTRL(a=.1,                # alpha in the per-coordinate rate
           b=1,                 # beta in the per-coordinate rate
           l1=1.,               # L1 regularization parameter
           l2=1.,               # L2 regularization parameter
           n=2**20,             # number of hashed features
           epoch=1,             # number of epochs
           interaction=True)    # use feature interaction or not

# training and prediction
clf.fit(X, y)
p = clf.predict(X)

Latest code is available at github.
Package documentation is available at https://pythonhosted.org/Kaggler/.

Please let me know if you have any comments or want to contribute. 🙂

Kaggler. Data Scientist.

Author: jeongyoonlee

Kaggler. Data Scientist.

One thought on “Kaggler 0.4.0 Released”

Leave a Reply

Your email address will not be published. Required fields are marked *