UPDATE on 9/15/2015
I found a bug in OneHotEncoder, and fixed it. The fix is not available on pip yet, but you can update Kaggler to latest version from the source as follows:
$ git clone https://github.com/jeongyoonlee/Kaggler.git
$ cd Kaggler
$ python setup.py build_ext --inplace
$ sudo python setup.py install
If you find a bug, please submit a pull request to github or comment here.
I’m glad to announce the release of Kaggler 0.4.0.
Kaggler is a Python package that provides utility functions and online learning algorithms for classification. I use it for Kaggle competitions along with scikit-learn, Lasagne, XGBoost, and Vowpal Wabbit.
kaggler.preprocessing now support
transform methods. Currently 2 preprocessing classes are available as follows:
Normalizer– aligns distributions of numerical features into a normal distribution. Note that it’s different from
sklearn.preprocessing.Normalizer, which only scales features without changing distributions.
OneHotEncoder– transforms categorical features into dummy variables. It is similar to
sklearn.preprocessing.OneHotEncoderexcept that it groups infrequent values into a dummy variable.
from kaggler.preprocessing import OneHotEncoder # values appearing less than min_obs are grouped into one dummy variable. enc = OneHotEncoder(min_obs=10, nan_as_var=False) X_train = enc.fit_transform(train) X_test = enc.transform(test)
3 metrics are available as follows:
logloss– calculates the bounded log loss error for classification predictions.
rmse– calculates the root mean squared error for regression predictions.
gini– calculates the gini coefficient for regression predictions.
from kaggler.metrics import gini score = gini(y, p)
ClassificationTree) now support
predict methods. Currently 5 online learning algorithms are available as follows:
SGD– stochastic gradient descent algorithm with hashing trick and interaction
FTRL– follow-the-regularized-leader algorithm with hashing trick and interaction
FM– factorization machine algorithm
NN_H2) – neural network algorithm with a single (or double) hidden layer(s)
ClassificationTree– decision tree algorithm
from kaggler.online_model import FTRL from kaggler.data_io import load_data # load a libsvm format sparse feature file X, y = load_data('train.sparse', dense=False) # FTRL clf = FTRL(a=.1, # alpha in the per-coordinate rate b=1, # beta in the per-coordinate rate l1=1., # L1 regularization parameter l2=1., # L2 regularization parameter n=2**20, # number of hashed features epoch=1, # number of epochs interaction=True) # use feature interaction or not # training and prediction clf.fit(X, y) p = clf.predict(X)
Please let me know if you have any comments or want to contribute. 🙂
Kaggler. Data Scientist. Father of Five.