In this blog, we are going to go over the fundementals of the Kaggler repository, a machine learning pipeline for data science competitions. The Kaggler pipeline uses Makefiles and Python scripts to coordinate dependencies, and allows quick iteration of new features and models. You can watch the demo at Kaggler TV Episode #4.
The pipeline is driven by running a model training, such as logistic regression, using the corresponding Makefile, e.g.,
$make -f Makefile.logreg1
. Before going into the details of a model run, let’s build our data and features from bottom up. To start, we need to initialize our repo by going to the Kaggler repository and clicking use this template
to name and create our repository, e.g., cat-in-the-dat-ii. Next, clone the repository by
$git clone https://github.com/YOUR_GITHUB_ID/cat-in-the-dat-ii.git
$cd cat-in-the-dat-ii
The file Makefile
defines the directories and the structure of the pipeline.
# XXX: competition name
COMPETITION := cat-in-the-dat-ii
# gsed on macOS. sed on LINUX
SED := gsed
# directories
DIR_DATA := input
DIR_BUILD := build
DIR_FEATURE := $(DIR_BUILD)/feature
DIR_METRIC := $(DIR_BUILD)/metric
DIR_MODEL := $(DIR_BUILD)/model
# directories for the cross validation and ensembling
DIR_VAL := $(DIR_BUILD)/val
DIR_TST := $(DIR_BUILD)/tst
DIR_SUB := $(DIR_BUILD)/sub
DIRS := $(DIR_DATA) $(DIR_BUILD) $(DIR_FEATURE) $(DIR_METRIC) $(DIR_MODEL) \
$(DIR_VAL) $(DIR_TST) $(DIR_SUB)
# data files for training and predict
DATA_TRN := $(DIR_DATA)/train.csv
DATA_TST := $(DIR_DATA)/test.csv
SAMPLE_SUBMISSION := $(DIR_DATA)/sample_submission.csv
LABEL_IDX = 25
ID_TST := $(DIR_DATA)/id.tst.csv
HEADER := $(DIR_DATA)/header.csv
Y_TRN:= $(DIR_FEATURE)/y.trn.txt
Y_TST:= $(DIR_FEATURE)/y.tst.txt
data: $(DATA_TRN) $(DATA_TST) $(SAMPLE_SUBMISSION)
$(DIRS):
mkdir -p $@
$(DATA_TRN) $(DATA_TST) $(SAMPLE_SUBMISSION): | $(DIR_DATA)
kaggle competitions download -c $(COMPETITION) -p $(DIR_DATA)
find . -name "*.zip" -exec sh -c 'unzip -d `dirname {}` {}' ';'
$(HEADER): $(SAMPLE_SUBMISSION)
head -1 $< > $@
$(ID_TST): $(SAMPLE_SUBMISSION)
cut -d, -f1 $< | tail -n +2 > $@
$(Y_TST): $(SAMPLE_SUBMISSION) | $(DIR_FEATURE)
cut -d, -f2 $< | tail -n +2 > $@
$(Y_TRN): $(DATA_TRN) | $(DIR_FEATURE)
cut -d, -f$(LABEL_IDX) $< | tail -n +2 > $@
# cleanup
clean::
find . -name '*.pyc' -delete
clobber: clean
-rm -rf $(DIR_DATA) $(DIR_BUILD)
.PHONY: clean clobber mac.setup ubuntu.setup apt.setup pip.setup
First, we need to define the name of the competition in Makefile
. After that, running $make data
will download the specified competition data from Kaggle into the ./input directory. You need to install the Kaggle API, and accept the competition rules on Kaggle to be able to download the data. If you do not download the data manually at this time, the pipeline will automatically start the download when running the first model training. The parameters defined in Makefile
are as follows.
$DIR_DATA
is the directory for the input data.
$DIR_TRN
, $DIR_TST
, and $SAMPLE_SUBMISSION
are the downloaded train, test and sample submission files.
LABEL_IDX
is the column index of the target variable in the train file, and needs to be specified.
$Y_TRN
is the file containing the target labels, and it is created automatically by the pipeline.
$HEADER
and $ID_TST
are also created by the pipeline, and are used to build submission files.
Let’s create our first feature by one hot encoding all the categorical columns 1. All the columns in this competition are categorical. The feature engieering for a specific feature are defined in files ./src/generate_$FEATURE_NAME.py, e.g., ./src/generate_e1.py. We also need to create a makefile correponding to this feature, Makefile.feature.e1 as follows.
#--------------------------------------------------------------------------
# e1: all OHE'd features
#--------------------------------------------------------------------------
include Makefile
FEATURE_NAME := e1
FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.sps
FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.sps
FEATURE_MAP := $(DIR_FEATURE)/$(FEATURE_NAME).fmap
$(FEATURE_TRN) $(FEATURE_TST) $(FEATURE_MAP): $(DATA_TRN) $(DATA_TST) | $(DIR_FEATURE)
python ./src/generate_$(FEATURE_NAME).py --train-file $< \
--test-file $(lastword $^) \
--train-feature-file $(FEATURE_TRN) \
--test-feature-file $(FEATURE_TST) \
--feature-map-file $(FEATURE_MAP)
Feature makefiles include all the parameters from Makefile
. The parameters defined in Makefile.feature.e1
are
FEATURE_NAME
: specified name of the feature
FEATURE_TRN
, FEATURE_TST
: train and test feature files, which are the outputs created by ./src/generate_$(FEATURE_NAME).py.
FEATURE_MAP
: a file where we keep the name of the features, which is also created by ./src/generate_$(FEATURE_NAME).py.
The models are defined in makefiles Makefile.$ALGO_NAME
, e.g., Makefile.logreg1
. At the top of each model file, we define which feature is going to be included as shown below. Then, we give the algorithm a short name, ALGO_NAME
, for reference. We define the parameters for the algorithm, C, REGULARIZER, CLASS_WEIGHT
and SOLVER
in this case. We also specifiy a model name for reference, MODEL_NAME
. The cross validation for algorithms are run by files ./src/train_predict_$MODEL_NAME.py, e.g., ./src/train_predict_logreg1.py, which produce validation and test predictions, PREDICT_VAL
and PREDICT_TST
.
After the cross validation, ./src/evaluate.py evaluates the validation predictions for a given metric, and writes the score to the file METRIC_VAL
. Finally the submission file, SUBMISSION_TST
, is created using the test predictions.
include Makefile.feature.e1
ALGO_NAME := logreg
C := 1.0
REGULARIZER := l2
CLASS_WEIGHT := balanced
SOLVER := lbfgs
MODEL_NAME := $(FEATURE_NAME)_$(ALGO_NAME)_$(REGULARIZER)_$(C)
METRIC_VAL := $(DIR_METRIC)/$(MODEL_NAME).val.txt
PREDICT_VAL := $(DIR_VAL)/$(MODEL_NAME).val.yht
PREDICT_TST := $(DIR_TST)/$(MODEL_NAME).tst.yht
SUBMISSION_TST := $(DIR_SUB)/$(MODEL_NAME)_sub.csv
all: validation submission
validation: $(METRIC_VAL)
submission: $(SUBMISSION_TST)
retrain: clean_$(ALGO_NAME) submission
submit: $(SUBMISSION_TST)
kaggle competitions submit -c $(COMPETITION) -f $< -m $(MODEL_NAME)
$(PREDICT_TST) $(PREDICT_VAL): $(FEATURE_TRN) $(FEATURE_TST) | $(DIR_VAL) $(DIR_TST)
python ./src/train_predict_logreg1.py --train-feature-file $< \
--test-feature-file $(word 2, $^) \
--predict-valid-file $(PREDICT_VAL) \
--predict-test-file $(PREDICT_TST) \
--C $(C) \
--regularizer $(REGULARIZER) \
--class_weight $(CLASS_WEIGHT) \
--solver $(SOLVER) \
--retrain
$(METRIC_VAL): $(PREDICT_VAL) $(Y_TRN) | $(DIR_METRIC)
python ./src/evaluate.py --predict-file $< \
--target-file $(lastword $^) > $@
cat $@
$(SUBMISSION_TST): $(PREDICT_TST) $(HEADER) $(ID_TST) | $(DIR_SUB)
paste -d, $(lastword $^) $< > [email protected]
cat $(word 2, $^) [email protected] > $@
rm [email protected]
.DEFAULT_GOAL := all
If we would like to run the same model with a different feature, e.g., j1, all we need to do is to change the first line in Makefile.logreg1
to include Makefile.feature.j1
. The pipeline will automatically create this feature and run cross validation using ./src/train_predict_j1.py.
Similarly, if we would like to run cross-validation using a different model, such as LightGBM, we need to include the right feature in Makefile.lgb1
, and run $make -f Makefile.lgb1
. If the train and test features for are already created, they will not be created again.
After creating several features and model runs, running the ensemble model is similar to running a single model. Ensemble model uses the predictions from single model runs as features. All we need to do is specify which model predictions should be included in the ensemble in Makefile.feature.esb1
as base models. The feature names should be the same as the model names defined in the model makefiles.
Final step is to submit our predictions. Kaggler pipeline allows submitting predictions through CLI. You need to have the following lines in your model makefile, as in Makefile.lgb1
submit: $(SUBMISSION_TST)
kaggle competitions submit -c $(COMPETITION) -f $< -m $(MODEL_NAME)
To make a submission with the predictions from this model and feature, all you need to do is type the following. The submission will inculde MODEL_NAME
as a message for the submission.
$make -f Makefile.lgb1 submit
We covered the main components of the Kaggler repository. Hopefully, this blog helps you become more comfortable with the Kaggler pipeline. Happy Kaggling :)
Kaggler repository: https://github.com/kaggler-tv/kaggler-template
Kaggler-TV Episode 4: https://www.youtube.com/watch?v=861NAO5-XJo&feature=youtu.be
Official Kaggle_API: https://github.com/Kaggle/kaggle-api
Kaggler template for cat-in-the-dat-ii: https://github.com/kaggler-tv/cat-in-the-dat-ii
https://www.kaggle.com/cuijamm/simple-onehot-logisticregression-score-0-80801
https://www.kaggle.com/cuijamm/simple-onehot-logisticregression-score-0-80801 ↩