I’d like to open up my toolbox that I’ve built for data mining competitions, and share with you.
Let me start with my setup.
I have access to 2 machines:
- Laptop – Macbook Pro Retina 15″, OS X Yosemite, i7 2.3GHz 4 Core CPU, 16GB RAM, GeForce GT 750M 2GB, 500GB SSD
- Desktop – Ubuntu 14.04, i7 5820K 3.3GHz 6 Core CPU, 64GB RAM, GeForce GT 620 1GB, 120GB SSD + 3TB HDD
I purchased the desktop from eBay around at $2,000 a year ago (September 2014).
As the code repository and version control system, I use git.
It’s useful for collaboration with other team members. It makes easy to share the code base, keep track of changes and resolve conflicts when two people change the same code.
It’s useful even when I work by myself too. It helps me reuse and improve the code from previous competitions I participated in before.
S3 / Dropbox
I use S3 to share files between my machines. It is cheap – it costs me about $0.1 per month on average.
I use Dropbox to share files between team members.
For flow control or pipelining, I use makefiles (or GNU
It modularizes the long process of a data mining competition into feature extraction, single model training, and ensemble model training, and controls workflow between components.
For example, I have a top level makefile that defines the raw data file locations, folder hierarchies, and target variable.
# directories DIR_DATA := data DIR_BUILD := build DIR_FEATURE := $(DIR_BUILD)/feature DIR_VAL := $(DIR_BUILD)/val DIR_TST := $(DIR_BUILD)/tst ... DATA_TRN := $(DIR_DATA)/train.csv DATA_TST := $(DIR_DATA)/test.csv ... Y_TRN := $(DIR_DATA)/y.trn.yht ... $(Y_TRN): $(DATA_TRN) cut -d, -f2 $< | tail -n +2 > [email protected]
Then, I have makefiles for features that includes the top level makefile, and defines how to generate training and test feature files in various formats (CSV, libSVM, VW, libFFM, etc.).
include Makefile FEATURE_NAME := feature3 FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.sps FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.sps FEATURE_TRN_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).trn.ffm FEATURE_TST_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).tst.ffm $(FEATURE_TRN) $(FEATURE_TST): $(DATA_TRN) $(DATA_TST) | $(DIR_FEATURE) src/generate_feature3.py --train-file $< \ --test-file $(lastword $^) \ --train-feature-file $(FEATURE_TRN) \ --test-feature-file $(FEATURE_TST) %.ffm: %.sps src/svm_to_ffm.py --svm-file $< \ --ffm-file [email protected] \ --feature-name $(FEATURE_NAME) ...
Then, I have makefiles for single model training that includes a feature makefile, and defines how to train a single model and produce CV and test predictions.
include Makefile.feature.feature3 N = 400 DEPTH = 8 LRATE = 0.05 ALGO_NAME := xg_$(N)_$(DEPTH)_$(LRATE) MODEL_NAME := $(ALGO_NAME)_$(FEATURE_NAME) ... PREDICT_VAL := $(DIR_VAL)/$(MODEL_NAME).val.yht PREDICT_TST := $(DIR_TST)/$(MODEL_NAME).tst.yht SUBMISSION_TST := $(DIR_TST)/$(MODEL_NAME).sub.csv all: validation submission validation: $(METRIC_VAL) submission: $(SUBMISSION_TST) retrain: clean_$(ALGO_NAME) submission $(PREDICT_TST) $(PREDICT_VAL): $(FEATURE_TRN) $(FEATURE_TST) \ | $(DIR_VAL) $(DIR_TST) ./src/train_predict_xg.py --train-file $< \ --test-file $(word 2, $^) \ --predict-valid-file $(PREDICT_VAL) \ --predict-test-file $(PREDICT_TST) \ --depth $(DEPTH) \ --lrate $(LRATE) \ --n-est $(N) $(SUBMISSION_TST): $(PREDICT_TST) $(ID_TST) | $(DIR_TST) paste -d, $(lastword $^) $< > [email protected] ...
Then, I have makefiles for ensemble features that defines which single model predictions to be included for ensemble training.
include Makefile FEATURE_NAME := esb9 BASE_MODELS := xg_600_4_0.05_feature9 \ xg_400_4_0.05_feature6 \ ffm_30_20_0.01_feature3 \ ... PREDICTS_TRN := $(foreach m, $(BASE_MODELS), $(DIR_VAL)/$(m).val.yht) PREDICTS_TST := $(foreach m, $(BASE_MODELS), $(DIR_TST)/$(m).tst.yht) FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.csv FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.csv $(FEATURE_TRN): $(Y_TRN) $(PREDICTS_TRN) | $(DIR_FEATURE) paste -d, $^ > [email protected] $(FEATURE_TST): $(Y_TST) $(PREDICTS_TST) | $(DIR_FEATURE) paste -d, $^ > [email protected]
Finally, I can (re)produce the submission from XGBoost ensemble with 9 single models described in
Makefile.feature.esb9 by (1) replacing
include Makefile.feature.feature3 in
include Makefile.feature.esb9 and (2) running:
$ make -f Makefile.xg
When I’m connected to Internet, I always ssh to the desktop for its computational resources (mainly for RAM).
I followed Julian Simioni’s tutorial to allow remote SSH connection to the desktop. It needs an additional system with a publicly accessible IP address. You can setup an AWS micro (or free tier) EC2 instance for it.
tmux allows you to keep your SSH sessions even when you get disconnected. It also let you split/add terminal screens in various ways and switch easily between those.
Documentation might look overwhelming, but all you need are:
# If there is no tmux session:
# If you created a tmux session, and want to connect to it:
$ tmux attach
Then to create a new pane/window and navigate in between:
Ctrl + b + "– to split the current window horizontally.
Ctrl + b + %– to split the current window vertically.
Ctrl + b + o– to move to next pane in the current window.
Ctrl + b + c– to create a new window.
Ctrl + b + n– to move to next window.
To close a pane/window, just type exit in the pane/window.
Hope this helps.
Next up is about machine learning tools I use.
Please share your setups and thoughts too. 🙂
Kaggler. Data Scientist. Father of Five.