PAKDDCup 2010 dataset

img

The preconfigured files including the dataset can be downloaded here. Extract the dataset to the ./DataFiles directory. Start the training with $ ./ELF PAKDDCup2010 t and predict the test set with $ ./ELF PAKDDCup2010 p . Filename of the training and the testset are configured under ./DataFiles/settings.txt.

A short technical report can be found here.
The solution described in the report can be found on place 7. See: PAKDDCup2010-Results

Here are the results of several optimized algorithms. As final model we use the one, which has the best AUC on our internal cross-validation set.
model notes training time 60k samples prediction time 20k samples cross validation AUC cross validation RMSE cross validation classification error leaderboard AUC
LR - linear regression Retraining, ProbablisticNormalization=yes, λ=0.00358632 4200[s] 1[s] 0.651516 0.854269 26.044% 0.6250
NN - neural network Retraining, ProbablisticNormalization=no, 143 epochs, stochastic gradient descent, Net: 10n, η=3e-5, λ=8e-2 3900[s] 2[s] 0.650801 0.854371 26.088% 0.6267
KRR - kernel ridge regression Retraining(12CV), ProbablisticNormalization=yes, gauss kernel, sigma=11.6269, λ=2.95755e-05 7255[s] 5968[s] 0.659685 0.851328 25.962% 0.6236
KRR+GBDT - linear ensemble Retraining(12CV), ProbablisticNormalization=yes, KRR: gauss kernel, sigma=10.1, λ=4.4e-05, GBDT: 1500epochs, subspaceSize=100, maxLeafs=100, learnrate=0.005, optSplit=yes, calcGlobalMean=yes 268354[s] 6446[s] 0.660936 0.850941 25.95% 0.6249