Winning Models for Grade Point Average, Grit, and Layoff in the Fragile Families Challenge
In this article, the authors discuss and analyze their approach to the Fragile Families Challenge. The data consisted of more than 12,000 features (covariates) about the children and their parents, schools, and overall environments from birth to age 9. The authors' modular and collaborative approach parallelized prediction tasks and relied primarily on existing data science techniques, including (1) data preprocessing: elimination of low variance features, imputation of missing data, and construction of composite features; (2) feature selection through univariate mutual information and extraction of nonzero least absolute shrinkage and selection operator coefficients; (3) three machine learning models: random forest, elastic net, and gradient-boosted trees; and finally (4) prediction aggregation according to performance. The top-performing submissions produced winning out-of-sample predictions for three outcomes: grade point average, grit, and layoff. However, predictions were at most 20 percent better than a baseline that predicted the mean value of the training data for each outcome.