I use two different approaches to selecting which variables to use in models:
- Simply create lots of models with different combinations of variables and see which are best on a holdout sample, or
- Perform some sort of feature selection process to narrow down the list of possible variables.
The feature selection can take various forms, but generally starting with a screening process is useful, e.g.:
- Screen fields with a high proportion of missing values
- For categorical variables, screen fields that have too high a proportion of a single value
- Again for categorical variables, screen fields with too many categories
- Perhaps screen fields with not enough variability (using a measure like the coefficient of variation or the standard deviation, or both)
Then it is helpful to assign the variables a measure of "importance". Again, there are many ways that this can be done, depending on whether your variables are categorical or continuous. A useful definition is:
Importance. A measure used to rank fields or results on a percentage scale, defined broadly as 1 minus the p value, or the probability of obtaining a result as extreme or more extreme than the observed result by chance alone. The measure used to rank importance depends on whether the predictors and the target are all categorical, all numeric ranges, or a mix of range and categorical. Despite the differences in computation, the use of a standard percentage scale allows comparisons across different types of fields and results.
In addition to simply selecting existing variables you might also choose to modify them; either by cleaning the data (always a good idea anyway), perhaps binning values, recoding, etc. Or by performing an operation such as principle components analysis to reduce dimensionality. This is also a very useful technique if you have lots of multicollinearity in your dataset.