Welcome to the OR-Exchange, your site for questions and answers in operations research.

2

1

With regards to this blog post. I have collected a number of datasets for different contributing variables to homicide incidents. I would like to know which of these variables is significantly contributing to the number of homicide in the area. I was planning to use our archaic friend, ANOVA, to see if there is any significant effect from each of these variables and then include the dominating ones in my model.

I just want to know if there is any better (or perhaps more modern) tool to find good signals. perhaps Bayesian Inference? :)

flag

4 Answers

2

For working straight out of the box analysis of good signals I recommend using decision trees. Good proprietary software tools include Agnoss KnowledgeSeeker. Some open source tools include R with the library "rpart" which implements CART. Here is a good link for an R tutorial.

http://www.statmethods.net/advstats/cart.html

link|flag
2

There are a number of data mining techniques (including CART, which larrydag mentioned) that can create a tree structure where you get locally accurate (or at least sort of accurate) predictions in the leaf nodes. Predictors may be significant in some leaf nodes and insignificant in others.

Another possibility would be to run "best subsets regression" with homicides per capita as the dependent variable and most/all of your available predictors (assuming the number of possible predictors is not too large). Look at the best few models of each size and see who is being used.

A statistician would probably say, though, that you should first posit a model for homicides per capita, including interactions of predictors if appropriate (and nonlinear if appropriate), rather than throwing things against the wall and seeing what sticks. I'm of mixed mind about it myself.

link|flag
1

if you are assuming generally linear relationships, though this can be modified to non-linear, I would recommend DOE (Design of Experiments). This method is useful in formulating models of systems. Also, when done properly it gives you the noise expected. I find it to be especially useful in disseminating relative data from non-relative. Most people think of this as a way to design an experiment so as to easly find a model, when it can also be used to find models from existing data.

link|flag
1

I use two different approaches to selecting which variables to use in models:

  1. Simply create lots of models with different combinations of variables and see which are best on a holdout sample, or
  2. Perform some sort of feature selection process to narrow down the list of possible variables.

The feature selection can take various forms, but generally starting with a screening process is useful, e.g.:

  • Screen fields with a high proportion of missing values
  • For categorical variables, screen fields that have too high a proportion of a single value
  • Again for categorical variables, screen fields with too many categories
  • Perhaps screen fields with not enough variability (using a measure like the coefficient of variation or the standard deviation, or both)

Then it is helpful to assign the variables a measure of "importance". Again, there are many ways that this can be done, depending on whether your variables are categorical or continuous. A useful definition is:

Importance. A measure used to rank fields or results on a percentage scale, defined broadly as 1 minus the p value, or the probability of obtaining a result as extreme or more extreme than the observed result by chance alone. The measure used to rank importance depends on whether the predictors and the target are all categorical, all numeric ranges, or a mix of range and categorical. Despite the differences in computation, the use of a standard percentage scale allows comparisons across different types of fields and results.

In addition to simply selecting existing variables you might also choose to modify them; either by cleaning the data (always a good idea anyway), perhaps binning values, recoding, etc. Or by performing an operation such as principle components analysis to reduce dimensionality. This is also a very useful technique if you have lots of multicollinearity in your dataset.

link|flag

Your Answer

Not the answer you're looking for? Browse other questions tagged or ask your own question.