Welcome to the OR-Exchange, your site for questions and answers in operations research.

5

1

INFORMS has recently announced its data mining competition. They have posted the famous billion Dollar question of "how to do day trading" and whoever solves the challenge may get some recognition (and no money from INFORMS!).

I am wondering, what are the interesting models, resources, software to tackle this question? I have used R and its GLM package to perform logistic regressions. The following is my code for getting the AUC of 0.611879 (I am currently in the 3rd place) feel free to improve upon this code (additionally, any citation to this code or anything that can help me get a faculty position in a year is hugely appreciated)

R code (I ran it on an Ubuntu Linux): please note that I developed this piece of code for personal use I noticed there are wording and grammatical errors in the comments.

# Logit Regression Model
rm(list = ls(all = TRUE))
setwd('~/Desktop/Informs\ Datamining\ Contest/Data/')
training.data <- read.csv("TrainingData.csv", header = TRUE) 

head(training.data)

plot(training.data$Timestamp,training.data$Variable141LAST_PRICE, type='l')
attach(training.data)

#Finding the maximum positive and negative correlation
names(training.data)
corrs <- as.null()

for (stock.name in names(training.data)) {
    correlation <- cor(eval(as.name(eval(stock.name))),TargetVariable)
    corrs <- rbind(corrs, correlation)
    print(paste("Correlation Between TargetStock and", stock.name," = ", correlation))
}
correls <-data.frame(ComparedStock = names(training.data), Correlations = corrs)
correls[order(correls$Correlations),] 

write.table(correls[order(correls$Correlations),],file="Correlations.csv",sep=",",
            quote=FALSE)

# Prediction Model
# Logistic Regression Model

mylogit<- glm(TargetVariable~Variable101OPEN+Variable101LOW+Variable101HIGH+Variable101LAST_PRICE+Variable133LOW+Variable133OPEN+Variable78LOW, family=binomial(link="logit"), na.action=na.pass)

#Readign Result Data
result.data <- read.csv("ResultData.csv", header = TRUE) 
head(result.data)
FinalPrediction<-predict(mylogit,newdata=result.data,type="response")

#Saving Data
template.data <- read.csv("template.csv", header = TRUE) 
my.output<-data.frame(template.data$Timestamp,FinalPrediction)
head(my.output)
names(my.output) <-c("Timestamp","TargetVariable")

write.table(my.output,file="submit.csv",sep=",",row.names=FALSE,
            quote=FALSE)

My email address is linux_jvm@yahoo.com if you have any questions.

flag
That's good. I'll give it a try. I'm curious why trading volume wasn't considered as a variable in the data. I would think it would be somewhat predictive. – larrydag Jun 27 at 20:50

6 Answers

1

Damn... I've had a model running for 4 days and now I have to reboot my laptop before its finished!

link|flag
1 
Please tell us about your model whenever you found a better model and felt comfortable about sharing your thoughts. I am ready to learn about some of the models that others are using. – Mark Jul 5 at 1:04
1 
I would also like to know that. You don'd have to go into details, a brief overview is absolutely enough, though. – mare Jul 8 at 21:38
1 
Sorry, I'm desperately busy... I have to hand in my phd thesis this week, will update later. Essentially it's all about deriving new variables in addition to those already there to incorporate trends etc. For a basic example: newvar1(t) = var1(t) - var(t-1) – David Woods Jul 9 at 0:41
David, that is also what I was doing. But this resembles only a one-day trend. Interesting other aspects may be longer time-period trends. But definitely the differences in values are extremely important for the predictions. – mare Jul 11 at 11:32
2

FYI: There's another, somewhat related DM competition - hosted @ AnalyticBridge community:

AnalyticBridge Competition: Investigate the spectacular stock market collapse of May 6, 2010

Due date: August 15, 2010

URL: http://www.analyticbridge.com/group/dataminingcompetitions/forum/topics/analyticbridge-competition

link|flag
1 
And this competition comes with a $1000 prize – larrydag Jul 2 at 11:54
2 
It's a bit undefined though... just "do some analysis". I read a really interesting analysis a few weeks ago, don't know how accurate it is but its thought provoking: nanex.net/20100506/FlashCrashAnalysis_Intro.html – David Woods Jul 2 at 13:05
Upvote for the $1000 prize :) – Mark Jul 2 at 19:52
1

This reddit thread is also interesting

link|flag
1 
That is interesting... the first part seems to be the tired old arguments of people who misunderstand ecomonic theory and argue that predicting the market isn't possible! Meanwhile, other people go out and make loads of money doing exactly that. Good distinction between uncertainty and randomness. – David Woods Jun 28 at 23:11
3

I think all variables are stock prices : VariablexxxOPEN should be prices at the beginning of the period, VariablexxxLAST at the end of the period, VariablexxxLOW and VariablexxxHIGH the higher and lower price observed during the period.

Each period is exactly 5 minutes : indeed, using the diff() function one can compute the difference between two consecutive timestamps. The difference is generally 0.003472 except every 79 observations where it is higher. If we compute Timestamp[i + 79] - Timestamp[i], we get exactly 1 : this should correspond to one day. Finally, we can conclude that every period corresponds to 5 five minutes by a rule of three :

1*24*60 = 1440 minutes <=> 1/0.003472 = 288 periods

1440/288 = 5 minutes <=> 1 period

(ps: When you compute the correlation you should use "get(stock.name)" instead of "eval(as.name(eval(stock.name))".)

link|flag
Thanks for get(stock.name) – Mark Jun 28 at 22:01
Good spotting on the timestamp analysis. – David Woods Jun 29 at 1:16
3

I found that doing a decision tree first helps with classifying variables. In R I used the package rpart. From the decision tree I used those variables in a Logisitic Regression model similar to Mark's method in R. My first entry is starting out with AUC of 0.642941.

link|flag
Nice Idea. I will post here if I get a better model. I am trying to use some machine learning models. Perhaps we can post our models here and encourage others to contribute. This way we can make or-exchange more active and bring in more traffic. – Mark Jun 28 at 4:23
2

This is such an interesting challenge - I'm just starting to work on this type of thing anyway, on currency trading, so I might see if I can adapt my techniques...


Is it just me or is that data awful? There are loads of variables with mostly missing values and some zeroes. It would be really useful if they gave some context to what the variables represent; are they the prices of other shares? How come there isn't a variable for the price of the share we're predicting, or if there is why isn't it identified? I know they want to make it anonymous, but they're making it overly hard I think.

Also some concerns about their scoring engine... as one of the entrants there noted, if you multiply all your scores by -1 then you should get 1-AUC... so a low score of 0.25 is equivalent to a high score of 0.75. But this doesn't happen when you resubmit with this transformation!

link|flag
I should mention I'm InflectionPoint, with a current AUC of 0.672 – David Woods Jun 28 at 4:28
1 
I totally agree. To me just the fact that I don't know the meaning of these indicators is bothersome. Anybody who has worked on automated trading knows how important these indicators are. Somebody can make a neural network and learn from the dataset quickly but so what! as long as we don't contribute to the underlying science it seems useless. I can correct for missing data in some of the variables but many of them are empty! – Mark Jun 28 at 4:30
I'd also like to know what the timestamp represents... it looks like some sort of tick id, but apparently the data represent 5m candlesticks... – David Woods Jun 28 at 6:09
It's also a bit unrealistic that we don't seem to have the values for the series we're trying to predict, just that binary indicator. I understand why they've left it out, but it lessens the potential power of our models. – David Woods Jun 29 at 1:17

Your Answer

Not the answer you're looking for? Browse other questions tagged or ask your own question.