Sunday, April 21, 2013

Generalized linear modeling with highly dimensional data

Question from a student, University of Missouri-Kansas City:

Hi guys,
I have project in Regression class, and we have to use R to do it,but till now I didn't find appropriate code for this project, and I dont now which method I have to use.

I have to analysis of a high dimensional dataset. The data has a total of 500 features.

we have no knowledge as to which of the features are useful and which not. Thus we want to apply model selection techniques to obtain a subset of useful features. What we have to do is the following:

a) There are totally 2000 observations in the data. Use the first 1000 to train or fit your model, and the other 1000 for prediction.

b) You will report the number of features you select and the percentage of response you correctly predict. Your project is considered valid only if the obtained percentage exceeds 54%.

Please help me as much as you can.
Your help would be appreciated..
Thank you!


well, doing batches of 30 variables I came across 88 of the 500 that minimize AIC for each batch:

t1=read.csv("qw.csv", header=FALSE)
# not a good solution -- better to get 1000 records randomly, but this is enough for now:
(bestAIC = bestglm(xy, IC="AIC"))

, and so on, going from  x=train_data[,2:31] to x=train_data[32:61], etc. Each run gives you a list of best variables to minimize AIC (I chose AIC but it can be any other criterion).

If I try to process more than 30 (or 31) columns with bestglm it takes too much time because it uses other programs and optimization is different... and clearly inefficient.

now, the problem seems reduced to using less than 90 variables instead of the original 500. Not the real solution, since I am doing this in a piecemeal basis, but maybe close to what we are looking for, which is to get 54pct of the observed values.

using other methods I got even less candidates to be used as variables, but let's keep the ones we found before

then I tried this: after finding the best candidates I created this object, a data frame:

dat = data.frame(train_data$V1, train_data$V50, train_data$V66, train_data$V325, train_data$V426, train_data$V28, train_data$V44, train_data$V75, train_data$V111, train_data$V128, train_data$V149, train_data$V152, train_data$V154, train_data$V179, train_data$V181, train_data$V189, train_data$V203, train_data$V210, train_data$V213, train_data$V216, train_data$V218, train_data$V234, train_data$V243, train_data$V309, train_data$V311, train_data$V323, train_data$V338, train_data$V382, train_data$V384, train_data$V405, train_data$V412, train_data$V415, train_data$V417, train_data$V424, train_data$V425, train_data$V434, train_data$V483)

then, I invoked this:

model = train(train_data$V1 ~ train_data$V50 + train_data$V66 + train_data$V325 + train_data$V426 + train_data$V28 + train_data$V44 + train_data$V75 + train_data$V111 + train_data$V128 + train_data$V149 + train_data$V152 + train_data$V154 + train_data$V179 + train_data$V181 + train_data$V189 + train_data$V203 + train_data$V210 + train_data$V213 + train_data$V216 + train_data$V218 + train_data$V234 + train_data$V243 + train_data$V309 + train_data$V311 + train_data$V323 + train_data$V338 + train_data$V382 + train_data$V384 + train_data$V405 + train_data$V412 + train_data$V415 + train_data$V417 + train_data$V424 + train_data$V425 + train_data$V434 + train_data$V483,
               trace = FALSE)
ps = predict(model, dat)

if you check the result, ps, you find that most values are the same:

606 are -0.2158001115381
346 are 0.364988437287819

the rest of the 1000 values are very close to these two, the whole thing is this:

just 1 is -0.10
   1  is -0.14
   1  is -0.17
   1  is -0.18
   3  is -0.20
 617 are -0.21
   1  is 0.195
   1  is 0.359
   1  is 0.360
   1  is 0.362
   2  is 0.363
 370  are 0.364

, so I just converged all negative values to -1 and all positive values to 1 (let's assume is propensity not to buy or to buy), and then I found that 380 rows were negative when the original value to be predicted was -1 (499 rows), that is, a success percentage of 76 pct

only 257 values were positive when the original values were positive (success rate of 257/501 = 51.3pct)

the combined success rate in predicting the response variable values is a bit above 63%, which is above the value we aimed at, 54pct

now, I tried with the second data set, test_data (the second 1000 rows)

negative values when original response value was negative too:
          success rate is 453/501 = .90419

impressive?  See how disappointing is this:

positive values when original response value was positive too:
          success rate is 123/499 = .24649

the combined success rate is about 57pct, which is barely above the mark

do I trust my own method?

of course not, I would get all previous consumer surveys (buy/not buy) my company had in the files and then I will check if I can get a success rate at or above 57pct (which to me is too low, to say nothing of 54pct)

for the time and effort I spent maybe I should have tossed an electronic coin, with a bit of luck you can get a bit above 50pct success     : - )

maybe to prevent this they chose 54pct, since in 1000 runs you could be very well near 50pct

refinement, or "If we had all the time of the world..."

since I got enough free time, I tried this (same dat data frame):

model = train(train_data$V1 ~ log(train_data$V50) + log(train_data$V66) + log(train_data$V325) + log(train_data$V426) + log(train_data$V28) + log(train_data$V44) + log(train_data$V75) + log(train_data$V111) + log(train_data$V128) + log(train_data$V149) + log(train_data$V152) + log(train_data$V154) + log(train_data$V179) + log(train_data$V181) + log(train_data$V189) + log(train_data$V203) + log(train_data$V210) + log(train_data$V213) + log(train_data$V216) + log(train_data$V218) + log(train_data$V234) + log(train_data$V243) + log(train_data$V309) + log(train_data$V311) + log(train_data$V323) + log(train_data$V338) + log(train_data$V382) + log(train_data$V384) + log(train_data$V405) + log(train_data$V412) + log(train_data$V415) + log(train_data$V417) + log(train_data$V424) + log(train_data$V425) + log(train_data$V434) + log(train_data$V483),
               trace = FALSE)
ps = predict(model, dat)

negative values when original response value was negative too: .7

positive values when original response value was positive too: .69

combined success rate: 69.4pct

# now we try with the other 1000 values:
[same dat data frame, but using test_data instead of train_data]

model = train(test_data$V1 ~ log(test_data$V50) + log(test_data$V66) + log(test_data$V325) + log(test_data$V426) + log(test_data$V28) + log(test_data$V44) + log(test_data$V75) + log(test_data$V111) + log(test_data$V128) + log(test_data$V149) + log(test_data$V152) + log(test_data$V154) + log(test_data$V179) + log(test_data$V181) + log(test_data$V189) + log(test_data$V203) + log(test_data$V210) + log(test_data$V213) + log(test_data$V216) + log(test_data$V218) + log(test_data$V234) + log(test_data$V243) + log(test_data$V309) + log(test_data$V311) + log(test_data$V323) + log(test_data$V338) + log(test_data$V382) + log(test_data$V384) + log(test_data$V405) + log(test_data$V412) + log(test_data$V415) + log(test_data$V417) + log(test_data$V424) + log(test_data$V425) + log(test_data$V434) + log(test_data$V483),
               trace = FALSE)
ps = predict(model, dat)

negative values when original response value was negative too:
          success rate is 322/499 = .645

positive values when original response value was positive too:
          success rate is 307/501 = .612

combined success rate: 62.9pct

other things I tried failed -- if we had all the time of the world we could try other possibilities and get better results... or not

you'll tell me if you can reproduce the results, which are clearly above the 54pct mark