Microsoft® SQL Server® 2012 Bible

(Ben Green) #1

1288


Part IX: Business Intelligence


Cross Validation
Cross validation is useful for evaluating the stability of a model for unseen cases. The con-
cept is to partition available data into some number of equal sized buckets called “folds,”
train the model on all but one fold, and test with the remaining fold. This will be repeated
until each of the folds has been used for testing. For example, if three folds were selected,
the model would be trained on 2 and 3 and tested with 1, then trained on 1 and 3 and
tested on 2, and fi nally trained on 1 and 2 and tested on 3.

Switch to the Cross Validation tab, and specify the parameters for the evaluation:

■ (^) Fold Count: The number of partitions the data will be placed into.
■ (^) Max Cases: The number of cases that the folds will be constructed from. For exam-
ple, 1000 cases and 10 folds result in approximately 100 cases per fold. Setting this
value to 0 results in all cases being used.
■ (^) Target Attribute and State: The prediction to validate.
■ (^) Target Threshold: The minimum probability required before assuming a positive result.
After the cross-validation has run, a report displays the outcome for each fold across a
number of different measures. The standard deviation of the results of each measure should
be relatively small. If the variation is large between folds, then it is likely an indication
that the model will not work well in actual use.
Troubleshooting Models
Models are rarely perfect in the real world, so you must assume an acceptable margin of
error. Several common problems arise when creating models:
■ (^) A nonrandom split of data into training and test data sets.
■ (^) Input columns are too case-specifi c (for example, IDs, Names, and so on). Adjust the
mining structure to ignore data that occurs in training data but never for produc-
tion data.
■ (^) Too few rows (cases) in the training data set to accurately characterize the popula-
tion of cases. Add additional data sources or limit special cases included.
■ (^) If all models are closer to the Random Guess line than the Ideal Model line, then
the input data does not correlate with the predicted outcome.
Some algorithms, such as Time_Series, do not support the mining accuracy chart view.
Always evaluate the model and modify the data and model defi nition until it meets the
business needs and margin of error.
Deploying
Several methods are available for interfacing applications with data mining functionality:
■ (^) Directly constructing XMLA, communicating with Analysis Services via SOAP. This
exposes all functionality at the price of in-depth programming.
c57.indd 1288c57.indd 1288 7/31/2012 10:35:02 AM7/31/2012 10:35:02 AM
http://www.it-ebooks.info

Free download pdf