Credit Risk Assessment, Statistical
Analysis (ver.1.03)
The training and testing data-points for the median neural network are
distinguished in the plot. In order to avoid biasing the cross validation
procedure, the testing sub-set is repeatedly harvested and the median performer
is selected. The best median performer (see
"Very Quick Guide to ANN")
decides what is the optimal number of hidden nodes, reported on the left box.
|
|
The quality of the
predictions can be inspected by looking at the predicted versus obtained values.
In our example, observed values can only be "0" or "1" - the
values represented by circles in the plot.
|
|
The closest the linear regression line (full) is to the identity line (dashed),
the better the predictions (in this example both lines are one upon another).
In addition, the proximity of the data points to the regressed line is
quantified by the standard Pearson's correlation coefficient, r2. It is
interesting to note that this is in fact a non-linear and non-monotonous
correlation coefficient, a correlation measure not available in the
conventional Statistics. The non-linear correlation coefficients are
restricted to monotonous datasets. On the contrary the r2 reported for the ANN
predictions is free from any restriction.
Continuing the analogy with conventional linear regression, index should be a
measure similar to the regression coefficients.
|
|
Here you can evaluate which variables most influence
your output by considering the average sensitivity of the output to each of the
inputs.
In our example, we can see that "income", "number of children" and "divorced"
have abig influence on "payed", while "relincharge" (relatives in charge)
and "pets" have little or no influence.
|
|
We find that the ANN correctly identified the second, fourth and seventh inputs
to be the most reliable as a basis for predictions, followed by the first,
third, eleventh and twelfth input variables, while all others were found to be
practically negligible.
Important note: The ANN training procedure is such that all
information available can be captured. The noisiness of an input variable by
itself will not prevent the underlying signal from being used.
The sensitivity analysis is as important as the regression analysis as it
quantifies the importance of each input variable for the prediction. This is
particularly important to simplify the number of variables necessary for
monitoring and can be also used as the basis for mechanistic
explanations for the association between input and output variables.
Important note: If you use more than one output (imagine you
were also trying to know how much of the total amount of the loan he/she didn't
pay) notice that each output is predicted independently. In fact, a separate
ANN is developed for each output, using all inputs at a time. Therefore, there
is no need to separate different sets of dependent
variables according to their interdependencies, which can be recovered by cluster or factor analysis of their sensitivities.
All information provided through graphic interfaces is also available through
as a text file such that you can use your favorite graphics or statistical
packages and proceed to a more advanced analysis and the production of
publication-quality plots. In order to get the text formatted report look for
the inconspicuous link somewhere on both the "Trained ANN Statistics" and the
"Predicting with trained ANN" pages.