 
GPExp: results analysis of the "data" dataset
 
 
DISCLAMER
 
The information provided below is provided without garantee and should
be considered as an illustrative example of how GPExp results can be exploited
Most of the qualitative analysis is based on experience and cannot be
considered as firm rules. Depending on the dataset, the imposed euristic
boundaries can vary significantly
 
 
DATASET INFORMATION
 
 36 points dataset relative to an open-drive scroll expander working with R123. 
data from: 
Quoilin, S. (2011). Sustainable energy conversion through the use of Organic Rankine Cycles for waste heat recovery and solar applications. Unpublished doctoral thesis, University of Liège, ​Liège, ​​Belgium. 
 
 
MAIN RESULTS
(NB: These results are stored in the out.CV and out.train variables)
 
Whole training set (i.e. with all the data points):
Normalized mean absolute error: 6.5124 %
Coefficient of determination (R square): 80.7933 %
Normalized root mean square error (RMSE): 0.122
 
In cross-validation:
Normalized mean absolute error: 8.2883 %
Coefficient of determination (R square): 72.58 %
Normalized root mean square error (RMSE): 0.14548
 
These results can be interpreted as follows:
 
The lowest average error that could be reached by a model predicting Wdot [kW]
as a function of p_{su} [bar] is 6.5124 %
 
When predicting values that are outside of the data, an average error of
8.2883 % can be expected
 
 
DETECTION OF OVERFITTING
 
Overfitting can be detected by comparing the error in training mode and in cross-
validation: in case of overfitting the train error should remain low, while the
CV error should increase significantly
However, there is no firm quantitative rule to detec overfitting. This analysis
therefore provides a warning, which should be checked visually by plotting the function
 
The ratio between the error in CV and training is 1.27
This value relatively low, which tends to indicate that there is no overfitting
  
 
RELEVANCE OF THE SELECTED INPUTS (PERMUTATIONS)
(NB: This result is stored in the out.CV.pmae variable)
 
The dependency of the output with the given inputs is checked by comparing
the mean average error (in cross-validation) of the dataset with a random
dataset (the same with randomly permuted outputs. The reported statistics
in the CV.pmae variable, corresponding to the probability of having a random
dataset performing better in terms of mean average error than the actual
dataset.
A pmae lower than 5 percent indicates that there is a significant correlation between
inputs and outputs
 
Computed pmae value: 0.1996  
 
OUTLIERS
(NB: These results are stored in the out.outliers vector)
 
The posterior of the Gaussian Process provides the likelihood function,
with its mean and standard deviation. For each data point, the Grubbs
test for outliers can therefore be applied: the error is expressed as the
a multiple of the standard deviation at that particular point. According 
to the normal distribution hypothesis, a multiple higher than 1.96 
indicates a significance level lower than 5 percent
Users should also refer to the error plot to visualize the repartition of 
the error on the normal distribution
 
The data point with the highest probability to be an outlier is:
Data point number 26, with an error 3.2149 times higher than the standard
deviation of the gaussian process function at that particular point
 
The following data points present a significance level lower than 5 percent.
They are therefore likely to be outliers:
Data point number 26: 3.2149
 
 
PREDICTION
 
Once the analysis has been performed and the results are satisfying (i.e. 
no overfitting, outliers have been removed, accuracy is sufficient, etc),
the result files can be used for prediction, i.e. to predict the output 
of a set of inputs outside of the data.
This can be done by :
1. loading the "in" and "out" structures from the .mat result analysis file:
   "load data-file.mat"
2. Use the GPExp prediction function by assigning a value to each input:
   "GP_prediction(in,results, 'p_{su} [bar]', value1)"
 
 
