Notes

Details about the implementation of each routine are provided directly within the code through verbose comments. They include also lists and explanations of the arguments needed, returns and options. Here, instead, a description of the different routines is provided:

`confidence_band`

creates confidence bands of a model optimized by a curve fit regression using Least Square (LS) method. It is able to handle with different cases: Linear or Ordinary Least Square (OLS), Non-Linear Least Square (NLLS), unknown and known variances (the latter very dear to those who works with experimental data) and then homoscedasticity or heteroskedasticity of the data. Some technical and statistical details:

Why The idea behind the creation of this routine is to permit to handle with different cases of fit of the data. Searching on the web it is possible to find a lot of documents and tutorials explaining how to create confidence or prediction intervals (not only in Python), but most of the times they focus only on some very limited (but widely requested) cases such as simple OLS regression or non-experimental data (...in Physics we have errors! Damn!). In other cases there is a way to solve the problem but it requires to use packages or libraries you are not familiar with or you cannot use on your machines. This routine tries to solve these two aspects (at least for the author).

How (code) It is a sort of wrapper of the famous Scipy function scipy.optimize.curve_fit and it is based on it: shared positional or keyword arguments follow the roles and the definitions used in Scipy so as to make the routine quickly usable for those already familiar with curve_fit itself. Needs Scipy v0.14.0 or later versions (needs the argument absolute_sigma). Here the prototype and description:

def confidence_band(model, xdata, ydata, confidence_level=0.6827,
                xvals=None, prediction=False, bootstrap=False,
                num_boots=1000, full_output=False, **kwargs):

"""Compute confidence or prediction bands of a model optimized by a
curve fit regression with Least Square method.

Parameters:
-----------

model : callable
          The model function f(x,*p) taking the independent
          variable as first argument and the fit parameters as
          separate remaining arguments.

xdata : scalar or array-like
          Measured independent variable values.

ydata : scalar or array-like
          Dependent data.

confidence_level : number, optional
          Desired confidence level used in the definition of the
          bands. It must be a number in the range (0,1).  Default
          value is 0.6827, corresponding to one 'gaussian' sigma.
          Other common values are 0.9545 ('two sigmas') and 0.9973
          ('three sigmas').

xvals : scalar or array-like, optional
          Value(s) of the indipendent variable where the band is
          computed. If not given, xdata is used as default. This
          value is also overrided by xdata if prediction=True and
          if absolute_sigma is passed as keyword argument to
          curve_fit (see below).

prediction : bool, optional 
          If True, compute prediction intervals instead of
          confidence band.

bootstrap : bool, optional
          If True, intervals are computed by using the boostrap
          method. This should be the preferred choice in the case
          of a NLLS regression.

num_boots : int, optional
          Number of bootstrap resamples used if bootstrap method
          is selected. Default is 1000 if absolute_sigma is True,
          otherwise it is set to the minimum between 1000 and N^N,
          where N is number of data points.

full_output : bool, optional
          If False, upper and lower bounds of the band are
          returned. Otherwise, also central predicted response,
          optimized parameters and covariance matrix of the
          regressor are inserted. Default is False.

kwargs              
          Keyword arguments passed to curve_fit. In particular,
          the values of keyword argument 'absolute_sigma' and
          'sigma' are crucial and uses in the calculation of the
          bands.

Returns: 
--------

upper : scalar or array like
          Upper bound on the confidence (or prediction) band.

lower : scalar or array like
          Lower bound on the confidence (or prediction) band.

central : scalar or array like, optional
          Mean predicted response curve, the model with optimized
          values of the regression parameter

popt : scalar or array like, optional
          Optimized values of the regression parameters.

pcov : 2d array
          The estimated covariance of popt. The diagonals provide
          the variance of the parameter estimate.


Notes:
------

I) The variance of the predicted values by the regression is
defined by using the (approximate) jacobian matrix of the model
with respect to the fit parameter. This method is robust in the
case of Ordinary Least Square regression (OLS), i.e. when the
model is linear in the parameter, but it is also used in the
non-linear case (NLLS) despite a bootstrap procedure should be
preferred (see below) 

II) Approximate bootstrap confidence intervals requires different
methods in the case of absolute_sigma or not. If true errors are
given, then the bootstrap is over the measures, assuming normal
distribution around ydata with spread given by sigma. If
absolute_sigma is False, residuals are bootstrapped. In this case,
the number of total possible resamplings with replacement is N^N,
where N is the number of data points.  It is left to the user to
decide which value of num_boots is good.

III) Prediction interval can be computed only on original data 
'xdata' if absolute_sigma is True.

"""

How (statistics) Confidence bands are computed directly using the observation/regression matrix X which appears in the fit model Y=Xb where Y is the dependent variable vector and b the vector of regression parameters. It is used to estimate the variance of the mean predicted response Y* which is given by s^2* = X w^2 X^t where w^2 is the covariance matrix of the fit parameters. The correct score that multiplies the variance depends on the case (z-score for known variances, t-score for unknown variance with homoscedasticity assumption, using the Mean Square Error (MSE) to estimate the unique variance) and on the desired confidence level (CL). Obviously, this computation coincides with the formulas easy to find on the web about some simpler cases such as simple linear regression. More in general, this method is robust in the case of OLS but it is also used in the case of NLLS by approximating numerically the Jacobian matrix near the optimized value of the fit parameters. In this case, however, the bootstrap method would be preferred (resampling the residuals in the case of unknown variances or sampling randomly around measures if variances are given). Prediction bands can be also computed from data. Finally, some important remarks about use and limitations. The routine can handle correlated data (if covariance matrix of the observation is given) but robustness may be threatened (also in the OLS case). The same may happens if bounds are provided to the observations in order to solve a constrained problem.
What Basically, this function returns two arrays containing the upper and lower bounds of the confidence (or prediction) band around the curve defined by the mean predicted response. In its lightest use, it requires two arrays containing, respectively, the x independent variable and the corresponding values of the dependent variable y, and the model used into the fit. Then, one can set the desired confidence level (CL), to define instead a prediction interval and/or in which range the band itself must be defined. Every other keyword parameter such as variances/errors (sigma), known variances (absolute_sigma) and so on are exactly the same required and passed to the curve_fit function of Scipy. See the examples!

`chi2_gof_test`

performs a ChiSquare Goodness-of-Fit test to check if a model is actually able to describe the observed data.

How (code) Here the prototype of the function:

"""Performs the Chi Square Goodness of Fit test on a model
with parameters optimized through a least square curve fit
procedure in 1d.

Parameters:
-----------

model: callable
         The model function f(x, *p) which takes the independent
         variable as first argument and the fit parameters as
         separate remaining arguments.

xdata : array-like
         Measured independent variable values.

ydata : array-like
         Dependent data.

popt : scalar or array-like
         Optimized values of the fit parameters as resulted from
         the curve fit procedure.

sigma : scalar or array-like, optional
         Determines the uncertainty in 'ydata'. 

counts : bool, optional
         If True, ydata are considered frequencies or counts so that
         sigma is substituted by the expected frequencies.

full_output : bool, optional
         If True, the Mean Squared Error (MSE) is returned together with 
         the Sum of Squared Error (SSE), the degrees of freedom and the
         p-value resulted from the chi2 distribution. Default is False,
         with only the first argument returned.

Returns:
--------

MSE : scalar
         The Mean Squared Error (MSE) of the data on the test model,
         which under assumptions is a reduced chi2 variable.

SSE : scalar, optional
         The Sum of Squared Errors of the data on the test model.

ndof : int, optional
         Number of degrees of freedom.

pvalue: scalar, optional
         The obtained p-value.

Notes:
------
In this version data and variances are assumed to be uncorrelated.

"""

How (statistics) The function computes the residuals of the observations with respect to the model and uses its sum to obtain the result of the test. Mean Square Error (MSE) is returned by default, but also other values (such as the p-value obtained according to the ChiSquare distribution) can be printed. The routine can handle automatically (with the keyword argument counts=True) the case frequency observations, i.e. by setting the variances to the square root of the expected frequencies (given by the model). Notice that if data are not counts and sigma is not given, these are set to one.
What By default, this routine returns the Mean Square Error (MSE), or Mean Sum of Square Residuals (MSSR) of the data with respect to the model. It needs the model, the array of the independent variable and that of the dependent one. Also variances can be passed, or a flag that tells the routine to handle with frequencies/counts. A full output can be asked by using the corresponding keyword argument, so that to return also the Sum of Square Errors (SSE) or Sum of Square Residuals (SSR), the degrees of freedom and the p-value according to the ChiSquare distribution. See the examples!

References

Here a list of some useful references (especially about statistics) I used to write this code. The first two are related to statistical theory and methods, the second pair describes the Bootstrap method and the last one is a link to the documentation of the Scipy modules used in this project.

Del Prete T. (2000), "Methods of Statistical Data Analysis in High Energy Physics"
Wikipedia contributors. "Ordinary least squares." Wikipedia, The Free Encyclopedia.
Efron, B. (1981). "Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods"
Fox, J. (2002). "Bootstrapping regression models"
scipy.optimize.curve_fit, scipy.optimize.least_squares which are heavily used in this module

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOTES.md

NOTES.md

Notes

`confidence_band`

`chi2_gof_test`

References

Files

NOTES.md

Latest commit

History

NOTES.md

File metadata and controls

Notes

confidence_band

chi2_gof_test

References

`confidence_band`

`chi2_gof_test`