Fitdistcp

Fitdistcp: fitting statistical models using calibrating priors

fitdistcp is a free R package for fitting statistical models using calibrating priors, with the goal of making reliable predictions.

fitdistcp can be called from R or Python (details below), and we hope to release a pure Python version in summer 2025.

Maximum likelihood and other point estimate methods underestimate predictive tail probabilities. For instance, with maximum likelihood prediction, the GEVD with one predictor and shape parameter of -0.3 fitted to 50 data points underestimates the 1 in 50 probability by a factor of 4. This is called a reliability bias.

Calibrating prior prediction provides an objective way to reduce this reliability bias by including parameter uncertainty. It eliminate the bias completely in many cases. The benefit of calibrating prior prediction, relative to maximum likelihood, is largest for small samples, models with more parameters, and in the tails.

fitdistcp uses objective Bayesian methods. The prior is chosen to maximize the reliability of the predictions, i.e., to give predictive probabilities that are as close as possible to the frequencies of future outcomes. The integration is performed analytically if possible, or using DMGS asymptotics otherwise. Posterior sampling is also available as an option (but is slower).

For many cases, charts are provided below that quantify the size of the maximum likelihood reliability bias, and the reduced bias from calibrating prior prediction integrated with DMGS.

Models

This is a list of the models that are currently supported, in alphabetical order. This list is motivated by various climate and actuarial applications. Let me know if you have any suggestions for other models to include.

The R command needs to be prefixed with one of d,p,q,r,t
H/I indicates whether the model is homogenenous or inhomogeneous
A/D indicates whether the default integration method is analytic or DMGS
In some cases, reliability bias charts are provided for maximum likelihood and calibrating prior predictions.
For EVT models, the reliability bias charts are given as a function of the true shape parameter

Model	R Command	H/I	A/D	Maxlik bias	CP bias
Cauchy	`cauchy_cp`	H	D	N/A	N/A
Cauchy with single predictor on the location	`cauchy_p1_cp`	H	D	N/A	N/A
Exponential	`exp_cp`	H	A	all parameters	all parameters
Exponential with single predictor on the rate	`exp_p1_cp`	H	D	all parameters	all parameters
Frechet with known location	`frechet_k1_cp`	H	D	all parameters	all parameters
Frechet with known location, and single predictor on the scale	`frechet_k1p3_cp`	H	D	N/A	N/A
Gamma	`gamma_cp`	I	D	N/A	N/A
Inverse gamma	`invgamma_cp`	I	D	N/A	N/A
Inverse Gaussian	`invgauss_cp`	I	D	N/A	N/A
Generalized normal with known shape	`gnorm_cp`	H	D	N/A	N/A
GEV	`gev_cp`	I	D	xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4 n=30, 40, 50, 60, 70, 80, 100, 120	xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4 n=30, 40, 50, 60, 70, 80, 100, 120
GEV with 1 predictor, on the location	`gev_p1_cp`	I	D	xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4 n=30, 40, 50, 60, 70, 80, 100, 120	xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4 n=30, 40, 50, 60, 70, 80, 100, 120
GEV with 2 predictors, one each on the location and scale	`gev_p12_cp`	I	D	xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4 n=30, 40, 50, 60, 70, 80, 100, 120	xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4 n=30, 40, 50, 60, 70, 80, 100, 120
GEV with 3 predictors, one each on the location, scale and shape	`gev_p123_cp`	I	D	N/A	N/A
GEV with known shape	`gev_k3_cp`	H	D	N/A	N/A
GEV with known shape with a single predictor on the location	`gev_k3p1_cp`	H	D	N/A	N/A
GPD with known location	`gpd_k1_cp`	I	D	xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2 0.3 0.4 n=30, 40, 50, 60, 70, 80, 100, 120	xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2 0.3 0.4 n=30, 40, 50, 60, 70, 80, 100, 120
GPD with known location and shape	`gpd_k13_cp`	I	D	N/A	N/A
Gumbel	`gumbel_cp`	H	D	all parameters	all parameters
Gumbel with single predictor on the location	`gumbel_cp`	H	D	all parameters	all parameters
Half-normal	`halfnorm_cp`	H	D	N/A	N/A
Logistic	`logis_cp`	H	D	all parameters	all parameters
Logistic with single predictor on the location	`logis_p1_cp`	H	D	N/A	N/A
Log-normal	`lnorm_cp`	H	A	same as `norm`	same as `norm`
Log-normal with a single predictor on the log-mean	`lnorm_p1_cp`	H	A	same as `norm_p1`	same as `norm_p1`
Normal	`norm_cp`	H	A	all parameters	all parameters
Normal with 1 predictor, on the mean i.e., simple linear regression	`norm_p1_cp`	H	A	all parameters	all parameters
Pareto with known scale	`pareto_k1_cp`	H	A	same as `exp`	same as `exp`
Pareto with known scale and a single predictor on the shape parameter	`pareto_p1k3_cp`	H	D	same as `exp_p1`	same as `exp_p1`
t distribution with known degrees of freedom	`lst_k3_cp`	H	D	N/A	N/A
t distribution with known degrees of freedom with a single predictor on the location parameter	`lst_p1k5_cp`	H	D	N/A	N/A
Uniform	`unif_cp`	H	A	N/A	N/A
Weibull	`weibull_cp`	H	D	all parameters	all parameters
Weibull with single predictor on the scale	`weibull_p2_cp`	H	D	N/A	N/A

Examples

Install fitdistcp from CRAN using install.packages("fitdistcp").
Load fitdistcp using library(fitdistcp).

Example 1: Fitting a normal distribution and calculating the fitted probability density function

> x=rnorm(20)		# make some example training data 
> y=seq(-5,5,0.01)	# define the points at which we wish to calculate the density
> d=dnorm_cp(x,y)	# this command calculates two sets of predictive densities for the normal, one based on maxlik, 
			# and one that includes parameter uncertainty based on a calibrating prior
> print(d$ml_pdf)	# have a look at the maxlik densities
> print(d$cp_pdf)	# have a look at the densities that include parameter uncertainty, which will have fatter tails

Example 2: Fitting a GEV distribution and calculating the fitted quantiles

> set.seed(1)
> x=rlogis(20)				# make some example training data 
> p=seq(0.001,0.999,0.001)		# define the probabilities at which we wish to calculate the quantiles
> q=qgev_cp(x,p)			# this command calculates two sets of predictive quantiles for the GEV, 
					# one based on maxlik, and one that includes parameter uncertainty based on a calibrating prior
> q$ml_params 				# have a look at the maxlik parameters
> plot(q$ml_quantiles,p)		# plot the maxlik quantiles
> points(q$cp_quantiles,p,col="red")	# overplot the quantiles that include parameter uncertainty, which will have fatter tails

Example 3: Fitting a GPD and calculating the fitted cumulative distribution function

> set.seed(1)
> x=rlnorm(20,sdlog=2)		# make some example training data (all positive values)
> y=seq(0.1,5,0.1)		# define the points at which we wish to calculate the probabilities
> p=pgpd_k1_cp(x,y)		# this command calculates two sets of predictive probabilities for the GPD (with known location parameter, hence k1), 
				# one based on maxlik, and one that includes parameter uncertainty based on a calibrating prior
> plot(y,p$ml_cdf)		# have a look at the maxlik probabilities
> point(y,p$cp_pdf,col="red")	# have a look at the probabilities that include parameter uncertainty, 
				# which will have fatter tails

Example 4: Fitting a Gumbel distribution and generating random numbers

> x=rnorm(10)					# make some example training data 
> r=rgumbel_cp(10000,x)				# this command calculates two sets of 10,000 predictive random numbers, 
						# based on maxlik, and then with parameter uncertainty based on a calibrating prior
> plot(density(r$ml_deviates))			# plot the density of the maxlik random numbers
> lines(density(r$cp_deviates),col="red")	# overplot the density of the random numbers including parameter uncertainty,
						# which will have fatter tails

Example 5: Generating posterior samples from a Weibull distribution

> x=rlnorm(20)						# make some example training data (all positive values)
> t=tweibull_cp(1000,x)$theta_samples			# this command generates 1000 posterior samples

Example 6: Fitting a GEV distribution with a predictor on the location parameter

> set.seed(1)
> n=30
> t=0.1*c(1:n)/n			# make an example predictor
> x=t+rlogis(n)				# make training data based on the predictor
> t0=t[n]				# set the value of the predictor at which to make the prediction
> p=seq(0.001,0.999,0.001)		# specify the probabilities at which to generate quantiles
> q=qgev_p1_cp(x,t,t0=t0,n0=NA,p) 	# specify either t0, or n0, and make the other one NA

Example 7: Fitting a GEV distribution with predictors on location and scale parameters

> set.seed(1)
> n=50
> t1=0.1*c(1:n)/n					# make the first predictor
> t2=0.2*c(1:n)/n					# make the second predictor
> x=(t1+rlogis(n))*t2					# make training data based on the predictors
> n0=n							# set the values of the predictors at which to make the prediction
> p=seq(0.001,0.999,0.001)		
> q=qgev_p12_cp(x,t1,t2,t01=NA,t02=NA,n01=n0,n02=n0,p)	# specify either t0, or n0, and make the other one NA

Detailed Examples by model

The short examples are given in the help pages accessible from within R (for example, in R, type >?norm_cp).

Model	Short example	Long example
Cauchy	R code	R code
Cauchy with single predictor on the location	R code	R code
Exponential	R code	R code
Exponential with single predictor on the rate	R code	R code
Frechet with known location	R code	R code
Frechet with known location, and single predictor on the scale	R code	R code
Gamma	R code	R code
Inverse gamma	R code	R code
Inverse Gaussian	R code	R code
Generalized normal with known shape	R code	R code
GEV	R code	R code
GEV with 1 predictor, on the location	R code	R code
GEV with 2 predictors, one each on the location and scale	R code	R code
GEV with 3 predictors, one each on the location, scale and shape	R code	R code
GEV with known shape	R code	R code
GEV with known shape with a single predictor on the location	R code	R code
GPD with known location	R code	R code
GPD with known location and shape	R code	R code
Gumbel	R code	R code
Gumbel with single predictor on the location	R code	R code
Half-normal	R code	R code
Logistic	R code	R code
Logistic with single predictor on the location	R code	R code
Log-normal	R code	R code
Log-normal with a single predictor on the log-mean	R code	R code
Normal	R code	R code
Normal with 1 predictor, on the mean i.e., simple linear regression	R code	R code
Pareto with known scale	R code	R code
Pareto with known scale and a single predictor on the shape parameter	R code	R code
t distribution with known degrees of freedom	R code	R code
t distribution with known degrees of freedom with a single predictor on the location parameter	R code	R code
Uniform	R code	R code
Weibull	R code	R code
Weibull with single predictor on the scale	R code	R code

Theory

The theory is discussed in detail in the citation given below. Here are some brief notes.

The goal is to make predictions that are as reliable as possible. Reliable means that events that are predicted to occur with a probability of 1% will indeed occur 1% of the time. Predictions from point estimate methods such as maximum likelihood are not reliable, and underestimate the tails.
The predictions with parameter uncertainty are generated using the Bayesian prediction equation.
Bayesian prediction requires a prior. For homogeneous models (i.e., models with sharply transitive transformation groups), we use the right Haar prior, which gives predictions that are in principle exactly reliable (if the model assumptions are correct, and depending on the integration scheme used). This applies to our Cauchy, exponential, Frechet, generalized normal, GEV with known shape, Gumbel, halfnormal, logistic, log-normal, normal, Pareto, t and Weibull models. It also applies to all these models with predictors.
For inhomogeneous models (i.e., models without sharply transitive transformation groups) there is no solution that gives exact reliability, so we use simple priors that we have found will give close to exact reliability. They are in all cases more reliable than maxlik, but may not be very reliable for small sample size (e.g., n<50). This applies to GEV, GPD, gamma, inverse gamma and inverse Gaussian, and these models with predictors.
Simulations can be used to test whether a certain prior gives reliable predictions, as a function of the true parameters. Simulation results are given in the reference paper cited below, and can be reproduced using the command reltest.
Where possible, we integrate the Bayesian prediction equation analytically. This applies to the exponential, normal, log-normal, uniform and Pareto.
Otherwise, by default we integrate the Bayesian prediction equation using the DMGS asymptotic expansion, which works well except for very small samples (n<10) (although even for very small sample sizes calibrating priors with DMGS still give more reliable predictions than maximum likelihood). DMGS is very fast, especially for quantiles (~1000x faster than MCMC).
DMGS requires the calculation of various derivatives. Where possible, we use analytic expressions. Otherwise, we use simple numerical derivatives.
There is also an option to integrate using posterior sampling, which then uses Paul Northrop's rust library (ratio of uniforms with transformation). Posterior sampling is slower than DMGS for the same level of accuracy (and for quantiles, much slower), but is ultimately more accurate if enough posterior samples are used.

Other

There are various other outputs, various additional options, and various related software tools are included in the package.

All calculations return the maxlik parameter estimates
All calculations with the q***_cp routines calculate the AIC and standard errors. AIC is presented as MAIC, which is -0.5 times the standard AIC, for better comparison with log scores, and so that higher is better.
Optionally, WAIC and out-of-sample log-score can be calculated too (although oos log-score can't be calculated for EVT routines, for mathematical reasons)
Model selection can be illustrated using the ms_flat_1tail, ms_flat_2tail, ms_predictors_1tail and ms_predictors_2tail routines.

More details are given in the help pages accessible from within R (e.g, for the GEV, type >?gev_cp).

Python

We plan to release a pure Python version in summer 2025. In the mean time, here's an example that shows how to run the R routines from Python, which is straightforward. It involves converting the input and output data between R and Python formats.

> import numpy as np 					# regular Python imports
> import os			
> os.environ["R_HOME"] = "C:/PROGRA~1/R/R-43~1.1"	# point to your R installation
> import rpy2.robjects as robjects			# import the routines that make R work in Python
> from rpy2.robjects.packages import importr
> cp=importr('fitdistcp')				# define the R library we want to use
> x = np.arange(1, 11) ** 4				# define the input data, as Python variables
> p = np.arange(1, 4) / 4				# noting that Python counts from 0 not 1
> rx=robjects.FloatVector(x)				# convert the input data to r variables
> rp=robjects.FloatVector(p)
> q=cp.qexp_cp(rx,rp)					# call the R function by prefixing with the library name we defined above
> ml=q.rx2("ml_quantiles")				# extract the parts of the output that we want, into Python variables
> cp=q.rx2("cp_quantiles")
> print("ml quantiles:")				# have a look at the results
> print(ml)
> print("cp quantiles:")
> print(cp)
> # robjects.r.source("script.R", encoding="utf-8")	# how to run the whole thing as a script

Citation

Please cite the following paper:

S. Jewson, T. Sweeting and L. Jewson (2024): Reducing Reliability Bias in Assessments of Extreme Weather Risk using Calibrating Priors: ASCMO (Advances in Statistical Climatology, Meteorology and Oceanography)

Bibtex:

@article{jewsonet2024,
	author = {Jewson, Stephen and Sweeting, Trevor and Jewson, Lynne},
	title = {Reducing reliability bias in assessments of extreme weather risk using calibrating priors},
	journal = {ASCMO},
	volume = {11},
	number = {1},
	pages = {1-22},
	doi = {10.5194/ascmo-11-1-2025},
	url = {https://doi.org/10.5194/ascmo-11-1-2025},
	year = {2025}
}

Notes

Tips and Tricks

The EVT routines (GEV and GPD) do not produce predictive densities and probabilities using DMGS, only quantiles. For the GEV, this is because when the maxlik shape is negative (positive), the DMGS p and d calculations don't handle the maxlik upper (lower) bound very well, because they can't give values beyond the bound, even though parameter uncertainty does lead to values beyond the bound. The DMGS q routines work fine, and can give values beyond the maxlik bound. To make up for the lack of DMGS densities, the q routine calculates a density derived from the quantiles, as a function of the probabilities, which does go beyond the maxlik bound. Posterior sampling for densities using RUST works fine for all routines.
For the GEV, it is sometimes necessary to specify initial parameter values for the maxlik parameter search
GEV with 3 predictors is included because it appears in some papers, but it's likely to be overfitted in almost all cases, and should probably be replaced by the GEV with 1 or 2 predictors. Check the AIC scores.

Work in Progress

The code-base is new, and is currently undergoing refactoring and improvements.
Where possible, numerical derivatives are being switched to analytical derivatives. Most have now been switched.
More models are gradually being added.
Ultimately the guts could be written in a compiled language, and would presumably run much faster (although it's already pretty fast, so maybe that doesn't matter).

Known Issues

None right now.

Bugs

There are undoubtedly bugs in the code and in the documentation. Please let me know.
Bugs and issues can be reported on the github page .

Acknowledgements

Many thanks to Trevor Sweeting for help with the statistical theory, and Lynne Jewson for help with the group theory.
Also, many thanks to the anonymous insurance companies that fund this research project.
And in addition, many thanks to the people I've discussed this whole topic with over the last couple of years, including Clare Barnes, Chris Paciorek and Stephen Dupon.

Applications

I'm working with various people on various projects to apply these routines to, e.g., sea level, temperature, wind-speed, rainfall, weather forecasts and hurricane damages. Let me know if you want to talk about an application.

Research

There are various extensions and improvements in the pipeline. I post news about my research on Linkedin.

Contact

I can be contacted at stephen.jewson-at-gmail.com