Fitdistcp: fitting statistical models using calibrating priors

fitdistcp is a free R package for fitting statistical models using calibrating priors, with the goal of making reliable predictions.

fitdistcp can be called from R or Python (details below), and we hope to release a pure Python version in summer 2025.

Maximum likelihood and other point estimate methods underestimate predictive tail probabilities. For instance, with maximum likelihood prediction, the GEVD with one predictor and shape parameter of -0.3 fitted to 50 data points underestimates the 1 in 50 probability by a factor of 4. This is called a reliability bias.

Calibrating prior prediction provides an objective way to reduce this reliability bias by including parameter uncertainty. It eliminate the bias completely in many cases. The benefit of calibrating prior prediction, relative to maximum likelihood, is largest for small samples, models with more parameters, and in the tails.

fitdistcp uses objective Bayesian methods. The prior is chosen to maximize the reliability of the predictions, i.e., to give predictive probabilities that are as close as possible to the frequencies of future outcomes. The integration is performed analytically if possible, or using DMGS asymptotics otherwise. Posterior sampling is also available as an option (but is slower).

For many cases, charts are provided below that quantify the size of the maximum likelihood reliability bias, and the reduced bias from calibrating prior prediction integrated with DMGS.

Models

This is a list of the models that are currently supported, in alphabetical order. This list is motivated by various climate and actuarial applications. Let me know if you have any suggestions for other models to include.

Model R Command H/I A/D Maxlik bias CP bias
Cauchy cauchy_cpHDN/AN/A
Cauchy with single predictor on the location cauchy_p1_cpHDN/AN/A
Exponential exp_cpHAall parametersall parameters
Exponential with single predictor on the rate exp_p1_cpHDall parametersall parameters
Frechet with known location frechet_k1_cpHDall parametersall parameters
Frechet with known location, and single predictor on the scale frechet_k1p3_cpHDN/AN/A
Gamma gamma_cpIDN/AN/A
Inverse gamma invgamma_cpIDN/AN/A
Inverse Gaussian invgauss_cpIDN/AN/A
Generalized normal with known shape gnorm_cpHDN/AN/A
GEV gev_cpID xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4

n=30, 40, 50, 60, 70, 80, 100, 120
xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4

n=30, 40, 50, 60, 70, 80, 100, 120
GEV with 1 predictor, on the location gev_p1_cpID xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4

n=30, 40, 50, 60, 70, 80, 100, 120
xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4

n=30, 40, 50, 60, 70, 80, 100, 120
GEV with 2 predictors, one each on the location and scale gev_p12_cpID xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4

n=30, 40, 50, 60, 70, 80, 100, 120
xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4

n=30, 40, 50, 60, 70, 80, 100, 120
GEV with 3 predictors, one each on the location, scale and shape gev_p123_cpIDN/AN/A
GEV with known shape gev_k3_cpHDN/AN/A
GEV with known shape with a single predictor on the location gev_k3p1_cpHDN/AN/A
GPD with known location gpd_k1_cpID xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2 0.3 0.4

n=30, 40, 50, 60, 70, 80, 100, 120
xi=-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2 0.3 0.4

n=30, 40, 50, 60, 70, 80, 100, 120
GPD with known location and shape gpd_k13_cpIDN/AN/A
Gumbel gumbel_cpHDall parametersall parameters
Gumbel with single predictor on the location gumbel_cpHDall parametersall parameters
Half-normal halfnorm_cpHDN/AN/A
Logistic logis_cpHDall parametersall parameters
Logistic with single predictor on the location logis_p1_cpHDN/AN/A
Log-normal lnorm_cpHAsame as normsame as norm
Log-normal with a single predictor on the log-mean lnorm_p1_cpHAsame as norm_p1same as norm_p1
Normal norm_cpHAall parametersall parameters
Normal with 1 predictor, on the mean i.e., simple linear regression norm_p1_cpHAall parametersall parameters
Pareto with known scale pareto_k1_cpHAsame as expsame as exp
Pareto with known scale and a single predictor on the shape parameter pareto_p1k3_cpHDsame as exp_p1same as exp_p1
t distribution with known degrees of freedom lst_k3_cpHDN/AN/A
t distribution with known degrees of freedom with a single predictor on the location parameter lst_p1k5_cpHDN/AN/A
Uniform unif_cpHAN/AN/A
Weibull weibull_cpHDall parametersall parameters
Weibull with single predictor on the scale weibull_p2_cpHDN/AN/A

Examples

Example 1: Fitting a normal distribution and calculating the fitted probability density function

> x=rnorm(20)		# make some example training data 
> y=seq(-5,5,0.01)	# define the points at which we wish to calculate the density
> d=dnorm_cp(x,y)	# this command calculates two sets of predictive densities for the normal, one based on maxlik, 
			# and one that includes parameter uncertainty based on a calibrating prior
> print(d$ml_pdf)	# have a look at the maxlik densities
> print(d$cp_pdf)	# have a look at the densities that include parameter uncertainty, which will have fatter tails

					

Example 2: Fitting a GEV distribution and calculating the fitted quantiles

> set.seed(1)
> x=rlogis(20)				# make some example training data 
> p=seq(0.001,0.999,0.001)		# define the probabilities at which we wish to calculate the quantiles
> q=qgev_cp(x,p)			# this command calculates two sets of predictive quantiles for the GEV, 
					# one based on maxlik, and one that includes parameter uncertainty based on a calibrating prior
> q$ml_params 				# have a look at the maxlik parameters
> plot(q$ml_quantiles,p)		# plot the maxlik quantiles
> points(q$cp_quantiles,p,col="red")	# overplot the quantiles that include parameter uncertainty, which will have fatter tails

					

Example 3: Fitting a GPD and calculating the fitted cumulative distribution function

> set.seed(1)
> x=rlnorm(20,sdlog=2)		# make some example training data (all positive values)
> y=seq(0.1,5,0.1)		# define the points at which we wish to calculate the probabilities
> p=pgpd_k1_cp(x,y)		# this command calculates two sets of predictive probabilities for the GPD (with known location parameter, hence k1), 
				# one based on maxlik, and one that includes parameter uncertainty based on a calibrating prior
> plot(y,p$ml_cdf)		# have a look at the maxlik probabilities
> point(y,p$cp_pdf,col="red")	# have a look at the probabilities that include parameter uncertainty, 
				# which will have fatter tails

					

Example 4: Fitting a Gumbel distribution and generating random numbers

> x=rnorm(10)					# make some example training data 
> r=rgumbel_cp(10000,x)				# this command calculates two sets of 10,000 predictive random numbers, 
						# based on maxlik, and then with parameter uncertainty based on a calibrating prior
> plot(density(r$ml_deviates))			# plot the density of the maxlik random numbers
> lines(density(r$cp_deviates),col="red")	# overplot the density of the random numbers including parameter uncertainty,
						# which will have fatter tails

					

Example 5: Generating posterior samples from a Weibull distribution

> x=rlnorm(20)						# make some example training data (all positive values)
> t=tweibull_cp(1000,x)$theta_samples			# this command generates 1000 posterior samples
			

Example 6: Fitting a GEV distribution with a predictor on the location parameter

> set.seed(1)
> n=30
> t=0.1*c(1:n)/n			# make an example predictor
> x=t+rlogis(n)				# make training data based on the predictor
> t0=t[n]				# set the value of the predictor at which to make the prediction
> p=seq(0.001,0.999,0.001)		# specify the probabilities at which to generate quantiles
> q=qgev_p1_cp(x,t,t0=t0,n0=NA,p) 	# specify either t0, or n0, and make the other one NA
			

Example 7: Fitting a GEV distribution with predictors on location and scale parameters

> set.seed(1)
> n=50
> t1=0.1*c(1:n)/n					# make the first predictor
> t2=0.2*c(1:n)/n					# make the second predictor
> x=(t1+rlogis(n))*t2					# make training data based on the predictors
> n0=n							# set the values of the predictors at which to make the prediction
> p=seq(0.001,0.999,0.001)		
> q=qgev_p12_cp(x,t1,t2,t01=NA,t02=NA,n01=n0,n02=n0,p)	# specify either t0, or n0, and make the other one NA
			

Detailed Examples by model

The short examples are given in the help pages accessible from within R (for example, in R, type >?norm_cp).

Model Short example Long example
Cauchy R code R code
Cauchy with single predictor on the location R code R code
Exponential R code R code
Exponential with single predictor on the rate R code R code
Frechet with known location R code R code
Frechet with known location, and single predictor on the scale R code R code
Gamma R code R code
Inverse gamma R code R code
Inverse Gaussian R code R code
Generalized normal with known shape R code R code
GEV R code R code
GEV with 1 predictor, on the location R code R code
GEV with 2 predictors, one each on the location and scale R code R code
GEV with 3 predictors, one each on the location, scale and shape R code R code
GEV with known shape R code R code
GEV with known shape with a single predictor on the location R code R code
GPD with known location R code R code
GPD with known location and shape R code R code
Gumbel R code R code
Gumbel with single predictor on the location R code R code
Half-normal R code R code
Logistic R code R code
Logistic with single predictor on the location R code R code
Log-normal R code R code
Log-normal with a single predictor on the log-mean R code R code
Normal R code R code
Normal with 1 predictor, on the mean i.e., simple linear regression R code R code
Pareto with known scale R code R code
Pareto with known scale and a single predictor on the shape parameter R code R code
t distribution with known degrees of freedom R code R code
t distribution with known degrees of freedom with a single predictor on the location parameter R code R code
Uniform R code R code
Weibull R code R code
Weibull with single predictor on the scale R code R code

Theory

The theory is discussed in detail in the citation given below. Here are some brief notes.

Other

There are various other outputs, various additional options, and various related software tools are included in the package. More details are given in the help pages accessible from within R (e.g, for the GEV, type >?gev_cp).

Python

We plan to release a pure Python version in summer 2025. In the mean time, here's an example that shows how to run the R routines from Python, which is straightforward. It involves converting the input and output data between R and Python formats.
> import numpy as np 					# regular Python imports
> import os			
> os.environ["R_HOME"] = "C:/PROGRA~1/R/R-43~1.1"	# point to your R installation
> import rpy2.robjects as robjects			# import the routines that make R work in Python
> from rpy2.robjects.packages import importr
> cp=importr('fitdistcp')				# define the R library we want to use
> x = np.arange(1, 11) ** 4				# define the input data, as Python variables
> p = np.arange(1, 4) / 4				# noting that Python counts from 0 not 1
> rx=robjects.FloatVector(x)				# convert the input data to r variables
> rp=robjects.FloatVector(p)
> q=cp.qexp_cp(rx,rp)					# call the R function by prefixing with the library name we defined above
> ml=q.rx2("ml_quantiles")				# extract the parts of the output that we want, into Python variables
> cp=q.rx2("cp_quantiles")
> print("ml quantiles:")				# have a look at the results
> print(ml)
> print("cp quantiles:")
> print(cp)
> # robjects.r.source("script.R", encoding="utf-8")	# how to run the whole thing as a script
					

Citation

Please cite the following paper:

Notes

Tips and Tricks

  • The EVT routines (GEV and GPD) do not produce predictive densities and probabilities using DMGS, only quantiles. For the GEV, this is because when the maxlik shape is negative (positive), the DMGS p and d calculations don't handle the maxlik upper (lower) bound very well, because they can't give values beyond the bound, even though parameter uncertainty does lead to values beyond the bound. The DMGS q routines work fine, and can give values beyond the maxlik bound. To make up for the lack of DMGS densities, the q routine calculates a density derived from the quantiles, as a function of the probabilities, which does go beyond the maxlik bound. Posterior sampling for densities using RUST works fine for all routines.
  • For the GEV, it is sometimes necessary to specify initial parameter values for the maxlik parameter search
  • GEV with 3 predictors is included because it appears in some papers, but it's likely to be overfitted in almost all cases, and should probably be replaced by the GEV with 1 or 2 predictors. Check the AIC scores.

Work in Progress

  • The code-base is new, and is currently undergoing refactoring and improvements.
  • Where possible, numerical derivatives are being switched to analytical derivatives. Most have now been switched.
  • More models are gradually being added.
  • Ultimately the guts could be written in a compiled language, and would presumably run much faster (although it's already pretty fast, so maybe that doesn't matter).

Known Issues

  • None right now.

Bugs

  • There are undoubtedly bugs in the code and in the documentation. Please let me know.
  • Bugs and issues can be reported on the github page .

Acknowledgements

  • Many thanks to Trevor Sweeting for help with the statistical theory, and Lynne Jewson for help with the group theory.
  • Also, many thanks to the anonymous insurance companies that fund this research project.
  • And in addition, many thanks to the people I've discussed this whole topic with over the last couple of years, including Clare Barnes, Chris Paciorek and Stephen Dupon.

Applications

  • I'm working with various people on various projects to apply these routines to, e.g., sea level, temperature, wind-speed, rainfall, weather forecasts and hurricane damages. Let me know if you want to talk about an application.

Research

  • There are various extensions and improvements in the pipeline. I post news about my research on Linkedin.

Contact

  • I can be contacted at stephen.jewson-at-gmail.com