Auto Insurance Frequency-Severity Modelling

In this project, I build a frequency-severity model for automobile insurance claims using the French Motor Third-Party Liability dataset (freMTPL2freq and freMTPL2sev). The goal is to predict expected pure premiums for policyholders by separately modeling claim frequency and severity, then combining them into a pricing model.

I begin by loading the required R packages for data manipulation, visualization, and modeling.

Loading Libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(dplyr)
library(patchwork)

## Warning: package 'patchwork' was built under R version 4.5.1

## 
## Attaching package: 'patchwork'
## 
## The following object is masked from 'package:MASS':
## 
##     area

library(pscl)

## Warning: package 'pscl' was built under R version 4.5.1

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

The dataset consists of two linked files: one with claim frequency information and one with claim severity. I import both and join them on the policy identifier. Missing claim amounts are set to zero to represent policies with no claims.

freq <- read.csv("freMTPL2freq.csv")
sev <- read.csv("freMTPL2sev.csv")

head(freq)

##   IDpol ClaimNb Exposure VehPower VehAge DrivAge BonusMalus VehBrand  VehGas
## 1     1       1     0.10        5      0      55         50      B12 Regular
## 2     3       1     0.77        5      0      55         50      B12 Regular
## 3     5       1     0.75        6      2      52         50      B12  Diesel
## 4    10       1     0.09        7      0      46         50      B12  Diesel
## 5    11       1     0.84        7      0      46         50      B12  Diesel
## 6    13       1     0.52        6      2      38         50      B12 Regular
##   Area Density             Region
## 1    D    1217        Rhone-Alpes
## 2    D    1217        Rhone-Alpes
## 3    B      54           Picardie
## 4    B      76          Aquitaine
## 5    B      76          Aquitaine
## 6    E    3003 Nord-Pas-de-Calais

str(freq)

## 'data.frame':    678013 obs. of  12 variables:
##  $ IDpol     : num  1 3 5 10 11 13 15 17 18 21 ...
##  $ ClaimNb   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Exposure  : num  0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
##  $ VehPower  : int  5 5 6 7 7 6 6 7 7 7 ...
##  $ VehAge    : int  0 0 2 0 0 2 2 0 0 0 ...
##  $ DrivAge   : int  55 55 52 46 46 38 38 33 33 41 ...
##  $ BonusMalus: int  50 50 50 50 50 50 50 68 68 50 ...
##  $ VehBrand  : chr  "B12" "B12" "B12" "B12" ...
##  $ VehGas    : chr  "Regular" "Regular" "Diesel" "Diesel" ...
##  $ Area      : chr  "D" "D" "B" "B" ...
##  $ Density   : int  1217 1217 54 76 76 3003 3003 137 137 60 ...
##  $ Region    : chr  "Rhone-Alpes" "Rhone-Alpes" "Picardie" "Aquitaine" ...

head(sev)

##     IDpol ClaimAmount
## 1    1552      995.20
## 2 1010996     1128.12
## 3 4024277     1851.11
## 4 4007252     1204.00
## 5 4046424     1204.00
## 6 4073956     1204.00

str(sev)

## 'data.frame':    26639 obs. of  2 variables:
##  $ IDpol      : int  1552 1010996 4024277 4007252 4046424 4073956 4012173 4020812 4020812 4074074 ...
##  $ ClaimAmount: num  995 1128 1851 1204 1204 ...

df <- full_join(freq, sev, by = "IDpol")


df <- df|>
  mutate(ClaimAmount = replace_na(ClaimAmount, 0))

str(df)

## 'data.frame':    679708 obs. of  13 variables:
##  $ IDpol      : num  1 3 5 10 11 13 15 17 18 21 ...
##  $ ClaimNb    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Exposure   : num  0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
##  $ VehPower   : int  5 5 6 7 7 6 6 7 7 7 ...
##  $ VehAge     : int  0 0 2 0 0 2 2 0 0 0 ...
##  $ DrivAge    : int  55 55 52 46 46 38 38 33 33 41 ...
##  $ BonusMalus : int  50 50 50 50 50 50 50 68 68 50 ...
##  $ VehBrand   : chr  "B12" "B12" "B12" "B12" ...
##  $ VehGas     : chr  "Regular" "Regular" "Diesel" "Diesel" ...
##  $ Area       : chr  "D" "D" "B" "B" ...
##  $ Density    : int  1217 1217 54 76 76 3003 3003 137 137 60 ...
##  $ Region     : chr  "Rhone-Alpes" "Rhone-Alpes" "Picardie" "Aquitaine" ...
##  $ ClaimAmount: num  0 0 0 0 0 0 0 0 0 0 ...

To prepare the data, I filter out extreme values such as very old drivers, unusually high claim counts, and vehicles with unrealistic ages. I also convert categorical variables such as vehicle brand, fuel type, and region into factors for modeling.

df <- df |>
  filter(ClaimNb <= 4) |>
  filter(Exposure <= 1) |>
  filter(DrivAge <= 90) |>
  filter(VehAge <= 40) |>
  mutate(across(c(VehBrand, VehGas, Area, Region), as.factor))

str(df)

## 'data.frame':    677581 obs. of  13 variables:
##  $ IDpol      : num  1 3 5 10 11 13 15 17 18 21 ...
##  $ ClaimNb    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Exposure   : num  0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
##  $ VehPower   : int  5 5 6 7 7 6 6 7 7 7 ...
##  $ VehAge     : int  0 0 2 0 0 2 2 0 0 0 ...
##  $ DrivAge    : int  55 55 52 46 46 38 38 33 33 41 ...
##  $ BonusMalus : int  50 50 50 50 50 50 50 68 68 50 ...
##  $ VehBrand   : Factor w/ 11 levels "B1","B10","B11",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ VehGas     : Factor w/ 2 levels "Diesel","Regular": 2 2 1 1 1 2 2 1 1 1 ...
##  $ Area       : Factor w/ 6 levels "A","B","C","D",..: 4 4 2 2 2 5 5 3 3 2 ...
##  $ Density    : int  1217 1217 54 76 76 3003 3003 137 137 60 ...
##  $ Region     : Factor w/ 21 levels "Alsace","Aquitaine",..: 21 21 18 2 2 16 16 13 13 17 ...
##  $ ClaimAmount: num  0 0 0 0 0 0 0 0 0 0 ...

Before modeling, I explore the distribution of claim counts. Most policies have zero claims, with the frequency dropping sharply as the number of claims increases. This suggests strong overdispersion, motivating the use of a negative binomial or zero-inflated model for frequency.

df |>
  count(ClaimNb) |>
  ggplot(aes(x = factor(ClaimNb), y = n)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
    geom_text(aes(label = scales::comma(n)), 
            vjust = -0.5, size = 3.5, fontface = "bold") +
  labs(title = "Distribution of Claim Counts",
       x = "Number of Claims", y = "Count") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)

I fit a zero-inflated negative binomial (ZINB) model to account for excess zeros and overdispersion in claim counts. The model uses driver and vehicle characteristics as predictors, with exposure included as an offset.

zinb_model <- zeroinfl(
  ClaimNb ~ Area + VehPower + VehAge + DrivAge + BonusMalus + 
             VehBrand + VehGas + Density + Region + offset(log(Exposure)) | 1,
  data = df,
  dist = "negbin"
)

summary(zinb_model)

## Warning in sqrt(diag(object$vcov)): NaNs produced

## 
## Call:
## zeroinfl(formula = ClaimNb ~ Area + VehPower + VehAge + DrivAge + BonusMalus + 
##     VehBrand + VehGas + Density + Region + offset(log(Exposure)) | 1, 
##     data = df, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
##  -0.6705  -0.2652  -0.2088  -0.1176 118.1343 
## 
## Count model coefficients (negbin with log link):
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.991e+00  1.008e-01 -39.582  < 2e-16 ***
## AreaB                              5.794e-02  2.272e-02   2.550 0.010787 *  
## AreaC                              1.030e-01  1.884e-02   5.468 4.55e-08 ***
## AreaD                              1.968e-01  1.991e-02   9.884  < 2e-16 ***
## AreaE                              2.618e-01  2.125e-02  12.320  < 2e-16 ***
## AreaF                              2.795e-01  4.124e-02   6.777 1.23e-11 ***
## VehPower                           1.384e-02  2.882e-03   4.802 1.57e-06 ***
## VehAge                            -3.958e-02  1.218e-03 -32.481  < 2e-16 ***
## DrivAge                            6.825e-03  4.222e-04  16.164  < 2e-16 ***
## BonusMalus                         2.424e-02  3.421e-04  70.832  < 2e-16 ***
## VehBrandB10                       -3.545e-03  3.821e-02  -0.093 0.926081    
## VehBrandB11                        9.828e-02  4.092e-02   2.402 0.016322 *  
## VehBrandB12                        1.869e-01  1.834e-02  10.193  < 2e-16 ***
## VehBrandB13                        2.974e-02  4.267e-02   0.697 0.485812    
## VehBrandB14                       -1.765e-01  8.130e-02  -2.171 0.029908 *  
## VehBrandB2                        -1.319e-02  1.597e-02  -0.826 0.408756    
## VehBrandB3                         3.084e-03  2.276e-02   0.135 0.892234    
## VehBrandB4                        -3.599e-02  3.103e-02  -1.160 0.246002    
## VehBrandB5                         7.348e-02  2.597e-02   2.830 0.004656 ** 
## VehBrandB6                        -3.833e-02  2.949e-02  -1.300 0.193575    
## VehGasRegular                      3.790e-02  1.141e-02   3.322 0.000894 ***
## Density                           -1.737e-06        NaN     NaN      NaN    
## RegionAquitaine                   -8.410e-02  9.420e-02  -0.893 0.371986    
## RegionAuvergne                    -3.263e-01  1.171e-01  -2.786 0.005337 ** 
## RegionBasse-Normandie             -2.155e-02  9.927e-02  -0.217 0.828099    
## RegionBourgogne                   -4.333e-02  1.015e-01  -0.427 0.669467    
## RegionBretagne                     5.029e-02  9.243e-02   0.544 0.586367    
## RegionCentre                       2.700e-02  9.102e-02   0.297 0.766735    
## RegionChampagne-Ardenne            1.403e-01  1.223e-01   1.147 0.251302    
## RegionCorse                        2.133e-01  1.106e-01   1.929 0.053673 .  
## RegionFranche-Comte               -9.819e-02  1.645e-01  -0.597 0.550685    
## RegionHaute-Normandie             -7.706e-02  1.078e-01  -0.715 0.474644    
## RegionIle-de-France               -4.296e-03  9.214e-02  -0.047 0.962815    
## RegionLanguedoc-Roussillon        -3.644e-02  9.395e-02  -0.388 0.698140    
## RegionLimousin                     2.062e-01  1.109e-01   1.859 0.062960 .  
## RegionMidi-Pyrenees               -1.573e-01  9.846e-02  -1.597 0.110186    
## RegionNord-Pas-de-Calais          -1.502e-01  9.324e-02  -1.611 0.107183    
## RegionPays-de-la-Loire            -2.761e-02  9.301e-02  -0.297 0.766552    
## RegionPicardie                     2.225e-02  1.033e-01   0.215 0.829426    
## RegionPoitou-Charentes            -5.886e-02  9.617e-02  -0.612 0.540551    
## RegionProvence-Alpes-Cotes-D'Azur  2.036e-02  9.142e-02   0.223 0.823743    
## RegionRhone-Alpes                  7.351e-02  9.117e-02   0.806 0.420052    
## Log(theta)                        -6.293e-01  2.870e-02 -21.926  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.9519     0.2368  -25.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.533 
## Number of iterations in BFGS optimization: 156 
## Log-likelihood: -1.499e+05 on 44 Df

The summary confirms that several covariates significantly influence claim frequency.

For claim severity, I restrict the dataset to positive claims only. The claim amount distribution is highly skewed, which is typical for insurance losses. I fit a Gamma GLM with a log link to model the conditional claim amount.

df_sev <- df |> 
  filter(ClaimAmount > 0)

str(df_sev)

## 'data.frame':    26300 obs. of  13 variables:
##  $ IDpol      : num  139 190 414 424 424 463 606 622 811 830 ...
##  $ ClaimNb    : int  1 1 1 2 2 1 1 1 1 1 ...
##  $ Exposure   : num  0.75 0.14 0.14 0.62 0.62 0.31 0.84 0.75 0.76 0.68 ...
##  $ VehPower   : int  7 12 4 10 10 5 10 5 5 4 ...
##  $ VehAge     : int  1 5 0 0 0 0 6 0 0 10 ...
##  $ DrivAge    : int  61 50 36 51 51 45 54 34 44 24 ...
##  $ BonusMalus : int  50 60 85 100 100 50 50 64 50 105 ...
##  $ VehBrand   : Factor w/ 11 levels "B1","B10","B11",..: 4 4 4 4 4 4 4 4 4 1 ...
##  $ VehGas     : Factor w/ 2 levels "Diesel","Regular": 2 1 2 2 2 2 1 2 2 2 ...
##  $ Area       : Factor w/ 6 levels "A","B","C","D",..: 6 2 5 6 6 1 4 4 5 5 ...
##  $ Density    : int  27000 56 4792 27000 27000 12 583 1565 3317 3064 ...
##  $ Region     : Factor w/ 21 levels "Alsace","Aquitaine",..: 12 4 12 12 12 15 20 16 20 12 ...
##  $ ClaimAmount: num  303 1982 1457 990 9844 ...

p1 <- ggplot(df, aes(x = ClaimAmount)) +
  geom_histogram(bins = 15, fill = "blue", color = "black") +
  labs(title = "Claim Amount including zero values",
       x = "Claim Amount")

p2 <- ggplot(df_sev, aes(x = log(ClaimAmount))) +
  geom_histogram(bins = 15, fill = "red", color = "black") +
  labs(title = "Claim Amount excluding zero values",
       x = "Log(Claim Amount)")

combined_plot <- p1 + p2

combined_plot

median(df_sev$ClaimAmount)

## [1] 1172

mean(df_sev$ClaimAmount)

## [1] 2267.601

var(df_sev$ClaimAmount)

## [1] 867343759

sd(df_sev$ClaimAmount)

## [1] 29450.7

sev_model <- glm(ClaimAmount ~ Area + VehPower + VehAge + DrivAge + 
                 BonusMalus + VehBrand + VehGas + Density + Region,
                 data = df_sev,
                 family = Gamma(link = "log")) 

summary(sev_model)

## 
## Call:
## glm(formula = ClaimAmount ~ Area + VehPower + VehAge + DrivAge + 
##     BonusMalus + VehBrand + VehGas + Density + Region, family = Gamma(link = "log"), 
##     data = df_sev)
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        6.538e+00  8.054e-01   8.118 4.94e-16 ***
## AreaB                              1.553e-01  1.841e-01   0.844  0.39889    
## AreaC                             -7.563e-02  1.520e-01  -0.498  0.61873    
## AreaD                             -1.317e-02  1.611e-01  -0.082  0.93486    
## AreaE                             -1.210e-01  2.098e-01  -0.576  0.56434    
## AreaF                             -5.362e-01  7.292e-01  -0.735  0.46218    
## VehPower                           2.446e-02  2.289e-02   1.068  0.28534    
## VehAge                             1.885e-03  9.328e-03   0.202  0.83987    
## DrivAge                           -3.241e-03  3.368e-03  -0.962  0.33583    
## BonusMalus                         7.401e-03  2.465e-03   3.003  0.00268 ** 
## VehBrandB10                        1.469e-01  2.753e-01   0.534  0.59352    
## VehBrandB11                        4.382e-01  2.905e-01   1.509  0.13144    
## VehBrandB12                        1.177e-01  1.534e-01   0.767  0.44286    
## VehBrandB13                        7.219e-02  3.159e-01   0.229  0.81922    
## VehBrandB14                       -1.288e-01  6.189e-01  -0.208  0.83513    
## VehBrandB2                         1.675e-01  1.201e-01   1.394  0.16325    
## VehBrandB3                         2.369e-02  1.667e-01   0.142  0.88704    
## VehBrandB4                         1.549e-01  2.280e-01   0.679  0.49702    
## VehBrandB5                        -1.446e-01  1.915e-01  -0.755  0.45032    
## VehBrandB6                        -2.199e-01  2.155e-01  -1.020  0.30752    
## VehGasRegular                      1.168e-01  8.904e-02   1.312  0.18942    
## Density                            1.358e-05  3.000e-05   0.453  0.65084    
## RegionAquitaine                    3.836e-01  7.596e-01   0.505  0.61353    
## RegionAuvergne                     4.414e-03  9.375e-01   0.005  0.99624    
## RegionBasse-Normandie              5.283e-01  8.000e-01   0.660  0.50900    
## RegionBourgogne                    3.337e-01  8.218e-01   0.406  0.68469    
## RegionBretagne                     5.122e-01  7.472e-01   0.685  0.49306    
## RegionCentre                       7.004e-01  7.358e-01   0.952  0.34118    
## RegionChampagne-Ardenne            1.636e+00  1.081e+00   1.514  0.13001    
## RegionCorse                        8.357e-01  9.513e-01   0.879  0.37967    
## RegionFranche-Comte                4.733e-01  1.347e+00   0.351  0.72542    
## RegionHaute-Normandie              2.126e-01  8.679e-01   0.245  0.80650    
## RegionIle-de-France                3.466e-01  7.471e-01   0.464  0.64270    
## RegionLanguedoc-Roussillon         4.099e-01  7.624e-01   0.538  0.59088    
## RegionLimousin                     1.632e-01  8.851e-01   0.184  0.85372    
## RegionMidi-Pyrenees                1.806e-01  8.159e-01   0.221  0.82481    
## RegionNord-Pas-de-Calais           4.355e-01  7.523e-01   0.579  0.56266    
## RegionPays-de-la-Loire             2.627e-01  7.498e-01   0.350  0.72608    
## RegionPicardie                     5.205e-01  8.283e-01   0.628  0.52972    
## RegionPoitou-Charentes             2.370e-01  7.708e-01   0.307  0.75850    
## RegionProvence-Alpes-Cotes-D'Azur  6.395e-01  7.392e-01   0.865  0.38702    
## RegionRhone-Alpes                  5.654e-01  7.367e-01   0.768  0.44277    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 48.71545)
## 
##     Null deviance: 46301  on 26299  degrees of freedom
## Residual deviance: 43061  on 26258  degrees of freedom
## AIC: 454550
## 
## Number of Fisher Scoring iterations: 14

The results show that driver and vehicle characteristics also explain some variation in claim severity.

df$pred_freq <- predict(zinb_model, type = "response")

df <- df |>
  mutate(pred_decile = ntile(pred_freq, 10))  

calibration <- df |>
  group_by(pred_decile) |>
  summarise(
    avg_pred = mean(pred_freq),
    avg_actual = mean(ClaimNb), 
    .groups = "drop"
  )

ggplot(calibration, aes(x = avg_pred, y = avg_actual)) +
  geom_point(size = 3, color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(title = "Calibration of Frequency Model (Deciles)",
       x = "Average Predicted Frequency",
       y = "Average Actual Frequency") +
  theme_minimal()

To evaluate the models, I compare predicted vs. observed outcomes. For frequency, I assess calibration across deciles of predicted risk. For severity, I compare predicted amounts to observed claim amounts. While the models are not perfectly accurate, they capture general trends in the data.

df_sev$pred_sev <- predict(sev_model, type = "response")

ggplot(df_sev, aes(x = pred_sev, y = ClaimAmount)) +
  geom_point(alpha = 0.3, color = "darkred") +
  xlim(0, 12500)+
  ylim(0, 12500)+
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Observed vs Predicted Severity (Gamma GLM)",
       x = "Predicted Claim Amount", y = "Observed Claim Amount") +
  theme_minimal()

## Warning: Removed 363 rows containing missing values or values outside the scale range
## (`geom_point()`).

df$pred_freq <- predict(zinb_model, newdata = df, type = "response")

df_sev$pred_sev <- predict(sev_model, newdata = df_sev, type = "response")

df_sev_unique <- df_sev |>
  group_by(IDpol) |>
  summarise(pred_sev = mean(pred_sev, na.rm = TRUE))

df <- left_join(df, df_sev_unique, by = "IDpol")

mean_sev <- mean(df$pred_sev, na.rm = TRUE)
df$pred_sev <- ifelse(is.na(df$pred_sev), mean_sev, df$pred_sev)

df$pure_premium <- df$pred_freq * df$pred_sev
loading <- 0.25
df$premium <- df$pure_premium * (1 + loading)

head(df[, c("IDpol", "ClaimNb", "ClaimAmount", "pred_freq", "pred_sev", "pure_premium", "premium")])

##   IDpol ClaimNb ClaimAmount   pred_freq pred_sev pure_premium   premium
## 1     1       1           0 0.015820810 2232.942     35.32695  44.15869
## 2     3       1           0 0.121820235 2232.942    272.01754 340.02193
## 3     5       1           0 0.086869602 2232.942    193.97480 242.46850
## 4    10       1           0 0.009872851 2232.942     22.04551  27.55688
## 5    11       1           0 0.092146613 2232.942    205.75806 257.19757
## 6    13       1           0 0.058369939 2232.942    130.33670 162.92087

With frequency and severity models built, I combine them to compute the pure premium for each policyholder. The pure premium is the expected claim cost, equal to predicted frequency multiplied by predicted severity. A 25% loading factor is then added to obtain the final premium.

ggplot(df, aes(x = pure_premium)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Predicted Pure Premiums",
       x = "Pure Premium", y = "Number of Policyholders") +
  xlim(0, 500)

## Warning: Removed 8174 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

  theme_minimal()

## List of 136
##  $ line                            :List of 6
##   ..$ colour       : chr "black"
##   ..$ linewidth    : num 0.5
##   ..$ linetype     : num 1
##   ..$ lineend      : chr "butt"
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ rect                            :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : chr "black"
##   ..$ linewidth    : num 0.5
##   ..$ linetype     : num 1
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ text                            :List of 11
##   ..$ family       : chr ""
##   ..$ face         : chr "plain"
##   ..$ colour       : chr "black"
##   ..$ size         : num 11
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 0.5
##   ..$ angle        : num 0
##   ..$ lineheight   : num 0.9
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ title                           : NULL
##  $ aspect.ratio                    : NULL
##  $ axis.title                      : NULL
##  $ axis.title.x                    :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.top                :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.bottom             : NULL
##  $ axis.title.y                    :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.y.left               : NULL
##  $ axis.title.y.right              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : num -90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text                       :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "grey30"
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x                     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.top                 :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.bottom              : NULL
##  $ axis.text.y                     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.y.left                : NULL
##  $ axis.text.y.right               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.theta                 : NULL
##  $ axis.text.r                     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 2.2points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.ticks                      : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ axis.ticks.x                    : NULL
##  $ axis.ticks.x.top                : NULL
##  $ axis.ticks.x.bottom             : NULL
##  $ axis.ticks.y                    : NULL
##  $ axis.ticks.y.left               : NULL
##  $ axis.ticks.y.right              : NULL
##  $ axis.ticks.theta                : NULL
##  $ axis.ticks.r                    : NULL
##  $ axis.minor.ticks.x.top          : NULL
##  $ axis.minor.ticks.x.bottom       : NULL
##  $ axis.minor.ticks.y.left         : NULL
##  $ axis.minor.ticks.y.right        : NULL
##  $ axis.minor.ticks.theta          : NULL
##  $ axis.minor.ticks.r              : NULL
##  $ axis.ticks.length               : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ axis.ticks.length.x             : NULL
##  $ axis.ticks.length.x.top         : NULL
##  $ axis.ticks.length.x.bottom      : NULL
##  $ axis.ticks.length.y             : NULL
##  $ axis.ticks.length.y.left        : NULL
##  $ axis.ticks.length.y.right       : NULL
##  $ axis.ticks.length.theta         : NULL
##  $ axis.ticks.length.r             : NULL
##  $ axis.minor.ticks.length         : 'rel' num 0.75
##  $ axis.minor.ticks.length.x       : NULL
##  $ axis.minor.ticks.length.x.top   : NULL
##  $ axis.minor.ticks.length.x.bottom: NULL
##  $ axis.minor.ticks.length.y       : NULL
##  $ axis.minor.ticks.length.y.left  : NULL
##  $ axis.minor.ticks.length.y.right : NULL
##  $ axis.minor.ticks.length.theta   : NULL
##  $ axis.minor.ticks.length.r       : NULL
##  $ axis.line                       : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ axis.line.x                     : NULL
##  $ axis.line.x.top                 : NULL
##  $ axis.line.x.bottom              : NULL
##  $ axis.line.y                     : NULL
##  $ axis.line.y.left                : NULL
##  $ axis.line.y.right               : NULL
##  $ axis.line.theta                 : NULL
##  $ axis.line.r                     : NULL
##  $ legend.background               : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.margin                   : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing                  : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing.x                : NULL
##  $ legend.spacing.y                : NULL
##  $ legend.key                      : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.key.size                 : 'simpleUnit' num 1.2lines
##   ..- attr(*, "unit")= int 3
##  $ legend.key.height               : NULL
##  $ legend.key.width                : NULL
##  $ legend.key.spacing              : 'simpleUnit' num 5.5points
##   ..- attr(*, "unit")= int 8
##  $ legend.key.spacing.x            : NULL
##  $ legend.key.spacing.y            : NULL
##  $ legend.frame                    : NULL
##  $ legend.ticks                    : NULL
##  $ legend.ticks.length             : 'rel' num 0.2
##  $ legend.axis.line                : NULL
##  $ legend.text                     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.text.position            : NULL
##  $ legend.title                    :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.title.position           : NULL
##  $ legend.position                 : chr "right"
##  $ legend.position.inside          : NULL
##  $ legend.direction                : NULL
##  $ legend.byrow                    : NULL
##  $ legend.justification            : chr "center"
##  $ legend.justification.top        : NULL
##  $ legend.justification.bottom     : NULL
##  $ legend.justification.left       : NULL
##  $ legend.justification.right      : NULL
##  $ legend.justification.inside     : NULL
##  $ legend.location                 : NULL
##  $ legend.box                      : NULL
##  $ legend.box.just                 : NULL
##  $ legend.box.margin               : 'margin' num [1:4] 0cm 0cm 0cm 0cm
##   ..- attr(*, "unit")= int 1
##  $ legend.box.background           : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.box.spacing              : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##   [list output truncated]
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi TRUE
##  - attr(*, "validate")= logi TRUE

The distribution of predicted pure premiums shows that most policyholders are expected to have relatively low annual claim costs, but a small proportion of policies are associated with much higher expected costs. This matches the typical risk profile observed in insurance portfolios.

In this project, I applied frequency-severity modeling to automobile insurance data. A zero-inflated negative binomial model was used for claim frequency, and a Gamma GLM was used for claim severity. Together, these models allowed me to estimate pure premiums and visualize the distribution of expected risk across the portfolio. While the models could be improved with additional feature engineering or alternative distributions, they demonstrate a standard actuarial approach to pricing and risk modeling.

Auto Insurance Frequency-Severity Modelling

Muhammad Awais

2025-09-19

Loading Libraries