Saturday, October 18, 2025

Inequality and murder, within-country and between nation


So this can be a weblog publish about making and discovering an error in modelling technique, in an apparently easy scenario.

I noticed this publish (toot?) on Mastodon from Daniel Lakens mentioning in passing that “However, the murder charge internationally is properly modeled as exp(ok * income_gini_coefficient).” This struck me as price checking; So I did! It seems it is kind of appropriate.

Here’s a chart I drew of homicides per 100,000 inhabitants on the vertical axis versus revenue or consumption inequality on the horizontal. Consumption is the popular technique to measure poverty and inequality in poorer nations specifically, so on this knowledge, which comes from the World Financial institution, it’s used when out there.

It’s not a giant level, however the mannequin implied by Professor Lakens can be a straight diagonal line, what you’d get for those who simply used geom_smooth(technique = lm, method = y ~ x - 1). I haven’t proven his, however two barely extra complicated fashions.

Right here’s the code that downloads the information, suits these fashions and attracts the chart. A number of issues to notice:

  • There’s a little bit of fancy footwork to outline which nations I label on the chart, and in the long run additionally a couple of I selected manually and advert hoc as a result of I used to be occupied with them.
  • Whereas the y axis on this plot makes use of a logarithm remodel, for the mannequin itself I reworked the response variable not with a logarithm however with a BoxCox transformation that also works when nations have zero homicides in a selected yr. I toyed with alternate options like utilizing a generalized linear mannequin with a log hyperlink (which nonetheless works when knowledge is noticed to be zero, as a result of it will likely be predicted/anticipated to be barely extra) however determined to keep away from this so I may use lme which lets me specify autocorrelated error phrases.
  • As a result of the mannequin and plot are each a bit complicated I set out my very own prediction grid and calculate anticipated values and crude (ie assuming asymptotic normality) confidence intervals reasonably than utilizing marginaleffects package deal to get me one thing out of the field.
  • The ribbons drawn are exhibiting the anticipated murder charge for the nation with random nation impact closest to zero, which occurs to be Czechia with the information on the time of writing (17 October 2025).
  • The ribbons are additionally based mostly on the belief that the hypothetical nation they seek advice from has a median inequality (over all years for which there are observations) as proven on the horizontal axis. This level is fairly essential.
library(tidyverse)
library(WDI)
library(nlme)
library(forecast) # for chosing field cox parameters
library(AICcmodavg) # for predictSE with lmer
library(ggrepel)
library(GGally)

# Evaluation to observe up this comment:
# https://mastodon.sdf.org/@dlakelan/115055789165555934
# "However, the murder charge internationally 
#  is properly modeled as exp(ok * income_gini_coefficient)."

#--------Obtain World Financial institution knowledge and prep it------------
# These searches have been used to seek out collection with numerous knowledge for country-year combos:
# WDIsearch("murder")
# WDIsearch("Gini")

# Metadata for SI.POV.GINI at https://data360files.worldbank.org/data360-data/metadata/WB_WDI/WB_WDI_SI_POV_GINI.pdf
# Key factors embrace:
# * people (not family)
# * adjusted for family measurement
# * could be revenue or consumption relying on what was out there

d <- WDI(indicator = c(murder = "VC.IHR.PSRC.P5", 
                              gini = "SI.POV.GINI")) |> 
  as_tibble()

# Nations we're going to wish to spotlight in our chart
highlights <- c("United States", "Russian Federation", 
                  "Samoa", "Australia", "United Kingdom",
                  "Fiji", "Mexico", "Papua New Guinea")

# There are a couple of country-years with zero homicides so cannot use a easy
# logarithm transformation, however can have a Field-Cox that has the same end result
# Select a lambda that offers a distribution after the transformation the place
# adjustments are comparatively secure in absolute phrases:
lambda <- forecast::BoxCox.lambda(d$murder)

# model of the information we are going to use for plotting and modelling:
d2 <- d |> 
  drop_na() |> 
  group_by(nation) |> 
  mutate(kind = ifelse(yr == max(yr), "Newest", "Earlier")) |> 
  mutate(ctry_avg_gini = imply(gini)) |> 
  ungroup() |> 
  mutate(label = ifelse(kind == "Newest" & (gini > 53 | murder > 35 | 
                                 gini < 26 | murder < 0.26 | nation %in% highlights),
                        nation, ""),
         hom_tran = BoxCox(murder, lambda = lambda),
         nation = as.issue(nation))  |> 
  organize(yr)

#------------Modelling and predictions (for drawing 95% confidence intervals on charts)----
# you could possibly have random slopes too however a good variety of nations have just one
# or two observations
# see https://www.frontiersin.org/journals/schooling/articles/10.3389/feduc.2017.00058/full
# for instance which says you need not less than 6 observations per group to have
# a random slope. Unsure what number of you want, however 1 is not sufficient :)

# Barchart of variety of observations per nation, used later within the weblog publish
p1 <- d2 |> 
  depend(nation, identify = "n_obs") |> 
  depend(n_obs, identify = "n_countries") |> 
  ggplot(aes(x = n_obs, y = n_countries)) +
  geom_col(fill = "steelblue") +
  labs(x = "Variety of observations (ie years)",
       y = "Variety of nations",
      title = "Full knowledge for Gini coefficient and murder charges")

print(p1)

# tremendous easy mannequin, not talked about within the precise weblog publish
model1 <- lm(hom_tran ~ gini, knowledge = filter(d2, kind == "Newest"))

# multilevel mannequin with a rustic random intercept, and inequality solely on the lowest degree
model2 <- lme(hom_tran ~ gini, random = ~1  | nation, 
                    knowledge = d2, correlation = corAR1())


# multilevel mannequin with nation random intercept and nations' common inequality, as properly
# because the lowest degree (country-year) granularity inequality:
model3 <- lme(hom_tran ~ gini + ctry_avg_gini, random = ~1  | nation, 
                    knowledge = d2, correlation = corAR1())

# Pretty excessive (and related) coefficients, about 0.11, however for differ
abstract(model1) # 0.12 for gini
abstract(model3) # 0.11 for ctry_avg_gini; gini not vital

# Comparatively low coefficients - the nation randomness 
# soaks up numerous the randomness:
abstract(model2) # gini not vital

# Is the typical random nation impact principally zero? - test:
stopifnot(spherical(imply(ranef(model2)[[1]]), 10) == 0)
stopifnot(spherical(imply(ranef(model3)[[1]]), 10) == 0)

# Discover the nation with random impact closest to zero. Wanted for predictions
# to attract a median nation ribbon on the chart, though the nation impact
# is definitely not taken under consideration in predictSE
avg_country <- ranef(model3) |> 
  organize(abs(`(Intercept)`)) |> 
  slice(1) |> 
  row.names()

pred_grid <- tibble(gini = 20:65, 
                    ctry_avg_gini = 20:65, 
                    nation = avg_country)

pse1 <- predict(model1, newdata = pred_grid, se = TRUE)
pse2 <- predictSE(model2, newdata = pred_grid, se = TRUE)
pse3 <- predictSE(model3, newdata = pred_grid, se = TRUE)


pred_grid <- pred_grid |> 
  mutate(predicted1 = pse1$match,
         lower1 = predicted1 - 1.96 * pse1$se.match,
        upper1 = predicted1 + 1.96 * pse1$se.match) |> 
  mutate(predicted2 = pse2$match,
         lower2 = predicted2 - 1.96 * pse2$se.match,
        upper2 = predicted2 + 1.96 * pse2$se.match) |> 
  mutate(predicted3 = pse3$match,
         lower3 = predicted3 - 1.96 * pse3$se.match,
        upper3 = predicted3 + 1.96 * pse3$se.match) |> 
  mutate(throughout(predicted1:upper3, operate(x){InvBoxCox(x, lambda = lambda)}))

#------------------Draw chart--------------------

mod_cols <- c("purple", "darkgreen", "brown", "pink")

p2 <- d2 |> 
  ggplot(aes(x = gini, y = murder)) +
  scale_y_log10() +
  xlim(18, 70) +
  scale_shape_manual(values = c(1, 19)) +
  scale_colour_manual(values = c("steelblue", "black")) +
  labs(x = "Inequality (Gini coefficient), based mostly on particular person revenue or in some instances, consumption. Greater means extra inequality.",
       y = "Murder charge (per 100,000)",
      color = "Statement yr:",
      form = "Statement yr:",
      title = "Greater inequality nations have extra homicides.",
      caption = "Supply: World Financial institution, World Improvement Indicators, collection VC.IHR.PSRC.P5 and SI.POV.GINI. Evaluation by freerangestats.information.") 
  
  
p2a <- p2 +
  # model2 - simply country-year degree knowledge, nation random impact
  geom_ribbon(knowledge = pred_grid, aes(ymin = lower2, ymax = upper2, y = NA), 
                fill = mod_cols[2], alpha = 0.2) +
  
  #model3 - country-year but in addition nation common knowledge, nation random impact
  geom_ribbon(knowledge = pred_grid, aes(ymin = lower3, ymax = upper3, y = NA), 
                fill = mod_cols[3], alpha = 0.2) +
  
  geom_point(aes(form = kind, color = kind)) +
  geom_label_repel(aes(label = label), max.overlaps = Inf, color = "grey10", measurement = 2.8,
                   seed = 123, label.measurement = unit(0, "mm"), fill = rgb(0,0,0, 0.04))  +
  # annotations - textual content and arrows - for fashions 2 and three
  annotate("textual content", x = 61.5, y = 0.9, color = mod_cols[2], hjust = 0, vjust = 1,
               label = str_wrap("Mannequin with no such common nation inequality impact.", 24)) +
  annotate("textual content", x = 61.5, y = 200, color = mod_cols[3], hjust = 0, vjust = 1,
               label = str_wrap("Mannequin together with an impact for common inequality in every 
                                nation over time.", 26)) +
  
  annotate("section", x = 64, xend = 64, y = 1, yend = 2, color = mod_cols[2], 
            arrow = arrow(size = unit(2, "mm"))) +
  annotate("section", x = 64, xend = 64, y = 75, yend = 45, color = mod_cols[3], 
            arrow = arrow(size = unit(2, "mm"))) +
  
  labs(subtitle = "Chosen nations highlighted. 
Shaded ribbons present 95% confidence interval of blended results fashions with random nation impact and autocorrelated error phrases.",
)

print(p2a)

Actually at first I simply had that inexperienced line—model2 within the code—which is fairly flat. I used to be naturally struck at the way it doesn’t appear to undergo the information factors. At first I assumed this was as a result of I used to be so intelligent, and had included a rustic degree random impact, and allowed autocorrelation over time for the a number of observations of every nation. That’s, I assumed I’d revealed some basic fact of a non-relationship the place extra naive modelling appeared to point out a relationship.

To extract that mannequin—the one with the inexperienced line that’s conspicuously below the information—from the mess of code above, it’s model2 and it appears like this:

model2 <- lme(hom_tran ~ gini, random = ~1  | nation, 
                    knowledge = d2, correlation = corAR1())

This appeared just like the actually apparent factor to suit for me. Why not just a few extraordinary least squares factor? As a result of most of the nations have a number of observations, as we see on this plot (which was created within the code chunk we’ve already seen above):

I might say that the factor that’s most frequently executed on this scenario is to filter the information all the way down to only one statement per nation—maybe by taking the newest statement for every nation, or the one that’s closest to some yr that’s chosen exactly as a result of most nations have an statement in that yr or near it.

That appeared like a tragic waste of information to me (though in my code above you possibly can see that I did match such a mannequin, model1, and if you wish to have a look at it, it avoids the issue I’m about to speak about!). So my considering was to make use of all the information in a blended results mannequin, with a random intercept on the nation degree to permit for the truth that every new statement on a rustic, whereas positively price one thing, isn’t price as a lot as an unbiased statement as a result of it’s going to be just like the opposite observations for that nation. To do that even higher, in addition to the nation degree random intercept I throw in an autocorrelation of the residuals, exhibiting that the values in a single yr are anticipated to be correlated with these within the earlier yr (as certainly they change into).

That’s all properly and good however on this case it backfired on me. Once I first drew this plot, with solely the inexperienced line of model2 exhibiting, I assumed “Oh that’s cool, once we appropriate for the truth that many of those observations are low-value repeats of the identical nation’s earlier observations, it turns on the market’s little or no relationship between inequality and murder in any case”. However extra considering, and some residuals, and evaluating it to the extra optimistic outcomes from the a lot less complicated model1, shortly made me realise I used to be barking badly up the unsuitable tree.

The issue is finest illustrated by this chart that I drew sooner or later alongside the best way. I realised that as I’ve a random intercept that’s completely different for every nation, the anticipated worth for that nation from the mannequin will range very a lot by nation. However then the modelling of murder on inequality will begin from that time—the mannequin can be looking for to clarify variance in murder charges on this specific nation based mostly on its observations.

If we had a random slope as properly, the end result would look nearer to the darkish gray traces in these plots—a distinct relationship for every nation. However I hadn’t executed this, as a result of I judged there simply weren’t sufficient observations per nation for a random slope (many nations actually have a single statement—financial inequality is tough and expensive to measure, needing a particular family consumption survey if potential). So my fashions look extra just like the pale purple ribbon proven right here:

What’s taking place is that we now have two issues happening:

  • Inequality between nations does fairly a very good job of explaining variations in murder charges.
  • Inside every nation that has observations over a number of years, there may be a lot much less relationship between the 2 variables; and the place there’s a relationship, it differs nation by nation. For instance, judging by the the black dot representing the newest level, Australia and the USA look to have gotten extra unequal over time however much less homicides; Mexico has gotten much less unequal and extra homicides; whereas Russia, Ukraine and the UK have had decreases in each inequality and homicides. In different phrases, for Russia, Ukraine and the UK, the within-country knowledge matches the between-country optimistic rleationship between inequality and murder, however Australia, USA and Mexico have a detrimental relationship between the 2 variables.

One other means of that is to say that the random nation intercept absorbs the variance in murder that could possibly be as a substitute defined by that nation’s enduring common inequality. However we don’t have “enduring, common inequality” as a variable in model2. If we calculate a proxy for that the plain means (common of the inequality values we now have for that nation, disregarding the truth that they arrive from completely different years for every nation), we will have a look at the connection between this “enduring inequality” and the nation intercepts in model2, and we see a robust optimistic relationship. That’s the center left panel (a scatter plot) within the chart beneath, in addition to the center prime correlation (0.608, properly completely different from zero).

The opposite variable in that scatter plot matrix is the nation results from model3, which has been specified like this:

model3 <- lme(hom_tran ~ gini + ctry_avg_gini, random = ~1  | nation, 
                    knowledge = d2, correlation = corAR1())

The distinction from model2 is that now we now have ctry_avg_gini there as our estimate of tolerating common nation inequality. We nonetheless have gini, which is the worth of inequality that varies for this nation yr by yr. That can decide up the typical within-country impact.

Within the scatterplot matrix above we will see that after we embrace every nation’s common inequality within the mannequin, there isn’t any relationship between that variable and the nation degree random results. That is to be anticipated, exactly as a result of the mannequin becoming course of is designed to realize this.

Right here’s the abstract outcomes for model3:

> abstract(model3)
Linear mixed-effects mannequin match by REML
  Information: d2 
       AIC      BIC   logLik
  2542.301 2574.914 -1265.15

Random results:
 Method: ~1 | nation
        (Intercept)  Residual
StdDev:   0.8371227 0.7065144

Correlation Construction: AR(1)
 Method: ~1 | nation 
 Parameter estimate(s):
      Phi 
0.7574314 
Mounted results:  hom_tran ~ gini + ctry_avg_gini 
                  Worth Std.Error   DF   t-value p-value
(Intercept)   -3.193199 0.3950271 1554 -8.083494  0.0000
gini           0.007850 0.0056463 1554  1.390213  0.1647
ctry_avg_gini  0.113641 0.0117143  141  9.701081  0.0000
 Correlation: 
              (Intr) gini  
gini           0.001       
ctry_avg_gini -0.858 -0.482

Standardized Inside-Group Residuals:
         Min           Q1          Med           Q3          Max 
-11.47176745  -0.44184320  -0.04595294   0.42362704   3.06647091 

Variety of Observations: 1698
Variety of Teams: 143 

We see the robust impact of common country-level inequality impact, and no vital proof of an impact of inequality inside nations. Fascinating! However past my scope at present to enter that.

Lastly I wish to end off with a clear plot of my finest mannequin and the phenomenon below query—the robust empirical relationship between financial inequality in a rustic and the murder charge.

Right here’s the code for that ultimate clear plot:

p2b <- p2 +
  
  #model3 - country-year but in addition nation common knowledge, nation random impact
  geom_ribbon(knowledge = pred_grid, aes(ymin = lower3, ymax = upper3, y = NA), 
                fill = mod_cols[3], alpha = 0.2) +
  
  geom_point(aes(form = kind, color = kind)) +
  geom_label_repel(aes(label = label), max.overlaps = Inf, color = "grey10", measurement = 2.8,
                   seed = 123, label.measurement = unit(0, "mm"), fill = rgb(0,0,0, 0.04)) 

print(p2b)

Be aware that this builds on the item p2 that was the idea of the primary plot, the one exhibiting each model2 and model3, to keep away from repetition.

And right here’s the code for the enjoying round with residuals at completely different ranges and drawing that aspect plot model with 12 chosen nations:

#---------------residuals from model2 and model3---------------

# Particular person degree residuals - plot not utilized in weblog
p3 <- d2 |> 
  mutate(`Residuals from Mannequin 2` = residuals(model2),
          `Residuals from Mannequin 3` = residuals(model3)) |> 
  choose(gini, `Residuals from Mannequin 2`, `Residuals from Mannequin 3`) |> 
  collect(variable, worth, -gini) |> 
  ggplot(aes(x = gini, y = worth)) +
  facet_wrap(~variable) +
  geom_point() +
  labs(x = "Inequality",
       y = "Residuals (on Field-Cox reworked scale)",
       title = "Residuals from two mixed-effects fashions of murder",
       subtitle = "Residuals on the most granular degree ie year-country are uncorrelated with inequality")

svg_png(p3, "../img/0303-residuals", w = 8, h = 4)

# out of curiousity what are these low outlier residuals? - Iceland and Malta
spherical(kind(residuals(model2)),2)[1:5]
spherical(kind(residuals(model3)),2)[1:5]

# Nation degree results plot - pairs plot utilized in weblog
p4 <- operate()> 
  ggpairs() +
  labs(title = "Comparability of country-level random results in two fashions",
      subtitle = "Mannequin 3 has a set impact for nations' common inequality; Mannequin 2 would not.
The country-level residuals are correlated with this variable in Mannequin 2, however not Mannequin 3.")
 
    print(p)


p4()


#---------------further illustrations - facets----------------------
d3 <- d2 |> 
  filter(nation %in% c(highlights, "South Africa", "France", "Japan", "Ukraine")) |> 
  group_by(nation) |> 
  summarise(r = coef(lm(log10(murder) ~ gini))[2]) |> 
  mutate(r = replace_na(r, 0)) |> 
  mutate(nation = fct_reorder(nation, r) |>  fct_drop()) |> 
  organize(nation)


pred_grid2 <- d2 |> 
  filter(nation %in% d3$nation) |> 
  distinct(nation, ctry_avg_gini) |> 
  expand_grid(gini = 20:70)

# Be aware that predictSE excludes random nation impact....
more_preds2 <- predictSE(model2, newdata = pred_grid2, se = TRUE)
more_preds3 <- predictSE(model3, newdata = pred_grid2, se = TRUE)


pred_grid2 <- pred_grid2 |> 
  mutate(predicted2 = more_preds2$match,
         lower2 = predicted2 - 1.96 * more_preds2$se.match,
         upper2 = predicted2 + 1.96 * more_preds2$se.match,
         predicted3 = more_preds3$match,
         lower3 = predicted3 - 1.96 * more_preds3$se.match,
        upper3 = predicted3 + 1.96 * more_preds3$se.match) |> 
  mutate(throughout(predicted2:upper3, operate(x){InvBoxCox(x, lambda = lambda)}))



p5 <- d2 |> 
  mutate(nation = issue(nation, ranges = ranges(d3$nation))) |> 
  drop_na() |> 
  ggplot(aes(x = gini, y = murder)) +
  scale_y_log10(label = comma, limits = c(0.1, 100)) +
  xlim(18, 70) +
  scale_shape_manual(values = c(1, 19)) +
  scale_colour_manual(values = c("steelblue", "black")) +
  geom_ribbon(knowledge = pred_grid2, aes(ymin = lower3, ymax = upper3, y = NA), 
                fill = mod_cols[3], alpha = 0.2) +
  geom_smooth(technique = lm, se = FALSE, color = "grey50", fullrange = TRUE, linewidth = 0.5) +
  geom_point(aes(form = kind, color = kind)) +
  facet_wrap(~nation, ncol = 3)  +
  labs(x = "Inequality (Gini coefficient), based mostly on particular person revenue or in some instances, consumption. Greater means extra inequality.",
       y = "Murder charge (per 100,000)",
      color = "Statement yr:",
      form = "Statement yr:",
      title = "Greater inequality nations have extra homicides.",
      subtitle = "However for any given nation, the connection over time might be the reverse.
Pale ribbons present a mannequin's confidence interval for murder for a rustic's common over time degree of inequality.
Darkish line exhibits easy mannequin match to simply this nation's knowledge.",
      caption = "Supply: World Financial institution, World Improvement Indicators, collection VC.IHR.PSRC.P5 and SI.POV.GINI. Evaluation by freerangestats.information.") 
  
print(p5)

The complete set of code, in sequence, could be discovered on GitHub if the above presentaiton is complicated.



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles