A fairly easy publish at the moment. I wished to have a look at month-to-month tourism numbers in Samoa. In reality I began to do that for Pacific islands typically, however the information wrangling challenges had been adequate that I solely bought so far as Samoa for now. There’s curiosity in these in the intervening time as a result of they’d be a comparatively well timed indicator of potential financial harm from the gasoline disaster associated to the Iran struggle. Customer numbers, inflation, and merchandise commerce are a part of the very choose variety of month-to-month revealed financial statistics on this a part of the world (for a choose subset of nations).
There’s two issues taking place on this publish:
- getting maintain of the Samoa’s customer arrivals numbers; and
- seasonally adjusting them.
The latter is extra attention-grabbing to me than the previous, however sadly the previous took more often than not.
Knowledge wrangling
Right here’s what the customer numbers seem like, as soon as we’ve bought all of them into one information body:
The Samoa Bureau of Statistics has a pleasant Excel workbook as much as Could 2023, however from that date onwards the info is barely out there so far as I can see in PDF reviews. Fortunately these are all out there as hyperlinks from a single web page, however there appears to be both a talent difficulty on my half or some form of block on the web site that stops any systematic obtain of all of them, so I needed to obtain all of the PDFs by hand one after the other.
Then Claude helped me write a parser to seek out the customer arrivals quantity in every PDF. Really, Claude wasn’t any good at really discovering the best quantity, but it surely did give me a sample I may adapt, a lot as within the previous days I’d have used Stack Overflow. There are loads of numbers in every PDF and we want the best one—customer arrivals, not whole arrivals (sure, that’s an em sprint, however I write all of them by hand). On this case it seems that the trick is that the PDFs all embody the sentence “General customer numbers for the month underneath assessment stood at [number]”, and fortuitously there isn’t any different use of the phrase “stood” within the doc.
The opposite fiddly factor was extracting the precise date every PDF referred to. Then every thing wanted to be examined. In the long run, it might have been faster to simply manually kind the 35 numbers I wanted. However right here’s the code that does all the info wrangling.
library(tidyverse)
library(rvest)
library(httr2)
library(lubridate)
library(pdftools)
library(readxl)
library(seasonal)
library(tsibble)
library(fable)
library(feasts)
library(ggtext)
library(scales)
#' Extract date from messy filenames
extract_date <- perform(messy_date){
tibble(path = messy_date) |>
mutate(
stem = instruments::file_path_sans_ext(basename(path)),
# Take away any main phrase adopted by _ or - (e.g. "Migration_")
stem = str_remove(stem, "^([A-Za-z]+[_-])+(?=[A-Z])"),
month_str = str_extract(stem, "[A-Za-z]{3,}"),
# Normalise non-standard abbreviations
month_str = str_replace(month_str, "^Sept$", "Sep"),
year_str = str_extract(stem, "d{4}|d{2}"),
12 months = if_else(nchar(year_str) == 2,
as.integer(paste0("20", year_str)),
as.integer(year_str)),
date = as.Date(paste(12 months, month_str, "01"), format = "%Y %B %d") |>
coalesce(as.Date(paste(12 months, month_str, "01"), format = "%Y %b %d"))
) |>
pull(date)
}
take a look at <- c("samoa_pdfs/April_25.pdf", "samoa_pdfs/Feb_25.pdf", "samoa_pdfs/Feb_26.pdf",
"samoa_pdfs/Jan_26.pdf", "samoa_pdfs/January_25.pdf", "samoa_pdfs/July_25.pdf",
"samoa_pdfs/June-25.pdf", "samoa_pdfs/March_2025.pdf", "samoa_pdfs/March_2026.pdf",
"samoa_pdfs/May_25.pdf", "samoa_pdfs/Migration_April-2026.pdf",
"samoa_pdfs/Sept-24.pdf", "samoa_pdfs/Migration_Rep_June_2023.pdf")
extract_date(take a look at)
#-------------------PDFs for current data------------------
# For the newer years no Excel tables are revealed, so want
# to make use of the PDFs and extract whole from there
# These needed to be downloaded by hand - nothing I attempted was capable of automate
# that. Obtain from https://www.sbs.gov.ws/migration/ and save
# in a subfolder /samoa_pdfs/.
pdf_dir <- "samoa_pdfs"
tbl <- tibble(local_path = checklist.recordsdata(pdf_dir, sample = ".pdf$", full.names = TRUE))
parse_pdf_visitors <- perform(path) {
txt <- tryCatch(pdf_text(path), error = perform(e) {
message(" [WARN] pdftools couldn't learn: ", basename(path))
NULL
})
if (is.null(txt)) return(NA_integer_)
full_text <- paste(txt, collapse = "n")
patterns <- c(
"stood at [^0-9(]{0,10}(?([0-9,]+))?"
)
for (pat in patterns) {
m <- str_match(full_text, pat)[, 2]
if (!is.na(m)) return(as.integer(str_remove_all(m, ",")))
}
message(" [WARN] Couldn't parse customer rely from: ", basename(path))
NA_integer_
}
discovered <- tbl |>
filter(file.exists(local_path))
message(" Parsing ", nrow(discovered), " native PDFs...")
pdf_tbl <- discovered |>
mutate(guests = map_int(local_path, (p) {
message(" ", basename(p))
parse_pdf_visitors(p)
})) |>
filter(!is.na(guests)) |>
mutate(date = extract_date(local_path))
#-----------------Excel variations for older data------------
# For Could 2023 and earlier we are able to get the info for a number of months
# at a time from Desk 1 of the Excel tables. The Could 2023 Excel
# file goes again to 2017 January (though the rows are hidden)
fn <- "May_23.xlsx"
if(!file.exists(fn)){
obtain.file("https://sbs.gov.ws/photographs/sbs-documents/social/Arrival/2023/May_23.xlsx",
destfile = fn, mode = "wb")
}
x <- read_excel(fn, sheet = "Desk 1",
vary = "D48:D130",
col_names = "guests") |>
drop_na() |>
pull(guests)
historic <- tibble(guests = x,
date = seq(as.Date("2017-01-01"), as.Date("2023-05-01"), by = "month"))
#-----------combine and test-----------------------
samoa_visitors <- pdf_tbl |>
choose(date, guests) |>
bind_rows(historic) |>
prepare(date) |>
mutate(date_month = yearmonth(date)) |>
as_tsibble(index = date_month)
# Take a look at - some hand picked take a look at circumstances, 4 from PDFs and three from the Excel
samoa_test <- tribble(~date, ~correct_visitors,
"2023-08-01", 16471,
"2024-04-01", 12644,
"2024-08-01", 17248,
"2026-04-01", 14188,
"2018-02-01", 7413,
"2020-12-01", 195,
"2022-06-01", 866) |>
mutate(date = as.Date(date))
stopifnot(
samoa_test |>
anti_join(samoa_visitors, by = c("date", "correct_visitors" = "guests")) |>
nrow() == 0
)
And right here’s the code that attracts the fundamental time sequence chart I used earlier:
the_caption = "Supply: Samoa Bureau of Statistics"
ggplot(samoa_visitors, aes(x = date, y = guests)) +
geom_line() +
scale_y_continuous(label = comma) +
labs(x = "",
y = "Customer arrivals",
title = "Customer arrivals per 30 days to Samoa",
subtitle = "Unadjusted originals",
caption = the_caption)
Modelling seasonal decomposition
Phew. Okay, on to the extra enjoyable a part of really modelling. I intend to make use of the X-13ARIMA-SEATS software. X‑13ARIMA‑SEATS is a program developed and maintained by the US Census Bureau for seasonal adjustment and time‑sequence decomposition. It’s going to robotically match a SARIMA (seasonal autoregressive built-in shifting common) time sequence mannequin, has in-built strategies for figuring out and coping with outliers, by default adjusts for variety of buying and selling days and the shifting Easter vacation, and permits the consumer to specify extra regression explanatory variables in order for you. It’s the go-to the world over for seasonal adjustment of official statistics.
X-13ARIMA-SEATS is on the market in R via the seasonal package deal (by Christoph Sax and Dirk Eddelbuettel). Downstream of that, the feasts and fable packages by (Mitchell O’Hara-Wild, Rob Hyndman and Earo Wangmake) make it simpler to work with in a tabular, tidyverse method.
The Covid interval is an apparent dominating think about tourism wherever in current many years, and exhibits up within the first chart I confirmed above. I’m additionally within the interval for the reason that USA-Israel-Iran struggle started to see if there may be an impression from that. There’s solely two months of knowledge (March and April 2026) for the reason that struggle began, so an impression must be dramatic to point out up, but it surely’s value checking.
X-13ARIMA-SEATS by default will match forecasts while you ask it to mannequin so it is advisable to present values of x regressor variables to cowl not solely the interval of the info however a couple of intervals out forward. I’m doing this as easy time sequence vectors—I haven’t but labored out the straightforward method to do that within the tabular world of fable. Fortunately, it appears I can use these vectors later even in fable. In a second I’ll use each seasonal instantly and through fable to verify I get the identical outcomes. In reality, updating my dated information of time sequence modelling to make use of fable was one in all my predominant motivations for this complete train.
Right here’s the code that makes the x regressors I’ll be utilizing for the presence of the Covid pandemic and for the Iran struggle:
#----------------x regressor variables-----------------
# Covid time sequence indicator to make use of as a regressor
covid_reg <- ts(
as.numeric(seq(as.Date("2017-01-01"), as.Date("2029-04-01"), by = "month") %in%
seq(as.Date("2020-04-01"), as.Date("2022-07-01"), by = "month")),
begin = c(2017, 1),
frequency = 12
)
# Iran struggle time sequence indicator
war_reg <- ts(
as.numeric(seq(as.Date("2017-01-01"), as.Date("2029-04-01"), by = "month") %in%
seq(as.Date("2026-03-01"), as.Date("2026-07-01"), by = "month")),
begin = c(2017, 1),
frequency = 12
)
Instantly with seasonal
Okay, it’s modelling time. First, utilizing simply an quaint time sequence vector of Samoa’s customer arrivals and the seasonal package deal instantly, right here’s becoming the X-13ARIMA-SEATS mannequin with defaults utilizing each the Covid and Iran struggle regressors:
sa_ts <- ts(samoa_visitors$guests, frequency = 12, begin = c(2017, 1))
fit_ts_war <- seas(sa_ts, xreg = cbind(covid_reg, war_reg))
abstract(fit_ts_war)
Tremendous easy. That will get us this output:
Name:
seas(x = sa_ts, xreg = cbind(covid_reg, war_reg))
Coefficients:
Estimate Std. Error z worth Pr(>|z|)
xreg1 -3.10069 0.14567 -21.286 < 2e-16 ***
xreg2 -0.15937 0.13098 -1.217 0.224
LS2020.Mar -0.85036 0.14567 -5.838 5.29e-09 ***
AO2020.Dec -0.75529 0.12350 -6.115 9.63e-10 ***
LS2021.Could -0.79012 0.13233 -5.971 2.36e-09 ***
AO2021.Jul -1.36973 0.12378 -11.066 < 2e-16 ***
LS2022.Could 0.89068 0.13233 6.731 1.68e-11 ***
LS2022.Aug -1.07689 0.19608 -5.492 3.97e-08 ***
AR-Nonseasonal-01 -0.59006 0.07041 -8.380 < 2e-16 ***
MA-Seasonal-12 0.99961 0.08044 12.427 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
SEATS adj. ARIMA: (1 1 0)(0 1 1) Obs.: 112 Remodel: log
AICc: 1608, BIC: 1634 QS (no seasonality in remaining):3.295
Field-Ljung (no autocorr.): 26.42 Shapiro (normality): 0.9721 *
Some key issues to notice right here.
The information was log-transformed, which is nice—I would definitely have chosen to do that, or at the least a sq. root rework, given the best way customer arrivals variance will increase as its imply does, all around the globe.
The mannequin ultimately adopted is described as ARIMA (1 1 0)(0 1 1). This implies the primary sequence as a autoregression time period on one lag, after one spherical of differencing; and the seasonal half has a shifting common time period on on lag, after one spherical of differencing. It is a very regular mannequin for tourism numbers. it signifies a common development/drive (the differencing in the primary sequence), an inclination for the values in a single month to be associated a method or one other to these within the month earlier than (on this case with a destructive correlation of -0.59) and a robust annual seasonality impact that adjustments slowly over time.
The Covid dummy variable (xreg1) is strongly negatively important. With a coefficient of -3.1 and the log rework of the response variable, because of this throughout the Covid interval the precise arrivals had been on common exp(-3.1) = 0.045 (ie 4.5% or down 95.5%) of throughout the non-Covid intervals.
In distinction, we don’t have a statistically important impact for the Iran struggle (xreg2). For subsequent evaluation I’ll take out that x regressor as I don’t need it complicating the current development and seasonally adjusted figures.
Six months’ information have been singled out as outliers and managed for appropriately. These are all within the difficult-to-model Covid interval of 2020 to 2022.
One level to notice is that we don’t have an Easter impact. I’m 100% certain there may be actually an Easter impact in Samoa’s customer arrivals, however 9 years of knowledge isn’t sufficient to point out it. Easter is usually in April and generally in March. However since 2017, it occurs to have been in April yearly aside from 2024. That’s simply not sufficient variation to differentiate it from common month-to-month seasonal impacts.
To examine my recollection that Easter is certainly checked for by default in X-13ARIMA-SEATS, I match a mannequin to the well-known Field and Jenkins airline information:
> # One other examplefor comparability
+ m <- seas(AirPassengers)
+ abstract(m)
Name:
seas(x = AirPassengers)
Coefficients:
Estimate Std. Error z worth Pr(>|z|)
Weekday -0.0029497 0.0005232 -5.638 1.72e-08 ***
Easter[1] 0.0177674 0.0071580 2.482 0.0131 *
AO1951.Could 0.1001558 0.0204387 4.900 9.57e-07 ***
MA-Nonseasonal-01 0.1156204 0.0858588 1.347 0.1781
MA-Seasonal-12 0.4973600 0.0774677 6.420 1.36e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
SEATS adj. ARIMA: (0 1 1)(0 1 1) Obs.: 144 Remodel: log
AICc: 947.3, BIC: 963.9 QS (no seasonality in remaining): 0
Field-Ljung (no autocorr.): 26.65 Shapiro (normality): 0.9908
Right here we see that the variety of week days in a month and the shifting Easter vacation are certainly within the remaining mannequin, with out me having to have requested for them to be checked. Easter has a constructive impression on this air passengers sequence (1949 to 1960), and the variety of week days in a month has a smaller destructive impression.
So after that Easter-checking digression, I refit the mannequin for my Samoa customer arrivals with out the struggle regressor and get an basically an identical end result:
> fit_ts <- seas(sa_ts, xreg = covid_reg)
> abstract(fit_ts)
Name:
seas(x = sa_ts, xreg = covid_reg)
Coefficients:
Estimate Std. Error z worth Pr(>|z|)
xreg -3.10045 0.14690 -21.106 < 2e-16 ***
LS2020.Mar -0.83292 0.14617 -5.698 1.21e-08 ***
AO2020.Dec -0.75463 0.12449 -6.062 1.35e-09 ***
LS2021.Could -0.78975 0.13356 -5.913 3.36e-09 ***
AO2021.Jul -1.36737 0.12476 -10.960 < 2e-16 ***
LS2022.Could 0.89113 0.13356 6.672 2.52e-11 ***
LS2022.Aug -1.07733 0.19782 -5.446 5.15e-08 ***
AR-Nonseasonal-01 -0.58577 0.07071 -8.284 < 2e-16 ***
MA-Seasonal-12 0.99912 0.07827 12.765 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
SEATS adj. ARIMA: (1 1 0)(0 1 1) Obs.: 112 Remodel: log
AICc: 1607, BIC: 1631 QS (no seasonality in remaining):2.501
Field-Ljung (no autocorr.): 26.63 Shapiro (normality): 0.9706 *
With fable and feasts
tsibble, fable and feasts are an excellent set of packages that allow you to work with time sequence in R in a extra tabular and tidyverse-friendly method than the assorted older time sequence information buildings allow you to. Most of my work with time sequence was earlier than they had been out there, although, so I’m missing confidence in how they work. Fortunately it appears to be fairly easy.
I’d already accomplished the crucial step earlier with as_tsibble(index = date_month), specifying my predominant samoa_visitors tibble is definitely a time sequence tibble. Now the modelling utilizing that tsibble is fairly easy:
#----------fable version-------
fit_fb <- samoa_visitors |>
mannequin(X_13ARIMA_SEATS(
guests ~ xreg(covid_reg)
))
report(fit_fb)
Notice we use report() fairly than abstract() to get the top end result. I’m not going to print it right here as a result of it’s actually an identical to what we bought earier with abstract(fit_ts).
The principle attraction for me in utilizing the fable/feasts method is that it suits higher with each my information wrangling workflow and my method to ggplot2 graphics. so here’s a good decomposition of hte authentic timeseries, produced with autoplot(
fit_fb |>
parts() |>
autoplot()
Notice that on this decomposition, the unique ‘guests’ sequence and the development are on the unique scale, however the ‘seasonal’ and ‘irregular’ parts are expressed as multipliers. So a seasonal worth of 1.2 means in a given month the worth is 20% larger because of the seasonality than in any other case.
And right here is my remaining, presentation model of the info:
Produced with this code:
comp_data <- fit_fb |>
parts() |>
# the 'development' that comes straight from the decomposition does
# not modify ofr the Covid coefficient and appears fairly bizarre
# so it's extra intuitive to current it after adjustment for
# Covid, which we have to calculate by multiplying (due to
# the log rework that SEATS used autoamtically):
mutate(covid = as.numeric(covid_reg)[1:nrow(samoa_visitors)],
trend_adj = development * exp(covid * coef(fit_ts)[1]))
comp_data|>
ggplot(aes(x = date_month, y = season_adjust)) +
geom_line(linewidth = 1.3) +
geom_line(aes(y = trend_adj), color = "steelblue", alpha = 0.9, linewidth = 1.2) +
geom_point(aes(y = guests), color = "grey70") +
scale_y_continuous(label = comma) +
labs(x = "",
y ="Customer arrivals",
title = "Customer arrivals per 30 days to Samoa",
subtitle = "Authentic, seasonally adjusted and development (adjusted for Covid interval).",
caption = "Supply: information from Samoa Bureau of Statistics. Seasonal adjustment by freerangestats.information.") +
theme(plot.subtitle = element_markdown())
In any case that, what perception do we have now? Effectively, not lots actually, aside from the blindingly apparent tendencies of the devastation to the business of Covid and the gradual and noisy progress development since then. We’re at the least properly positioned to make commentary on the impression of the gasoline disaster on tourism, and may say that there isn’t any evident but. If and once we do see the impression, we’ll have the ability to speak about when it comes to tendencies and random variation, after having eliminated the seasonal factor. In order that’s helpful.
Effectively, that’s all. Maybe I’ll get round in a later publish to including the opposite Pacific island international locations with month-to-month tourism information—Fiji, Vanuatu, Cook dinner Islands, French Polynesia being the important thing ones I’m conscious of.
