Basic 'kendallknight' usage • kendallknight

For the full details about the implementation, please check Sepulveda MV (2025) Kendallknight: An R package for efficient implementation of Kendall’s correlation coefficient computation. PLoS One 20(6): e0326090. https://doi.org/10.1371/journal.pone.0326090

The kendallknight package is exclusively focused on the Kendall’s correlation coefficient and provides additional functions to test the statistical significance of the computed correlation not available in other packages, which is particularly useful in econometric and statistical contexts.

The kendallknight package is available on CRAN and can be installed using the following command:

# CRAN
install.packages("kendallknight")

# GitHub
remotes::install_github("pachadotdev/kendallknight")

As an illustrative exercise we can explore the question ‘is there a relationship between the number of computer science doctorates awarded in the United States and the total revenue generated by arcades?’ Certainly, this question is about a numerical exercise and not about causal mechanisms.

The following table obtained from @vigen2015 can be used to illustrate the usage of the kendallknight package:

Year	Computer science doctorates awarded in the US	Total revenue generated by arcades
2000	861	1.196
2001	830	1.176
2002	809	1.269
2003	867	1.240
2004	948	1.307
2005	1129	1.435
2006	1453	1.601
2007	1656	1.654
2008	1787	1.803
2009	1611	1.734

The kendall_cor() function can be used to compute the Kendall’s correlation coefficient:

library(kendallknight)

kendall_cor(arcade$doctorates, arcade$revenue)

[1] 0.8222222

The kendall_cor_test() function can be used to test the null hypothesis that the Kendall’s correlation coefficient is zero:

kendall_cor_test(
  arcade$doctorates,
  arcade$revenue,
  conf.level = 0.8,
  alternative = "greater"
)

        Kendall's rank correlation tau

data:  arcade$doctorates and arcade$revenue
tau = 0.82222, p-value = 0.0001788
alternative hypothesis: true tau is greater than 0
80 percent confidence interval:
 0.5038182 1.0000000

One important difference with base R implementation is that this implementation allows to obtain confidence intervals for different confidence levels (e.g., 95%, 90%, etc).

With the obtained \(p\)-value and a significance level of 80% (the default is 95%), the null hypothesis is rejected for the two-tailed test (\(H_0: \tau = 0\) versus \(H_1: \neq 0\), the default option) and the greater than one-tailed test (\(H_0: \tau = 0\) versus \(H_1: \tau > 0\)) but not for the lower than one-tailed test (\(H_0: \tau = 0\) versus \(H_1: \tau < 0\)). This suggests the correlation is positive (e.g., more doctorates are associated with more revenue generated by arcades). In other words, these three tests tell us that the empirical evidence from this dataset provides three answers to the research questions:

Is there any relationship? Yes, more doctorates are associated with more revenue generated by arcades.
Is there a positive relationship? Yes, more doctorates are associated with more revenue generated by arcades.
Is there a negative relationship? No, more doctorates are not associated with less revenue generated by arcades.

With base R or Kendall, an equivalent result can be obtained with the following code:

cor.test(arcade$doctorates, arcade$revenue, method = "kendall")

        Kendall's rank correlation tau

data:  arcade$doctorates and arcade$revenue
T = 41, p-value = 0.0003577
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.8222222

Kendall::Kendall(arcade$doctorates, arcade$revenue)

tau = 0.822, 2-sided pvalue =0.0012822

In an Econometric context, the current implementation is particularly useful to compute the pseudo-\(R^2\) statistic defined as the squared Kendall correlation in the context of (Quasi) Poisson regression with fixed effects [@santos;@sepulveda]. A local test reveals how the pseudo-\(R^2\) computation time drops from fifty to one percent of the time required to compute the model coefficients by using the fepois() function from the lfe package [@gaure] and a dataset containing fifteen thousand rows [@yotov]:

library(tradepolicy)
library(lfe)

data8694 <- subset(agtpa_applications, year %in% seq(1986, 1994, 4))

fit <- fepois(
  trade ~ dist + cntg + lang + clny + rta |
    as.factor(paste0(exporter, year)) +
    as.factor(paste0(importer, year)),
  data = data8694
)

psr <- (cor(data8694$trade, fit$fitted.values, method = "kendall"))^2

psr2 <- (kendall_cor(data8694$trade, fit$fitted.values))^2

c("kendallknight" = psr, "base R" = psr2)

kendallknight        base R 
     0.263012      0.263012

Comparing the model fitting time versus the correlation computation time, we can see that the kendallknight package becomes relevant:

Operation	Time	Pseudo \(R^2\) / Model fitting
Model fitting	3.75s
Pseudo-\(R^2\) (base R)	1.78s	47.58%
Pseudo-\(R^2\) (kendallknight)	0.02s	0.51%

For simulated datasets with variables “x” and “y” created with rnorm() and rpois() we observe a increasingly time differences as the number of observations increases:

No. of observations	kendallknight median time (s)	Kendall median time (s)	Base R median time (s)
10,000	0.013	1.0	4
20,000	0.026	3.9	16
30,000	0.040	8.7	36
40,000	0.056	15.6	64
50,000	0.071	24.2	100
60,000	0.088	34.8	144
70,000	0.104	47.5	196
80,000	0.123	61.9	256
90,000	0.137	78.2	324
100,000	0.153	96.4	399

Basic ‘kendallknight’ usage