Correlation and correlation construction (9) – Parallelizing Matrix Computation

October 20, 2025

3

Datasets have grown from massive to huge, and so we more and more discover ourselves refactoring for readability and prioritizing computational effectivity (pace). The computing time for the ever-important pattern covariance estimate of a dataset , with $n$ observations and $p$ variables is $mathcal{O}(n p^2)$ . Though a single covariance calculation for right this moment’s massive datasets is manageable nonetheless, it’s computationally prohibitive to make use of bootstrap, or associated resampling strategies that require very many repetitions the place every repetition calls for its personal covariance computation. With out quick computation bootstrap stays impractical for high-dimensional issues. And that, we undoubtedly all agree is a tragedy.

So, what can we do restore resampling strategies to our toolkit? We are able to scale back computing instances, and appreciably so, if we compute in parallel. We are able to scale back ready instances from in a single day to issues of minutes or seconds even. Associated to this, I wrote a put up about Randomized Matrix Multiplication the place I supply computationally cheaper approximation as an alternative of the precise, however longer to compute process.

This put up you now learn was impressed by a query from Laura Balzano (College of Michigan) who requested if we will’t get an actual resolution (relatively than an approximation) utilizing parallel computing proven in that different put up. I spent a while interested by it and certainly it’s potential, and priceless. So with that context out of the way in which, right here is the Rython (R + Python) code to calculate the pattern covariance estimate in parallel, with some indication for time saved. Use it when you have got massive matrices and also you want the pattern covariance matrix or spinoff thereof.

Parallelizing Matrix Computation – Rython code

Assume a matrix $X_{10^4 times 2500}$ , calculating the pattern covariance takes round 40 seconds on my machine

TT <- 10000 p <- 2500 X <- matrix(rnorm(TT*p),TT,p) Xc <- scale(X, middle=TRUE,scale=FALSE) no_parallel= cov(Xc) system.time(cov(Xc)) # person system elapsed # 41.99 0.00 42.00

TT <– 10000

p <– 2500

X <– matrix(rnorm(TT*p),TT,p)

Xc <– scale(X, middle=TRUE,scale=FALSE)

no_parallel= cov(Xc)

system.time(cov(Xc))

# person system elapsed

# 41.99 0.00 42.00

It’s value acknowledging NumPy’s impressively optimized backend – relative to the R compiler; I needed to enhance the matrix dimensions fivefold to get a comparable ready time.

import numpy as np from multiprocessing import Pool, cpu_count from itertools import product import time TT = 50000 p = 12500 X = np.random.randn(TT, p) Xc = X – np.imply(X, axis=0) start_time = time.time() tmpp = np.cov(Xc, rowvar=False) elapsed = time.time() – start_time print(f”Completed in {elapsed:.2f} seconds.”) Completed in 18.75 seconds.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

import numpy as np

from multiprocessing import Pool, cpu_count

from itertools import product

import time

TT = 50000

p = 12500

X = np.random.randn(TT, p)

Xc = X – np.imply(X, axis=0)

start_time = time.time()

tmpp = np.cov(Xc, rowvar=False)

elapsed = time.time() – start_time

print(f“Completed in {elapsed:.2f} seconds.”)

Completed in 18.75 seconds.

Now let’s parallel. The trick is to interrupt the large matrix into smaller chunksblocks, compute the covariance of these small chunks, and thoroughly rearrange it again to it’s unique dimensions. Within the code block 1 has our variables listed from 1 by way of $p/5$ , block 2 has indices $(p/5)+1$ by way of $2times (p/5)$ and so forth till block 5. No cause to decide on precisely 5 chunks, experiment at your leisure (which I’m sure you’ll ). Subsequent we create a grid that pairs every potential mixture of those blocks. Now you possibly can ship your particular person “employees” to work on these blocks individually. When carried out, rearrange it again into covariance type, which shouldn’t take lengthy.

library(parallel) blocks <- record( 1:(p/5), (1*(p/5)+1): (2*(p/5)), (2*(p/5)+1):(3*(p/5)), (3*(p/5)+1):(4*(p/5)), (4*(p/5)+1):p) ij<-expand.grid(i=1:5, j=1:5) cl <- makeCluster(5) clusterExport(cl, varlist = c(“Xc”, “ij”, “blocks”) ) system.time ( out <- parLapply(cl, 1:nrow(ij), operate(okay){ i<-ij$i[k] j<-ij$j[k] crossprod( Xc[,blocks[[i]]], Xc[,blocks[[j]]] ) } ) ) # person system elapsed # 0.04 0.01 10.27 stopCluster(cl) S <- matrix(0,p,p) for(okay in 1:size(out)){ i<-ij$i[k] j<-ij$j[k] S[blocks[[i]],blocks[[j]]] <- out[[k]] } S[lower.tri(S)]<-t(S)[lower.tri(S)] # Because of symmetry S<-S/(nrow(X)-1) # as a result of we had crossprod and never the cov operate within the loop above # Now verify there is no such thing as a hiccup all.equal(no_parallel, S) [1] TRUE

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

library(parallel)

blocks <– record( 1:(p/5),

(1*(p/5)+1): (2*(p/5)),

(2*(p/5)+1):(3*(p/5)),

(3*(p/5)+1):(4*(p/5)),

(4*(p/5)+1):p)

ij<–broaden.grid(i=1:5, j=1:5)

cl <– makeCluster(5)

clusterExport(cl, varlist = c(“Xc”, “ij”, “blocks”) )

system.time (

out <– parLapply(cl, 1:nrow(ij),

operate(okay){

i<–ij$i[k]

j<–ij$j[k]

crossprod( Xc[,blocks[[i]]], Xc[,blocks[[j]]] )

}

)

# person system elapsed

# 0.04 0.01 10.27

stopCluster(cl)

S <– matrix(0,p,p)

for(okay in 1:size(out)){

i<–ij$i[k]

j<–ij$j[k]

S[blocks[[i]],blocks[[j]]] <– out[[k]]

}

S[lower.tri(S)]<–t(S)[lower.tri(S)]

# Because of symmetry

S<–S/(nrow(X)–1)

# as a result of we had crossprod and never the cov operate within the loop above

# Now verify there is no such thing as a hiccup

all.equal(no_parallel, S)

[1] TRUE

For completeness right here is the Python code, however you don’t want it as I clarify immediately after.

# from multiprocessing import Pool, cpu_count # as another choice import time from joblib import Parallel, delayed block_size = p // 5 blocks = [np.arange(i * block_size, (i + 1) * block_size) for i in range(5)] ij = record(product(vary(5), repeat=2)) def compute_block(i, j): return Xc[:, blocks[i]].T @ Xc[:, blocks[j]] start_time = time.time() out = Parallel(n_jobs=5, backend=”threading”)( delayed(compute_block)(i, j) for i, j in ij ) elapsed = time.time() – start_time print(f”Completed in {elapsed:.2f} seconds.”) # Completed in 38.36 seconds. S = np.zeros((p, p)) # Fill in blocks from parallel output for okay, (i, j) in enumerate(ij): # ij = record of (i, j) pairs rows = blocks[i] cols = blocks[j] S[np.ix_(rows, cols)] = out[k] # Fill the decrease triangle by symmetry i_lower = np.tril_indices(p, -1) S[i_lower] = S.T[i_lower] # Normalize S = S / (X.form[0] – 1) np.allclose(tmpp, S) True

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

# from multiprocessing import Pool, cpu_count # as another choice

import time

from joblib import Parallel, delayed

block_size = p // 5

blocks = [np.arange(i * block_size, (i + 1) * block_size) for i in range(5)]

ij = record(product(vary(5), repeat=2))

def compute_block(i, j):

return Xc[:, blocks[i]].T @ Xc[:, blocks[j]]

start_time = time.time()

out = Parallel(n_jobs=5, backend=“threading”)(

delayed(compute_block)(i, j) for i, j in ij

)

elapsed = time.time() – start_time

print(f“Completed in {elapsed:.2f} seconds.”)

# Completed in 38.36 seconds.

S = np.zeros((p, p))

# Fill in blocks from parallel output

for okay, (i, j) in enumerate(ij): # ij = record of (i, j) pairs

rows = blocks[i]

cols = blocks[j]

S[np.ix_(rows, cols)] = out[k]

# Fill the decrease triangle by symmetry

i_lower = np.tril_indices(p, –1)

S[i_lower] = S.T[i_lower]

# Normalize

S = S / (X.form[0] – 1)

np.allclose(tmpp, S)

True

As you possibly can see it takes longer, which caught me abruptly. After some digging, ensuring there aren’t any bugs, I understood that NumPy’s spectacular pace is because of the truth that it’s already optimized to run in parallel within the backend. You’ll be able to print np.__config__.present() to see the decrease stage software program used behind the scenes, with some heavy-hitters like Lapack and Blas that are particularly designed for matrix math on trendy CPUs. Express parallelization then solely creates extra overhead. Attention-grabbing stuff. Plus 1 for Python.

Appendix

Whenever you refactor code, additionally, you will do properly to look beneath the hood for what the operate is doing, and maybe take away all types of checks which can be maybe not wanted (classtype checks, NA checks and many others).

Additionally, in case you want to discuss with this put up, you possibly can cite it like so:
Eran Raviv (2025). “Parallelizing Matrix Computation.”

Correlation and correlation construction (9) – Parallelizing Matrix Computation

Parallelizing Matrix Computation – Rython code

Appendix

Related Articles

Econometric Sense: Blocking and Causality

What Can We Truly Do With corner-shape?

Constructing a Streamlit Python UI for LLaVA with OpenAI API Integration

Latest Articles

Econometric Sense: Blocking and Causality

What Can We Truly Do With corner-shape?

Constructing a Streamlit Python UI for LLaVA with OpenAI API Integration

How AI is reshaping the way forward for startups

This startup thinks slime mildew may help us design higher cities