Consider a labelled data set $$D = \{(x_1, y_1),...,(x_n, y_n)\} $$ on which we want to evaluate a machine learning algorithm using $k$-fold cross validation with $m$ different random seeds. This means that in total we need to train the algorithm on $mk$ distinct data splits whereby each data split consists of its own training set and test set.
For each of these $mk$ data splits we use two different performance metrics $f$ and $g$ to evaluate the trained machine learning algorithm on the test set.
Running this experimental setup delivers two matrices full of of performance measures $A^f, A^g \in \mathbb{R}^{m \times k}$. For example, $A^f_{ij}$ describes the performance of the trained algorithm on the test set of the $j$-th split based on the $i$-th random seed whereby the performance is evaluated using the performance metric $f$.
I am interested in studying the statistical association between the values in $A^f$, and $A^g$ using measures such as Pearson correlation and mutual information. I have the $mk$ statistical realisations $(A^f_{ij}, A^g_{ij}) \in \mathbb{R}^2$ to work with.
But what worries me is the fact that the value-pairs $(A^f_{ij}, A^g_{ij})$ are not the product of independent and identically distributed (i.i.d.) sampling because I feel that the value-pairs for each fixed matrix row $i$
$$(A^f_{i1}, A^g_{i1}),...,(A^f_{ik}, A^g_{ik}) $$
are not independent (and thus not i.i.d.) due to the underlying cross validation scheme. On the other hand, all estimation schemes for statistical association measures such as Pearson correlation (PC) or mutual information (MI) assume that you have i.i.d. samples of your random variables.
To me it seems that, for each fixed column $j$, the value-pairs
$$(A^f_{1j}, A^g_{1j}),...,(A^f_{mj}, A^g_{mj}) $$
are in fact i.i.d. Should I compute the PC and MI column-wise and then average out the resulting $k$ values?
More generally, what is a statistically rigorous way to study the statistical association between the samples in $A^f$ and $A^g$ given the stochastic dependency structure in the matrix data imposed by the cross-validation scheme?
Am I overthinking this? I.e. should I simply brutally compute the PC and MI using all value-pairs $(A^f_{ij}, A^g_{ij})$ as if they were i.i.d. samples?