POST
|
The R function caret::findCorrelation searches a correlation matrix and returns a vector of integers corresponding to variables which, if removed, would reduce pair-wise correlations among the remaining variables. Here is the R code for this function: function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 100) { if (names & is.null(colnames(x))) stop("'x' must have column names when `names = TRUE`") out <- if (exact) findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose) else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose) out if (names) out <- colnames(x)[out] out } And the function findCorrelation_fast , which is the one I am interested in (with optional arguments removed): findCorrelation_fast <- function(x, cutoff = .90) { if(any(!complete.cases(x))) stop("The correlation matrix has some missing values.") averageCorr <- colMeans(abs(x)) averageCorr <- as.numeric(as.factor(averageCorr)) x[lower.tri(x, diag = TRUE)] <- NA combsAboveCutoff <- which(abs(x) > cutoff) colsToCheck <- ceiling(combsAboveCutoff / nrow(x)) rowsToCheck <- combsAboveCutoff %% nrow(x) colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck] rowsToDiscard <- !colsToDiscard deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard]) deletecol <- unique(deletecol) deletecol } I am writing a function that emulates the intent of this function in Python 3 with help from pandas. My implementation contains a nested for loop, which I understand is far from the most efficient way to achieve the desired result. The original R function does the job without any looping. My two questions are: Based on my implementation below, is there a Pythonic way to replace the nested for loop with a vectorised implementation? Related to (1), the R function findCorrelation_fast uses the line averageCorr <- as.numeric(as.factor(averageCorr)) . This construction seems both very alien to me and also crucial to the success of the loopless R implementation. Can anyone shed any light on what this line is doing? My intuition tells me that it is being incredibly clever and leveraging some unique behaviour of R. My Python implementation and an example of its usage: import numpy as np import pandas as pd # calculate pair-wise correlations def findCorrelated(corrmat, cutoff = 0.8): ### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations # args: # corrmat: a correlation matrix # cutoff: pairwise absolute correlation cutoff # returns: # variables to removed if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square' averageCorr = corrmat.abs().mean(axis = 1) # set lower triangle and diagonal of correlation matrix to NA corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool)) corrmat.values[[np.arange(len(corrmat))]*2] = None # where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it to_delete = list() for col in range(0, len(corrmat.columns)): for row in range(0, len(corrmat)): if(corrmat.iloc[row, col] > cutoff): if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row) else: to_delete.append(col) to_delete = list(set(to_delete)) return to_delete # generate some data df = pd.DataFrame(np.random.randn(50,25)) # demonstrate usage of function removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelated print('Columns to be removed:') print(removeCols) uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False) print('Uncorrelated variables:') print(uncorrelated) Hope you can see clearly now...
... View more
01-20-2017
10:11 PM
|
0
|
1
|
1180
|
POST
|
The R function caret::findCorrelation searches a correlation matrix and returns a vector of integers corresponding to variables which, if removed, would reduce pair-wise correlations among the remaining variables. Here is the R code for this function: function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 100) { if (names & is.null(colnames(x))) stop("'x' must have column names when `names = TRUE`") out <- if (exact) findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose) else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose) out if (names) out <- colnames(x)[out] out} And the function findCorrelation_fast , which is the one I am interested in (with optional arguments removed): findCorrelation_fast <- function(x, cutoff = .90){ if(any(!complete.cases(x))) stop("The correlation matrix has some missing values.") averageCorr <- colMeans(abs(x)) averageCorr <- as.numeric(as.factor(averageCorr)) x[lower.tri(x, diag = TRUE)] <- NA combsAboveCutoff <- which(abs(x) > cutoff) colsToCheck <- ceiling(combsAboveCutoff / nrow(x)) rowsToCheck <- combsAboveCutoff %% nrow(x) colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck] rowsToDiscard <- !colsToDiscard deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard]) deletecol <- unique(deletecol) deletecol} I am writing a function that emulates the intent of this function in Python 3 with help from pandas. My implementation contains a nested for loop, which I understand is far from the most efficient way to achieve the desired result. The original R function does the job without any looping. My two questions are: Based on my implementation below, is there a Pythonic way to replace the nested for loop with a vectorised implementation? Related to (1), the R function findCorrelation_fast uses the line averageCorr <- as.numeric(as.factor(averageCorr)) . This construction seems both very alien to me and also crucial to the success of the loopless R implementation. Can anyone shed any light on what this line is doing? My intuition tells me that it is being incredibly clever and leveraging some unique behaviour of R. My Python implementation and an example of its usage: import numpy as npimport pandas as pd # calculate pair-wise correlationsdef findCorrelated(corrmat, cutoff = 0.8): ### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations# args: # corrmat: a correlation matrix # cutoff: pairwise absolute correlation cutoff# returns: # variables to removed if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square' averageCorr = corrmat.abs().mean(axis = 1) # set lower triangle and diagonal of correlation matrix to NA corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool)) corrmat.values[[np.arange(len(corrmat))]*2] = None # where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it to_delete = list() for col in range(0, len(corrmat.columns)): for row in range(0, len(corrmat)): if(corrmat.iloc[row, col] > cutoff): if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row) else: to_delete.append(col) to_delete = list(set(to_delete)) return to_delete # generate some datadf = pd.DataFrame(np.random.randn(50,25))# demonstrate usage of function removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelatedprint('Columns to be removed:')print(removeCols)uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)print('Uncorrelated variables:')print(uncorrelated)
... View more
01-20-2017
02:31 AM
|
0
|
3
|
3658
|
Online Status |
Offline
|
Date Last Visited |
11-11-2020
02:24 AM
|