6.4 다중공선성과 변수 선택¶
다중공선성(multicollinearity)란 독립 변수의 일부가 다른 독립 변수의 조합으로 표현될 수 있는 경우이다. 독립 변수들이 서로 독립이 아니라 상호상관관계가 강한 경우에 발생한다. 이는 독립 변수의 공분산 행렬이 full rank 이어야 한다는 조건을 침해한다.
다음 데이터는 미국의 거시경제지표를 나타낸 것이다.
TOTEMP - Total Employment
GNPDEFL - GNP deflator
GNP - GNP
UNEMP - Number of unemployed
ARMED - Size of armed forces
POP - Population
YEAR - Year (1947 - 1962)
스캐터 플롯에서 보듯이 독립변수간의 상관관계가 강하다.
from statsmodels.datasets.longley import load_pandas
dfy = load_pandas().endog
dfX = load_pandas().exog
df = pd.concat([dfy, dfX], axis=1)
sns.pairplot(dfX)
plt.show()
상관관계는 상관계수 행렬로도 살펴볼 수 있다.
dfX.corr()
GNPDEFL | GNP | UNEMP | ARMED | POP | YEAR | |
---|---|---|---|---|---|---|
GNPDEFL | 1.000000 | 0.991589 | 0.620633 | 0.464744 | 0.979163 | 0.991149 |
GNP | 0.991589 | 1.000000 | 0.604261 | 0.446437 | 0.991090 | 0.995273 |
UNEMP | 0.620633 | 0.604261 | 1.000000 | -0.177421 | 0.686552 | 0.668257 |
ARMED | 0.464744 | 0.446437 | -0.177421 | 1.000000 | 0.364416 | 0.417245 |
POP | 0.979163 | 0.991090 | 0.686552 | 0.364416 | 1.000000 | 0.993953 |
YEAR | 0.991149 | 0.995273 | 0.668257 | 0.417245 | 0.993953 | 1.000000 |
cmap = sns.light_palette("darkgray", as_cmap=True)
sns.heatmap(dfX.corr(), annot=True, cmap=cmap)
plt.show()
다중 공선성이 있으면 독립변수의 공분산 행렬의 조건수(conditional number)가 증가한다.
from sklearn.model_selection import train_test_split
def get_model1(seed):
df_train, df_test = train_test_split(df, test_size=0.5, random_state=seed)
model = sm.OLS.from_formula("TOTEMP ~ GNPDEFL + POP + GNP + YEAR + ARMED + UNEMP", data=df_train)
return df_train, df_test, model.fit()
df_train, df_test, result1 = get_model1(3)
print(result1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: TOTEMP R-squared: 1.000
Model: OLS Adj. R-squared: 0.997
Method: Least Squares F-statistic: 437.5
Date: Sun, 23 Jun 2019 Prob (F-statistic): 0.0366
Time: 18:13:38 Log-Likelihood: -44.199
No. Observations: 8 AIC: 102.4
Df Residuals: 1 BIC: 103.0
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -1.235e+07 2.97e+06 -4.165 0.150 -5e+07 2.53e+07
GNPDEFL 106.2620 75.709 1.404 0.394 -855.708 1068.232
POP 2.2959 0.725 3.167 0.195 -6.915 11.506
GNP -0.3997 0.120 -3.339 0.185 -1.920 1.121
YEAR 6300.6231 1498.900 4.203 0.149 -1.27e+04 2.53e+04
ARMED -0.2450 0.402 -0.609 0.652 -5.354 4.864
UNEMP -6.3311 1.324 -4.782 0.131 -23.153 10.491
==============================================================================
Omnibus: 0.258 Durbin-Watson: 1.713
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.304
Skew: 0.300 Prob(JB): 0.859
Kurtosis: 2.258 Cond. No. 2.01e+10
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.01e+10. This might indicate that there are
strong multicollinearity or other numerical problems.
또한 다음처럼 학습용 데이터와 검증용 데이터로 나누어 회귀분석 성능을 비교하면 과최적화가 발생하였음을 알 수 있다.
def calc_r2(df_test, result):
target = df.loc[df_test.index].TOTEMP
predict_test = result.predict(df_test)
RSS = ((predict_test - target)**2).sum()
TSS = ((target - target.mean())**2).sum()
return 1 - RSS / TSS
test1 = []
for i in range(10):
df_train, df_test, result = get_model1(i)
test1.append(calc_r2(df_test, result))
test1
[0.9815050656837723,
0.9738497543069347,
0.9879366369871746,
0.7588861967897188,
0.980720608930437,
0.8937889315168234,
0.8798563810651999,
0.9314665778963799,
0.8608525682180641,
0.9677198735170137]
독립변수가 서로 의존하게 되면 이렇게 과최적화(over-fitting) 문제가 발생하여 회귀 결과의 안정성을 해칠 가능성이 높아진다. 이를 방지하는 방법들은 다음과 같다.
변수 선택법으로 의존적인 변수 삭제
PCA(principal component analysis) 방법으로 의존적인 성분 삭제
정규화(regularized) 방법 사용
VIF¶
다중 공선성을 없애는 가장 기본적인 방법은 다른 독립변수에 의존하는 변수를 없애는 것이다. 가장 의존적인 독립변수를 선택하는 방법으로는 VIF(Variance Inflation Factor)를 사용할 수 있다. VIF는 독립변수를 다른 독립변수로 선형회귀한 성능을 나타낸 것이다. \(i\)번째 변수의 VIF는 다음과 같이 계산한다.
여기에서 \(R^2_i\)는 다른 변수로 \(i\)번째 변수를 선형회귀한 성능(결정 계수)이다. 다른 변수에 의존적일 수록 VIF가 커진다.
StatsModels에서는 variance_inflation_factor
명령으로 VIF를 계산한다.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(
dfX.values, i) for i in range(dfX.shape[1])]
vif["features"] = dfX.columns
vif
VIF Factor | features | |
---|---|---|
0 | 12425.514335 | GNPDEFL |
1 | 10290.435437 | GNP |
2 | 136.224354 | UNEMP |
3 | 39.983386 | ARMED |
4 | 101193.161993 | POP |
5 | 84709.950443 | YEAR |
상관계수와 VIF를 사용하여 독립 변수를 선택하면 GNP, ARMED, UNEMP 세가지 변수만으로도 비슷한 수준의 성능이 나온다는 것을 알 수 있다.
def get_model2(seed):
df_train, df_test = train_test_split(df, test_size=0.5, random_state=seed)
model = sm.OLS.from_formula("TOTEMP ~ scale(GNP) + scale(ARMED) + scale(UNEMP)", data=df_train)
return df_train, df_test, model.fit()
df_train, df_test, result2 = get_model2(3)
print(result2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: TOTEMP R-squared: 0.989
Model: OLS Adj. R-squared: 0.981
Method: Least Squares F-statistic: 118.6
Date: Sun, 23 Jun 2019 Prob (F-statistic): 0.000231
Time: 18:13:39 Log-Likelihood: -57.695
No. Observations: 8 AIC: 123.4
Df Residuals: 4 BIC: 123.7
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 6.538e+04 163.988 398.686 0.000 6.49e+04 6.58e+04
scale(GNP) 4338.7051 406.683 10.669 0.000 3209.571 5467.839
scale(ARMED) -812.1407 315.538 -2.574 0.062 -1688.215 63.933
scale(UNEMP) -1373.0426 349.316 -3.931 0.017 -2342.898 -403.187
==============================================================================
Omnibus: 0.628 Durbin-Watson: 2.032
Prob(Omnibus): 0.731 Jarque-Bera (JB): 0.565
Skew: 0.390 Prob(JB): 0.754
Kurtosis: 1.958 Cond. No. 4.77
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
다중공선성을 제거한 경우에는 학습 성능과 검증 성능간의 차이가 줄어들었음을 확인할 수 있다. 즉, 과최적화가 발생하지 않는다.
test2 = []
for i in range(10):
df_train, df_test, result = get_model2(i)
test2.append(calc_r2(df_test, result))
test2
[0.9763608388904907,
0.9841984331185702,
0.9687069366140136,
0.9397304053201785,
0.9773357061188462,
0.9561262155732316,
0.980385249669863,
0.9917361722470804,
0.9837134067639467,
0.9789512977093212]
plt.subplot(121)
plt.plot(test1, 'ro', label="검증 성능")
plt.hlines(result1.rsquared, 0, 9, label="학습 성능")
plt.legend()
plt.xlabel("시드값")
plt.ylabel("성능(결정계수)")
plt.title("다중공선성 제거 전")
plt.ylim(0.5, 1.2)
plt.subplot(122)
plt.plot(test2, 'ro', label="검증 성능")
plt.hlines(result2.rsquared, 0, 9, label="학습 성능")
plt.legend()
plt.xlabel("시드값")
plt.ylabel("성능(결정계수)")
plt.title("다중공선성 제거 후")
plt.ylim(0.5, 1.2)
plt.suptitle("다중공선성 제거 전과 제거 후의 성능 비교", y=1.04)
plt.tight_layout()
plt.show()
보스턴 집값 예측 문제에 응용¶
from sklearn.datasets import load_boston
boston = load_boston()
dfX0 = pd.DataFrame(boston.data, columns=boston.feature_names)
from patsy import dmatrix
formula = "scale(CRIM) + scale(I(CRIM ** 2)) + " + \
"scale(ZN) + scale(I(ZN ** 2)) + scale(INDUS) + " + \
"scale(NOX) + scale(RM) + scale(AGE) + " + \
"scale(np.log(DIS)) + scale(RAD) + scale(TAX) + " + \
"scale(np.log(PTRATIO)) + scale(B) + scale(np.log(LSTAT)) + CHAS"
dfX = dmatrix(formula, dfX0, return_type="dataframe")
dfy = pd.DataFrame(boston.target, columns=["MEDV"])
idx_outlier = \
np.array([7, 54, 148, 152, 160, 214, 253, 267, 364, 365, 367, 368, 369,
371, 372, 374, 380, 385, 397, 398, 399, 400, 401, 405, 409, 410,
412, 413, 414, 415, 416, 418, 419, 426, 445, 489, 490, 492, 505,
161, 162, 163, 166, 186, 195, 204, 225, 257, 267, 283, 368, 369,
370, 371, 372])
idx = list(set(range(len(dfX))).difference(idx_outlier))
dfX = dfX.iloc[idx, :].reset_index(drop=True)
dfy = dfy.iloc[idx, :].reset_index(drop=True)
cmap = sns.light_palette("black", as_cmap=True)
sns.heatmap(dfX.corr(), annot=True, fmt='3.1f', cmap=cmap)
plt.show()
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(
dfX.values, i) for i in range(dfX.shape[1])]
vif["features"] = dfX.columns
vif = vif.sort_values("VIF Factor").reset_index(drop=True)
vif
VIF Factor | features | |
---|---|---|
0 | 1.061624 | CHAS |
1 | 1.338325 | scale(B) |
2 | 1.478553 | Intercept |
3 | 1.780320 | scale(np.log(PTRATIO)) |
4 | 2.596496 | scale(RM) |
5 | 3.748931 | scale(AGE) |
6 | 3.807459 | scale(INDUS) |
7 | 4.682812 | scale(np.log(LSTAT)) |
8 | 5.071802 | scale(NOX) |
9 | 5.215025 | scale(np.log(DIS)) |
10 | 9.107858 | scale(TAX) |
11 | 10.218588 | scale(I(CRIM ** 2)) |
12 | 11.254736 | scale(RAD) |
13 | 11.751869 | scale(I(ZN ** 2)) |
14 | 14.646056 | scale(ZN) |
15 | 21.260182 | scale(CRIM) |
model_boston1 = sm.OLS(np.log(dfy), dfX)
result_boston1 = model_boston1.fit()
print(result_boston1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MEDV R-squared: 0.872
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 199.9
Date: Sun, 23 Jun 2019 Prob (F-statistic): 1.56e-185
Time: 18:13:41 Log-Likelihood: 317.45
No. Observations: 456 AIC: -602.9
Df Residuals: 440 BIC: -536.9
Df Model: 15
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 3.0338 0.007 433.880 0.000 3.020 3.048
scale(CRIM) -0.3471 0.044 -7.976 0.000 -0.433 -0.262
scale(I(CRIM ** 2)) 0.3075 0.071 4.331 0.000 0.168 0.447
scale(ZN) -0.0465 0.022 -2.110 0.035 -0.090 -0.003
scale(I(ZN ** 2)) 0.0440 0.020 2.206 0.028 0.005 0.083
scale(INDUS) 0.0037 0.012 0.323 0.747 -0.019 0.026
scale(NOX) -0.0652 0.013 -5.001 0.000 -0.091 -0.040
scale(RM) 0.0999 0.011 9.195 0.000 0.079 0.121
scale(AGE) -0.0273 0.011 -2.438 0.015 -0.049 -0.005
scale(np.log(DIS)) -0.1008 0.014 -7.368 0.000 -0.128 -0.074
scale(RAD) 0.1634 0.020 8.106 0.000 0.124 0.203
scale(TAX) -0.0934 0.018 -5.153 0.000 -0.129 -0.058
scale(np.log(PTRATIO)) -0.0699 0.008 -8.872 0.000 -0.085 -0.054
scale(B) 0.0492 0.007 6.699 0.000 0.035 0.064
scale(np.log(LSTAT)) -0.1487 0.013 -11.074 0.000 -0.175 -0.122
CHAS 0.0659 0.026 2.580 0.010 0.016 0.116
==============================================================================
Omnibus: 28.653 Durbin-Watson: 1.309
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43.266
Skew: 0.465 Prob(JB): 4.03e-10
Kurtosis: 4.188 Cond. No. 35.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
cols = ["Intercept", "CHAS", "scale(B)", "scale(CRIM)",
"scale(np.log(PTRATIO))", "scale(RM)", "scale(np.log(LSTAT))"]
model_boston2 = sm.OLS(np.log(dfy), dfX[cols])
result_boston2 = model_boston2.fit()
print(result_boston2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MEDV R-squared: 0.836
Model: OLS Adj. R-squared: 0.834
Method: Least Squares F-statistic: 380.7
Date: Sun, 23 Jun 2019 Prob (F-statistic): 1.42e-172
Time: 18:13:41 Log-Likelihood: 260.52
No. Observations: 456 AIC: -507.0
Df Residuals: 449 BIC: -478.2
Df Model: 6
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 3.0192 0.007 445.252 0.000 3.006 3.033
CHAS 0.0884 0.028 3.141 0.002 0.033 0.144
scale(B) 0.0558 0.008 6.989 0.000 0.040 0.072
scale(CRIM) -0.1179 0.013 -9.120 0.000 -0.143 -0.092
scale(np.log(PTRATIO)) -0.0508 0.007 -6.936 0.000 -0.065 -0.036
scale(RM) 0.1153 0.011 10.828 0.000 0.094 0.136
scale(np.log(LSTAT)) -0.1570 0.011 -14.179 0.000 -0.179 -0.135
==============================================================================
Omnibus: 29.141 Durbin-Watson: 1.113
Prob(Omnibus): 0.000 Jarque-Bera (JB): 42.637
Skew: 0.483 Prob(JB): 5.51e-10
Kurtosis: 4.145 Cond. No. 5.91
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.