대규모 데이터 학습¶
대규모 데이터(big data)의 경우에는 메모리 등의 문제로 특정한 모형은 사용할 수 없는 경우가 많다. 이 때는
사전 확률분포를 설정할 수 있는 생성 모형
시작 가중치를 설정할 수 있는 모형
등을 이용하고 전체 데이터를 처리 가능한 작은 주각으로 나누어 학습을 시키는 점진적 학습 방법을 사용한다.
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
covtype = fetch_covtype(shuffle=True, random_state=0)
X_covtype = covtype.data
y_covtype = covtype.target - 1
classes = np.unique(y_covtype)
X_train, X_test, y_train, y_test = train_test_split(X_covtype, y_covtype)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
def read_Xy(start, end):
# 실무에서는 파일이나 데이터베이스에서 읽어온다.
idx = list(range(start, min(len(y_train) - 1, end)))
X = X_train[idx, :]
y = y_train[idx]
return X, y
SGD¶
퍼셉트론 모형은 가중치를 계속 업데이트하므로 일부 데이터를 사용하여 구한 가중치를 다음 단계에서 초기 가중치로 사용할 수 있다.
%%time
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
model = SGDClassifier(random_state=0)
n_split = 10
n_X = len(y_train) // n_split
n_epoch = 10
for epoch in range(n_epoch):
for n in range(n_split):
X, y = read_Xy(n * n_X, (n + 1) * n_X)
model.partial_fit(X, y, classes=classes)
accuracy_train = accuracy_score(y_train, model.predict(X_train))
accuracy_test = accuracy_score(y_test, model.predict(X_test))
print("epoch={:d} train acc={:5.3f} test acc={:5.3f}".format(epoch, accuracy_train, accuracy_test))
epoch=0 train acc=0.707 test acc=0.707
epoch=1 train acc=0.710 test acc=0.710
epoch=2 train acc=0.711 test acc=0.710
epoch=3 train acc=0.711 test acc=0.711
epoch=4 train acc=0.711 test acc=0.711
epoch=5 train acc=0.711 test acc=0.711
epoch=6 train acc=0.711 test acc=0.711
epoch=7 train acc=0.710 test acc=0.710
epoch=8 train acc=0.711 test acc=0.711
epoch=9 train acc=0.711 test acc=0.711
CPU times: user 15.2 s, sys: 330 ms, total: 15.6 s
Wall time: 9.21 s
나이브베이즈 모형¶
나이브베이즈 모형과 같은 생성모형은 일부 데이터를 이용하여 구한 확률분포를 사전확률분포로 사용할 수 있다.
%%time
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
model = BernoulliNB(alpha=0.1)
n_split = 10
n_X = len(y_train) // n_split
for n in range(n_split):
X, y = read_Xy(n * n_X, (n + 1) * n_X)
model.partial_fit(X, y, classes=classes)
accuracy_train = accuracy_score(y_train, model.predict(X_train))
accuracy_test = accuracy_score(y_test, model.predict(X_test))
print("n={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))
n=0 train accuracy=0.630 test accuracy=0.628
n=1 train accuracy=0.630 test accuracy=0.629
n=2 train accuracy=0.632 test accuracy=0.630
n=3 train accuracy=0.633 test accuracy=0.632
n=4 train accuracy=0.633 test accuracy=0.631
n=5 train accuracy=0.632 test accuracy=0.631
n=6 train accuracy=0.632 test accuracy=0.631
n=7 train accuracy=0.632 test accuracy=0.630
n=8 train accuracy=0.632 test accuracy=0.630
n=9 train accuracy=0.632 test accuracy=0.630
CPU times: user 10.1 s, sys: 920 ms, total: 11 s
Wall time: 2.78 s
그레디언트 부스팅¶
그레디언트 부스팅에서는 초기 커미티 멤버로 일부 데이터를 사용하여 학습한 모형을 사용할 수 있다.
%%time
from lightgbm import train, Dataset
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
params = {
'objective': 'multiclass',
"num_class": len(classes),
'learning_rate': 0.2,
'seed': 0,
}
n_split = 10
n_X = len(y_train) // n_split
num_tree = 10
model = None
for n in range(n_split):
X, y = read_Xy(n * n_X, (n + 1) * n_X)
model = train(params, init_model=model, train_set=Dataset(X, y),
keep_training_booster=False, num_boost_round=num_tree)
accuracy_train = accuracy_score(y_train, np.argmax(model.predict(X_train), axis=1))
accuracy_test = accuracy_score(y_test, np.argmax(model.predict(X_test), axis=1))
print("n={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))
n=0 train accuracy=0.769 test accuracy=0.767
n=1 train accuracy=0.792 test accuracy=0.788
n=2 train accuracy=0.806 test accuracy=0.802
n=3 train accuracy=0.816 test accuracy=0.812
n=4 train accuracy=0.812 test accuracy=0.807
n=5 train accuracy=0.808 test accuracy=0.804
n=6 train accuracy=0.805 test accuracy=0.800
n=7 train accuracy=0.793 test accuracy=0.788
n=8 train accuracy=0.795 test accuracy=0.790
n=9 train accuracy=0.794 test accuracy=0.789
CPU times: user 3min 17s, sys: 3.59 s, total: 3min 20s
Wall time: 1min 3s
Random Forest¶
랜덤 포레스트와 같은 앙상블 모형에서는 일부 데이터를 사용한 모형을 개별 분류기로 사용할 수 있다.
%%time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
n_split = 10
n_X = len(y_train) // n_split
num_tree_ini = 10
num_tree_step = 10
model = RandomForestClassifier(n_estimators=num_tree_ini, warm_start=True)
for n in range(n_split):
X, y = read_Xy(n * n_X, (n + 1) * n_X)
model.fit(X, y)
accuracy_train = accuracy_score(y_train, model.predict(X_train))
accuracy_test = accuracy_score(y_test, model.predict(X_test))
print("epoch={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))
model.n_estimators += num_tree_step
epoch=0 train accuracy=0.868 test accuracy=0.855
epoch=1 train accuracy=0.892 test accuracy=0.874
epoch=2 train accuracy=0.898 test accuracy=0.880
epoch=3 train accuracy=0.902 test accuracy=0.884
epoch=4 train accuracy=0.904 test accuracy=0.885
epoch=5 train accuracy=0.906 test accuracy=0.886
epoch=6 train accuracy=0.906 test accuracy=0.887
epoch=7 train accuracy=0.907 test accuracy=0.887
epoch=8 train accuracy=0.907 test accuracy=0.888
epoch=9 train accuracy=0.907 test accuracy=0.888
CPU times: user 4min 46s, sys: 11.4 s, total: 4min 57s
Wall time: 1min 42s