다운로드
작성자: admin 작성일시: 2018-12-21 17:59:58 조회수: 253 다운로드: 23
카테고리: 기타 태그목록:

대규모 데이터 학습

점진적 학습

In [1]:
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

covtype = fetch_covtype(shuffle=True, random_state=0)
X_covtype = covtype.data
y_covtype = covtype.target - 1
classes = np.unique(y_covtype)
X_train, X_test, y_train, y_test = train_test_split(X_covtype, y_covtype)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

def read_Xy(start, end):
    # 실무에서는 파일이나 데이터베이스에서 읽어온다.
    idx = list(range(start, min(len(y_train) - 1, end)))
    X = X_train[idx, :]
    y = y_train[idx]
    return X, y

SGD

In [2]:
%%time

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

model = SGDClassifier(random_state=0)
n_split = 10
n_X = len(y_train) // n_split
n_epoch = 10
for epoch in range(n_epoch):
    for n in range(n_split):
        X, y = read_Xy(n * n_X, (n + 1) * n_X)
        model.partial_fit(X, y, classes=classes)
    accuracy_train = accuracy_score(y_train, model.predict(X_train))
    accuracy_test = accuracy_score(y_test, model.predict(X_test))
    print("epoch={:d} train acc={:5.3f} test acc={:5.3f}".format(epoch, accuracy_train, accuracy_test))
epoch=0 train acc=0.704 test acc=0.704
epoch=1 train acc=0.707 test acc=0.706
epoch=2 train acc=0.708 test acc=0.707
epoch=3 train acc=0.709 test acc=0.708
epoch=4 train acc=0.710 test acc=0.709
epoch=5 train acc=0.710 test acc=0.709
epoch=6 train acc=0.710 test acc=0.709
epoch=7 train acc=0.710 test acc=0.709
epoch=8 train acc=0.711 test acc=0.709
epoch=9 train acc=0.711 test acc=0.710
CPU times: user 9.21 s, sys: 230 ms, total: 9.44 s
Wall time: 9.45 s

나이브베이즈 모형

In [3]:
%%time

from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

model = BernoulliNB(alpha=0.1)

n_split = 10
n_X = len(y_train) // n_split
for n in range(n_split):
    X, y = read_Xy(n * n_X, (n + 1) * n_X)
    model.partial_fit(X, y, classes=classes)
    accuracy_train = accuracy_score(y_train, model.predict(X_train))
    accuracy_test = accuracy_score(y_test, model.predict(X_test)) 
    print("n={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))
n=0 train accuracy=0.631 test accuracy=0.632
n=1 train accuracy=0.632 test accuracy=0.632
n=2 train accuracy=0.634 test accuracy=0.635
n=3 train accuracy=0.633 test accuracy=0.634
n=4 train accuracy=0.632 test accuracy=0.633
n=5 train accuracy=0.632 test accuracy=0.633
n=6 train accuracy=0.632 test accuracy=0.633
n=7 train accuracy=0.632 test accuracy=0.633
n=8 train accuracy=0.633 test accuracy=0.633
n=9 train accuracy=0.632 test accuracy=0.633
CPU times: user 3.18 s, sys: 730 ms, total: 3.91 s
Wall time: 3.87 s

그레디언트 부스팅

In [4]:
%%time

from lightgbm import train, Dataset
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

params = {
    'objective': 'multiclass',
    "num_class": len(classes),
    'learning_rate': 0.2,
    'seed': 0,
}

n_split = 10
n_X = len(y_train) // n_split
num_tree = 10
model = None
for n in range(n_split):
    X, y = read_Xy(n * n_X, (n + 1) * n_X)
    model = train(params, init_model=model, train_set=Dataset(X, y),
                  keep_training_booster=False, num_boost_round=num_tree)
    accuracy_train = accuracy_score(y_train, np.argmax(model.predict(X_train), axis=1))
    accuracy_test = accuracy_score(y_test, np.argmax(model.predict(X_test), axis=1)) 
    print("n={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))
n=0 train accuracy=0.770 test accuracy=0.768
n=1 train accuracy=0.792 test accuracy=0.789
n=2 train accuracy=0.807 test accuracy=0.803
n=3 train accuracy=0.819 test accuracy=0.813
n=4 train accuracy=0.824 test accuracy=0.818
n=5 train accuracy=0.816 test accuracy=0.810
n=6 train accuracy=0.819 test accuracy=0.812
n=7 train accuracy=0.818 test accuracy=0.811
n=8 train accuracy=0.803 test accuracy=0.797
n=9 train accuracy=0.801 test accuracy=0.796
CPU times: user 2min 37s, sys: 520 ms, total: 2min 37s
Wall time: 41.1 s

Random Forest

In [5]:
%%time

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

n_split = 10
n_X = len(y_train) // n_split
num_tree_ini = 10
num_tree_step = 10
model = RandomForestClassifier(n_estimators=num_tree_ini, warm_start=True)
for n in range(n_split):
    X, y = read_Xy(n * n_X, (n + 1) * n_X)
    model.fit(X, y)
    accuracy_train = accuracy_score(y_train, model.predict(X_train))
    accuracy_test = accuracy_score(y_test, model.predict(X_test))
    print("epoch={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))
    
    model.n_estimators += num_tree_step
epoch=0 train accuracy=0.871 test accuracy=0.859
epoch=1 train accuracy=0.892 test accuracy=0.875
epoch=2 train accuracy=0.899 test accuracy=0.881
epoch=3 train accuracy=0.902 test accuracy=0.882
epoch=4 train accuracy=0.904 test accuracy=0.885
epoch=5 train accuracy=0.906 test accuracy=0.887
epoch=6 train accuracy=0.907 test accuracy=0.887
epoch=7 train accuracy=0.907 test accuracy=0.888
epoch=8 train accuracy=0.907 test accuracy=0.887
epoch=9 train accuracy=0.908 test accuracy=0.888
CPU times: user 1min 8s, sys: 680 ms, total: 1min 9s
Wall time: 1min 9s
In [ ]:
 

질문/덧글

아직 질문이나 덧글이 없습니다. 첫번째 글을 남겨주세요!