다운로드
작성자: admin 작성일시: 2017-07-02 14:12:24 조회수: 2761 다운로드: 215
카테고리: 시계열 분석 태그목록:

StatsModels의 샘플 데이터

StatsModels 패키지의 개발 목표 중 하나는 기존에 R을 사용하여 통계 분석 및 시계열 분석을 하던 사용자가 파이썬에서 동일한 분석을 할 수 있도록 하는 것이다. 따라서 R에서 제공하던 명령어 뿐만 아니라 Rdatasets 이라는 프로젝트를 통해 R에서 사용하던 1000개 이상의 표준 데이터셋도 동일하게 제공하기 위해 노력하고 있다. 자세한 사항은 다음 프로젝트 홈페이지에서 확인할 수 있다.

다음은 위 프로젝트에서 제공하는 데이터셋의 목록이다.

이 목록에 있는 데이터를 가져오려면 우선 "Package"이름과 "Item"을 알아낸 후 다음에 설명하는 get_rdataset 명령을 이용한다.

get_rdataset 함수

StatsModels의 datasets 서브패키지에 있는 get_rdataset 함수를 사용하면 표준 데이터셋을 가져올 수 있다. 사용법은 다음과 같다.

get_rdataset(item, [package="datasets"])

itempackage 인수로 해당 데이터의 "Package"이름과 "Item"을 넣는다. "Package"이름이 datasets인 경우에는 생략할 수 있다.

이 함수는 인터넷에서 데이터를 다운로드 받으므로 인터넷에 연결되어 있어야 한다.

이렇게 받은 데이터는 다음과 같은 속성을 가지고 있다.

  • data: 데이터를 담고 있는 데이터프레임
  • __doc__: 데이터에 대한 설명 문자열. 이 설명은 R 패키지의 내용을 그대로 가져온 것이므로 예제 코드가 R로 되어 있어 파이썬에서 바로 사용할 수 없다.

샘플데이터의 예 1: 타이타닉 생존자

In [1]:
data = sm.datasets.get_rdataset("Titanic", package="datasets")
In [2]:
df = data.data
df.tail()
Out:
Name PClass Age Sex Survived SexCode
1308 Zakarian, Mr Artun 3rd 27.0 male 0 0
1309 Zakarian, Mr Maprieder 3rd 26.0 male 0 0
1310 Zenni, Mr Philip 3rd 22.0 male 0 0
1311 Lievens, Mr Rene 3rd 24.0 male 0 0
1312 Zimmerman, Leo 3rd 29.0 male 0 0
In [3]:
print(data.__doc__)
+--------------------------------------+--------------------------------------+
| Titanic                              |
| R Documentation                      |
+--------------------------------------+--------------------------------------+

Survival of passengers on the Titanic
-------------------------------------

Description
~~~~~~~~~~~

This data set provides information on the fate of passengers on the
fatal maiden voyage of the ocean liner ‘Titanic’, summarized according
to economic status (class), sex, age and survival.

Usage
~~~~~

::

    Titanic

Format
~~~~~~

A 4-dimensional array resulting from cross-tabulating 2201 observations
on 4 variables. The variables and their levels are as follows:

+--------------------------+--------------------------+--------------------------+
| No                       | 1                        | 2                        |
| Name                     | Class                    | Sex                      |
| Levels                   | 1st, 2nd, 3rd, Crew      | Male, Female             |
+--------------------------+--------------------------+--------------------------+

Details
~~~~~~~

The sinking of the Titanic is a famous event, and new books are still
being published about it. Many well-known facts—from the proportions of
first-class passengers to the ‘women and children first’ policy, and the
fact that that policy was not entirely successful in saving the women
and children in the third class—are reflected in the survival rates for
various classes of passenger.

These data were originally collected by the British Board of Trade in
their investigation of the sinking. Note that there is not complete
agreement among primary sources as to the exact numbers on board,
rescued, or lost.

Due in particular to the very successful film ‘Titanic’, the last years
saw a rise in public interest in the Titanic. Very detailed data about
the passengers is now available on the Internet, at sites such as
*Encyclopedia Titanica* (http://www.rmplc.co.uk/eduweb/sites/phind).

Source
~~~~~~

Dawson, Robert J. MacG. (1995), The ‘Unusual Episode’ Data Revisited.
*Journal of Statistics Education*, **3**.
https://www.amstat.org/publications/jse/v3n3/datasets.dawson.html

The source provides a data set recording class, sex, age, and survival
status for each person on board of the Titanic, and is based on data
originally collected by the British Board of Trade and reprinted in:

British Board of Trade (1990), *Report on the Loss of the ‘Titanic’
(S.S.)*. British Board of Trade Inquiry Report (reprint). Gloucester,
UK: Allan Sutton Publishing.

Examples
~~~~~~~~

::

    require(graphics)
    mosaicplot(Titanic, main = "Survival on the Titanic")
    ## Higher survival rates in children?
    apply(Titanic, c(3, 4), sum)
    ## Higher survival rates in females?
    apply(Titanic, c(2, 4), sum)
    ## Use loglm() in package 'MASS' for further analysis ...


샘플데이터의 예 2: 미국 강수량

In [19]:
data = sm.datasets.get_rdataset("precip")
In [20]:
print(data.__doc__)
+--------------------------------------+--------------------------------------+
| precip                               |
| R Documentation                      |
+--------------------------------------+--------------------------------------+

Annual Precipitation in US Cities
---------------------------------

Description
~~~~~~~~~~~

The average amount of precipitation (rainfall) in inches for each of 70
United States (and Puerto Rico) cities.

Usage
~~~~~

::

    precip

Format
~~~~~~

A named vector of length 70.

Source
~~~~~~

Statistical Abstracts of the United States, 1975.

References
~~~~~~~~~~

McNeil, D. R. (1977) *Interactive Data Analysis*. New York: Wiley.

Examples
~~~~~~~~

::

    require(graphics)
    dotchart(precip[order(precip)], main = "precip data")
    title(sub = "Average annual precipitation (in.)")


In [21]:
df = data.data
df.tail()
Out:
x
65 17.4
66 40.8
67 29.1
68 14.6
69 59.2
In [22]:
df.plot()
plt.show()

샘플데이터의 예 3: 황체형성 호르몬 수치 시계열

In [4]:
data = sm.datasets.get_rdataset("lh")
In [5]:
print(data.__doc__)
+--------------------------------------+--------------------------------------+
| lh                                   |
| R Documentation                      |
+--------------------------------------+--------------------------------------+

Luteinizing Hormone in Blood Samples
------------------------------------

Description
~~~~~~~~~~~

A regular time series giving the luteinizing hormone in blood samples at
10 mins intervals from a human female, 48 samples.

Usage
~~~~~

::

    lh

Source
~~~~~~

P.J. Diggle (1990) *Time Series: A Biostatistical Introduction.* Oxford,
table A.1, series 3

In [6]:
df = data.data
df.tail()
Out:
time lh
43 44 2.6
44 45 2.1
45 46 3.4
46 47 3.0
47 48 2.9
In [7]:
df.plot(x="time", y="lh")
plt.show()

샘플데이터의 예 4: 호흡기질환 사망자수

In [8]:
data = sm.datasets.get_rdataset("deaths", "MASS")
In [9]:
print(data.__doc__)
+--------------------------------------+--------------------------------------+
| deaths                               |
| R Documentation                      |
+--------------------------------------+--------------------------------------+

Monthly Deaths from Lung Diseases in the UK
-------------------------------------------

Description
~~~~~~~~~~~

A time series giving the monthly deaths from bronchitis, emphysema and
asthma in the UK, 1974-1979, both sexes (``deaths``),

Usage
~~~~~

::

    deaths

Source
~~~~~~

P. J. Diggle (1990) *Time Series: A Biostatistical Introduction.*
Oxford, table A.3

References
~~~~~~~~~~

Venables, W. N. and Ripley, B. D. (2002) *Modern Applied Statistics with
S.* Fourth edition. Springer.

See Also
~~~~~~~~

This the same as dataset ``ldeaths`` in **R**'s datasets package.

In [10]:
df = data.data
df.tail()
Out:
time deaths
67 1979.583333 1354
68 1979.666667 1333
69 1979.750000 1492
70 1979.833333 1781
71 1979.916667 1915

이 시계열 데이터에서는 시간이 1년을 1.0으로, 1개월을 1/12로 하는 값(year-fraciton)으로 인코딩되어 있다. 이 값을 파이썬의 datatime 포맷으로 바꾸려면 다음과 같은 함수를 사용해야 한다.

In [11]:
def yearfraction2datetime(yearfraction, startyear=0):
    import datetime, dateutil
    year = int(yearfraction) + startyear
    month = int(round(12 * (yearfraction - year)))
    delta = dateutil.relativedelta.relativedelta(months=month)
    date = datetime.datetime(year, 1, 1) + delta
    return date
In [12]:
df["datetime"] = df.time.map(yearfraction2datetime)
df.tail()
Out:
time deaths datetime
67 1979.583333 1354 1979-08-01
68 1979.666667 1333 1979-09-01
69 1979.750000 1492 1979-10-01
70 1979.833333 1781 1979-11-01
71 1979.916667 1915 1979-12-01
In [13]:
df.plot(x="datetime", y="deaths")
plt.show()

샘플데이터의 예 5: 항공 운송인원

In [14]:
data = sm.datasets.get_rdataset("AirPassengers")
In [15]:
print(data.__doc__)
+--------------------------------------+--------------------------------------+
| AirPassengers                        |
| R Documentation                      |
+--------------------------------------+--------------------------------------+

Monthly Airline Passenger Numbers 1949-1960
-------------------------------------------

Description
~~~~~~~~~~~

The classic Box & Jenkins airline data. Monthly totals of international
airline passengers, 1949 to 1960.

Usage
~~~~~

::

    AirPassengers

Format
~~~~~~

A monthly time series, in thousands.

Source
~~~~~~

Box, G. E. P., Jenkins, G. M. and Reinsel, G. C. (1976) *Time Series
Analysis, Forecasting and Control.* Third Edition. Holden-Day. Series G.

Examples
~~~~~~~~

::

    ## Not run: 
    ## These are quite slow and so not run by example(AirPassengers)

    ## The classic 'airline model', by full ML
    (fit <- arima(log10(AirPassengers), c(0, 1, 1),
                  seasonal = list(order = c(0, 1, 1), period = 12)))
    update(fit, method = "CSS")
    update(fit, x = window(log10(AirPassengers), start = 1954))
    pred <- predict(fit, n.ahead = 24)
    tl <- pred$pred - 1.96 * pred$se
    tu <- pred$pred + 1.96 * pred$se
    ts.plot(AirPassengers, 10^tl, 10^tu, log = "y", lty = c(1, 2, 2))

    ## full ML fit is the same if the series is reversed, CSS fit is not
    ap0 <- rev(log10(AirPassengers))
    attributes(ap0) <- attributes(AirPassengers)
    arima(ap0, c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
    arima(ap0, c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12),
          method = "CSS")

    ## Structural Time Series
    ap <- log10(AirPassengers) - 2
    (fit <- StructTS(ap, type = "BSM"))
    par(mfrow = c(1, 2))
    plot(cbind(ap, fitted(fit)), plot.type = "single")
    plot(cbind(ap, tsSmooth(fit)), plot.type = "single")

    ## End(Not run)


In [16]:
df = data.data
df.tail()
Out:
time AirPassengers
139 1960.583333 606
140 1960.666667 508
141 1960.750000 461
142 1960.833333 390
143 1960.916667 432
In [17]:
df["datetime"] = df.time.map(yearfraction2datetime)
df.tail()
Out:
time AirPassengers datetime
139 1960.583333 606 1960-08-01
140 1960.666667 508 1960-09-01
141 1960.750000 461 1960-10-01
142 1960.833333 390 1960-11-01
143 1960.916667 432 1960-12-01
In [18]:
df.plot(x="datetime", y="AirPassengers")
plt.show()

질문/덧글

아직 질문이나 덧글이 없습니다. 첫번째 글을 남겨주세요!