Pandas의 시각화 기능

Pandas의 시각화 기능#

Pandas의 시리즈나 데이터프레임은 plot이라는 시각화 메서드를 내장하고 있다. plot은 matplotlib를 내부에서 임포트하여 사용한다.

np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(100, 3),
                   index=pd.date_range('1/1/2018', periods=100),
                   columns=['A', 'B', 'C']).cumsum()
df1.tail()

	A	B	C
2018-04-06	9.396256	6.282026	-11.198087
2018-04-07	10.086074	7.583872	-11.826175
2018-04-08	9.605047	9.887789	-12.886190
2018-04-09	9.469097	11.024680	-12.788465
2018-04-10	10.052051	10.625231	-12.418409

df1.plot()
plt.title("Pandas의 Plot메소드 사용 예")
plt.xlabel("시간")
plt.ylabel("Data")
plt.show()

../_images/f77505f613125817c9466a9f95ff5739d91134e4b5adaf0aeb3d031a59ce8ee0.png

plot 메서드의 kind라는 인수를 바꾸면 여러가지 플롯을 그릴 수 있다. 다음과 같은 인수값을 지원한다.

bar
pie
hist
kde
box
scatter
area

iris = sns.load_dataset("iris")    # 붓꽃 데이터
titanic = sns.load_dataset("titanic")    # 타이타닉호 데이터

iris.sepal_length[:20].plot(kind='bar', rot=0)
plt.title("꽃받침의 길이 시각화")
plt.xlabel("Data")
plt.ylabel("꽃받침의 길이")
plt.show()

../_images/8f476a70c7a741ab9f38d0e0009588c99d07033a3e4392a35d4e90d14a30b9af.png

kind 인수에 문자열을 쓰는 대신 plot.bar처럼 직접 메서드로 사용할 수도 있다.

iris[:5].plot.bar(rot=0)
plt.title("Iris 데이터의 Bar Plot")
plt.xlabel("Data")
plt.ylabel("각 Feature의 값")
plt.ylim(0, 7)
plt.show()

../_images/1bd0d15254fd2b5648858a0acb97e45e98c1e291c308e79827aeb6fa4bc90d69.png

다음은 그룹분석으로 각 붓꽃종의 특징값의 평균을 구한 것이다.

df2 = iris.groupby(iris.species).mean()
df2.columns.name = "feature"
df2

feature	sepal_length	sepal_width	petal_length	petal_width
species
setosa	5.006	3.428	1.462	0.246
versicolor	5.936	2.770	4.260	1.326
virginica	6.588	2.974	5.552	2.026

그룹분석 결과도 데이터프레임이므로 같은 방식으로 시각화할 수 있다.

df2.plot.bar(rot=0)
plt.title("각 종의 Feature별 평균")
plt.xlabel("평균")
plt.ylabel("종")
plt.ylim(0, 8)
plt.show()

../_images/68744cf7618f587de878b2587192e094aa17ad5eee9972c06bfec07825736180.png

전치연산으로 시각화 방법을 다르게 할 수도 있다.

df2.T.plot.bar(rot=0)
plt.title("각 Feature의 종 별 평균")
plt.xlabel("Feature")
plt.ylabel("평균")
plt.show()

../_images/98cd477334f49d60af022d31976ad7044a864ccbc7961407dbb6f8145f885fda.png

다음은 pie, hist, kde, box, scatter 플롯 예제이다.

df3 = titanic.pclass.value_counts()
df3.plot.pie(autopct='%.2f%%')
plt.title("선실별 승객 수 비율")
plt.axis('equal')
plt.show()

../_images/250cfc17c664a6348cedf1af5a56502c5c54c530a02738acc45701531c301047.png

iris.plot.hist()
plt.title("각 Feature 값들의 빈도수 Histogram")
plt.xlabel("데이터 값")
plt.show()

../_images/4ba186551eb1608fa2cab7377495ef6a80362873e56e52742020c5a4052bb0e5.png

iris.plot.kde()
plt.title("각 Feature 값들의 빈도수에 대한 Kernel Density Plot")
plt.xlabel("데이터 값")
plt.show()

../_images/2e1299507457b854512c8d505fa6d1a1536215d5546bcab40a9af73f468e55d4.png

iris.plot.box()
plt.title("각 Feature 값들의 빈도수에 대한 Box Plot")
plt.xlabel("Feature")
plt.ylabel("데이터 값")
plt.show()

../_images/0fca71144bc407f09f04651098b5e56c6e73418dd6dbcb3f32068e0c98e3feac.png

박스플롯에 대해서는 추가적인 기능을 가진 boxplot 명령이 별도로 있다.

iris.boxplot(by='species')
plt.tight_layout(pad=3, h_pad=1)
plt.title("각 Feature의 종 별 데이터에 대한 Box Plot")
plt.show()

../_images/65278a2a82f6a52be48a09624e9b3f66a0e7ff8b4420da47276d5c1516f79ac7.png