๋จธ์‹ ๋Ÿฌ๋‹

์ฐจ์› ์ถ•์†Œ, PCA python ์‚ฌ์ดํ‚ท๋Ÿฐ

์ฃผ์˜ ๐Ÿฑ 2022. 12. 5. 12:49
728x90
๋ฐ˜์‘ํ˜•

๋จธ์‹ ๋Ÿฌ๋‹์˜ ๋งŽ์€ ๋ฌธ์ œ๋Š” train sample ์ด ์ˆ˜์ฒœ์—์„œ ์ˆ˜๋ฐฑ๋งŒ๊ฐœ์˜ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์€ 3์ฐจ์› ๊ณต๊ฐ„์—์„œ ์‚ด๊ณ  ์žˆ๊ธฐ์— ์šฐ๋ฆฌ๊ฐ€ ๋ณด๊ณ  ๋Š๋‚„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๋“ค์€ 1,2,3์ฐจ์› ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ๋Š” ๋‹ค๋ฃจ๋Š” ์ฐจ์›์˜ ์ˆ˜๊ฐ€ ์ •๋ง ํฝ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•œ ํ˜•ํƒœ์ด๋”๋ผ๋„ ์ฐจ์›์˜ ์ˆ˜๊ฐ€ ๋†’์•„์ง€๋ฉด ์ดํ•ดํ•  ์ˆ˜ ์—†์–ด์ง‘๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋งŽ์€ ํŠน์„ฑ์€ triain ์‹œ๊ฐ„์„ ๋Š๋ฆฌ๊ฒŒ ํ•  ๋ฟ๋งŒ์•„๋‹ˆ๋ผ, ์ข‹์€ ์†”๋ฃจ์…˜์„ ์ฐพ๊ธฐ ํž˜๋“ค๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ข…์ข… ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ์ฐจ์›์˜ ์ €์ฃผ(CURSE OF DIMENSIONALITY)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. 

ํ•˜์ง€๋งŒ ํŠน์„ฑ ์ˆ˜๋ฅผ ํฌ๊ฒŒ ์ค„์—ฌ ๊ณ ์ฐจ์› ๊ณต๊ฐ„์„ ์šฐ๋ฆฌ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ €์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ธฐ์ˆ ์„ ์—ฐ๊ตฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ dimensionality reduction์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด MNIST ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด, ์ด๋ฏธ์ง€ ๊ฒฝ๊ณ„์— ์žˆ๋Š” ํ”ฝ์…€์€ ํฐ์ƒ‰์ด๋ฏ€๋กœ ์ด๋Ÿฐ ํ”ฝ์…€์„ ์™„์ „ํžˆ ์ œ๊ฑฐํ•ด๋„ ๋งŽ์€ ์ •๋ณด๋ฅผ ์žƒ์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. 

mnist์ด๋ฏธ์ง€, ์†๊ธ€์”จ ์ˆซ์ž ์ด๋ฏธ์ง€

์ฐจ์›์„ ์ถ•์†Œ์‹œ์ผœ ์ผ๋ถ€ ์ •๋ณด๋ฅผ ์—†์•ค๋‹ค๋ฉด ํ›ˆ๋ จ์†๋„๊ฐ€ ๋นจ๋ผ์ง€๋Š” ๊ฒƒ ์™ธ์— ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”์—๋„ ์•„์ฃผ ์œ ์šฉํ•˜๊ฒŒ ์“ฐ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฐจ์› ์ˆ˜๋ฅผ ๋‘˜์ด๋‚˜ ์…‹์œผ๋กœ ์ค„์ด๋ฉด ํ•˜๋‚˜์˜ ์••์ถ•๋œ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ฆด ์ˆ˜ ์žˆ๊ณ  ๊ตฐ์ง‘๊ฐ™์€ ์‹œ๊ฐ์ ์ธ ํŒจํ„ด์„ ๊ฐ์ง€ํ•ด ์ธ์‚ฌ์ดํŠธ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

์ฐจ์› ์ถ•์†Œ ๊ธฐ๋ฒ•์€ ํˆฌ์˜projection ๊ณผ ๋งค๋‹ˆํด๋“œ ํ•™์Šตmanifold learning, ๊ฐ€์žฅ ์ธ๊ธฐ์žˆ๋Š” PCA๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

PCA(์ฃผ์„ฑ๋ถ„ ๋ถ„์„)

 

๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ์ฐจ์› ์ถ•์†Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. 

์‚ฌ์ดํ‚ท๋Ÿฐ์˜ PCA - SVD ๋ถ„ํ•ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. 

 

MNIST๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ 28 x 28 pixel ์˜ ์ˆซ์ž ์ด๋ฏธ์ง€๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ํ™•์ธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋จผ์ € ๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋ฏธ์ง€์™€ ๊ทธ ์ˆซ์ž๊ฐ€ ๋ฌด์—‡์ธ์ง€๋ฅผ ์•Œ๋ ค์ฃผ๋Š” label๋กœ ๋‚˜๋ˆ ์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ ์ด๋ฏธ์ง€๋Š” 28 x 28 pixel๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์šฐ๋ฆฌ๋Š” 28×28=784 ์ฐจ์›์˜ ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ 784์ฐจ์›์˜ ๊ณต๊ฐ„์—์„œ ์šฐ๋ฆฌ์˜ MNIST ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” ๊ณต๊ฐ„์€ ๋งค์šฐ ์ž‘์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. 784 ์ฐจ์›์—๋Š” ๋งค์šฐ๋งค์šฐ ๋งŽ์€ ๋ฒกํ„ฐ๋“ค์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml


mnist = fetch_openml('mnist_784', cache=False)

X = mnist.data.astype('float32').to_numpy()
y = mnist.target.astype('int64').to_numpy()

PCA ๋Š” Principal Components Analysis ์˜ ์•ฝ์ž๋กœ, ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์žฅ ํฉ์–ด์ ธ์žˆ๋Š” ์ถ•์„ ์ฐพ์•„์„œ ๊ทธ๊ณณ์œผ๋กœ ์‚ฌ์˜ํ•ด์„œ ์›ํ•˜๋Š” ์ฐจ์› ๊ฐœ์ˆ˜๋งŒํผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์žฅ ํฉ์–ด์ ธ์žˆ๋Š” ์ถ•์ด๋ผ๋Š” ๋ง์€ ๊ฐ€์žฅ variance ๊ฐ€ ์ปค์ง€๊ฒŒ ํ•˜๋Š” ์ถ•์ด๋ผ๋Š” ๋ง๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

PCA๋ฅผ scikit-learn ํŒจํ‚ค์ง€๋ฅผ ํ™œ์šฉํ•ด์„œ ๋‚˜ํƒ€๋‚ด๋ฉด์„œ ์ดํ•ดํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 42000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋Š” ๊ฐœ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐœ์ˆ˜๋ฅผ ์ข€ ์ค„์—ฌ์„œ 15000๊ฐœ๋ฅผ ๊ฐ€์ง€๊ณ  ์ง„ํ–‰ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

labels = y[:15000]
data = X[:15000]

print("the shape of sample data = ", data.shape)

๊ทธ๋ฆฌ๊ณ  feature์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งค์šฐ ๋งŽ์ด ๋•Œ๋ฌธ์— ์ •๊ทœํ™”๋ฅผ ์‹œ์ผœ์ค๋‹ˆ๋‹ค.  Sklearn ํŒจํ‚ค์ง€ ์•ˆ์˜ StandardScaler ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด์„œ  z-score ์ •๊ทœํ™”๋ฅผ ์‹œ์ผœ์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค

from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
print(standardized_data.shape)

sample_data = standardized_data

2์ฐจ์›์œผ๋กœ ์ถ•์†Œ๋ฅผ ํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— number of components๋ฅผ 2๋กœ ํ•ด์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค.

from sklearn import decomposition
pca = decomposition.PCA()

# configuring the parameteres
# the number of components = 2
pca.n_components = 2
pca_data = pca.fit_transform(sample_data)

# pca_reduced will contain the 2-d projects of simple data
print("shape of pca_reduced.shape = ", pca_data.shape)

์›๋ž˜ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋˜ ๋ฐ์ดํ„ฐ๋Š” 784์ฐจ์›์ด์—ˆ๋Š”๋ฐ PCA๋ฅผ ํ†ตํ•ด์„œ 2๋กœ ์ค„์–ด๋“  ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ œ ์ด๊ฒƒ์„ ์‹œ๊ฐํ™”ํ•ด์„œ ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ผ๋ฒจ๋งˆ๋‹ค ์ƒ‰์„ ๋ถ€์—ฌํ•ด์„œ ์‹œ๊ฐํ™”ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

# attaching the label for each 2-d data point 
pca_data = np.vstack((pca_data.T, labels)).T

import seaborn as sn

# creating a new data fram which help us in ploting the result data
pca_df = pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal", "label"))
sn.FacetGrid(pca_df, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.show()

์ด๋ ‡๊ฒŒ ์šฐ๋ฆฌ์˜ MNIST ๋ฐ์ดํ„ฐ์…‹์„ 2D๋กœ ์ฐจ์›์„ ์ถ•์†Œํ•ด์„œ ์‹œ๊ฐํ™”๋ฅผ ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋น„์Šทํ•œ ๋ผ๋ฒจ์˜ ์ด๋ฏธ์ง€๋“ค๋ผ๋ฆฌ ๋ชจ์—ฌ์žˆ๋Š” ๊ฒƒ์„ ๋ณด์•„ ์ž˜ ์ถ•์†Œ๋œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋ฐ˜์‘ํ˜•