๋จธ์‹ ๋Ÿฌ๋‹

[๋จธ์‹ ๋Ÿฌ๋‹3] Multiple Linear Regression ๋‹ค์ค‘์„ ํ˜•ํšŒ๊ท€ python

์ฃผ์˜ ๐Ÿฑ 2022. 11. 30. 12:35
728x90

์‹ค์ œ๋กœ ์˜ˆ์ธก์„ ํ•˜๊ณ ์ž ํ•  ๋–„ ๋ณดํ†ต ํ•˜๋‚˜ ์ด์ƒ์˜ ๋ณ€์ˆ˜๋“ค์„ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. multiple linear regression์€ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ๋ณ€์ˆ˜๋“ค์„ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์ธก๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์œ„ ์ด๋ฏธ์ง€๋ฅผ ์˜ˆ๋กœ ์„ค๋ช…ํ•˜๋ฉด,

์ง‘๊ฐ€๊ฒฉ(y)๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค๊ณ  ํ•  ๋•Œ, x1(์นจ์‹ค์ˆ˜), x2=์ธต ์ˆ˜, x3=์ง€์–ด์ง„์—ฐ์ˆ˜, x4=ํฌ๊ธฐ 4๊ฐ€์ง€ feature(n=4)๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค .

feature= dimension=attribute

x(2)๋Š” =[3 2 40 127](์—ด๋ฒกํ„ฐ๋กœ)๊ฐ€ ๋˜๊ณ , x3(2)๋Š” 30 ์ž…๋‹ˆ๋‹ค

default๋Š” ํ•œ์ƒ ์—ด๋ฒกํ„ฐ์ด๊ณ , row vector ์ฆ‰ [3 2 40 127]๋กœ ํ‘œํ˜„ํ•˜๊ณ  ์‹ถ๋‹คํ•˜๋ฉด, transpose๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

 

์˜ˆ์ธก ๋ชจ๋ธ ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

previously = simple linear regression, now= multiple

์„ธํƒ€0,1,2,3์€ ๊ฐ ๋ณ€์ˆ˜์˜ ๊ฐ€์ค‘์น˜์ด๊ณ , x1,2,3๋Š” ๊ฐ feature์ž…๋‹ˆ๋‹ค. 

 

๊ฐ feature์˜ ๊ฐ’ ํฌ๊ธฐ๊ฐ€ ์ œ๊ฐ๊ฐ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด ์ง‘ ํฌ๊ธฐ๋Š”1-2000๊นŒ์ง€ ์ด๊ณ , ์นจ์‹ค ์ˆ˜๋Š” 1-5๋ฒ”์œ„์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์ฐจ์ด๋Š” ์Šค์ผ€์ผ๋Ÿฌ(min-max ๋“ฑ)๋ฅผ ํ†ตํ•ด 0~1์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. 

 

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(gradient descent)์œผ๋กœ ๋ฏธ๋ถ„ ๊ฐ’(๊ธฐ์šธ๊ธฐ)์ด ์ตœ์†Œ๊ฐ€ ๋˜๋Š” ์ ์„ ์ฐพ์•„ ์•Œ๋งž์€ weight(๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์€ ํ•ด๋‹น ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’ ์œ„์น˜๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด Cost Function์˜ ๊ฒฝ์‚ฌ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์ •์˜ํ•œ step size๋ฅผ ๊ฐ€์ง€๊ณ  ์กฐ๊ธˆ์”ฉ ์›€์ง์—ฌ ๊ฐ€๋ฉด์„œ ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์œผ๋ ค๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ(ํŽธ๋ฏธ๋ถ„ํ•œ ๋ฒกํ„ฐ)๋ฅผ ์กฐ๊ธˆ์”ฉ ์›€์ง์—ฌ๊ฐ€๋ฉฐ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

 


 

์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

์ž๋™์ฐจ์˜ ์—ฌ๋Ÿฌ ๊ธฐ์ˆ ์ ์ธ ์‚ฌ์–‘๋“ค์„ ๊ณ ๋ คํ•˜์—ฌ ์—ฐ๋น„๋ฅผ ์˜ˆ์ธกํ•˜๋Š” auto miles per gallon(MPG) dataset์„ ์˜ˆ์‹œ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ์‚ฌ์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

1.1 Dataset

1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas
import seaborn
seaborn.set()
from urllib.request import urlretrieve
URL = 'https://go.gwu.edu/engcomp6data3'
urlretrieve(URL, 'auto_mpg.csv')
mpg_data = pandas.read_csv('/content/auto_mpg.csv')
mpg_data.head()

mpg_data.info()๋ฅผ ํ†ตํ•ด์„œ Data์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค

 

์ด 392๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๊ณ  9๊ฐœ์˜ ์ •๋ณด๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ car name ์€ object๋กœ ์ž๋™์ฐจ์˜ ์ด๋ฆ„์„ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  origin์€ int๋กœ ์ •์ˆ˜ ํ˜•ํƒœ์ด์ง€๋งŒ ์ด๊ฒƒ์ด ๋งŒ๋“ค์–ด์ง„ ๋„์‹œ๋กœ categorical ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค(ex. ์„œ์šธ : 1, ๊ฒฝ๊ธฐ : 2, ... ).

๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฒˆ์— linear regression์„ ํ•  ๋•Œ๋Š” car name, origin ๊ฐ’์€ ์ œ์™ธํ•˜๊ณ  ์ƒ๊ฐํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

y_col = 'mpg'
x_cols = mpg_data.columns.drop(['car name', 'origin', 'mpg'])  # also drop mpg column

print(x_cols)

1.2 Data exploration

๋จผ์ € linear regression์„ ์ง„ํ–‰ํ•˜๊ธฐ ์ „์— ์ž๋™์ฐจ์˜ ์ •๋ณด๋“ค๊ณผ ์—ฐ๋น„์™€์˜ 1๋Œ€1 ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์‹œ๊ฐํ™”ํ•ด์„œ ๋ณด๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค.

 

seaborn.pairplot(data=mpg_data, height=5, aspect=1,
             x_vars=x_cols,
             y_vars=y_col);

Accerlation๊ณผ model_year ์˜ ์ •๋ณด๋Š” ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„์— ์žˆ๊ณ  ๋‚˜๋จธ์ง€๋Š” ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„์— ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ†ตํ•ด์„œ linear model ์ด ์—ฐ๋น„๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

1.3 Linear model in matrix form

from autograd import numpy
from autograd import grad
X = mpg_data[x_cols].values
X = numpy.hstack((numpy.ones((X.shape[0], 1)), X))  # pad 1s to the left of input matrix
y = mpg_data[y_col].values

print("X.shape = {}, y.shape = {}".format(X.shape, y.shape))
#X.shape = (392, 7), y.shape = (392,)

 mean squared error๋กœ cost function์„ ์ •์˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

def linear_regression(params, X):
    '''
    The linear regression model in matrix form.
    Arguments:
      params: 1D array of weights for the linear model
      X     : 2D array of input values
    Returns:
      1D array of predicted values
    '''
    return numpy.dot(X, params)

def cost_function(params, model, X, y):
    '''
    The mean squared error loss function.
    Arguments:
      params: 1D array of weights for the linear model
      model : function for the linear regression model
      X     : 2D array of input values
      y     : 1D array of predicted values
    Returns:
      float, mean squared error
    '''
    y_pred = model(params, X)
    return numpy.mean( numpy.sum((y-y_pred)**2) )

 

 

1.4 Find the weights using gradient descent

์ด์ œ Gradient descent๋กœ cost function์„ ์ตœ์†Œ๋กœ ํ•ด์ฃผ๋Š” ๊ณ„์ˆ˜๋ฅผ ์ฐพ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. autograd.grad()ํ•จ์ˆ˜๋กœ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•ด์„œ ์‚ฌ์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

gradient = grad(cost_function)

๊ธฐ์šธ๊ธฐ๊ฐ’์ด ์ž˜ ๊ตฌํ•ด์ง€๋Š”์ง€ ๋žœ๋คํ•œ ๊ฐ’์„ ํ†ตํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

gradient(numpy.random.rand(X.shape[1]), linear_regression, X, y)
 

 

 

1.5 Feature scaling

Gradient descent๋ฅผ ์ง„ํ–‰ํ–ˆ๋”๋‹ˆ loss๊ฐ€ ๋ฌดํ•œ๋Œ€๋กœ ๋ฐœ์‚ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ž…๋ ฅ ๋ณ€์ˆ˜๋“ค ์ค‘์— ํŠน์ • ๊ฐ’๋“ค์ด ๋„ˆ๋ฌด ์ปค์„œ ์ผ์–ด๋‚œ ์ผ์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋“ค์˜ max ์™€ min ๊ฐ’์„ ํ•œ๋ฒˆ ์ถœ๋ ฅํ•ด๋ณด๋ฉด,

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(mpg_data[x_cols])
X_scaled = numpy.hstack((numpy.ones((X_scaled.shape[0], 1)), X_scaled)) 

pandas.DataFrame(X_scaled).describe().loc[['max', 'min']]

0๋ฒˆ์งธ ํ–‰์€ ์ฒ˜์Œ์— 1์„ ์ถ”๊ฐ€ํ•ด์ค€ ํ–‰์ด๋ฏ€๋กœ 1๋กœ ์œ ์ง€๋˜๋Š”๊ฒŒ ๋งž์Šต๋‹ˆ๋‹ค. ์ด์ œ ๋ณ€ํ™”๋œ ๋ฐ์ดํ„ฐ๋กœ ๋‹ค์‹œ gradient descent๋ฅผ ์ง„ํ–‰ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

max_iter = 1000
alpha = 0.001
params = numpy.zeros(X.shape[1])

for i in range(max_iter):
    descent = gradient(params, linear_regression, X_scaled, y)
    params = params - descent * alpha
    loss = cost_function(params, linear_regression, X_scaled, y)
    if i%100 == 0:
        print("iteration {}, loss = {}".format(i, loss))

 

 

 

1.6 How accurate is the model?

์ด์ œ ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“  ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•œ์ง€ ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. Regression ๋ฌธ์ œ์—์„œ๋Š” ์ฃผ๋กœ ๋‘๊ฐœ์˜ ๊ธฐ๋ณธ ์ง€ํ‘œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. Mean absolute error(MAE)์™€ root mean squared error(RMSE)์ž…๋‹ˆ๋‹ค. ๋‘๊ฐœ์˜ ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y, y_pred_gd)
rmse = mean_squared_error(y, y_pred_gd, squared=False)
print("mae  = {}".format(mae))
print("rmse = {}".format(rmse))
mae  = 2.613991601156043
rmse = 3.40552056741184
 
 
 
๋ฐ˜์‘ํ˜•