๋จธ์‹ ๋Ÿฌ๋‹

[๋จธ์‹ ๋Ÿฌ๋‹1] ์„ ํ˜•ํšŒ๊ท€ Linear Regression , gradient descent pyhton

์ฃผ์˜ ๐Ÿฑ 2022. 11. 28. 14:21
728x90
๋ฐ˜์‘ํ˜•

๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ž…๋ ฅ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ถœ๋ ฅ(์˜ˆ์ธก๊ฐ’)์ด ๋‚˜์™€์•ผ ํ•œ๋‹ค. 

 

์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ๋ณ€์ˆ˜ = target variable(ํƒ€๊ฒŸ ๋ณ€์ˆ˜)

 

ํƒ€๊ฒŸ ๋ณ€์ˆ˜๊ฐ€ ์‹ค์ˆ˜์ด๋ฉด = regression problem

ํƒ€๊ฒŸ ๋ณ€์ˆ˜๊ฐ€ ์นดํ…Œ๊ณ ๋ฆฌ ๋ณ€์ˆ˜์ด๋ฉด = classification (๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•๋ก : ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€)

์ด ๋‘˜์€ supervised learning(์ง€๋„ ํ•™์Šต)์ด๋‹ค. 

 

unsupervised learning(๋น„์ง€๋„ ํ•™์Šต)์—๋Š” clustring(k-means) ๋“ฑ์ด ์žˆ๋‹ค. 

 

 

์„ ํ˜•ํšŒ๊ท€ Linear Regression

 

- ์ข…์† ๋ณ€์ˆ˜ ๐‘ฆ์™€ ํ•œ๊ฐœ ์ด์ƒ์˜ ๋…๋ฆฝ ๋ณ€์ˆ˜ ๐‘‹์™€์˜ ์„ ํ˜• ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋ง(=1์ฐจ๋กœ ์ด๋ฃจ์–ด์ง„ ์ง์„ ์„ ๊ตฌํ•œ๋‹ค)ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก 

- ์ตœ์ ์˜ ์ง์„ ์„ ์ฐพ์•„ ๋…๋ฆฝ ๋ณ€์ˆ˜์™€ ์ข…์† ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋„์ถœํ•˜๋Š” ๊ณผ์ •

 

๋…๋ฆฝ ๋ณ€์ˆ˜= ์ž…๋ ฅ ๊ฐ’์ด๋‚˜ ์›์ธ(input)

์ข…์† ๋ณ€์ˆ˜ = ๋…๋ฆฝ ๋ณ€์ˆ˜์— ์˜ํ•ด ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๋ณ€์ˆ˜(output)

 

<simple linear regression>(๋…๋ฆฝ๋ณ€์ˆ˜x๊ฐ€ 1๊ฐœ์ผ ๋•Œ)

 

๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ์ง์„ ์€ ์˜ˆ์ธกํ•œ ๊ฐ’์ด ์‹ค์ œ ๋ฐ์ดํ„ฐ์˜ ๊ฐ’๊ณผ ๊ฐ€์žฅ ๋น„์Šทํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๊ฐ’์€ ์œ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ ๐‘“(๐‘ฅ๐‘–)์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ ๋ฐ์ดํ„ฐ๋Š” ๐‘ฆ ์ž…๋‹ˆ๋‹ค.

์‹ค์ œ ๋ฐ์ดํ„ฐ(์œ„ ๊ทธ๋ฆผ์—์„œ ๋นจ๊ฐ„ ์ ) ๊ณผ ์ง์„  ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ด ์šฐ๋ฆฌ์˜ ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.

๊ทธ๊ฒƒ์„ ๋ฐ”ํƒ•์œผ๋กœ cost function์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

(n: ์ƒ˜ํ”Œ ์ˆ˜, i: i๋ฒˆ์งธ ๋ฐ์ดํ„ฐ)

cost_function = (w*x + b - y)**2

 

 

์šฐ๋ฆฌ๋Š” cost function์„ ์ตœ์†Œ๋กœ ํ•˜๋Š” ๐‘ค์™€ ๐‘๋ฅผ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด์ฐจํ•จ์ˆ˜์ด๋ฏ€๋กœ ์ด์ฐจํ•จ์ˆ˜์˜ ์ตœ์†Ÿ๊ฐ’์„ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€

1. ๋ฏธ๋ถ„ํ•œ ๊ฐ’์ด 0์ด ๋˜๋Š” ์ง€์ ์ฐพ๊ธฐ

[3/2โ€‹]

2. gradient descent 

 

gradient descent 

ํ•œ๋ฒˆ์— ์ •๋‹ต์— ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ •๋‹ต์— ๊ฐ€๊นŒ์›Œ์ง€๋Š” ๋ฐฉ๋ฒ•

1. ๊ธฐ์šธ๊ธฐ๊ฐ’ ๊ตฌํ•˜๋Š” ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ
 
fpnum = sympy.lambdify(w, fprime)

2. ์ฒ˜์Œ ๐‘ค ๊ฐ’์„ ์„ค์ •ํ•œ ๋’ค, ๋ฐ˜๋ณต์ ์œผ๋กœ ์ตœ์†Ÿ๊ฐ’์„ ํ–ฅํ•ด์„œ ์ ‘๊ทผ

w = 10.0 # starting guess for the min

for i in range(1000):
w = w - fpnum(w)*0.01 # with 0.01 the step size

print(w)

๊ฒฐ๊ณผ๋Š” ๋ฏธ๋ถ„ํ•œ ๊ฐ’๊ณผ ๊ฐ™๋‹ค

 

์‹ค์ œ๋กœ ์ ์šฉํ•ด๋ณด๊ธฐ :  

linear regression ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด์„œ ์‹œ๊ฐ„ ํ๋ฆ„์— ๋”ฐ๋ฅธ ์ง€๊ตฌ์˜ ์˜จ๋„ ๋ณ€ํ™” ๋ถ„์„

Global temperature anomaly๋ผ๋Š” ์ง€ํ‘œ๋ฅผ ํ†ตํ•ด์„œ ๋ถ„์„์„ ํ•ด๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ temperature anomaly๋Š” ์–ด๋– ํ•œ ๊ธฐ์ค€ ์˜จ๋„ ๊ฐ’์„ ์ •ํ•ด๋†“๊ณ  ๊ทธ๊ฒƒ๊ณผ์˜ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ temperature anomaly๊ฐ€ ์–‘์ˆ˜์˜ ๋†’์€ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค๋ฉด ๊ทธ๊ฒƒ์€ ํ‰์†Œ๋ณด๋‹ค ๋”ฐ๋“ฏํ•œ ๊ธฐ์˜จ์„ ๊ฐ€์กŒ๋‹ค๋Š” ๋ง์ด๊ณ , ์Œ์ˆ˜์˜ ์ž‘์€ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค๋ฉด ๊ทธ๊ฒƒ์€ ํ‰์†Œ๋ณด๋‹ค ์ฐจ๊ฐ€์šด ๊ธฐ์˜จ์„ ๊ฐ€์กŒ๋‹ค๋Š” ๋ง์ž…๋‹ˆ๋‹ค.

์„ธ๊ณ„ ์—ฌ๋Ÿฌ ์ง€์—ญ์˜ ์˜จ๋„๊ฐ€ ๊ฐ๊ฐ ๋‹ค ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— global temperature anomaly๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ถ„์„์„ ํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ ๋งํฌ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

https://www.ncdc.noaa.gov/monitoring-references/faq/anomalies.php

 

Global Surface Temperature Anomalies | National Centers for Environmental Information (NCEI)

 

www.ncei.noaa.gov

 

Step 1 : Read a data file

NOAA(National Oceanic and Atmospheric Administration) ํ™ˆํŽ˜์ด์ง€์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๊ฒ ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ๋ช…๋ น์–ด๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋ฐ›๊ณ ,  numpy ํŒจํ‚ค์ง€๋ฅผ ์ด์šฉํ•ด ๋ถˆ๋Ÿฌ์˜ค๊ฒ ์Šต๋‹ˆ๋‹ค.

from urllib.request import urlretrieve
import numpy

URL = 'http://go.gwu.edu/engcomp1data5?accessType=DOWNLOAD'
urlretrieve(URL, 'land_global_temperature_anomaly-1880-2016.csv')

fname = '/content/land_global_temperature_anomaly-1880-2016.csv'
year, temp_anomaly = numpy.loadtxt(fname, delimiter=',', skiprows=5, unpack=True)

Step 2 : Plot the data

Matplotlib ํŒจํ‚ค์ง€์˜ pyplot์„ ์ด์šฉํ•ด์„œ 2D plot์„ ์ฐ์–ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

from matplotlib import pyplot
%matplotlib inline

 

pyplot.rc('font', family='serif', size='18')

#You can set the size of the figure by doing:
pyplot.figure(figsize=(10,5))

#Plotting
pyplot.plot(year, temp_anomaly, color='#2929a3', linestyle='-', linewidth=1) 
pyplot.title('Land global temperature anomalies. \n')
pyplot.xlabel('Year')
pyplot.ylabel('Land temperature anomaly [°C]')
pyplot.grid();

Step 3 : Analytically

Linear regression์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋จผ์ € ์ง์„ ์„ ์ •์˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ทธ ๋‹ค์Œ  ์ตœ์†Œํ™” ํ•ด์•ผ ํ•  cost function์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด์ œ cost function ์„ ๊ตฌํ•˜๊ณ ์ž ํ•˜๋Š” ๋ณ€์ˆ˜๋กœ ๋ฏธ๋ถ„ํ•œ ๋’ค 0์ด ๋˜๋„๋ก ํ•˜๋Š” ๊ฐ’์„ ์ฐพ์œผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

์ด์ œ ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด์„œ ์ ์šฉํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
w = numpy.sum(temp_anomaly*(year - year.mean())) / numpy.sum(year*(year - year.mean())) 
b = a_0 = temp_anomaly.mean() - w*year.mean()

print(w)
print(b)

#0.01037028394347266

#-20.148685384658464

 

์ด์ œ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ค์„œ ํ™•์ธํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค

reg = b + w * year

pyplot.figure(figsize=(10, 5))

pyplot.plot(year, temp_anomaly, color='#2929a3', linestyle='-', linewidth=1, alpha=0.5) 
pyplot.plot(year, reg, 'k--', linewidth=2, label='Linear regression')
pyplot.xlabel('Year')
pyplot.ylabel('Land temperature anomaly [°C]')
pyplot.legend(loc='best', fontsize=15)
pyplot.grid();

๋ฐ˜์‘ํ˜•