λ¨Έμ‹ λŸ¬λ‹

[λ¨Έμ‹ λŸ¬λ‹1] μ„ ν˜•νšŒκ·€ Linear Regression , gradient descent pyhton

주영 🐱 2022. 11. 28. 14:21
728x90
λ°˜μ‘ν˜•

κ³Όκ±° 데이터 기반으둜, μž…λ ₯ μ£Όμ–΄μ‘Œμ„ λ•Œ 좜λ ₯(μ˜ˆμΈ‘κ°’)이 λ‚˜μ™€μ•Ό ν•œλ‹€. 

 

μ˜ˆμΈ‘ν•˜κ³ μž ν•˜λŠ” λ³€μˆ˜ = target variable(νƒ€κ²Ÿ λ³€μˆ˜)

 

νƒ€κ²Ÿ λ³€μˆ˜κ°€ μ‹€μˆ˜μ΄λ©΄ = regression problem

νƒ€κ²Ÿ λ³€μˆ˜κ°€ μΉ΄ν…Œκ³ λ¦¬ λ³€μˆ˜μ΄λ©΄ = classification (λŒ€ν‘œμ μΈ 방법둠: λ‘œμ§€μŠ€ν‹± νšŒκ·€)

이 λ‘˜μ€ supervised learning(지도 ν•™μŠ΅)이닀. 

 

unsupervised learning(비지도 ν•™μŠ΅)μ—λŠ” clustring(k-means) 등이 μžˆλ‹€. 

 

 

μ„ ν˜•νšŒκ·€ Linear Regression

 

- 쒅속 λ³€μˆ˜ π‘¦μ™€ ν•œκ°œ μ΄μƒμ˜ 독립 λ³€μˆ˜ π‘‹μ™€μ˜ μ„ ν˜• 관계λ₯Ό λͺ¨λΈλ§(=1차둜 이루어진 직선을 κ΅¬ν•œλ‹€)ν•˜λŠ” 방법둠

- 졜적의 직선을 μ°Ύμ•„ 독립 λ³€μˆ˜μ™€ 쒅속 λ³€μˆ˜ μ‚¬μ΄μ˜ 관계λ₯Ό λ„μΆœν•˜λŠ” κ³Όμ •

 

독립 λ³€μˆ˜= μž…λ ₯ κ°’μ΄λ‚˜ 원인(input)

쒅속 λ³€μˆ˜ = 독립 λ³€μˆ˜μ— μ˜ν•΄ 영ν–₯을 λ°›λŠ” λ³€μˆ˜(output)

 

<simple linear regression>(λ…λ¦½λ³€μˆ˜xκ°€ 1개일 λ•Œ)

 

데이터λ₯Ό κ°€μž₯ 잘 μ„€λͺ…ν•˜λŠ” 직선은 μ˜ˆμΈ‘ν•œ 값이 μ‹€μ œ λ°μ΄ν„°μ˜ κ°’κ³Ό κ°€μž₯ λΉ„μŠ·ν•΄μ•Ό ν•©λ‹ˆλ‹€.

우리의 λͺ¨λΈμ΄ μ˜ˆμΈ‘ν•œ 값은 μœ„μ—μ„œ μ•Œ 수 μžˆλ“― π‘“(π‘₯𝑖)μž…λ‹ˆλ‹€. 그리고 μ‹€μ œ λ°μ΄ν„°λŠ” π‘¦ μž…λ‹ˆλ‹€.

μ‹€μ œ 데이터(μœ„ κ·Έλ¦Όμ—μ„œ λΉ¨κ°„ 점) κ³Ό 직선 μ‚¬μ΄μ˜ 차이λ₯Ό μ€„μ΄λŠ” 것이 우리의 λͺ©μ μž…λ‹ˆλ‹€.

그것을 λ°”νƒ•μœΌλ‘œ cost function을 λ‹€μŒκ³Ό 같이 μ •μ˜ν•΄λ³΄κ² μŠ΅λ‹ˆλ‹€.

(n: μƒ˜ν”Œ 수, i: i번째 데이터)

cost_function = (w*x + b - y)**2

 

 

μš°λ¦¬λŠ” cost function을 μ΅œμ†Œλ‘œ ν•˜λŠ” π‘€μ™€ π‘λ₯Ό μ°Ύμ•„μ•Ό ν•©λ‹ˆλ‹€.

μ΄μ°¨ν•¨μˆ˜μ΄λ―€λ‘œ μ΄μ°¨ν•¨μˆ˜μ˜ μ΅œμ†Ÿκ°’μ„ κ΅¬ν•˜λŠ” 방법은

1. λ―ΈλΆ„ν•œ 값이 0이 λ˜λŠ” 지점찾기

[3/2​]

2. gradient descent 

 

gradient descent 

ν•œλ²ˆμ— 정닡에 μ ‘κ·Όν•˜λŠ” 것이 μ•„λ‹Œ 반볡적으둜 정닡에 κ°€κΉŒμ›Œμ§€λŠ” 방법

1. κΈ°μšΈκΈ°κ°’ κ΅¬ν•˜λŠ” ν•¨μˆ˜ λ§Œλ“€κΈ°
 
fpnum = sympy.lambdify(w, fprime)

2. 처음 π‘€ κ°’을 μ„€μ •ν•œ λ’€, 반볡적으둜 μ΅œμ†Ÿκ°’μ„ ν–₯ν•΄μ„œ μ ‘κ·Ό

w = 10.0 # starting guess for the min

for i in range(1000):
w = w - fpnum(w)*0.01 # with 0.01 the step size

print(w)

κ²°κ³ΌλŠ” λ―ΈλΆ„ν•œ κ°’κ³Ό κ°™λ‹€

 

μ‹€μ œλ‘œ μ μš©ν•΄λ³΄κΈ° :  

linear regression 방법을 μ‚¬μš©ν•΄μ„œ μ‹œκ°„ 흐름에 λ”°λ₯Έ μ§€κ΅¬μ˜ μ˜¨λ„ λ³€ν™” 뢄석

Global temperature anomalyλΌλŠ” μ§€ν‘œλ₯Ό ν†΅ν•΄μ„œ 뢄석을 ν•΄λ³Ό κ²ƒμž…λ‹ˆλ‹€.

μ—¬κΈ°μ„œ temperature anomalyλŠ” μ–΄λ– ν•œ κΈ°μ€€ μ˜¨λ„ 값을 정해놓고 κ·Έκ²ƒκ³Όμ˜ 차이λ₯Ό λ‚˜νƒ€λ‚Έ κ²ƒμž…λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄μ„œ temperature anomalyκ°€ μ–‘μˆ˜μ˜ 높은 값을 가진닀면 그것은 ν‰μ†Œλ³΄λ‹€ λ”°λ“―ν•œ κΈ°μ˜¨μ„ κ°€μ‘Œλ‹€λŠ” 말이고, 음수의 μž‘μ€ 값을 가진닀면 그것은 ν‰μ†Œλ³΄λ‹€ μ°¨κ°€μš΄ κΈ°μ˜¨μ„ κ°€μ‘Œλ‹€λŠ” λ§μž…λ‹ˆλ‹€.

세계 μ—¬λŸ¬ μ§€μ—­μ˜ μ˜¨λ„κ°€ 각각 λ‹€ λ‹€λ₯΄κΈ° λ•Œλ¬Έμ— global temperature anomalyλ₯Ό μ‚¬μš©ν•΄μ„œ 뢄석을 ν•˜λ„λ‘ ν•˜κ² μŠ΅λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ μ•„λž˜ λ§ν¬μ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.

https://www.ncdc.noaa.gov/monitoring-references/faq/anomalies.php

 

Global Surface Temperature Anomalies | National Centers for Environmental Information (NCEI)

 

www.ncei.noaa.gov

 

Step 1 : Read a data file

NOAA(National Oceanic and Atmospheric Administration) ν™ˆνŽ˜μ΄μ§€μ—μ„œ 데이터λ₯Ό κ°€μ Έμ˜€κ² μŠ΅λ‹ˆλ‹€.

μ•„λž˜ λͺ…λ Ήμ–΄λ‘œ 데이터λ₯Ό λ‹€μš΄λ°›κ³ ,  numpy νŒ¨ν‚€μ§€λ₯Ό μ΄μš©ν•΄ λΆˆλŸ¬μ˜€κ² μŠ΅λ‹ˆλ‹€.

from urllib.request import urlretrieve
import numpy

URL = 'http://go.gwu.edu/engcomp1data5?accessType=DOWNLOAD'
urlretrieve(URL, 'land_global_temperature_anomaly-1880-2016.csv')

fname = '/content/land_global_temperature_anomaly-1880-2016.csv'
year, temp_anomaly = numpy.loadtxt(fname, delimiter=',', skiprows=5, unpack=True)

Step 2 : Plot the data

Matplotlib νŒ¨ν‚€μ§€μ˜ pyplot을 μ΄μš©ν•΄μ„œ 2D plot을 찍어보도둝 ν•˜κ² μŠ΅λ‹ˆλ‹€.

from matplotlib import pyplot
%matplotlib inline

 

pyplot.rc('font', family='serif', size='18')

#You can set the size of the figure by doing:
pyplot.figure(figsize=(10,5))

#Plotting
pyplot.plot(year, temp_anomaly, color='#2929a3', linestyle='-', linewidth=1) 
pyplot.title('Land global temperature anomalies. \n')
pyplot.xlabel('Year')
pyplot.ylabel('Land temperature anomaly [°C]')
pyplot.grid();

Step 3 : Analytically

Linear regression을 ν•˜κΈ° μœ„ν•΄μ„œ λ¨Όμ € 직선을 μ •μ˜ν•˜κ² μŠ΅λ‹ˆλ‹€.

κ·Έ λ‹€μŒ  μ΅œμ†Œν™” ν•΄μ•Ό ν•  cost function은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

이제 cost function 을 κ΅¬ν•˜κ³ μž ν•˜λŠ” λ³€μˆ˜λ‘œ λ―ΈλΆ„ν•œ λ’€ 0이 λ˜λ„λ‘ ν•˜λŠ” 값을 찾으면 λ©λ‹ˆλ‹€.

이제 μ½”λ“œλ₯Ό ν†΅ν•΄μ„œ μ μš©ν•΄λ³΄λ„λ‘ ν•˜κ² μŠ΅λ‹ˆλ‹€.
w = numpy.sum(temp_anomaly*(year - year.mean())) / numpy.sum(year*(year - year.mean())) 
b = a_0 = temp_anomaly.mean() - w*year.mean()

print(w)
print(b)

#0.01037028394347266

#-20.148685384658464

 

이제 κ·Έλž˜ν”„λ‘œ κ·Έλ €μ„œ 확인해보도둝 ν•˜κ² μŠ΅λ‹ˆλ‹€

reg = b + w * year

pyplot.figure(figsize=(10, 5))

pyplot.plot(year, temp_anomaly, color='#2929a3', linestyle='-', linewidth=1, alpha=0.5) 
pyplot.plot(year, reg, 'k--', linewidth=2, label='Linear regression')
pyplot.xlabel('Year')
pyplot.ylabel('Land temperature anomaly [°C]')
pyplot.legend(loc='best', fontsize=15)
pyplot.grid();

λ°˜μ‘ν˜•