๋ฐ์ดํ„ฐ ๋ถ„์„/Today I learned :

[๋งˆ์ผ€ํŒ…์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋ถ„์„2] ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ํ™œ์šฉํ•œ ์„ธ๋ถ„ํ™”

์ฃผ์˜ ๐Ÿฑ 2022. 12. 2. 16:26
728x90
๋ฐ˜์‘ํ˜•

ํด๋Ÿฌ์Šคํ„ฐ๋ง(Clustering)์€ ๋ฐ์ดํ„ฐ์—์„œ ํ‘œ๋ฉด์ƒ์œผ๋กœ๋Š” ์•ˆ ๋ณด์ด๋Š” ํŒจํ„ด์„ ์ฐพ์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค. 

์ค‘์š”ํ•œ ๊ฒƒ์€ ๋ช‡ ๊ฐœ์˜ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ์ƒˆ๋ถ„ํ™”๋ฅผ ์ž˜ ํ•ด๋‚ด๋Š”์ง€ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. 

ํด๋Ÿฌ์Šคํ„ฐ๋ง์˜ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์€ k-means clustering ์ด ์žˆ๋‹ค. 

k-means clustering 

- group similar data points

- iterative approach (๋ฐ˜๋ณต์ ์ธ ์ ‘๊ทผ๋ฒ•)

- Starting point : Randomly selected cluster centers , Variable = you're interested in (location, demographics,,,)

----> revaluate hoe good your random choice was and improve it!

 

 

๊ณผ์ •

1. k๋ฅผ ์ •์˜ํ•œ๋‹ค( elbow method๋ฅผ ํ™œ์šฉํ•˜์—ฌ elbow criterion์„ ์‚ฌ์šฉ)

2. ๋žœ๋ค์œผ๋กœ k๊ฐœ์˜ centroids๋ฅผ ์ •ํ•œ๋‹ค(k=4๋ฉด, 4๊ฐœ์˜ data point๊ฐ€ ๋žœ๋ค์œผ๋กœ ์„ ํƒ๋˜๊ณ , ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ค‘์‹ฌ์œผ๋กœ ํ• ๋‹น๋œ๋‹ค.)

3. ๊ฐ ์ ๊ณผ centroid์— ๋Œ€ํ•œ ์œ ํด๋ผ๋””์•ˆ ๊ฑฐ๋ฆฌ(2์  ๊ฐ„ ๊ฐ€์žฅ ์งง์€ ๊ฑฐ๋ฆฌ)๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. 

4. baseline cluster๋ฅผ ์ •ํ•œ๋‹ค.

5. ๊ฐ data point๊ฐ€ ์–ด๋Š ์ค‘์‹ฌ์  (centroid) ์™€ ๊ฐ€์žฅ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€์ง€ ์•Œ์•„๋‚ธ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๋ ‡๊ฒŒ ์ฐพ์•„๋‚ธ ์ค‘์‹ฌ์ ์œผ๋กœ ๊ฐ data point๋“ค์„ ํ• ๋‹นํ•œ๋‹ค.

6. ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ค‘์‹ฌ์ ์„ ๋‹ค์‹œ ๊ณ„์‚ฐํ•œ๋‹ค. ์ฆ‰, 2์—์„œ ์žฌํ• ๋‹น๋œ ํด๋Ÿฌ์Šคํ„ฐ๋“ค์„ ๊ธฐ์ค€์œผ๋กœ ์ค‘์‹ฌ์ ์„ ๋‹ค์‹œ ๊ณ„์‚ฐํ•œ๋‹ค

7. ๊ฐ data point์˜ ์†Œ์† ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ๋ฐ”๋€Œ์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€ 2, 3 ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค.

8. ๋ฐ์ดํ„ฐ ๊ฒ€์ฆ(๊ตฐ์ง‘๋ถ„์„์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ์ด๋ฃจ์–ด์กŒ๋Š”์ง€ ํ‰๊ฐ€ํ•œ๋‹ค)

๊ฒ€์ฆํ‰๊ฐ€๋ฐฉ๋ฒ•์€ ์ด 2๊ฐ€์ง€์ด๋‹ค. 

1 : ๋ถ„์‚ฐ(variance)์ด low, tightํ•˜๋ฉด data point๊ฐ€ ๊ฐ€๊นŒ์ด ์œ„์น˜ํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๋œป์ด๋‹ค. 

               1-1: ๋ชจ์ง‘๋‹จ ๋ถ„ํฌ๊ณต์‹(๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ–ˆ๋‹ค๋ฉด)

               1-2:ํ‘œ๋ณธ๋ถ„์‚ฐ(์ถ”์ •์น˜ ์ƒ์„ฑ)

2: Dunn Index๋กœ tightness์™€ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋จผ์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

Reference;

๋งˆ์ผ€ํŒ…์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐฉ๋ฒ•๋ก 

https://ko.wikipedia.org/wiki/K-%ED%8F%89%EA%B7%A0_%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98

 

k-ํ‰๊ท  ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ์œ„ํ‚ค๋ฐฑ๊ณผ, ์šฐ๋ฆฌ ๋ชจ๋‘์˜ ๋ฐฑ๊ณผ์‚ฌ์ „

์œ„ํ‚ค๋ฐฑ๊ณผ, ์šฐ๋ฆฌ ๋ชจ๋‘์˜ ๋ฐฑ๊ณผ์‚ฌ์ „. k-ํ‰๊ท  ์•Œ๊ณ ๋ฆฌ์ฆ˜(K-means clustering algorithm)์€ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ k๊ฐœ์˜ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ๋ฌถ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์™€ ๊ฑฐ๋ฆฌ ์ฐจ์ด์˜ ๋ถ„์‚ฐ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘

ko.wikipedia.org

 

๋ฐ˜์‘ํ˜•