๋ฐ˜์‘ํ˜•

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ 35

Word Embedding

์›Œ๋“œ ์ž„๋ฒ ๋”ฉ(Word Embedding)์€ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 1. Sparse Representation ์•ž์„œ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์„ ํ†ตํ•ด์„œ ๋‚˜์˜จ ์›-ํ•ซ ๋ฒกํ„ฐ๋“ค์€ ํ‘œํ˜„ํ•˜๊ณ ์ž ํ•˜๋Š” ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค์˜ ๊ฐ’๋งŒ 1์ด๊ณ , ๋‚˜๋จธ์ง€ ์ธ๋ฑ์Šค์—๋Š” ์ „๋ถ€ 0์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ๋ฒกํ„ฐ ํ‘œํ˜„ ๋ฐฉ๋ฒ•์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋ฒกํ„ฐ ๋˜๋Š” ํ–‰๋ ฌ(matrix)์˜ ๊ฐ’์ด ๋Œ€๋ถ€๋ถ„์ด 0์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ๋ฐฉ๋ฒ•์„ ํฌ์†Œ ํ‘œํ˜„(sparse representation)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์›-ํ•ซ ๋ฒกํ„ฐ๋Š” ํฌ์†Œ ๋ฒกํ„ฐ(sparse vector)์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํฌ์†Œ ๋ฒกํ„ฐ์˜ ๋ฌธ์ œ์ ์€ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด ๋ฒกํ„ฐ์˜ ์ฐจ์›์ด ํ•œ์—†์ด ์ปค์ง„๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค์— ํ•ด๋‹น๋˜๋Š” ๋ถ€๋ถ„๋งŒ 1์ด๊ณ  ๋‚˜๋จธ์ง€๋Š” 0์˜ ๊ฐ’์„ ๊ฐ€์ ธ์•ผ๋งŒ ํ•˜๋ฏ€๋กœ ๋‹จ์–ด ์ง‘ํ•ฉ์ด ํด์ˆ˜๋ก ๊ณ ์ฐจ์›์˜ ๋ฒกํ„ฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฒกํ„ฐ..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ํŒŒ์ดํ† ์น˜ LSTM ๊ตฌํ˜„

์ด๋ฒˆ์—๋Š” ์‹ค์ œ๋กœ LSTM์„ ๊ตฌํ˜„ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim torch.manual_seed(1) `torch.nn` ์„ ํ™œ์šฉํ•˜์—ฌ LSTM cell ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. * `input_size` : The number of expected features in the input x * `hidden_size` : The number of features in the hidden state h lstm = nn.LSTM(input_size, hidden_size) # input_size: 3, hidden_size: 3 ์œผ๋กœ ์„ค์ •ํ•˜์—ฌ L..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] RNN์„ ๋ณด์™„ํ•˜๋Š” LSTM๊ณผ GRU

https://getacherryontop.tistory.com/125 RNN ์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ ํ† ํฐ์˜ ์ˆœ์„œ๊ฐ€ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ง‘์— ๊ฐ„๋‹ค. (o) ๋‚˜๋Š” ๊ฐ„๋‹ค ์ง‘์— (x) ์ด๋Ÿฌํ•œ ํ† ํฐ์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด RNN ํ˜•ํƒœ์˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. RNN์˜ ์˜๋ฏธ์™€ ๊ตฌ์กฐ ๋˜‘๊ฐ™์€ weight getacherryontop.tistory.com RNN์€ ์‹œํ€€์…œํ•œ ๋ฐ์ดํ„ฐ ์ž…๋ ฅ์— ํ›Œ๋ฅญํ•˜๊ฒŒ ๋™์ž‘ํ•˜์ง€๋งŒ, ๊ธฐ์–ต๋ ฅ์ด ์•ˆ์ข‹์Šต๋‹ˆ๋‹ค. time-step์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก ์•ž์˜ ๋‚ด์šฉ์„ ๊นŒ๋จน๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด์ฃ . ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด LSTM์ด๋ผ๋Š” ๊ตฌ์กฐ๊ฐ€ ์ œ์‹œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. LSTM(Long short-term memory) - ๋ณ„๋„์˜ Cell state ๋ณ€์ˆ˜๋ฅผ ํ†ตํ•ด ๊ธฐ์–ต์„ ํ•  ์ˆ˜ ์žˆ๋„๋กํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๊ฐ€์ง€ gate๋ฅผ ๋‘์–ด ๊ธฐ์–ต, ์žŠ์–ด๋ฒ„๋ฆฌ๊ธฐ, ์ถœ๋ ฅํ•  ๋ฐ..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ํ† ํฐํ™”์™€ ํ† ํฌ๋‚˜์ด์ € ์ข…๋ฅ˜

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ํ† ํฐ์€ ์‚ฌ์šฉ์ž ์ง€์ • ๋‹จ์œ„์ž…๋‹ˆ๋‹ค. raw string์„ ํ† ํฐ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ์„ ํ† ํฐํ™”๋ผ๊ณ  ํ•˜๊ณ , ์šฐ๋ฆฌ๋Š” ์ด ํ† ํฐ๋“ค๋กœ vocab ์‚ฌ์ „์„ ๋งŒ๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํ† ํฐํ™”๋Š” ์ฃผ์–ด์ง„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ์ด ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์œ„๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด I am student๋ผ๋Š” ๋ฌธ์žฅ์€ ํ† ํฐํ™”๋ฅผ ๊ฑฐ์น˜๋ฉด I, am, student๋กœ ๋‚˜๋ˆ ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. I am student=[I, am, student] ๊นŒ์ง€ ๋˜๋ฉด ์šฐ๋ฆฌ๋Š” ์ด ๋‹จ์–ด๋“ค์ด vocab ์‚ฌ์ „์—์„œ์˜ ์ธ๋ฑ์Šค๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ I๊ฐ€ 10๋ฒˆ์งธ, am์ด 12๋ฒˆ์งธ, student๊ฐ€ 100๋ฒˆ์งธ์— ์œ„์น˜ํ•˜๊ณ  ์žˆ๋‹ค๋ฉด [I, am, student] =[10,12,100]์ด ๋˜๊ณ  ํ•ด๋‹น ๋ฒกํ„ฐ๋“ค์ด ๋ชจ๋ธ์˜ ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ† ํฌ๋‚˜์ด์ € ํ† ํฐ..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ๋งž์ถค๋ฒ• ์ „์ฒ˜๋ฆฌ ๊ต์ • Py-Hanspell ์˜ˆ์ œ

!pip install git+https://github.com/ssut/py-hanspell.git from hanspell import spell_checker sent = "์–ด์ œ๋Š” ์•Š์žค์–ด. ๊ฐ๊ธฐ ๋นจ๋ฆฌ ๋‚ณ์•„! " spelled_sent = spell_checker.check(sent) hanspell_sent = spelled_sent.checked print(hanspell_sent) ์–ด์ œ๋Š” ์•ˆ ์žค์–ด. ๊ฐ๊ธฐ ๋นจ๋ฆฌ ๋‚˜์•„!

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ]์ „์ฒ˜๋ฆฌ ๋„์–ด์“ฐ๊ธฐ ๊ต์ • ์ˆ˜์ • PyKoSpacing ์˜ˆ์ œ

!pip install git+https://github.com/haven-jeon/PyKoSpacing.git sentence = '์ฝ”๋”ฉ๊ณผ AI ๊ฐœ๋ฐœ์ด ๋‘˜๋‹ค ๊ฐ€๋Šฅํ•œ ์‚ฌ๋žŒ์€ ๋งŽ์ง€ ์•Š๋‹ค.' new = sentence.replace(" ", '') # ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†๋Š” ๋ฌธ์žฅ ์ž„์˜๋กœ ๋งŒ๋“ค๊ธฐ print(new) from pykospacing import Spacing spacing = Spacing() kospacing_sen = spacing(new) print('๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†๋Š” ๋ฌธ์žฅ :\n', new) print('์ •๋‹ต ๋ฌธ์žฅ:\n', sentence) print('๋„์–ด์“ฐ๊ธฐ ๊ต์ • ํ›„:\n', kospacing_sen) ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†๋Š” ๋ฌธ์žฅ : ์ฝ”๋”ฉ๊ณผAI๊ฐœ๋ฐœ์ด๋‘˜๋‹ค๊ฐ€๋Šฅํ•œ์‚ฌ๋žŒ์€๋งŽ์ง€์•Š๋‹ค. ์ •๋‹ต ๋ฌธ์žฅ: ์ฝ”๋”ฉ๊ณผ AI ๊ฐœ๋ฐœ..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ํ•œ๊ตญ์–ด ํ† ํฐํ™”, ํ’ˆ์‚ฌํƒœ๊น… ๊ตฌํ˜„ KoNLPy (Hannanum,Kkma),Khaiii

์„ค์น˜ !pip install konlpy ํ•œ๋‚˜๋ˆ”(Hannanum) from konlpy.tag import Hannanum hannanum = Hannanum() text = '์•ˆ๋…•ํ•˜์„ธ์š”! ์˜ค๋Š˜ ๋งŽ์ด ์ถ”์›Œ์š”' print(hannanum.morphs(text)) # Parse phrase to morphemes print(hannanum.nouns(text)) # Noun extractors print(hannanum.pos(text)) # POS tagger ['์•ˆ๋…•', 'ํ•˜', '์„ธ', '์š”', '!', '์˜ค๋Š˜', '๋งŽ', '์ด', '์ถฅ', '์–ด์š”'] ['์•ˆ๋…•', '์˜ค๋Š˜'] [('์•ˆ๋…•', 'N'), ('ํ•˜', 'X'), ('์„ธ', 'E'), ('์š”', 'J'), ('!', 'S'), ('์˜ค๋Š˜', 'N'..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ]ํ•œ๊ตญ์–ด ์ „์ฒ˜๋ฆฌ re

import re re.sub('[0-9]+', 'num', '1 2 3 4 hello') # ์ˆซ์ž๋งŒ ์ฐพ์•„์„œ num์œผ๋กœ ๋ฐ”๊ฟˆ re.sub('ํŒจํ„ด', '๋ฐ”๊ฟ€๋ฌธ์ž์—ด', '๋ฌธ์ž์—ด', ๋ฐ”๊ฟ€ํšŸ์ˆ˜)๋กœ ๊ฐ„๋‹จํ•œ ๋ฌธ์ž์—ด ์ฐจํ™˜์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, re.complie()์„ ์“ฐ๋ฉด ๋ฐ˜๋ณต๋˜๋Š” ์ž‘์—…์„ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. # ์ž„์˜์˜ ํ•œ ๊ฐœ์˜ ๋ฌธ์ž๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” . r = re.compile("a.c") r.search("abc") # ? ์•ž์˜ ๋ฌธ์ž๊ฐ€ ์กด์žฌํ•  ์ˆ˜๋„ ์žˆ๊ณ , ์กด์žฌํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋Š” ๊ฒฝ์šฐ r = re.compile("a?c") # * ์€ ๋ฐ”๋กœ ์•ž์˜ ๋ฌธ์ž๊ฐ€ 0๊ฐœ ์ด์ƒ์ผ ๊ฒฝ์šฐ๋ฅผ ๋‚˜ํƒ€๋ƒ„. r = re.compile("ab*c") # b ๊ฐ€ ํ•˜๋‚˜๋„ ์—†๊ฑฐ๋‚˜, ์—ฌ๋Ÿฌ ๊ฐœ์ธ ๊ฒฝ์šฐ'' # + ์•ž์˜ ๋ฌธ์ž๊ฐ€ ์ตœ์†Œ 1๊ฐœ ์ด์ƒ ์žˆ์–ด์•ผ ํ•จ. r = ..

[์ž์—ฐ์–ด์ฒ˜๋ฆฌ] ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ(Text Preprocessing)

ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ ๊ณผ์ • 1. normalization : raw string์„ cleaningํ•˜๋Š” ์ž‘์—…(์—ฌ๋ฐฑ ์ œ๊ฑฐ, ๋Œ€์†Œ๋ฌธ์ž ๋ณ€ํ™˜) 2. pre-tokenization : text๋ฅผ word๋กœ ์ž๋ฅด๊ธฐ 3. tokenization: ๋” ์ž‘๊ฒŒ ์ž๋ฅด๋Š” ๊ณผ์ • 1. normalization - stemming , lestemming : ๊ฐ„๋‹ค, ๊ฐ”๋‹ค, ๊ฐ€๋Š”,,,,-> ์–ด๊ทผ="๊ฐ€" - uncased : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ ๋ฐ”๊ฟˆ(He=he๋ฅผ ๋ช…์‹œํ•˜๊ธฐ์œ„ํ•ด, ํ•œ๊ตญ์–ด๋Š” ํ•ด๋‹น๋˜์ง€ ์•Š์Œ) - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ : ํŠน์ˆ˜๋ฌธ์ž, ํ•™์Šต์— ํ•„์š”ํ•˜์ง€ ์•Š์€ ๋‹จ์–ด(์˜์–ด์˜ ๊ฒฝ์šฐ ๊ธธ์ด๊ฐ€ 1์ธ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ๋„ํ•จ) - ์ •๊ทœ ํ‘œํ˜„์‹ ํ™œ์šฉํ•œ ํŒจํ„ด ์ œ๊ฑฐ- (re, NLTK) 2. pre-tokenization - ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์€ ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ, ๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€์œผ..

๋ฐ˜์‘ํ˜•