๋ฐ˜์‘ํ˜•

๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ 402

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ํ† ํฐํ™”์™€ ํ† ํฌ๋‚˜์ด์ € ์ข…๋ฅ˜

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ํ† ํฐ์€ ์‚ฌ์šฉ์ž ์ง€์ • ๋‹จ์œ„์ž…๋‹ˆ๋‹ค. raw string์„ ํ† ํฐ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ์„ ํ† ํฐํ™”๋ผ๊ณ  ํ•˜๊ณ , ์šฐ๋ฆฌ๋Š” ์ด ํ† ํฐ๋“ค๋กœ vocab ์‚ฌ์ „์„ ๋งŒ๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํ† ํฐํ™”๋Š” ์ฃผ์–ด์ง„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ์ด ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์œ„๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด I am student๋ผ๋Š” ๋ฌธ์žฅ์€ ํ† ํฐํ™”๋ฅผ ๊ฑฐ์น˜๋ฉด I, am, student๋กœ ๋‚˜๋ˆ ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. I am student=[I, am, student] ๊นŒ์ง€ ๋˜๋ฉด ์šฐ๋ฆฌ๋Š” ์ด ๋‹จ์–ด๋“ค์ด vocab ์‚ฌ์ „์—์„œ์˜ ์ธ๋ฑ์Šค๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ I๊ฐ€ 10๋ฒˆ์งธ, am์ด 12๋ฒˆ์งธ, student๊ฐ€ 100๋ฒˆ์งธ์— ์œ„์น˜ํ•˜๊ณ  ์žˆ๋‹ค๋ฉด [I, am, student] =[10,12,100]์ด ๋˜๊ณ  ํ•ด๋‹น ๋ฒกํ„ฐ๋“ค์ด ๋ชจ๋ธ์˜ ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ† ํฌ๋‚˜์ด์ € ํ† ํฐ..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ๋งž์ถค๋ฒ• ์ „์ฒ˜๋ฆฌ ๊ต์ • Py-Hanspell ์˜ˆ์ œ

!pip install git+https://github.com/ssut/py-hanspell.git from hanspell import spell_checker sent = "์–ด์ œ๋Š” ์•Š์žค์–ด. ๊ฐ๊ธฐ ๋นจ๋ฆฌ ๋‚ณ์•„! " spelled_sent = spell_checker.check(sent) hanspell_sent = spelled_sent.checked print(hanspell_sent) ์–ด์ œ๋Š” ์•ˆ ์žค์–ด. ๊ฐ๊ธฐ ๋นจ๋ฆฌ ๋‚˜์•„!

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ]์ „์ฒ˜๋ฆฌ ๋„์–ด์“ฐ๊ธฐ ๊ต์ • ์ˆ˜์ • PyKoSpacing ์˜ˆ์ œ

!pip install git+https://github.com/haven-jeon/PyKoSpacing.git sentence = '์ฝ”๋”ฉ๊ณผ AI ๊ฐœ๋ฐœ์ด ๋‘˜๋‹ค ๊ฐ€๋Šฅํ•œ ์‚ฌ๋žŒ์€ ๋งŽ์ง€ ์•Š๋‹ค.' new = sentence.replace(" ", '') # ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†๋Š” ๋ฌธ์žฅ ์ž„์˜๋กœ ๋งŒ๋“ค๊ธฐ print(new) from pykospacing import Spacing spacing = Spacing() kospacing_sen = spacing(new) print('๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†๋Š” ๋ฌธ์žฅ :\n', new) print('์ •๋‹ต ๋ฌธ์žฅ:\n', sentence) print('๋„์–ด์“ฐ๊ธฐ ๊ต์ • ํ›„:\n', kospacing_sen) ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†๋Š” ๋ฌธ์žฅ : ์ฝ”๋”ฉ๊ณผAI๊ฐœ๋ฐœ์ด๋‘˜๋‹ค๊ฐ€๋Šฅํ•œ์‚ฌ๋žŒ์€๋งŽ์ง€์•Š๋‹ค. ์ •๋‹ต ๋ฌธ์žฅ: ์ฝ”๋”ฉ๊ณผ AI ๊ฐœ๋ฐœ..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ํ•œ๊ตญ์–ด ํ† ํฐํ™”, ํ’ˆ์‚ฌํƒœ๊น… ๊ตฌํ˜„ KoNLPy (Hannanum,Kkma),Khaiii

์„ค์น˜ !pip install konlpy ํ•œ๋‚˜๋ˆ”(Hannanum) from konlpy.tag import Hannanum hannanum = Hannanum() text = '์•ˆ๋…•ํ•˜์„ธ์š”! ์˜ค๋Š˜ ๋งŽ์ด ์ถ”์›Œ์š”' print(hannanum.morphs(text)) # Parse phrase to morphemes print(hannanum.nouns(text)) # Noun extractors print(hannanum.pos(text)) # POS tagger ['์•ˆ๋…•', 'ํ•˜', '์„ธ', '์š”', '!', '์˜ค๋Š˜', '๋งŽ', '์ด', '์ถฅ', '์–ด์š”'] ['์•ˆ๋…•', '์˜ค๋Š˜'] [('์•ˆ๋…•', 'N'), ('ํ•˜', 'X'), ('์„ธ', 'E'), ('์š”', 'J'), ('!', 'S'), ('์˜ค๋Š˜', 'N'..

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ]ํ•œ๊ตญ์–ด ์ „์ฒ˜๋ฆฌ re

import re re.sub('[0-9]+', 'num', '1 2 3 4 hello') # ์ˆซ์ž๋งŒ ์ฐพ์•„์„œ num์œผ๋กœ ๋ฐ”๊ฟˆ re.sub('ํŒจํ„ด', '๋ฐ”๊ฟ€๋ฌธ์ž์—ด', '๋ฌธ์ž์—ด', ๋ฐ”๊ฟ€ํšŸ์ˆ˜)๋กœ ๊ฐ„๋‹จํ•œ ๋ฌธ์ž์—ด ์ฐจํ™˜์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, re.complie()์„ ์“ฐ๋ฉด ๋ฐ˜๋ณต๋˜๋Š” ์ž‘์—…์„ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. # ์ž„์˜์˜ ํ•œ ๊ฐœ์˜ ๋ฌธ์ž๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” . r = re.compile("a.c") r.search("abc") # ? ์•ž์˜ ๋ฌธ์ž๊ฐ€ ์กด์žฌํ•  ์ˆ˜๋„ ์žˆ๊ณ , ์กด์žฌํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋Š” ๊ฒฝ์šฐ r = re.compile("a?c") # * ์€ ๋ฐ”๋กœ ์•ž์˜ ๋ฌธ์ž๊ฐ€ 0๊ฐœ ์ด์ƒ์ผ ๊ฒฝ์šฐ๋ฅผ ๋‚˜ํƒ€๋ƒ„. r = re.compile("ab*c") # b ๊ฐ€ ํ•˜๋‚˜๋„ ์—†๊ฑฐ๋‚˜, ์—ฌ๋Ÿฌ ๊ฐœ์ธ ๊ฒฝ์šฐ'' # + ์•ž์˜ ๋ฌธ์ž๊ฐ€ ์ตœ์†Œ 1๊ฐœ ์ด์ƒ ์žˆ์–ด์•ผ ํ•จ. r = ..

[์ž์—ฐ์–ด์ฒ˜๋ฆฌ] ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ(Text Preprocessing)

ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ ๊ณผ์ • 1. normalization : raw string์„ cleaningํ•˜๋Š” ์ž‘์—…(์—ฌ๋ฐฑ ์ œ๊ฑฐ, ๋Œ€์†Œ๋ฌธ์ž ๋ณ€ํ™˜) 2. pre-tokenization : text๋ฅผ word๋กœ ์ž๋ฅด๊ธฐ 3. tokenization: ๋” ์ž‘๊ฒŒ ์ž๋ฅด๋Š” ๊ณผ์ • 1. normalization - stemming , lestemming : ๊ฐ„๋‹ค, ๊ฐ”๋‹ค, ๊ฐ€๋Š”,,,,-> ์–ด๊ทผ="๊ฐ€" - uncased : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ ๋ฐ”๊ฟˆ(He=he๋ฅผ ๋ช…์‹œํ•˜๊ธฐ์œ„ํ•ด, ํ•œ๊ตญ์–ด๋Š” ํ•ด๋‹น๋˜์ง€ ์•Š์Œ) - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ : ํŠน์ˆ˜๋ฌธ์ž, ํ•™์Šต์— ํ•„์š”ํ•˜์ง€ ์•Š์€ ๋‹จ์–ด(์˜์–ด์˜ ๊ฒฝ์šฐ ๊ธธ์ด๊ฐ€ 1์ธ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ๋„ํ•จ) - ์ •๊ทœ ํ‘œํ˜„์‹ ํ™œ์šฉํ•œ ํŒจํ„ด ์ œ๊ฑฐ- (re, NLTK) 2. pre-tokenization - ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์€ ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ, ๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€์œผ..

Bag-of-words

Bag-of-words(BoW) is a statistical language model used to analyze text and documents based on word count. BoW๋Š” ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค. BoW๊ฐ€์ •์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์€ ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด ๋“ฑ์žฅ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๋‹ค. ์œ„ ์ด๋ฏธ์ง€ ์ฒ˜๋Ÿผ BoW๋ฅผ ์ ์šฉํ•ด ํ–‰๋ ฌ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. ๋ณด๋‹ค์‹œํ”ผ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง„ ์•Š๋Š”๋‹ค. ๋‹จ์–ด ๊ฐœ์ˆ˜๋ฅผ ์„ผ๋‹ค๊ณ  ๋ด๋„ ๋ฌด๋ฐฉํ•˜๋‹ค. {'it':6, 'I':5, 'the': 4, 'to':3....} ํŒŒ์ด์ฌ์—์„œ collection ๋ชจ๋“ˆ์˜ Counter๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ scikit-learn์—์„œ CountVectorizer๋ฅผ ํ†ตํ•ด ์†์‰ฝ๊ฒŒ ํ–‰๋ ฌ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. TF-IDF ๋‹จ์ˆœํžˆ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๊ฐ€ ์ค‘์š”ํ•œ..

ํŒŒ์ดํ† ์น˜๋กœ ๊ฐ„๋‹จํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ตฌํ˜„ํ•˜๊ธฐ (๋ถ„๋ฅ˜)

ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž„ํฌํŠธ import torch import numpy as np from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import torch.nn.functional as F ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ n_dim=2 x_train, y_train = make_blobs(n_samples=50, n_features=n_dim, centers=[[1,1],[-1,-1],[1,-1],[-1,1]], shuffle=True, cluster_std=0.3) x_test, y_test = make_blobs(n_samples=20, n_features=n_dim, centers=[[1,1],[-1,-1],[1,-1],[-1,1]], s..

RNN

์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ ํ† ํฐ์˜ ์ˆœ์„œ๊ฐ€ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ง‘์— ๊ฐ„๋‹ค. (o) ๋‚˜๋Š” ๊ฐ„๋‹ค ์ง‘์— (x) ์ด๋Ÿฌํ•œ ํ† ํฐ์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด RNN ํ˜•ํƒœ์˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. RNN์˜ ์˜๋ฏธ์™€ ๊ตฌ์กฐ ๋˜‘๊ฐ™์€ weight๋ฅผ ํ†ตํ•ด ์žฌ๊ท€์ ์œผ๋กœ(Recurrent) ํ•™์Šตํ•œ๋‹ค. = RNN(Recurrent Neural Network) xt๋ผ๋Š” input์ด ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๋ฉด ์ด์ „์— xt-1์—์„œ ํ•™์Šต๋œ A๋ผ๋Š” weight๋ฅผ ํ†ตํ•ด ht๋ฅผ ๋ฆฌํ„ดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. RNN์˜ ์—ฌ๋Ÿฌ ํ˜•ํƒœ์™€ ์‚ฌ์šฉ ๋ถ„์•ผ ๋นจ๊ฐ„๋ฐ•์Šค๋Š” ์ธํ’‹, ์ดˆ๋ก๋ฐ•์Šค๋Š” RNN Block, ํŒŒ๋ž€๋ฐ•์Šค๋Š” y(์ •๋‹ต) ํ˜น์€ y^(์˜ˆ์ธก๊ฐ’) ์•„์›ƒํ’‹์ž…๋‹ˆ๋‹ค. one-to-many ์‚ฌ์šฉ๋ถ„์•ผ : image captioning ( input: image, output: sequence of words/toke..

๋ฐ˜์‘ํ˜•