๋”ฅ๋Ÿฌ๋‹/Today I learned :

๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ NLP

์ฃผ์˜ ๐Ÿฑ 2021. 3. 28. 01:55
728x90
๋ฐ˜์‘ํ˜•

[[ 0 0 1 2] [ 0 0 0 3] [ 4 5 6 7] [ 0 8 9 10] [ 0 11 12 13] [ 0 0 0 14] [ 0 0 0 15] [ 0 0 16 17] [ 0 0 18 19] [ 0 0 0 20]]์ž์—ฐ์–ด = ์šฐ๋ฆฌ๊ฐ€ ํ‰์†Œ์— ๋งํ•˜๋Š” ์Œ์„ฑ์ด๋‚˜ ํ…์ŠคํŠธ

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(Natural Language Processing, NLP) :  ์ž์—ฐ์–ด๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ธ์‹ํ•˜๊ณ  ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ

 

  1.  ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •

ํ† ํฐํ™”(tokenization) : ์ž…๋ ฅ๋œ ํ…์ŠคํŠธ๋ฅผ ์ž˜๊ฒŒ ๋‚˜๋ˆ„๋Š” ๊ณผ์ •

 

keras, text ๋ชจ๋“ˆ์˜ text_to_word_sequence() ํ•จ์ˆ˜ : ๋ฌธ์žฅ์„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋‚˜๋ˆ”

from tensorflow.keras.preprocessing.text import text_to_word_sequence

text = ‘ํ•ด๋ณด์ง€ ์•Š์œผ๋ฉด ํ•ด๋‚ผ ์ˆ˜ ์—†๋‹ค’
result = text_to_word_sequence(text)
print(result)
[‘ํ•ด๋ณด์ง€’, ‘์•Š์œผ๋ฉด’, ‘ํ•ด๋‚ผ’, ‘์ˆ˜’, ‘์—†๋‹ค’]

 

๋จผ์ € ํ…์ŠคํŠธ์˜ ๊ฐ ๋‹จ์–ด๋ฅผ ๋‚˜๋ˆ„์–ด ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
ํ…์ŠคํŠธ์˜ ๋‹จ์–ด๋ฅผ ํ† ํฐํ™”ํ•ด์•ผ ๋”ฅ๋Ÿฌ๋‹์—์„œ ์ธ์‹๋ฉ๋‹ˆ๋‹ค.
ํ† ํฐํ™” ํ•œ ๊ฒฐ๊ณผ๋Š” ๋”ฅ๋Ÿฌ๋‹์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ผ€๋ผ์Šค์˜ Tokenizer() ํ•จ์ˆ˜ : ๋‹จ์–ด์˜ ๋นˆ๋„ ์ˆ˜ ๊ณ„์‚ฐ

from tensorflow.keras.preprocessing.text import Tokenizer

docs = [‘๋จผ์ € ํ…์ŠคํŠธ์˜ ๊ฐ ๋‹จ์–ด๋ฅผ ๋‚˜๋ˆ„์–ด ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค.’, ‘ํ…์ŠคํŠธ์˜ ๋‹จ์–ด๋กœ ํ† ํฐํ™”ํ•ด์•ผ ๋”ฅ๋Ÿฌ๋‹์—์„œ ์ธ์‹๋ฉ๋‹ˆ๋‹ค.’, ‘ํ† ํฐํ™”ํ•œ ๊ฒฐ๊ณผ๋Š” ๋”ฅ๋Ÿฌ๋‹์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.’,]

#Tokenizer()๋ฅผ ์ด์šฉํ•ด ์ „์ฒ˜๋ฆฌ
token = Tokenizer()         # ํ† ํฐํ™” ํ•จ์ˆ˜ ์ง€์ •
token.fit_on_texts(docs)    # ํ† ํฐํ™” ํ•จ์ˆ˜์— ๋ฌธ์žฅ ์ ์šฉ
print(token.word_counts)    # ๋‹จ์–ด์˜ ๋นˆ๋„ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฒฐ๊ณผ ์ถœ๋ ฅ

#word_counts= ๋‹จ์–ด์˜ ๋นˆ๋„ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ด์ฃผ๋Š” ํ•จ์ˆ˜
OrderedDict([(‘๋จผ์ €’, 1), (‘ํ…์ŠคํŠธ์˜’, 2), (‘๊ฐ’, 1), (‘๋‹จ์–ด๋ฅผ’, 1), (‘๋‚˜๋ˆ„์–ด’, 1), (‘ํ† ํฐํ™”’, 3), (‘ํ•ฉ๋‹ˆ๋‹ค’, 1), (‘๋‹จ์–ด๋กœ’, 1), (‘ํ•ด์•ผ’, 1), (‘๋”ฅ๋Ÿฌ๋‹์—์„œ’, 2), (‘์ธ์‹๋ฉ๋‹ˆ๋‹ค’, 1), (‘ํ•œ’, 1), (‘๊ฒฐ๊ณผ๋Š”’, 1), (‘์‚ฌ์šฉ’, 1), (‘ํ• ’, 1), (‘์ˆ˜’, 1), (‘์žˆ์Šต๋‹ˆ๋‹ค’, 1)])

‘ํ† ํฐํ™”’๊ฐ€ 3๋ฒˆ, ‘ํ…์ŠคํŠธ์˜’์™€ ‘๋”ฅ๋Ÿฌ๋‹์—์„œ’๊ฐ€ 2๋ฒˆ, ๋‚˜๋จธ์ง€๊ฐ€ 1๋ฒˆ์”ฉ ๋‚˜์˜ค๊ณ  ์žˆ๋‹ค

์ˆœ์„œ๋ฅผ ๊ธฐ์–ตํ•˜๋Š” OrderedDict ํด๋ž˜์Šค์— ๋‹ด๊ฒจ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ์ถœ๋ ฅ

 

 

document_count() ํ•จ์ˆ˜ : ์ด ๋ช‡ ๊ฐœ์˜ ๋ฌธ์žฅ์ด ๋“ค์–ด์žˆ๋Š”๊ฐ€?

print(token.document_count)

์‹คํ–‰๊ฒฐ๊ณผ : 3

 

 

word_docs() ํ•จ์ˆ˜ :  ๊ฐ ๋‹จ์–ด๋“ค์ด ๋ช‡ ๊ฐœ์˜ ๋ฌธ์žฅ์— ๋‚˜์˜ค๋Š”๊ฐ€? (์ถœ๋ ฅ๋˜๋Š” ์ˆœ์„œ๋Š” ๋žœ๋ค)

print(token.word_docs)
{‘ํ•œ’: 1, ‘๋จผ์ €’: 1, ‘๋‚˜๋ˆ„์–ด’: 1, ‘ํ•ด์•ผ’: 1, ‘ํ† ํฐํ™”’: 3, ‘๊ฒฐ๊ณผ๋Š”’: 1, ‘๊ฐ’: 1, ‘๋‹จ์–ด๋ฅผ’: 1, ‘์ธ์‹๋ฉ๋‹ˆ๋‹ค’: 1, ‘์žˆ์Šต๋‹ˆ๋‹ค’: 1, ‘ํ• ’: 1, ‘๋‹จ์–ด๋กœ’: 1, ‘์ˆ˜’: 1, ‘ํ•ฉ๋‹ˆ๋‹ค’: 1, ‘๋”ฅ๋Ÿฌ๋‹์—์„œ’: 2, ‘์‚ฌ์šฉ’: 1, ‘ํ…์ŠคํŠธ์˜’: 2}

 

 word_index() ํ•จ์ˆ˜ : ๊ฐ ๋‹จ์–ด์— ๋งค๊ฒจ์ง„ ์ธ๋ฑ์Šค ๊ฐ’

print(token.word_index)
{'๋”ฅ๋Ÿฌ๋‹์—์„œ': 3, '๋‹จ์–ด๋ฅผ': 6, '๊ฒฐ๊ณผ๋Š”': 13, '์ˆ˜': 16, 'ํ•œ': 12, '์ธ์‹๋ฉ๋‹ˆ๋‹ค': 11, 'ํ•ฉ๋‹ˆ๋‹ค': 8, 'ํ…์ŠคํŠธ์˜': 2, 'ํ† ํฐํ™”': 1, 'ํ• ': 15, '๊ฐ': 5, '์žˆ์Šต๋‹ˆ๋‹ค': 17, '๋จผ์ €': 4, '๋‚˜๋ˆ„์–ด': 7, 'ํ•ด์•ผ': 10, '์‚ฌ์šฉ': 14, '๋‹จ์–ด๋กœ': 9}

 


 ๋‹จ์–ด์˜ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ

 

‘์˜ค๋žซ๋™์•ˆ ๊ฟˆ๊พธ๋Š” ์ด๋Š” ๊ทธ ๊ฟˆ์„ ๋‹ฎ์•„๊ฐ„๋‹ค’

 

์›-ํ•ซ ์ธ์ฝ”๋”ฉ = ๊ฐ ๋‹จ์–ด๋ฅผ ๋ชจ๋‘ 0์œผ๋กœ ๋ฐ”๊พธ์–ด ์ฃผ๊ณ  ์›ํ•˜๋Š” ๋‹จ์–ด๋งŒ 1๋กœ ๋ฐ”๊พธ์–ด ์ฃผ๋Š” ๊ฒƒ

์ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋จผ์ € ๋‹จ์–ด ์ˆ˜๋งŒํผ 0์œผ๋กœ ์ฑ„์›Œ์ง„ ๋ฒกํ„ฐ ๊ณต๊ฐ„์œผ๋กœ ๋ฐ”๊พธ๊ธฐ

 

 

ํŒŒ์ด์ฌ ๋ฐฐ์—ด์˜ ์ธ๋ฑ์Šค๋Š” 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋ฏ€๋กœ, ๋งจ ์•ž์— 0์ด ์ถ”๊ฐ€๋จ

์ด์ œ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ฐฐ์—ด ๋‚ด์—์„œ ํ•ด๋‹นํ•˜๋Š” ์œ„์น˜๋ฅผ 1๋กœ ๋ฐ”๊ฟ”์„œ ๋ฒกํ„ฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.


1. ํ† ํฐํ™” ํ•จ์ˆ˜๋ฅผ ๋ถˆ๋Ÿฌ์™€ ๋‹จ์–ด ๋‹จ์œ„๋กœ ํ† ํฐํ™”ํ•˜๊ณ  ๊ฐ ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค ๊ฐ’์„ ์ถœ๋ ฅ

from tensorflow.keras.preprocessing.text import Tokenizer
text=“์˜ค๋žซ๋™์•ˆ ๊ฟˆ๊พธ๋Š” ์ด๋Š” ๊ทธ ๊ฟˆ์„ ๋‹ฎ์•„๊ฐ„๋‹ค”


token = Tokenizer()
token.fit_on_texts([text])
print(token.word_index)
{‘๊ฟˆ์„’: 5, ‘๊ฟˆ๊พธ๋Š”’: 2, ‘๊ทธ’: 4, ‘๋‹ฎ์•„๊ฐ„๋‹ค’: 6, ‘์ด๋Š”’: 3, ‘์˜ค๋žซ๋™์•ˆ’: 1}

 

2. ์›-ํ•ซ ์ธ์ฝ”๋”ฉ

 

์ผ€๋ผ์Šค์—์„œ ์ œ๊ณตํ•˜๋Š” Tokenizer์˜ texts_to_sequences() ํ•จ์ˆ˜ : ๋งŒ๋“ค์–ด์ง„ ํ† ํฐ์˜ ์ธ๋ฑ์Šค๋กœ๋งŒ ์ฑ„์›Œ์ง„ ์ƒˆ๋กœ์šด ๋ฐฐ์—ด ์ƒ์„ฑ

x = token.texts_to_sequences([text])
print(x)
[[1,2,3,4,5,6]]

 

to_categorical() ํ•จ์ˆ˜ - 1~6๊นŒ์ง€์˜ ์ •์ˆ˜๋กœ ์ธ๋ฑ์Šค ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ 0๊ณผ 1๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„ ๋ฐฐ์—ด๋กœ ๋ฐ”๊พธ๊ธฐ, ์›-ํ•ซ ์ธ์ฝ”๋”ฉ ์ง„ํ–‰

. ๋ฐฐ์—ด ๋งจ ์•ž์— 0์ด ์ถ”๊ฐ€๋˜๋ฏ€๋กœ ๋‹จ์–ด ์ˆ˜๋ณด๋‹ค 1์ด ๋” ๋งŽ๊ฒŒ ์ธ๋ฑ์Šค ์ˆซ์ž๋ฅผ ์žก์•„ ์ฃผ๋Š” ๊ฒƒ์— ์œ ์˜!

from keras.utils import to_categorical

# ์ธ๋ฑ์Šค ์ˆ˜์— ํ•˜๋‚˜๋ฅผ ์ถ”๊ฐ€ํ•ด์„œ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ ๋ฐฐ์—ด ๋งŒ๋“ค๊ธฐ
word_size = len(t.word_index) +1

x = to_categorical(x, num_classes=word_size)


print(x)

 


์›-ํ•ซ ์ธ์ฝ”๋”ฉ์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋ฉด ๋ฒกํ„ฐ์˜ ๊ธธ์ด๊ฐ€ ๋„ˆ๋ฌด ๊ธธ์–ด์ง„๋‹ค๋Š” ๋‹จ์ ( ์˜ˆ๋ฅผ ๋“ค์–ด 1๋งŒ ๊ฐœ์˜ ๋‹จ์–ด ํ† ํฐ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ง๋ญ‰์น˜๋ฅผ ๋‹ค๋ฃฌ๋‹ค๊ณ  ํ•  ๋•Œ, ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์œผ๋กœ ๋ฒกํ„ฐํ™”ํ•˜๋ฉด 9,999๊ฐœ์˜ 0๊ณผ ํ•˜๋‚˜์˜ 1๋กœ ์ด๋ฃจ์–ด์ง„ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ 1๋งŒ ๊ฐœ๋‚˜ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค.) ์ด๋Ÿฌํ•œ ๊ณต๊ฐ„์  ๋‚ญ๋น„๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ(word embedding)์ด๋ผ๋Š” ๋ฐฉ๋ฒ•

 

๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์€ ์ฃผ์–ด์ง„ ๋ฐฐ์—ด์„ ์ •ํ•ด์ง„ ๊ธธ์ด๋กœ ์••์ถ•

๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์–ป์€ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐ€์ง‘๋œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ  ๊ณต๊ฐ„์˜ ๋‚ญ๋น„๊ฐ€ ์ ๋‹ค

์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ด์œ ๋Š” ๊ฐ ๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ

 


๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋Š” ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ผ๊นŒ?

 

์˜ค์ฐจ ์—ญ์ „ํŒŒ

 

์ผ€๋ผ์Šค์—์„œ ์ œ๊ณตํ•˜๋Š” Embedding()ํ•จ์ˆ˜

from keras.layers import Embedding model = Sequential() model.add(Embedding(16,4)

Embedding(16,4)๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๋ฐ”๋Š”//  ‘์ž…๋ ฅ’๊ณผ ‘์ถœ๋ ฅ’์˜ ํฌ๊ธฐ// ์ž…๋ ฅ๋  ์ด ๋‹จ์–ด ์ˆ˜๋Š” 16, ์ž„๋ฒ ๋”ฉ ํ›„ ์ถœ๋ ฅ๋˜๋Š” ๋ฒกํ„ฐ ํฌ๊ธฐ๋Š” 4

์—ฌ๊ธฐ์— ๋‹จ์–ด๋ฅผ ๋งค๋ฒˆ ์–ผ๋งˆ๋‚˜ ์ž…๋ ฅํ• ์ง€๋ฅผ ์ถ”๊ฐ€๋กœ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. 

Embedding(16,4, input_length=2)๋ผ๊ณ  ํ•˜๋ฉด ์ด ์ž…๋ ฅ๋˜๋Š” ๋‹จ์–ด ์ˆ˜๋Š” 16๊ฐœ์ด์ง€๋งŒ ๋งค๋ฒˆ 2๊ฐœ์”ฉ๋งŒ ๋„ฃ๊ฒ ๋‹ค๋Š” ๋œป

 

 

 

ํ…์ŠคํŠธ ๊ฐ์ •์„ ์˜ˆ์ธกํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ -์˜ํ™” ๋ฆฌ๋ทฐ๋ฅผ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋กœ ํ•™์Šตํ•ด์„œ, ๊ฐ ๋ฆฌ๋ทฐ๊ฐ€ ๊ธ์ •์ ์ธ์ง€ ๋ถ€์ •์ ์ธ์ง€ ์˜ˆ์ธก

 

 

 1. ์งง์€ ๋ฆฌ๋ทฐ 10๊ฐœ๋ฅผ ๋ถˆ๋Ÿฌ์™€ ๊ฐ๊ฐ ๊ธ์ •์ด๋ฉด 1์ด๋ผ๋Š” ํด๋ž˜์Šค๋ฅผ, ๋ถ€์ •์ ์ด๋ฉด 0์ด๋ผ๋Š” ํด๋ž˜์Šค๋กœ ์ง€์ •

# ํ…์ŠคํŠธ ๋ฆฌ๋ทฐ ์ž๋ฃŒ ์ง€์ •
docs = [‘๋„ˆ๋ฌด ์žฌ๋ฐŒ๋„ค์š”’,‘์ตœ๊ณ ์˜ˆ์š”’,‘์ฐธ ์ž˜ ๋งŒ๋“  ์˜ํ™”์˜ˆ์š”’,‘์ถ”์ฒœํ•˜๊ณ  ์‹ถ์€ ์˜ํ™”์ž…๋‹ˆ๋‹ค.’,‘ํ•œ ๋ฒˆ ๋” ๋ณด๊ณ ์‹ถ๋„ค์š”’,‘๊ธ€์Ž„์š”’,‘๋ณ„๋กœ์˜ˆ์š”’,‘์ƒ๊ฐ๋ณด๋‹ค ์ง€๋ฃจํ•˜๋„ค์š”’,‘์—ฐ๊ธฐ๊ฐ€ ์–ด์ƒ‰ํ•ด์š”’,‘์žฌ๋ฏธ์—†์–ด์š”’]

# ๊ธ์ • ๋ฆฌ๋ทฐ๋Š” 1, ๋ถ€์ • ๋ฆฌ๋ทฐ๋Š” 0์œผ๋กœ ํด๋ž˜์Šค ์ง€์ •
class = array([1,1,1,1,1,0,0,0,0,0])

 

2. ํ† ํฐํ™” ๊ณผ์ •

Tokenizer() ํ•จ์ˆ˜์˜ fit_on_text: ๊ฐ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜

# ํ† ํฐํ™”
token = Tokenizer()
token.fit_on_texts(docs)
print(token.word_index)   # ํ† ํฐํ™” ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ด ํ™•์ธ
{'์ƒ๊ฐ๋ณด๋‹ค': 16, '๋งŒ๋“ ': 6, '์˜ํ™”์ž…๋‹ˆ๋‹ค': 10, 'ํ•œ ๋ฒˆ': 11, '์˜ํ™”์˜ˆ์š”': 7, '์‹ถ์€': 9, '๋ณด๊ณ ์‹ถ๋„ค์š”': 13, '์–ด์ƒ‰ํ•ด์š”': 19, '์žฌ๋ฏธ์—†์–ด์š”': 20, '๋”': 12, '์ถ”์ฒœํ•˜๊ณ ': 8, '์ง€๋ฃจํ•˜๋„ค์š”': 17, '์ตœ๊ณ ์˜ˆ์š”': 3, '์ž˜': 5, '์ฐธ': 4, '์žฌ๋ฐŒ๋„ค์š”': 2, '๋ณ„๋กœ์˜ˆ์š”': 15, '๊ธ€์Ž„์š”': 14, '์—ฐ๊ธฐ๊ฐ€': 18, '๋„ˆ๋ฌด': 1}.

 

3. ํ† ํฐ์— ์ง€์ •๋œ ์ธ๋ฑ์Šค๋กœ ์ƒˆ๋กœ์šด ๋ฐฐ์—ด์„ ์ƒ์„ฑ

x = token.texts_to_sequences(docs)
print(x)
[[1, 2], [3], [4, 5, 6, 7], [8, 9, 10], [11, 12, 13], [14], [15], [16, 17], [18, 19], [20]]

๋‹จ์–ด๊ฐ€ 1๋ถ€ํ„ฐ 20๊นŒ์ง€์˜ ์ˆซ์ž๋กœ ํ† ํฐํ™”

๊ทธ๋Ÿฐ๋ฐ ์ž…๋ ฅ๋œ ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ์˜ ํ† ํฐ ์ˆ˜๊ฐ€ ๊ฐ๊ฐ ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์— ์œ ์˜ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ‘์ตœ๊ณ ์˜ˆ์š”’๋Š” ํ•˜๋‚˜์˜ ํ† ํฐ ([3])์ด์ง€๋งŒ ‘์ฐธ ์ž˜ ๋งŒ๋“  ์˜ํ™”์˜ˆ์š”’๋Š” 4๊ฐœ์˜ ํ† ํฐ([4, 5, 6, 7])์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€์š”.

๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ์ž…๋ ฅ์„ ํ•˜๋ ค๋ฉด ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด๊ฐ€ ๋™์ผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.=>ํŒจ๋”ฉ( padding)

pad_sequnce() 

padded_x = pad_sequences(x, 4)  # ์„œ๋กœ ๋‹ค๋ฅธ ๊ธธ์ด์˜ ๋ฐ์ดํ„ฐ๋ฅผ 4๋กœ ๋งž์ถ”๊ธฐ
print(padded_x)
[[ 0 0 1 2]
[ 0 0 0 3]
[ 4 5 6 7]
[ 0 8 9 10]
[ 0 11 12 13]
[ 0 0 0 14]
[ 0 0 0 15]
[ 0 0 16 17]
[ 0 0 18 19]
[ 0 0 0 20]]

 

 

4. ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ํฌํ•จ, ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ

 

 

์ž„๋ฒ ๋”ฉ ํ•จ์ˆ˜์— ํ•„์š”ํ•œ ์„ธ ๊ฐ€์ง€ ํŒŒ๋ผ๋ฏธํ„ฐ= ‘์ž…๋ ฅ, ์ถœ๋ ฅ, ๋‹จ์–ด ์ˆ˜’

1. ์ด ๋ช‡ ๊ฐœ์˜ ๋‹จ์–ด ์ง‘ํ•ฉ์—์„œ(์ž…๋ ฅ),

2. ๋ช‡ ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€(์ถœ๋ ฅ),

3. ๊ทธ๋ฆฌ๊ณ  ๋งค๋ฒˆ ์ž…๋ ฅ๋  ๋‹จ์–ด ์ˆ˜๋Š” ๋ช‡ ๊ฐœ๋กœ ํ• ์ง€(๋‹จ์–ด ์ˆ˜)

 

1 . word_size๋ผ๋Š” ๋ณ€์ˆ˜๋ฅผ ๋งŒ๋“  ๋’ค, ๊ธธ์ด๋ฅผ ์„ธ๋Š” len() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด word_index ๊ฐ’์„ ์•ž์„œ ๋งŒ๋“  ๋ณ€์ˆ˜์— ๋Œ€์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ์ „์ฒด ๋‹จ์–ด์˜ ๋งจ ์•ž์— 0์ด ๋จผ์ € ๋‚˜์™€์•ผ ํ•˜๋ฏ€๋กœ ์ด ๋‹จ์–ด ์ˆ˜์— 1์„ ๋”ํ•˜๋Š” ๊ฒƒ์„ ์žŠ์ง€ ๋งˆ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

word_size = len(token.word_index) +1

2. ์ด๋ฒˆ ์˜ˆ์ œ์—์„œ๋Š” word_size๋งŒํผ์˜ ์ž…๋ ฅ ๊ฐ’์„ ์ด์šฉํ•ด 8๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค๊ฒ ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ 8์ด๋ผ๋Š” ์ˆซ์ž๋Š” ์ž„์˜๋กœ ์ •ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ ๋งŒ๋“ค์–ด์ง„ 8๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ๋Š” ์šฐ๋ฆฌ ๋ˆˆ์— ๋ณด์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋‚ด๋ถ€์—์„œ ๊ณ„์‚ฐํ•˜์—ฌ ๋”ฅ๋Ÿฌ๋‹์˜ ๋ ˆ์ด์–ด๋กœ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

3.  ํŒจ๋”ฉ ๊ณผ์ •์„ ๊ฑฐ์ณ 4๊ฐœ์˜ ๊ธธ์ด๋กœ ๋งž์ถฐ ์ฃผ์—ˆ์œผ๋ฏ€๋กœ 4๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ๋“ค์–ด๊ฐ€๊ฒŒ ์„ค์ •ํ•˜๋ฉด ์ž„๋ฒ ๋”ฉ ๊ณผ์ •์€ ๋‹ค์Œ ํ•œ ์ค„๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.

Embedding(word_size, 8, input_length=4)

๋ชจ๋ธ ์ƒ์„ฑ

# ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ํฌํ•จํ•˜์—ฌ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ
model = Sequential()
model.add(Embedding(word_size, 8, input_length=4))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(padded_x, labels, epochs=20)
print("\n Accuracy: %.4f" % (model.evaluate(padded_x, labels)[1]))

์ตœ์ ํ™” ํ•จ์ˆ˜๋กœ adam()์„ ์‚ฌ์šฉํ•˜๊ณ  ์˜ค์ฐจ ํ•จ์ˆ˜๋กœ๋Š” binary_crossentropy()๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. 30๋ฒˆ ๋ฐ˜๋ณตํ•˜๊ณ ๋‚˜์„œ ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ถœ๋ ฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.

 


import numpy
import tensorflow as tf
from numpy import array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten,Embedding
  
# ํ…์ŠคํŠธ ๋ฆฌ๋ทฐ ์ž๋ฃŒ ์ง€์ •
docs = ['๋„ˆ๋ฌด ์žฌ๋ฐŒ๋„ค์š”','์ตœ๊ณ ์˜ˆ์š”','์ฐธ ์ž˜ ๋งŒ๋“  ์˜ํ™”์˜ˆ์š”','์ถ”์ฒœํ•˜๊ณ  ์‹ถ์€ ์˜ํ™”์ž…๋‹ˆ๋‹ค.','ํ•œ๋ฒˆ ๋” ๋ณด๊ณ ์‹ถ๋„ค์š”','๊ธ€์Ž„์š”','๋ณ„๋กœ์˜ˆ์š”','์ƒ๊ฐ๋ณด๋‹ค ์ง€๋ฃจํ•˜๋„ค์š”','์—ฐ๊ธฐ๊ฐ€ ์–ด์ƒ‰ํ•ด์š”','์žฌ๋ฏธ์—†์–ด์š”']
  
# ๊ธ์ • ๋ฆฌ๋ทฐ๋Š” 1, ๋ถ€์ • ๋ฆฌ๋ทฐ๋Š” 0์œผ๋กœ ํด๋ž˜์Šค ์ง€์ •
classes = array([1,1,1,1,1,0,0,0,0,0])
  
# ํ† ํฐํ™” 
token = Tokenizer()
token.fit_on_texts(docs)
print(token.word_index)
  
# ํŒจ๋”ฉ, ์„œ๋กœ ๋‹ค๋ฅธ ๊ธธ์ด์˜ ๋ฐ์ดํ„ฐ๋ฅผ 4๋กœ ๋งž์ถค
padded_x = pad_sequences(x, 4)  
"\nํŒจ๋”ฉ ๊ฒฐ๊ณผ\n", print(padded_x)
  
# ์ž„๋ฒ ๋”ฉ์— ์ž…๋ ฅ๋  ๋‹จ์–ด ์ˆ˜ ์ง€์ •
word_size = len(token.word_index)+1
  
# ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ํฌํ•จํ•˜์—ฌ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๊ฒฐ๊ณผ ์ถœ๋ ฅ
model = Sequential()
model.add(Embedding(word_size, 8, input_length=4))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(padded_x, labels, epochs=20)
 
print("\n Accuracy: %.4f" % (model.evaluate(padded_x, classes)[1]))
์‹คํ–‰ ๊ฒฐ๊ณผ
Train on 10 samples
Epoch 1/20 10/10 [==============================] - 0s 7ms/sample - loss: 0.7047 - accuracy: 0.3000
Epoch 2/20 10/10 [==============================] - 0s 183us/sample - loss: 0.7027 - accuracy: 0.4000
(์ค‘๋žต)
Epoch 20/20 10/10 [==============================] - 0s 150us/sample - loss: 0.6668 - accuracy: 0.9000 10/10 [==============================] - 0s 2ms/sample - loss: 0.6648 - accuracy: 0.9000
ํ•™์Šต ํ›„ 10๊ฐœ์˜ ๋ฆฌ๋ทฐ ์ƒ˜ํ”Œ ์ค‘ 9๊ฐœ์˜ ๊ธ์ • ๋˜๋Š” ๋ถ€์ •์„ ๋งžํ˜”์Œ
๋ฐ˜์‘ํ˜•