์ž์—ฐ์–ด ์ฒ˜๋ฆฌ/Today I learned :

[์ž์—ฐ์–ด์ฒ˜๋ฆฌ] ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ(Text Preprocessing)

์ฃผ์˜ ๐Ÿฑ 2023. 1. 3. 15:52
728x90
๋ฐ˜์‘ํ˜•

ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •

1. normalization : raw string์„ cleaningํ•˜๋Š” ์ž‘์—…(์—ฌ๋ฐฑ ์ œ๊ฑฐ, ๋Œ€์†Œ๋ฌธ์ž ๋ณ€ํ™˜)

2. pre-tokenization : text๋ฅผ word๋กœ ์ž๋ฅด๊ธฐ

3. tokenization: ๋” ์ž‘๊ฒŒ ์ž๋ฅด๋Š” ๊ณผ์ •

 

1. normalization

- stemming , lestemming : ๊ฐ„๋‹ค, ๊ฐ”๋‹ค, ๊ฐ€๋Š”,,,,-> ์–ด๊ทผ="๊ฐ€"

- uncased : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ ๋ฐ”๊ฟˆ(He=he๋ฅผ ๋ช…์‹œํ•˜๊ธฐ์œ„ํ•ด, ํ•œ๊ตญ์–ด๋Š” ํ•ด๋‹น๋˜์ง€ ์•Š์Œ)

- ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ : ํŠน์ˆ˜๋ฌธ์ž, ํ•™์Šต์— ํ•„์š”ํ•˜์ง€ ์•Š์€ ๋‹จ์–ด(์˜์–ด์˜ ๊ฒฝ์šฐ ๊ธธ์ด๊ฐ€ 1์ธ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ๋„ํ•จ)

- ์ •๊ทœ ํ‘œํ˜„์‹ ํ™œ์šฉํ•œ ํŒจํ„ด ์ œ๊ฑฐ- (re, NLTK)

 

2. pre-tokenization 

- ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์€ ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ, ๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€์œผ๋กœ ์ž๋ฅด๊ธฐ

- ํ•œ๊ตญ์–ด ํ† ํฌ๋‚˜์ด์ €๋กœ๋Š” KoNLPy, Khaiii, soynlp ๊ฐ€ ์žˆ์Œ(soynlp๋Š” ๋‹ค๋ฅธ ๋‘ ๋ฐฉ๋ฒ•๊ณผ ๋‹ค๋ฅด๊ฒŒ ํ™•๋ฅ ์  ํ†ต๊ณ„ ๋ฐฉ๋ฒ•์„ ์“ฐ๋Š” ๋น„์ง€๋„์ ์ธ ํ•™์Šต์„ ์‚ฌ์šฉ)

KoNLPy

- Hannanum, KKma, komoran, Mecab, Okt ๊ฐ€ ์žˆ์ง€๋งŒ ์ด์ค‘์—์„œ๋„ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๊ณ  ์•Œ๋ ค์ง„(๊ธ€์ž์ˆ˜๊ฐ€ ๋งŽ์•„์ ธ๋„ ์‹œ๊ฐ„์ด ๋ณ„๋กœ ์•ˆ๊ฑธ๋ฆผ) Mecab, Okt๋ฅผ ์“ฐ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋Œ€๋‹ค์ˆ˜

๋„์–ด์“ฐ๊ธฐ ๊ต์ •

from pykospacing import spacing

spacing=Spacing()

kospacy = spacing(text)

๋งž์ถค๋ฒ• ๊ต์ •

from hanspell import spell_checker
๋ฐ˜์‘ํ˜•