[자연어 처리] 토큰화와 토크나이저 종류

자연어 처리/Today I learned :

[자연어 처리] 토큰화와 토크나이저 종류

주영 🐱 2023. 1. 5. 11:30

728x90

자연어 처리에서 토큰은 사용자 지정 단위입니다.

raw string을 토큰 단위로 쪼개는 것을 토큰화라고 하고, 우리는 이 토큰들로 vocab 사전을 만들게 됩니다.

즉, 토큰화는 주어진 입력 데이터를 자연어처리 모델이 인식할 수 있는 단위로 변환해주는 방법입니다.

예를 들어 I am student라는 문장은 토큰화를 거치면 I, am, student로 나눠지게 됩니다.

I am student=[I, am, student] 까지 되면 우리는 이 단어들이 vocab 사전에서의 인덱스값으로 변환합니다. 만약 I가 10번째, am이 12번째, student가 100번째에 위치하고 있다면

[I, am, student] =[10,12,100]이 되고 해당 벡터들이 모델의 인풋으로 들어가게 됩니다.

토크나이저

토큰화를 하는 토크나이저에는 세가지 종류가 있습니다.

1. word based tokenizer - 단어 단위로 토큰화, OOV(out-of-vocabulary)문제 발생합니다,(사전에 없는 단어는 unknownd으로 처리하기 때문)

"I have a meal"이라고 하는 문장을 가지고 word tokenization을 하면 다음과 같습니다. 

- ['I', 'have', 'a', 'meal']

영어의 경우 대부분 space를 기준으로 단어가 정의되기 때문에 .split()을 이용해 쉽게 word tokenization을 구현할 수 있습니다.
영어에서 word tokenization은 space tokenization이라고도 할 수 있고, 
subword tokenization 이전에 수행되는 pre-tokenization 방법으로도 많이 사용됩니다.

"나는 밥을 먹는다"라는 문장을 word tokenization하면 다음과 같습니다.
- ['나', '는', '밥', '을', '먹는다']

한국어에서 "단어"는 공백(space)을 기준으로 정의되지 않습니다. 이는 한국어가 갖고 있는 "교착어"로서의 특징 때문입니다. 
체언 뒤에 조사가 붙는 것이 대표적인 특징이며 의미 단위가 구분되고 자립성이 있기 때문에 조사는 "단어"입니다.

한국어에서는 pre-tokenization 방법으로 space tokenization을 사용하지 않고 형태소 분석기를 활용하고 있습니다.

2. character based tokenizer - 영어에서 알파벳 하나단위로 나누지만, 한국어에서는 음절 단위로 나뉘게 됩니다. 학습이 어렵고 시퀀스가 길고, 좋은 퍼포먼스를 보이지 못합니다.

character 기반 tokenization을 하면 다음과 같습니다.

"I have a meal" -> ['I', 'h', 'a', 'v', 'e', 'a', 'm', 'e', 'a', 'l']
"나는 밥을 먹는다" -> ['나', '는', '밥', '을', '먹', '는', '다']

3. sub-word based tokenizer - 주로 사용되는 방법으로, 시퀀스가 짧아지고, 좋은 퍼포먼스를 보입니다.

*Subword란?

Subword는 하나의 단어를 여러개의 단위로 분리했을 때 하나의 단위를 가리킵니다.

"subword"를 subword 단위로 나타낸 하나의 예시는 다음과 같습니다.

"sub" + "word"

sub라는 접두사와 word라고 하는 어근으로 나누어 "subword"라고 하는 word를 2개의 subword로 나타냈습니다.

이외에도 다양한 형태의 subword로 나타낼 수 있습니다. (e.g., "su" + "bword", "s" + "ubword", "subwor" + "d")

Subword tokenization을 적용했을 때는 다음과 같이 tokenization이 될 수 있습니다.

Example 1

"I have a meal" -> ['I', 'hav', 'e', 'a', 'me', 'al']
"나는 밥을 먹는다" -> ['나', '는', '밥', '을', '먹는', '다']

word 단위가 아니라 그보다 더 잘게 쪼갠 subword 단위로 문장을 tokenization합니다.

위에서 말씀드린 것과 같이 여러가지 경우의 수가 가능합니다.

Example 2

"I have a meal" -> ['I', 'ha', 've', 'a', 'mea', 'l']
"나는 밥을 먹는다" -> ['나', '는', '밥', '을', '먹', '는다']

그렇지만 기본적으로 공백을 넘어선 subword를 구성하진 않습니다.
예를 들어 다음과 같이 tokenizaiton을 수행하진 않습니다.

Example 3

"I have a meal" -> ['Iha', 've', 'am', 'ea', 'l']
"나는 밥을 먹는다" -> ['나는밥', '을먹', '는다']

subword tokenization의 장점은 Out-of-vocabulary (OOV) 문제에서 상대적으로 자유롭다는 것입니다.

일반적으로 subword들은 최소 철자 단위에서 하나씩 더 긴 subword를 추가하는 방식으로 만들어집니다.

예를 들어, 영어의 경우 a~z의 알파벳부터 시작해서 두글자, 세글자, 네글자 subword 등으로 확장해나가며
subword를 추가해 단어를 구성하고 이를 바탕으로 subword tokenization을 수행하기 때문에 다른 언어를 tokenization하지 않는다면
OOV 문제에서 자유롭다고 볼 수 있습니다.

subword tokenization 알고리즘

1. BPE(Byte-Pair-Enciding) -GPT에서 사용

2. WordPiece - 확률론적으로 likeihood가 높은 것끼리 묶음 - BERT, ELECTRA, DistilBERT

3. Unigram - 가장 많이 쓰는 substring 부터 쪼개기 시작

4. SentencePiece - ALBERT , XLVET, T5

BERT 모델에서 사용한 subword tokenization algorithm을 이용해 language modeling task를 수행해보겠습니다.
subword tokenizer는 transformers 라이브러리를 이용해 쉽게 불러올 수 있습니다.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# subword tokenization 예시
print(tokenizer.tokenize('Natural language expert training'))
print(tokenizer.tokenize('NewJeans release new single OMG'))

['Natural', 'language', 'expert', 'training']

['New', '##J', '##ean', '##s', 'release', 'new', 'single', 'O', '##MG']

class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = tokenizer.tokenize(line.strip()) + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = tokenizer.tokenize(line.strip()) + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

모델을 선언하고 parameter의 개수를 살펴보겠습니다.

subword_corpus = Corpus('./data/text-2')
ntokens = len(subword_corpus.dictionary)
subwordmodel = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout)

print(f"Word embedding parameter 개수: {count_parameters(subwordmodel.encoder)}")
print(f"RNN parameter 개수: {count_parameters(subwordmodel.rnn)}")

subword 기반의 언어 모델 성능을 살펴보겠습니다

###############################################################################
# Load data
###############################################################################

# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

eval_batch_size = 10
train_data = batchify(subword_corpus.train, args.batch_size)
val_data = batchify(subword_corpus.valid, eval_batch_size)
test_data = batchify(subword_corpus.test, eval_batch_size)

###############################################################################
# Build the model
###############################################################################

model = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout).to(device)
criterion = nn.NLLLoss()

###############################################################################
# Training code1 - define functions
###############################################################################

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""

    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


# get_batch subdivides the source data into chunks of length args.bptt.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# ┌ a g m s ┐ ┌ b h n t ┐
# └ b h n t ┘ └ c i o u ┘
# Note that despite the name of the function, the subdivison of data is not
# done along the batch dimension (i.e. dimension 1), since that was handled
# by the batchify function. The chunks are along dimension 0, corresponding
# to the seq_len dimension in the LSTM.

def get_batch(source, i):
    seq_len = min(args.bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(subword_corpus.dictionary)
    hidden = model.init_hidden(eval_batch_size)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, args.bptt):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)
            total_loss += len(data) * criterion(output, targets).item()
    return total_loss / (len(data_source) - 1)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(subword_corpus.dictionary)
    hidden = model.init_hidden(args.batch_size)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()

        hidden = repackage_hidden(hidden)
        output, hidden = model(data, hidden)

        loss = criterion(output, targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
        for p in model.parameters():
            p.data.add_(p.grad, alpha=-lr)

        total_loss += loss.item()

        if batch % args.log_interval == 0 and batch > 0:
            cur_loss = total_loss / args.log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // args.bptt, lr,
                elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()
        if args.dry_run:
            break

###############################################################################
# Training code2 - run 
###############################################################################

# Loop over epochs.
lr = args.lr
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, args.epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(args.save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open(args.save, 'rb') as f:
    model = torch.load(f)
    # after load the rnn params are not a continuous chunk of memory
    # this makes them a continuous chunk, and will speed up forward pass
    # Currently, only rnn model supports flatten_parameters function.
    if args.model in ['RNN_TANH', 'RNN_RELU', 'LSTM', 'GRU']:
        model.rnn.flatten_parameters()

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

학습이 완료된 모델을 불러와 random 한 단어를 input 으로 넣어준 후 정해진 개수의 단어를 생성합니다.
생성한 문장을 decode 하여 (즉, idx2word 를 이용해 id 를 word 로 변환하여) generate.txt 파일에 저장합니다.

import torch

# Model parameters.
test_args = easydict.EasyDict({
    "data"      : './data/text-2',  # location of data corpus
    "checkpoint": './model.pt',         # model checkpoint to use
    "outf"      : 'generate.txt',       # output file for generated text
    "words"     : 1000,                 # number of words to generate
    "seed"      : 1111,                 # random seed
    "cuda"      : True,                 # use CUDA
    "temperature": 1.0,                 # temperature - higher will increase diversity
    "log_interval": 100                 # reporting interval
})

# Set the random seed manually for reproducibility.
torch.manual_seed(test_args.seed)
if torch.cuda.is_available():
    if not test_args.cuda:
        print("WARNING: You have a CUDA device, so you should probably run with --cuda")

device = torch.device("cuda" if test_args.cuda else "cpu")

if test_args.temperature < 1e-3:
    parser.error("--temperature has to be greater or equal 1e-3")

with open(test_args.checkpoint, 'rb') as f:
    model = torch.load(f).to(device)
model.eval()

# corpus = Corpus(test_args.data)
# ntokens = len(subword_corpus.dictionary)

hidden = model.init_hidden(1)
input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)

with open(test_args.outf, 'w') as outf:
    with torch.no_grad():  # no tracking history
        for i in range(test_args.words):
            output, hidden = model(input, hidden)
            word_weights = output.squeeze().div(test_args.temperature).exp().cpu()
            word_idx = torch.multinomial(word_weights, 1)[0]
            input.fill_(word_idx)

            word = subword_corpus.dictionary.idx2word[word_idx]

            outf.write(word + ('\n' if i % 20 == 19 else ' '))

            if i % test_args.log_interval == 0:
                print('| Generated {}/{} words'.format(i, test_args.words))

출처

https://github.com/pytorch/examples/tree/main/word_language_model

저작자표시 비영리 변경금지 (새창열림)

'자연어 처리 > Today I learned :' 카테고리의 다른 글

[자연어 처리] 파이토치 LSTM 구현 (0)	2023.01.05
[자연어 처리] RNN을 보완하는 LSTM과 GRU (0)	2023.01.05
[자연어 처리] 정규표현식 전처리 한글만 남기기 (0)	2023.01.03
[자연어 처리] 맞춤법 전처리 교정 Py-Hanspell 예제 (0)	2023.01.03
[자연어 처리]전처리 띄어쓰기 교정 수정 PyKoSpacing 예제 (0)	2023.01.03

현재글[자연어 처리] 토큰화와 토크나이저 종류

TIL

마케팅, 오픽, 오픽모의고사, 배당투자, 1인개발, 오픽1주, 안드로이드개발, 오픽공부법, 딥러닝, 1인개발마케팅, 특수문자이모티콘, 오픽AL, 오픽IH, 오픽기출, 오블완, 배당투자계산기, 티스토리챌린지, 1인개발자, 오픽 AL, 오픽 모의테스트,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

TIL