์์ฐ์ด ์ฒ๋ฆฌ์์ ํ ํฐ์ ์ฌ์ฉ์ ์ง์ ๋จ์์ ๋๋ค.
raw string์ ํ ํฐ ๋จ์๋ก ์ชผ๊ฐ๋ ๊ฒ์ ํ ํฐํ๋ผ๊ณ ํ๊ณ , ์ฐ๋ฆฌ๋ ์ด ํ ํฐ๋ค๋ก vocab ์ฌ์ ์ ๋ง๋ค๊ฒ ๋ฉ๋๋ค.
์ฆ, ํ ํฐํ๋ ์ฃผ์ด์ง ์ ๋ ฅ ๋ฐ์ดํฐ๋ฅผ ์์ฐ์ด์ฒ๋ฆฌ ๋ชจ๋ธ์ด ์ธ์ํ ์ ์๋ ๋จ์๋ก ๋ณํํด์ฃผ๋ ๋ฐฉ๋ฒ์ ๋๋ค.
์๋ฅผ ๋ค์ด I am student๋ผ๋ ๋ฌธ์ฅ์ ํ ํฐํ๋ฅผ ๊ฑฐ์น๋ฉด I, am, student๋ก ๋๋ ์ง๊ฒ ๋ฉ๋๋ค.
I am student=[I, am, student] ๊น์ง ๋๋ฉด ์ฐ๋ฆฌ๋ ์ด ๋จ์ด๋ค์ด vocab ์ฌ์ ์์์ ์ธ๋ฑ์ค๊ฐ์ผ๋ก ๋ณํํฉ๋๋ค. ๋ง์ฝ I๊ฐ 10๋ฒ์งธ, am์ด 12๋ฒ์งธ, student๊ฐ 100๋ฒ์งธ์ ์์นํ๊ณ ์๋ค๋ฉด
[I, am, student] =[10,12,100]์ด ๋๊ณ ํด๋น ๋ฒกํฐ๋ค์ด ๋ชจ๋ธ์ ์ธํ์ผ๋ก ๋ค์ด๊ฐ๊ฒ ๋ฉ๋๋ค.
ํ ํฌ๋์ด์
ํ ํฐํ๋ฅผ ํ๋ ํ ํฌ๋์ด์ ์๋ ์ธ๊ฐ์ง ์ข ๋ฅ๊ฐ ์์ต๋๋ค.
1. word based tokenizer - ๋จ์ด ๋จ์๋ก ํ ํฐํ, OOV(out-of-vocabulary)๋ฌธ์ ๋ฐ์ํฉ๋๋ค,(์ฌ์ ์ ์๋ ๋จ์ด๋ unknownd์ผ๋ก ์ฒ๋ฆฌํ๊ธฐ ๋๋ฌธ)
"I have a meal"์ด๋ผ๊ณ ํ๋ ๋ฌธ์ฅ์ ๊ฐ์ง๊ณ word tokenization์ ํ๋ฉด ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
- ['I', 'have', 'a', 'meal']
์์ด์ ๊ฒฝ์ฐ ๋๋ถ๋ถ space๋ฅผ ๊ธฐ์ค์ผ๋ก ๋จ์ด๊ฐ ์ ์๋๊ธฐ ๋๋ฌธ์ .split()์ ์ด์ฉํด ์ฝ๊ฒ word tokenization์ ๊ตฌํํ ์ ์์ต๋๋ค.
์์ด์์ word tokenization์ space tokenization์ด๋ผ๊ณ ๋ ํ ์ ์๊ณ ,
subword tokenization ์ด์ ์ ์ํ๋๋ pre-tokenization ๋ฐฉ๋ฒ์ผ๋ก๋ ๋ง์ด ์ฌ์ฉ๋ฉ๋๋ค.
"๋๋ ๋ฐฅ์ ๋จน๋๋ค"๋ผ๋ ๋ฌธ์ฅ์ word tokenizationํ๋ฉด ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
- ['๋', '๋', '๋ฐฅ', '์', '๋จน๋๋ค']
ํ๊ตญ์ด์์ "๋จ์ด"๋ ๊ณต๋ฐฑ(space)์ ๊ธฐ์ค์ผ๋ก ์ ์๋์ง ์์ต๋๋ค. ์ด๋ ํ๊ตญ์ด๊ฐ ๊ฐ๊ณ ์๋ "๊ต์ฐฉ์ด"๋ก์์ ํน์ง ๋๋ฌธ์
๋๋ค.
์ฒด์ธ ๋ค์ ์กฐ์ฌ๊ฐ ๋ถ๋ ๊ฒ์ด ๋ํ์ ์ธ ํน์ง์ด๋ฉฐ ์๋ฏธ ๋จ์๊ฐ ๊ตฌ๋ถ๋๊ณ ์๋ฆฝ์ฑ์ด ์๊ธฐ ๋๋ฌธ์ ์กฐ์ฌ๋ "๋จ์ด"์
๋๋ค.
ํ๊ตญ์ด์์๋ pre-tokenization ๋ฐฉ๋ฒ์ผ๋ก space tokenization์ ์ฌ์ฉํ์ง ์๊ณ ํํ์ ๋ถ์๊ธฐ๋ฅผ ํ์ฉํ๊ณ ์์ต๋๋ค.
2. character based tokenizer - ์์ด์์ ์ํ๋ฒณ ํ๋๋จ์๋ก ๋๋์ง๋ง, ํ๊ตญ์ด์์๋ ์์ ๋จ์๋ก ๋๋๊ฒ ๋ฉ๋๋ค. ํ์ต์ด ์ด๋ ต๊ณ ์ํ์ค๊ฐ ๊ธธ๊ณ , ์ข์ ํผํฌ๋จผ์ค๋ฅผ ๋ณด์ด์ง ๋ชปํฉ๋๋ค.
character ๊ธฐ๋ฐ tokenization์ ํ๋ฉด ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
"I have a meal" -> ['I', 'h', 'a', 'v', 'e', 'a', 'm', 'e', 'a', 'l']
"๋๋ ๋ฐฅ์ ๋จน๋๋ค" -> ['๋', '๋', '๋ฐฅ', '์', '๋จน', '๋', '๋ค']
3. sub-word based tokenizer - ์ฃผ๋ก ์ฌ์ฉ๋๋ ๋ฐฉ๋ฒ์ผ๋ก, ์ํ์ค๊ฐ ์งง์์ง๊ณ , ์ข์ ํผํฌ๋จผ์ค๋ฅผ ๋ณด์ ๋๋ค.
*Subword๋?
Subword๋ ํ๋์ ๋จ์ด๋ฅผ ์ฌ๋ฌ๊ฐ์ ๋จ์๋ก ๋ถ๋ฆฌํ์ ๋ ํ๋์ ๋จ์๋ฅผ ๊ฐ๋ฆฌํต๋๋ค.
"subword"๋ฅผ subword ๋จ์๋ก ๋ํ๋ธ ํ๋์ ์์๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
"sub" + "word"
sub๋ผ๋ ์ ๋์ฌ์ word๋ผ๊ณ ํ๋ ์ด๊ทผ์ผ๋ก ๋๋์ด "subword"๋ผ๊ณ ํ๋ word๋ฅผ 2๊ฐ์ subword๋ก ๋ํ๋์ต๋๋ค.
์ด์ธ์๋ ๋ค์ํ ํํ์ subword๋ก ๋ํ๋ผ ์ ์์ต๋๋ค. (e.g., "su" + "bword", "s" + "ubword", "subwor" + "d")
Subword tokenization์ ์ ์ฉํ์ ๋๋ ๋ค์๊ณผ ๊ฐ์ด tokenization์ด ๋ ์ ์์ต๋๋ค.
Example 1
"I have a meal" -> ['I', 'hav', 'e', 'a', 'me', 'al']
"๋๋ ๋ฐฅ์ ๋จน๋๋ค" -> ['๋', '๋', '๋ฐฅ', '์', '๋จน๋', '๋ค']
word ๋จ์๊ฐ ์๋๋ผ ๊ทธ๋ณด๋ค ๋ ์๊ฒ ์ชผ๊ฐ subword ๋จ์๋ก ๋ฌธ์ฅ์ tokenizationํฉ๋๋ค.
์์์ ๋ง์๋๋ฆฐ ๊ฒ๊ณผ ๊ฐ์ด ์ฌ๋ฌ๊ฐ์ง ๊ฒฝ์ฐ์ ์๊ฐ ๊ฐ๋ฅํฉ๋๋ค.
Example 2
"I have a meal" -> ['I', 'ha', 've', 'a', 'mea', 'l']
"๋๋ ๋ฐฅ์ ๋จน๋๋ค" -> ['๋', '๋', '๋ฐฅ', '์', '๋จน', '๋๋ค']
๊ทธ๋ ์ง๋ง ๊ธฐ๋ณธ์ ์ผ๋ก ๊ณต๋ฐฑ์ ๋์ด์ subword๋ฅผ ๊ตฌ์ฑํ์ง ์์ต๋๋ค.
์๋ฅผ ๋ค์ด ๋ค์๊ณผ ๊ฐ์ด tokenizaiton์ ์ํํ์ง ์์ต๋๋ค.
Example 3
"I have a meal" -> ['Iha', 've', 'am', 'ea', 'l']
"๋๋ ๋ฐฅ์ ๋จน๋๋ค" -> ['๋๋๋ฐฅ', '์๋จน', '๋๋ค']
subword tokenization์ ์ฅ์ ์ Out-of-vocabulary (OOV) ๋ฌธ์ ์์ ์๋์ ์ผ๋ก ์์ ๋กญ๋ค๋ ๊ฒ์
๋๋ค.
์ผ๋ฐ์ ์ผ๋ก subword๋ค์ ์ต์ ์ฒ ์ ๋จ์์์ ํ๋์ฉ ๋ ๊ธด subword๋ฅผ ์ถ๊ฐํ๋ ๋ฐฉ์์ผ๋ก ๋ง๋ค์ด์ง๋๋ค.
์๋ฅผ ๋ค์ด, ์์ด์ ๊ฒฝ์ฐ a~z์ ์ํ๋ฒณ๋ถํฐ ์์ํด์ ๋๊ธ์, ์ธ๊ธ์, ๋ค๊ธ์ subword ๋ฑ์ผ๋ก ํ์ฅํด๋๊ฐ๋ฉฐ
subword๋ฅผ ์ถ๊ฐํด ๋จ์ด๋ฅผ ๊ตฌ์ฑํ๊ณ ์ด๋ฅผ ๋ฐํ์ผ๋ก subword tokenization์ ์ํํ๊ธฐ ๋๋ฌธ์ ๋ค๋ฅธ ์ธ์ด๋ฅผ tokenizationํ์ง ์๋๋ค๋ฉด
OOV ๋ฌธ์ ์์ ์์ ๋กญ๋ค๊ณ ๋ณผ ์ ์์ต๋๋ค.
subword tokenization ์๊ณ ๋ฆฌ์ฆ
1. BPE(Byte-Pair-Enciding) -GPT์์ ์ฌ์ฉ
2. WordPiece - ํ๋ฅ ๋ก ์ ์ผ๋ก likeihood๊ฐ ๋์ ๊ฒ๋ผ๋ฆฌ ๋ฌถ์ - BERT, ELECTRA, DistilBERT
3. Unigram - ๊ฐ์ฅ ๋ง์ด ์ฐ๋ substring ๋ถํฐ ์ชผ๊ฐ๊ธฐ ์์
4. SentencePiece - ALBERT , XLVET, T5
BERT ๋ชจ๋ธ์์ ์ฌ์ฉํ subword tokenization algorithm์ ์ด์ฉํด language modeling task๋ฅผ ์ํํด๋ณด๊ฒ ์ต๋๋ค.
subword tokenizer๋ transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ด์ฉํด ์ฝ๊ฒ ๋ถ๋ฌ์ฌ ์ ์์ต๋๋ค.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# subword tokenization ์์
print(tokenizer.tokenize('Natural language expert training'))
print(tokenizer.tokenize('NewJeans release new single OMG'))
['Natural', 'language', 'expert', 'training']
['New', '##J', '##ean', '##s', 'release', 'new', 'single', 'O', '##MG']
class Corpus(object):
def __init__(self, path):
self.dictionary = Dictionary()
self.train = self.tokenize(os.path.join(path, 'train.txt'))
self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
self.test = self.tokenize(os.path.join(path, 'test.txt'))
def tokenize(self, path):
assert os.path.exists(path)
# Add words to the dictionary
with open(path, 'r', encoding="utf8") as f:
for line in f:
words = tokenizer.tokenize(line.strip()) + ['<eos>']
for word in words:
self.dictionary.add_word(word)
# Tokenize file content
with open(path, 'r', encoding="utf8") as f:
idss = []
for line in f:
words = tokenizer.tokenize(line.strip()) + ['<eos>']
ids = []
for word in words:
ids.append(self.dictionary.word2idx[word])
idss.append(torch.tensor(ids).type(torch.int64))
ids = torch.cat(idss)
return ids
๋ชจ๋ธ์ ์ ์ธํ๊ณ parameter์ ๊ฐ์๋ฅผ ์ดํด๋ณด๊ฒ ์ต๋๋ค.
subword_corpus = Corpus('./data/text-2')
ntokens = len(subword_corpus.dictionary)
subwordmodel = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout)
print(f"Word embedding parameter ๊ฐ์: {count_parameters(subwordmodel.encoder)}")
print(f"RNN parameter ๊ฐ์: {count_parameters(subwordmodel.rnn)}")
subword ๊ธฐ๋ฐ์ ์ธ์ด ๋ชจ๋ธ ์ฑ๋ฅ์ ์ดํด๋ณด๊ฒ ์ต๋๋ค
###############################################################################
# Load data
###############################################################################
# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# โ a g m s โ
# โ b h n t โ
# โ c i o u โ
# โ d j p v โ
# โ e k q w โ
# โ f l r x โ.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.
def batchify(data, bsz):
# Work out how cleanly we can divide the dataset into bsz parts.
nbatch = data.size(0) // bsz
# Trim off any extra elements that wouldn't cleanly fit (remainders).
data = data.narrow(0, 0, nbatch * bsz)
# Evenly divide the data across the bsz batches.
data = data.view(bsz, -1).t().contiguous()
return data.to(device)
eval_batch_size = 10
train_data = batchify(subword_corpus.train, args.batch_size)
val_data = batchify(subword_corpus.valid, eval_batch_size)
test_data = batchify(subword_corpus.test, eval_batch_size)
###############################################################################
# Build the model
###############################################################################
model = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout).to(device)
criterion = nn.NLLLoss()
###############################################################################
# Training code1 - define functions
###############################################################################
def repackage_hidden(h):
"""Wraps hidden states in new Tensors, to detach them from their history."""
if isinstance(h, torch.Tensor):
return h.detach()
else:
return tuple(repackage_hidden(v) for v in h)
# get_batch subdivides the source data into chunks of length args.bptt.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# โ a g m s โ โ b h n t โ
# โ b h n t โ โ c i o u โ
# Note that despite the name of the function, the subdivison of data is not
# done along the batch dimension (i.e. dimension 1), since that was handled
# by the batchify function. The chunks are along dimension 0, corresponding
# to the seq_len dimension in the LSTM.
def get_batch(source, i):
seq_len = min(args.bptt, len(source) - 1 - i)
data = source[i:i+seq_len]
target = source[i+1:i+1+seq_len].view(-1)
return data, target
def evaluate(data_source):
# Turn on evaluation mode which disables dropout.
model.eval()
total_loss = 0.
ntokens = len(subword_corpus.dictionary)
hidden = model.init_hidden(eval_batch_size)
with torch.no_grad():
for i in range(0, data_source.size(0) - 1, args.bptt):
data, targets = get_batch(data_source, i)
output, hidden = model(data, hidden)
hidden = repackage_hidden(hidden)
total_loss += len(data) * criterion(output, targets).item()
return total_loss / (len(data_source) - 1)
def train():
# Turn on training mode which enables dropout.
model.train()
total_loss = 0.
start_time = time.time()
ntokens = len(subword_corpus.dictionary)
hidden = model.init_hidden(args.batch_size)
for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):
data, targets = get_batch(train_data, i)
# Starting each batch, we detach the hidden state from how it was previously produced.
# If we didn't, the model would try backpropagating all the way to start of the dataset.
model.zero_grad()
hidden = repackage_hidden(hidden)
output, hidden = model(data, hidden)
loss = criterion(output, targets)
loss.backward()
# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
for p in model.parameters():
p.data.add_(p.grad, alpha=-lr)
total_loss += loss.item()
if batch % args.log_interval == 0 and batch > 0:
cur_loss = total_loss / args.log_interval
elapsed = time.time() - start_time
print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
'loss {:5.2f} | ppl {:8.2f}'.format(
epoch, batch, len(train_data) // args.bptt, lr,
elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
total_loss = 0
start_time = time.time()
if args.dry_run:
break
###############################################################################
# Training code2 - run
###############################################################################
# Loop over epochs.
lr = args.lr
best_val_loss = None
# At any point you can hit Ctrl + C to break out of training early.
try:
for epoch in range(1, args.epochs+1):
epoch_start_time = time.time()
train()
val_loss = evaluate(val_data)
print('-' * 89)
print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
val_loss, math.exp(val_loss)))
print('-' * 89)
# Save the model if the validation loss is the best we've seen so far.
if not best_val_loss or val_loss < best_val_loss:
with open(args.save, 'wb') as f:
torch.save(model, f)
best_val_loss = val_loss
else:
# Anneal the learning rate if no improvement has been seen in the validation dataset.
lr /= 4.0
except KeyboardInterrupt:
print('-' * 89)
print('Exiting from training early')
# Load the best saved model.
with open(args.save, 'rb') as f:
model = torch.load(f)
# after load the rnn params are not a continuous chunk of memory
# this makes them a continuous chunk, and will speed up forward pass
# Currently, only rnn model supports flatten_parameters function.
if args.model in ['RNN_TANH', 'RNN_RELU', 'LSTM', 'GRU']:
model.rnn.flatten_parameters()
# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
test_loss, math.exp(test_loss)))
print('=' * 89)
ํ์ต์ด ์๋ฃ๋ ๋ชจ๋ธ์ ๋ถ๋ฌ์ random ํ ๋จ์ด๋ฅผ input ์ผ๋ก ๋ฃ์ด์ค ํ ์ ํด์ง ๊ฐ์์ ๋จ์ด๋ฅผ ์์ฑํฉ๋๋ค.
์์ฑํ ๋ฌธ์ฅ์ decode ํ์ฌ (์ฆ, idx2word ๋ฅผ ์ด์ฉํด id ๋ฅผ word ๋ก ๋ณํํ์ฌ) generate.txt ํ์ผ์ ์ ์ฅํฉ๋๋ค.
import torch
# Model parameters.
test_args = easydict.EasyDict({
"data" : './data/text-2', # location of data corpus
"checkpoint": './model.pt', # model checkpoint to use
"outf" : 'generate.txt', # output file for generated text
"words" : 1000, # number of words to generate
"seed" : 1111, # random seed
"cuda" : True, # use CUDA
"temperature": 1.0, # temperature - higher will increase diversity
"log_interval": 100 # reporting interval
})
# Set the random seed manually for reproducibility.
torch.manual_seed(test_args.seed)
if torch.cuda.is_available():
if not test_args.cuda:
print("WARNING: You have a CUDA device, so you should probably run with --cuda")
device = torch.device("cuda" if test_args.cuda else "cpu")
if test_args.temperature < 1e-3:
parser.error("--temperature has to be greater or equal 1e-3")
with open(test_args.checkpoint, 'rb') as f:
model = torch.load(f).to(device)
model.eval()
# corpus = Corpus(test_args.data)
# ntokens = len(subword_corpus.dictionary)
hidden = model.init_hidden(1)
input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)
with open(test_args.outf, 'w') as outf:
with torch.no_grad(): # no tracking history
for i in range(test_args.words):
output, hidden = model(input, hidden)
word_weights = output.squeeze().div(test_args.temperature).exp().cpu()
word_idx = torch.multinomial(word_weights, 1)[0]
input.fill_(word_idx)
word = subword_corpus.dictionary.idx2word[word_idx]
outf.write(word + ('\n' if i % 20 == 19 else ' '))
if i % test_args.log_interval == 0:
print('| Generated {}/{} words'.format(i, test_args.words))
์ถ์ฒ
https://github.com/pytorch/examples/tree/main/word_language_model
'์์ฐ์ด ์ฒ๋ฆฌ > Today I learned :' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[์์ฐ์ด ์ฒ๋ฆฌ] ํ์ดํ ์น LSTM ๊ตฌํ (0) | 2023.01.05 |
---|---|
[์์ฐ์ด ์ฒ๋ฆฌ] RNN์ ๋ณด์ํ๋ LSTM๊ณผ GRU (0) | 2023.01.05 |
[์์ฐ์ด ์ฒ๋ฆฌ] ์ ๊ทํํ์ ์ ์ฒ๋ฆฌ ํ๊ธ๋ง ๋จ๊ธฐ๊ธฐ (0) | 2023.01.03 |
[์์ฐ์ด ์ฒ๋ฆฌ] ๋ง์ถค๋ฒ ์ ์ฒ๋ฆฌ ๊ต์ Py-Hanspell ์์ (0) | 2023.01.03 |
[์์ฐ์ด ์ฒ๋ฆฌ]์ ์ฒ๋ฆฌ ๋์ด์ฐ๊ธฐ ๊ต์ ์์ PyKoSpacing ์์ (0) | 2023.01.03 |