์ž์—ฐ์–ด ์ฒ˜๋ฆฌ/Today I learned :

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ํ† ํฐํ™”์™€ ํ† ํฌ๋‚˜์ด์ € ์ข…๋ฅ˜

์ฃผ์˜ ๐Ÿฑ 2023. 1. 5. 11:30
728x90

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ํ† ํฐ์€ ์‚ฌ์šฉ์ž ์ง€์ • ๋‹จ์œ„์ž…๋‹ˆ๋‹ค. 

raw string์„ ํ† ํฐ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ์„ ํ† ํฐํ™”๋ผ๊ณ  ํ•˜๊ณ , ์šฐ๋ฆฌ๋Š” ์ด ํ† ํฐ๋“ค๋กœ vocab ์‚ฌ์ „์„ ๋งŒ๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 

์ฆ‰, ํ† ํฐํ™”๋Š” ์ฃผ์–ด์ง„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ์ด ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์œ„๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 

 

์˜ˆ๋ฅผ ๋“ค์–ด I am student๋ผ๋Š” ๋ฌธ์žฅ์€ ํ† ํฐํ™”๋ฅผ ๊ฑฐ์น˜๋ฉด I, am, student๋กœ ๋‚˜๋ˆ ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 

I am student=[I, am, student] ๊นŒ์ง€ ๋˜๋ฉด ์šฐ๋ฆฌ๋Š” ์ด ๋‹จ์–ด๋“ค์ด vocab ์‚ฌ์ „์—์„œ์˜ ์ธ๋ฑ์Šค๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ I๊ฐ€ 10๋ฒˆ์งธ, am์ด 12๋ฒˆ์งธ, student๊ฐ€ 100๋ฒˆ์งธ์— ์œ„์น˜ํ•˜๊ณ  ์žˆ๋‹ค๋ฉด 

[I, am, student] =[10,12,100]์ด ๋˜๊ณ  ํ•ด๋‹น ๋ฒกํ„ฐ๋“ค์ด ๋ชจ๋ธ์˜ ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 

 

ํ† ํฌ๋‚˜์ด์ €

ํ† ํฐํ™”๋ฅผ ํ•˜๋Š” ํ† ํฌ๋‚˜์ด์ €์—๋Š” ์„ธ๊ฐ€์ง€ ์ข…๋ฅ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. 

1. word based tokenizer - ๋‹จ์–ด ๋‹จ์œ„๋กœ ํ† ํฐํ™”, OOV(out-of-vocabulary)๋ฌธ์ œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค,(์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๋Š” unknownd์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ)

"I have a meal"์ด๋ผ๊ณ  ํ•˜๋Š” ๋ฌธ์žฅ์„ ๊ฐ€์ง€๊ณ  word tokenization์„ ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

- ['I', 'have', 'a', 'meal']

์˜์–ด์˜ ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„ space๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด๊ฐ€ ์ •์˜๋˜๊ธฐ ๋•Œ๋ฌธ์— .split()์„ ์ด์šฉํ•ด ์‰ฝ๊ฒŒ word tokenization์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์˜์–ด์—์„œ word tokenization์€ space tokenization์ด๋ผ๊ณ ๋„ ํ•  ์ˆ˜ ์žˆ๊ณ , 
subword tokenization ์ด์ „์— ์ˆ˜ํ–‰๋˜๋Š” pre-tokenization ๋ฐฉ๋ฒ•์œผ๋กœ๋„ ๋งŽ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

"๋‚˜๋Š” ๋ฐฅ์„ ๋จน๋Š”๋‹ค"๋ผ๋Š” ๋ฌธ์žฅ์„ word tokenizationํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
- ['๋‚˜', '๋Š”', '๋ฐฅ', '์„', '๋จน๋Š”๋‹ค']

ํ•œ๊ตญ์–ด์—์„œ "๋‹จ์–ด"๋Š” ๊ณต๋ฐฑ(space)์„ ๊ธฐ์ค€์œผ๋กœ ์ •์˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ•œ๊ตญ์–ด๊ฐ€ ๊ฐ–๊ณ  ์žˆ๋Š” "๊ต์ฐฉ์–ด"๋กœ์„œ์˜ ํŠน์ง• ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. 
์ฒด์–ธ ๋’ค์— ์กฐ์‚ฌ๊ฐ€ ๋ถ™๋Š” ๊ฒƒ์ด ๋Œ€ํ‘œ์ ์ธ ํŠน์ง•์ด๋ฉฐ ์˜๋ฏธ ๋‹จ์œ„๊ฐ€ ๊ตฌ๋ถ„๋˜๊ณ  ์ž๋ฆฝ์„ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์กฐ์‚ฌ๋Š” "๋‹จ์–ด"์ž…๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด์—์„œ๋Š” pre-tokenization ๋ฐฉ๋ฒ•์œผ๋กœ space tokenization์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ํ™œ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

2. character based tokenizer - ์˜์–ด์—์„œ ์•ŒํŒŒ๋ฒณ ํ•˜๋‚˜๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์ง€๋งŒ, ํ•œ๊ตญ์–ด์—์„œ๋Š” ์Œ์ ˆ ๋‹จ์œ„๋กœ ๋‚˜๋‰˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต์ด ์–ด๋ ต๊ณ  ์‹œํ€€์Šค๊ฐ€ ๊ธธ๊ณ , ์ข‹์€ ํผํฌ๋จผ์Šค๋ฅผ ๋ณด์ด์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. 

character ๊ธฐ๋ฐ˜ tokenization์„ ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

"I have a meal" -> ['I', 'h', 'a', 'v', 'e', 'a', 'm', 'e', 'a', 'l']
"๋‚˜๋Š” ๋ฐฅ์„ ๋จน๋Š”๋‹ค" -> ['๋‚˜', '๋Š”', '๋ฐฅ', '์„', '๋จน', '๋Š”', '๋‹ค']

 

3. sub-word based tokenizer - ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ์‹œํ€€์Šค๊ฐ€ ์งง์•„์ง€๊ณ , ์ข‹์€ ํผํฌ๋จผ์Šค๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. 

*Subword๋ž€?


Subword๋Š” ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌํ–ˆ์„ ๋•Œ ํ•˜๋‚˜์˜ ๋‹จ์œ„๋ฅผ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค. 

"subword"๋ฅผ subword ๋‹จ์œ„๋กœ ๋‚˜ํƒ€๋‚ธ ํ•˜๋‚˜์˜ ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

"sub" + "word"

sub๋ผ๋Š” ์ ‘๋‘์‚ฌ์™€ word๋ผ๊ณ  ํ•˜๋Š” ์–ด๊ทผ์œผ๋กœ ๋‚˜๋ˆ„์–ด "subword"๋ผ๊ณ  ํ•˜๋Š” word๋ฅผ 2๊ฐœ์˜ subword๋กœ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.

์ด์™ธ์—๋„ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ subword๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (e.g., "su" + "bword", "s" + "ubword", "subwor" + "d")

 

Subword tokenization์„ ์ ์šฉํ–ˆ์„ ๋•Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด tokenization์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Example 1

"I have a meal" -> ['I', 'hav', 'e', 'a', 'me', 'al']
"๋‚˜๋Š” ๋ฐฅ์„ ๋จน๋Š”๋‹ค" -> ['๋‚˜', '๋Š”', '๋ฐฅ', '์„', '๋จน๋Š”', '๋‹ค']

word ๋‹จ์œ„๊ฐ€ ์•„๋‹ˆ๋ผ ๊ทธ๋ณด๋‹ค ๋” ์ž˜๊ฒŒ ์ชผ๊ฐ  subword ๋‹จ์œ„๋กœ ๋ฌธ์žฅ์„ tokenizationํ•ฉ๋‹ˆ๋‹ค.

์œ„์—์„œ ๋ง์”€๋“œ๋ฆฐ ๊ฒƒ๊ณผ ๊ฐ™์ด ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Example 2

"I have a meal" -> ['I', 'ha', 've', 'a', 'mea', 'l']
"๋‚˜๋Š” ๋ฐฅ์„ ๋จน๋Š”๋‹ค" -> ['๋‚˜', '๋Š”', '๋ฐฅ', '์„', '๋จน', '๋Š”๋‹ค']

๊ทธ๋ ‡์ง€๋งŒ ๊ธฐ๋ณธ์ ์œผ๋กœ ๊ณต๋ฐฑ์„ ๋„˜์–ด์„  subword๋ฅผ ๊ตฌ์„ฑํ•˜์ง„ ์•Š์Šต๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด tokenizaiton์„ ์ˆ˜ํ–‰ํ•˜์ง„ ์•Š์Šต๋‹ˆ๋‹ค.

Example 3

"I have a meal" -> ['Iha', 've', 'am', 'ea', 'l']
"๋‚˜๋Š” ๋ฐฅ์„ ๋จน๋Š”๋‹ค" -> ['๋‚˜๋Š”๋ฐฅ', '์„๋จน', '๋Š”๋‹ค']

 

subword tokenization์˜ ์žฅ์ ์€ Out-of-vocabulary (OOV) ๋ฌธ์ œ์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ ์ž์œ ๋กญ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ subword๋“ค์€ ์ตœ์†Œ ์ฒ ์ž ๋‹จ์œ„์—์„œ ํ•˜๋‚˜์”ฉ ๋” ๊ธด subword๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ์˜์–ด์˜ ๊ฒฝ์šฐ a~z์˜ ์•ŒํŒŒ๋ฒณ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ ๋‘๊ธ€์ž, ์„ธ๊ธ€์ž, ๋„ค๊ธ€์ž subword ๋“ฑ์œผ๋กœ ํ™•์žฅํ•ด๋‚˜๊ฐ€๋ฉฐ 
subword๋ฅผ ์ถ”๊ฐ€ํ•ด ๋‹จ์–ด๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ subword tokenization์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ์–ธ์–ด๋ฅผ tokenizationํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด 
OOV ๋ฌธ์ œ์—์„œ ์ž์œ ๋กญ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

subword tokenization ์•Œ๊ณ ๋ฆฌ์ฆ˜

 

1. BPE(Byte-Pair-Enciding) -GPT์—์„œ ์‚ฌ์šฉ

2. WordPiece - ํ™•๋ฅ ๋ก ์ ์œผ๋กœ likeihood๊ฐ€ ๋†’์€ ๊ฒƒ๋ผ๋ฆฌ ๋ฌถ์Œ - BERT, ELECTRA, DistilBERT

3. Unigram - ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ๋Š” substring ๋ถ€ํ„ฐ ์ชผ๊ฐœ๊ธฐ ์‹œ์ž‘

4. SentencePiece - ALBERT , XLVET, T5

 


 

BERT ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•œ subword tokenization algorithm์„ ์ด์šฉํ•ด language modeling task๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 
subword tokenizer๋Š” transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด ์‰ฝ๊ฒŒ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# subword tokenization ์˜ˆ์‹œ
print(tokenizer.tokenize('Natural language expert training'))
print(tokenizer.tokenize('NewJeans release new single OMG'))

['Natural', 'language', 'expert', 'training']

['New', '##J', '##ean', '##s', 'release', 'new', 'single', 'O', '##MG']

 

class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = tokenizer.tokenize(line.strip()) + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = tokenizer.tokenize(line.strip()) + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

๋ชจ๋ธ์„ ์„ ์–ธํ•˜๊ณ  parameter์˜ ๊ฐœ์ˆ˜๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

subword_corpus = Corpus('./data/text-2')
ntokens = len(subword_corpus.dictionary)
subwordmodel = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout)
print(f"Word embedding parameter ๊ฐœ์ˆ˜: {count_parameters(subwordmodel.encoder)}")
print(f"RNN parameter ๊ฐœ์ˆ˜: {count_parameters(subwordmodel.rnn)}")

subword ๊ธฐ๋ฐ˜์˜ ์–ธ์–ด ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค

###############################################################################
# Load data
###############################################################################

# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# โ”Œ a g m s โ”
# โ”‚ b h n t โ”‚
# โ”‚ c i o u โ”‚
# โ”‚ d j p v โ”‚
# โ”‚ e k q w โ”‚
# โ”” f l r x โ”˜.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

eval_batch_size = 10
train_data = batchify(subword_corpus.train, args.batch_size)
val_data = batchify(subword_corpus.valid, eval_batch_size)
test_data = batchify(subword_corpus.test, eval_batch_size)
###############################################################################
# Build the model
###############################################################################

model = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout).to(device)
criterion = nn.NLLLoss()
###############################################################################
# Training code1 - define functions
###############################################################################

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""

    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


# get_batch subdivides the source data into chunks of length args.bptt.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# โ”Œ a g m s โ” โ”Œ b h n t โ”
# โ”” b h n t โ”˜ โ”” c i o u โ”˜
# Note that despite the name of the function, the subdivison of data is not
# done along the batch dimension (i.e. dimension 1), since that was handled
# by the batchify function. The chunks are along dimension 0, corresponding
# to the seq_len dimension in the LSTM.

def get_batch(source, i):
    seq_len = min(args.bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(subword_corpus.dictionary)
    hidden = model.init_hidden(eval_batch_size)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, args.bptt):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)
            total_loss += len(data) * criterion(output, targets).item()
    return total_loss / (len(data_source) - 1)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(subword_corpus.dictionary)
    hidden = model.init_hidden(args.batch_size)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()

        hidden = repackage_hidden(hidden)
        output, hidden = model(data, hidden)

        loss = criterion(output, targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
        for p in model.parameters():
            p.data.add_(p.grad, alpha=-lr)

        total_loss += loss.item()

        if batch % args.log_interval == 0 and batch > 0:
            cur_loss = total_loss / args.log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // args.bptt, lr,
                elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()
        if args.dry_run:
            break
###############################################################################
# Training code2 - run 
###############################################################################

# Loop over epochs.
lr = args.lr
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, args.epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(args.save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open(args.save, 'rb') as f:
    model = torch.load(f)
    # after load the rnn params are not a continuous chunk of memory
    # this makes them a continuous chunk, and will speed up forward pass
    # Currently, only rnn model supports flatten_parameters function.
    if args.model in ['RNN_TANH', 'RNN_RELU', 'LSTM', 'GRU']:
        model.rnn.flatten_parameters()

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

ํ•™์Šต์ด ์™„๋ฃŒ๋œ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€ random ํ•œ ๋‹จ์–ด๋ฅผ input ์œผ๋กœ ๋„ฃ์–ด์ค€ ํ›„ ์ •ํ•ด์ง„ ๊ฐœ์ˆ˜์˜ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
์ƒ์„ฑํ•œ ๋ฌธ์žฅ์„ decode ํ•˜์—ฌ (์ฆ‰, idx2word ๋ฅผ ์ด์šฉํ•ด id ๋ฅผ word ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ) generate.txt ํŒŒ์ผ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

import torch

# Model parameters.
test_args = easydict.EasyDict({
    "data"      : './data/text-2',  # location of data corpus
    "checkpoint": './model.pt',         # model checkpoint to use
    "outf"      : 'generate.txt',       # output file for generated text
    "words"     : 1000,                 # number of words to generate
    "seed"      : 1111,                 # random seed
    "cuda"      : True,                 # use CUDA
    "temperature": 1.0,                 # temperature - higher will increase diversity
    "log_interval": 100                 # reporting interval
})

# Set the random seed manually for reproducibility.
torch.manual_seed(test_args.seed)
if torch.cuda.is_available():
    if not test_args.cuda:
        print("WARNING: You have a CUDA device, so you should probably run with --cuda")

device = torch.device("cuda" if test_args.cuda else "cpu")

if test_args.temperature < 1e-3:
    parser.error("--temperature has to be greater or equal 1e-3")

with open(test_args.checkpoint, 'rb') as f:
    model = torch.load(f).to(device)
model.eval()

# corpus = Corpus(test_args.data)
# ntokens = len(subword_corpus.dictionary)

hidden = model.init_hidden(1)
input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)

with open(test_args.outf, 'w') as outf:
    with torch.no_grad():  # no tracking history
        for i in range(test_args.words):
            output, hidden = model(input, hidden)
            word_weights = output.squeeze().div(test_args.temperature).exp().cpu()
            word_idx = torch.multinomial(word_weights, 1)[0]
            input.fill_(word_idx)

            word = subword_corpus.dictionary.idx2word[word_idx]

            outf.write(word + ('\n' if i % 20 == 19 else ' '))

            if i % test_args.log_interval == 0:
                print('| Generated {}/{} words'.format(i, test_args.words))

 

์ถœ์ฒ˜

https://github.com/pytorch/examples/tree/main/word_language_model

๋ฐ˜์‘ํ˜•