์ž์—ฐ์–ด ์ฒ˜๋ฆฌ/Today I learned :

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] ํŒŒ์ดํ† ์น˜ LSTM ๊ตฌํ˜„

์ฃผ์˜ ๐Ÿฑ 2023. 1. 5. 17:40
728x90

์ด๋ฒˆ์—๋Š” ์‹ค์ œ๋กœ LSTM์„ ๊ตฌํ˜„ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)
`torch.nn` ์„ ํ™œ์šฉํ•˜์—ฌ LSTM cell ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

* `input_size` : The number of expected features in the input x

* `hidden_size` : The number of features in the hidden state h



lstm = nn.LSTM(input_size, hidden_size)
# input_size: 3, hidden_size: 3 ์œผ๋กœ ์„ค์ •ํ•˜์—ฌ LSTM cell ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
lstm = nn.LSTM(3, 3)

LSTM cell ์„ ์ƒ์„ฑํ•œ ํ›„์—๋Š”, ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐˆ input x, hidden state h, cell state c ๋ฅผ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์œ„์—์„œ ์ •ํ•œ input_size ์™€ hidden_size ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ inputs ์™€ hidden (h ์™€ c) ์„ ์ƒ์„ฑํ•ด ๋ด…์‹œ๋‹ค.

# sequence length ๊ฐ€ 5 ์ธ input์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. 
# ์ด๋•Œ, input_size ๋ฅผ 3 ์œผ๋กœ ์„ค์ •ํ–ˆ์œผ๋ฏ€๋กœ, 3 ์ฐจ์› ๋ฒกํ„ฐ 5๊ฐœ๋ฅผ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
inputs = [torch.randn(1, 3) for _ in range(5)] 

# lstm ์€ input x ์™€ hidden state h ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์—, hidden state ๋„ ์ƒ์„ฑํ•ด ์ค๋‹ˆ๋‹ค.
# ์ด๋•Œ, hidden_size ๋ฅผ 3 ์œผ๋กœ ์„ค์ •ํ–ˆ์œผ๋ฏ€๋กœ, 3 ์ฐจ์› ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
# lstm ์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” h ๋Š” RNN ์—์„œ์˜ hidden state ์™€, lstm ์—์„œ ๋“ฑ์žฅํ•œ ๊ฐœ๋…์ธ cell state ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—
# hidden ์€ 3 ์ฐจ์› ๋ฒกํ„ฐ 2๊ฐœ๋กœ ๊ตฌ์„ฑ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))

๋ฐฉ๋ฒ• 1: Sequence length ๊ฐ€ 5 ์ธ input ์— ๋Œ€ํ•˜์—ฌ ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ element ๋ฅผ lstm cell ์— ํ†ต๊ณผ์‹œํ‚ต๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ• 2: ์ „์ฒด ์‹œํ€€์Šค๋ฅผ ํ•œ๋ฒˆ์— ํ†ต๊ณผ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

LSTM ์ด ๋ฐ˜ํ™˜ํ•˜๋Š” ์ถœ๋ ฅ์˜ ์ฒซ ๋ฒˆ์งธ ๊ฐ’์€ ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•œ ํ†ต๊ณผํ•œ hidden state ์ด๊ณ , ๋‘ ๋ฒˆ์งธ ๊ฐ’์€ ๋งˆ์ง€๋ง‰ step ์˜ hidden state ์ž…๋‹ˆ๋‹ค. out ๊ณผ hidden ์˜ size ๋ฅผ ๋น„๊ตํ•ด๋ณด์„ธ์š”.

inputs = torch.cat(inputs).view(len(inputs), 1, -1) # ๋ฐฉ๋ฒ• 2 ๋ฅผ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด input ์„ list ๊ฐ€ ์•„๋‹Œ ํ•˜๋‚˜์˜ tensor ๋กœ concat ํ•ด์ค๋‹ˆ๋‹ค.
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # ๋ฐฉ๋ฒ• 2 ๋ฅผ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด hidden ์„ ๋‹ค์‹œ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

LSTM ์„ ์ด์šฉํ•ด Part-of-Speech (PoS) Tagging ์„ ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค.

  • training_data ์—๋Š” ๋‹จ์–ด ์‹œํ€€์Šค์™€ ๊ฐ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ ํƒœ๊ทธ๋ฅผ ์ค€๋น„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • word_to_ix: ๋ชจ๋ธ์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋‹จ์–ด๋ฅผ id ๋กœ mapping ํ•ฉ๋‹ˆ๋‹ค.
  • tag_to_ix: ํ’ˆ์‚ฌ ํƒœ๊ทธ ๋˜ํ•œ id ๋กœ mapping ํ•ฉ๋‹ˆ๋‹ค.
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}  # Assign each tag with a unique index

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}

 

Embedding layer, output layer, lstm cell ์„ ํฌํ•จํ•œ LSTMTagger ๋ชจ๋“ˆ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

  • embeds: input id ๋ฅผ embedding layer ๋กœ encode ํ•˜์—ฌ input ์— ํ•ด๋‹นํ•˜๋Š” embedding ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • lstm_out: embedding ์„ lstm ์— ํ†ต๊ณผํ•˜์—ฌ ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•œ hidden state ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  • tag_space: lstm ์˜ output ์ธ hidden ์„ ์ด์šฉํ•ด ์กด์žฌํ•˜๋Š” tag (DET, NN, V) ๊ณต๊ฐ„์œผ๋กœ linear transform ํ•ฉ๋‹ˆ๋‹ค.
  • tag_scores: ์ดํ›„ softmax ๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ฐ tag ๊ฐ€ ๋  score ๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

model ์„ build ํ•˜๊ณ , ํ•™์Šต์— ํ•„์š”ํ•œ loss ํ•จ์ˆ˜์™€ optimizer ๋ฅผ ์„ ์–ธํ•ฉ๋‹ˆ๋‹ค.

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

์ด์ œ, training data ๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, input ์„ LSTMTagger ์— ํ†ต๊ณผ์‹œ์ผœ ๊ฐ ๋‹จ์–ด์˜ PoS tag ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •๋‹ต tag ์™€ ๋น„๊ตํ•˜์—ฌ loss ๋ฅผ ๊ณ„์‚ฐํ•œ ํ›„ loss ๋ฅผ backpropagate ํ•˜์—ฌ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ฉ๋‹ˆ๋‹ค

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

 

 

LSTM ์ด ์•„๋‹Œ GRU ๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด nn.GRU ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์œ„์ฒ˜๋Ÿผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋งŒ ํ™œ์šฉํ•ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•  ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ generalization ์„ฑ๋Šฅ์ด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ํ•™์Šต์‹œ ๋ณด์ง€ ๋ชปํ•œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ train, test ๋กœ split ํ•˜๊ฑฐ๋‚˜ cross-validation ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

Reference: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

 

๋ฐ˜์‘ํ˜•