๋Œ€ํšŒ ํ”„๋กœ์ ํŠธ/ํ”„๋กœ์ ํŠธ

BERT๋กœ ๋‰ด์Šค ๊ธฐ์‚ฌ ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜ํ•˜๊ธฐ

์ฃผ์˜ ๐Ÿฑ 2022. 12. 12. 12:18
728x90
๋ฐ˜์‘ํ˜•

์ž‘์—… ์ธ์› : 1๋ช…, ์†Œ์š”๊ธฐ๊ฐ„ : 1์ผ, ์ž‘์—…ํ™˜๊ฒฝ : google colab, python, KoBERT, ๋ฐ์ดํ„ฐ : ๋‰ด์Šค๊ธฐ์‚ฌ ํ…์Šคํฌ์™€ ํ•ด๋‹น ์นดํ…Œ๊ณ ๋ฆฌ

์ •ํ™•๋„ 0.847, F1 Score - 0.845

 

์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์€ ๊ฐ„๋‹จํ•œ RNN, LSTM๋ถ€ํ„ฐ BERT , ๋“ฑ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” KoBERT๋ฅผ ์‚ฌ์šฉํ•ด์„œ Fine-tuningํ•ด๋ณด์•˜๋Š”๋ฐ, KoBERT๋Š” BERT์— ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€๋กœ ๋” ํ•™์Šต์‹œํ‚จ ๋ชจ๋ธ๋กœ ํ•œ๊ตญ์–ด์— ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์ด ํŠน์ง•์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ๋งํฌ์—์„œ ๋ชจ๋ธ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

https://github.com/SKTBrain/KoBERT

๋ถ„๋ฅ˜ ์นดํ…Œ๊ณ ๋ฆฌ๋Š” ์ด 6๊ฐ€์ง€ ์ž…๋‹ˆ๋‹ค. ('๊ฒฝ์ œ', '์‚ฌํšŒ', '์ƒํ™œ', '์Šคํฌ์ธ ', '์—ฐ์˜ˆ/๋ฌธํ™”', '์ •์น˜')
 
 
 

์‹คํ–‰ ์ „ ์ฝ”๋žฉ์—์„œ ๋Ÿฐํƒ€์ž„ ์œ ํ˜•์„ GPU๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

1. ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ, ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ

 !pip install mxnet
 !pip install gluonnlp pandas tqdm
 !pip install sentencepiece
 !pip install transformers==3.0.2
 !pip install torch
 
 # KoBERT ๋‹ค์šด, ๋กœ๋“œ
!pip install git+https://git@github.com/SKTBrain/KoBERT.git@master

from kobert.utils import get_tokenizer
from kobert.pytorch_kobert import get_pytorch_kobert_model

from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup''

import numpy as np
import random
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp

# ์‹œ๋“œ ๊ณ ์ •
RANDOM_SEED = 2022
torch.manual_seed(RANDOM_SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

#BERT ๋ชจ๋ธ, Vocabulary ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
bertmodel, vocab = get_pytorch_kobert_model()

 

2. ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ (ํ•œ๊ธ€๊ณผ ์˜์–ด๋งŒ ๋‚จ๊ธฐ๊ณ  ๋ชจ๋‘ ์‚ญ์ œํ–ˆ์Šต๋‹ˆ๋‹ค.) 

import re
df['text'] = [re.sub('[^A-Za-z0-9๊ฐ€-ํžฃ]', '', s) for s in df['text']]
๋‰ด์Šค๊ธฐ์‚ฌ์˜ ํŠน์„ฑ์€ ํŠน์ˆ˜๊ธฐํ˜ธ, ํ•œ์ž๊ฐ€ ๋“ค์–ด๊ฐ€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ณ , ๋งˆ์ง€๋ง‰์— ๊ธฐ์ž์˜ ๋ฉ”์ผ์ฃผ์†Œ๋„ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ •ํ™•๋„๋ฅผ ๋†’์ด๋ ค๋ฉด ์ด๋Ÿฌํ•œ ๋…ธ์ด์ฆˆ ๋˜ํ•œ ์ •๊ทœํ‘œํ˜„์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์ผ์ผ์ด ์ œ๊ฑฐํ•ด์ฃผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

+ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์œผ๋กœ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ 8000~8500 ๊ฐœ๋กœ ๋งž์ท„์Šต๋‹ˆ๋‹ค. ์ฐธ๊ณ !

https://getacherryontop.tistory.com/109

 

3. ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ 

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(df['category'])
df['category'] = encoder.transform(df['category'])

mapping = dict(zip(range(len(encoder.classes_)), encoder.classes_))

๋ฌธ์ž์ธ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ˆซ์ž๋กœ ๋ณ€๊ฒฝํ•ด์ค๋‹ˆ๋‹ค. 

data_list = []
for q, label in zip(df['text'], df['category'])  :
    data = []
    data.append(q)
    data.append(str(label))
    data_list.append(data)

 

4. ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •

Bert์˜ ํŠน์„ฑ์€ 512ํ† ํฐ๊นŒ์ง€๋ฐ–์— ๋“ค์–ด๊ฐ€์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.(์•ฝ 2~3๋ฌธ์žฅ ์ •๋„) ์•„๋ฌด๋ฆฌ ๊ธด ํ…์ŠคํŠธ๋ฅผ ๋„ฃ์–ด๋„ ํ•ด๋‹น ๊ธธ์ด๋ฐ–์— ํ•™์Šต์ด ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ max_length๋กœ ์•ž์—์„œ๋ถ€ํ„ฐ ์ž˜๋ผ ๋„ฃ์–ด์ค๋‹ˆ๋‹ค. ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ๋Š” ์•„๋ฌด๋ž˜๋„ gpu์˜ ํ• ๋‹น๋Ÿ‰์ด ์žˆ์–ด ์ตœ๋Œ€ํ•œ ์ž‘๊ฒŒ ๋„ฃ์—ˆ์Šต๋‹ˆ๋‹ค. 

max_len = 128
batch_size = 32

class CustomDataset(Dataset):

    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)
        
        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

__init__ : ์ดˆ๊ธฐํ™” ,  __len__ : ๋ฐ์ดํ„ฐ์…‹(input)์˜ ๊ธธ์ด ๋ฐ˜ํ™˜, __getitem__ : ๋ฐ์ดํ„ฐ์…‹์„ ์ธ๋ฑ์Šค๋กœ ๋ถˆ๋Ÿฌ์˜ด

4. ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹ ๋‚˜๋ˆ„๊ธฐ

from sklearn.model_selection import train_test_split
                                                         
dataset_train, dataset_val = train_test_split(data_list, test_size=0.1, random_state=42)

train , 44066 ๊ฐœ , test 4897 ๊ฐœ์ž…๋‹ˆ๋‹ค. 

5. ํ† ํฐํ™”

#ํ† ํฐํ™”
tokenizer = get_tokenizer()
tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)

data_train = CustomDataset(dataset_train, 0, 1, tok, max_len, True, False)
data_val = CustomDataset(dataset_val, 0, 1, tok, max_len, True, False)

train_dataloader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, shuffle=True, num_workers=5)
valid_dataloader = torch.utils.data.DataLoader(data_val, batch_size=batch_size, shuffle=True, num_workers=5)

 

6. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ, ๋ถ„๋ฅ˜๋ชจ๋ธ ์„ค์ •

learning_rate =  2e-5
num_epochs = 4
log_interval = 200
warmup_ratio = 0.1

# ๊ฒฝ๋กœ ์„ค์ •
ROOT_PATH = '/content/drive/MyDrive'
DATA_DIR = '/content/drive/MyDrive'

class BertClassifier(nn.Module):
    def __init__(self,
                 bert,
                 hidden_size = 768,
                 num_classes=6,   ##ํด๋ž˜์Šค ์ˆ˜ ์กฐ์ •##
                 dr_rate=None,
                 params=None):
        super(BertClassifier, self).__init__()
        self.bert = bert
        self.dr_rate = dr_rate
                 
        self.classifier = nn.Linear(hidden_size , num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)
    
    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length, segment_ids):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)
        
        _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device))
        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)
    
    
    #BERT ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model = BertClassifier(bertmodel,  dr_rate=0.5).to(device)

#optimizer์™€ schedule ์„ค์ •
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()

t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)

scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_step, num_training_steps=t_total)

#์ •ํ™•๋„ ์ธก์ •์„ ์œ„ํ•œ ํ•จ์ˆ˜ ์ •์˜
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc
    
train_dataloader

 

 

 

 7. train

criterion = 0 #initialization
print('Training Start...')

for e in range(num_epochs):
    train_acc = 0.0
    val_acc = 0.0
    model.train()
    
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    print("epoch {} train loss {} train acc {}".format(e+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm(valid_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        val_acc += calc_accuracy(out, label)
        
    print("epoch {} valid loss {} valid acc {}".format(e+1, loss.data.cpu().numpy(), val_acc / (batch_id+1)))
    if val_acc / (batch_id+1) > criterion:
        # ๋ชจ๋ธ์ด ๊ฐœ์„ ๋จ -> ๊ฒ€์ฆ ์ ์ˆ˜์™€ weight ๊ฐฑ์‹ 
        criterion = val_acc / (batch_id+1)
        check_point = {
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'scheduler': scheduler.state_dict()
        }
        # ๋ชจ๋ธ์ด ๊ฐœ์„ ๋จ -> ๋ชจ๋ธ ์ €์žฅ
        print("Update model save...")
        torch.save(check_point, os.path.join(ROOT_PATH, 'model_best.pt'))
        
print('Training Finish!')
 
epoch 4 valid loss 0.03084656223654747 valid acc 0.8474025974025974

8. ํ•™์Šต๋œ ๋ชจ๋ธ ์ ์šฉ

class TestDataset(Dataset):
    def __init__(self, dataset, sent_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)
        self.categories = mapping
        self.num_labels = len(self.categories)
        
        self.sentences = [transform([i]) for i in dataset]
        
    def label_decoder(self, labels):
        try:
            labels = list(map(lambda x : self.categories[x], labels))
            return labels
        except:
            assert 'Invalid intent'
            
    def __len__(self):
        return (len(self.sentences))
    
    def __getitem__(self, i):
        return (self.sentences[i])
 
 dataset_test = testdf['text'].tolist()
data_test = TestDataset(dataset_test, 0, tok, max_len, True, False)
test_dataloader = torch.utils.data.DataLoader(data_test, batch_size=batch_size, num_workers=5)

#MODEL_DIR = os.path.join(ROOT_PATH, 'model_best.pt')
#model = BertClassifier(bertmodel,  dr_rate=0.5).to(device)
#model.load_state_dict(torch.load(MODEL_DIR)['model'])

model.eval()
pred = []

for batch_id, (token_ids, valid_length, segment_ids) in enumerate(test_dataloader):
    token_ids = token_ids.long().to(device)
    segment_ids = segment_ids.long().to(device)
    valid_length = valid_length.long().to(device)

    out = model(token_ids, valid_length, segment_ids)
    pred.extend(out.argmax(dim=1).tolist())
    
    # ์ง„ํ–‰๊ณผ์ • ์ถœ๋ ฅ
    if batch_id % 10 == 0:
        print(f'Prediction: {batch_id}/{len(test_dataloader)} completed')

pred = data_test.label_decoder(pred)
print('decode Completed!')

๊ฒฐ๊ณผ

 

๋‚˜๋ฆ„ ์ž˜ ๋‚˜์™”์Œ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

๋ฐ˜์‘ํ˜•