์ž์—ฐ์–ด ์ฒ˜๋ฆฌ/Today I learned :

transfomers ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฐ„๋‹จํ•œ ๋ถ„๋ฅ˜ ์˜ˆ์ œ(BertForSequenceClassification)

์ฃผ์˜ ๐Ÿฑ 2023. 1. 12. 14:03
728x90
๋ฐ˜์‘ํ˜•

 

https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification

 

BERT

call < source > ( input_ids: typing.Union[typing.List[tensorflow.python.framework.ops.Tensor], typing.List[numpy.ndarray], typing.List[tensorflow.python.keras.engine.keras_tensor.KerasTensor], typing.Dict[str, tensorflow.python.framework.ops.Tensor], typin

huggingface.co

 

๊ฑฐ์˜ ๋ชจ๋“  ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ํ…Œ์Šคํฌ์—๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค.

์ด๋ฒˆ์—๋Š” ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์ธ BERT๋ฅผ ๊ฐ€์ง€๊ณ  ๊ฐ„๋‹จํ•œ ๋ถ„๋ฅ˜ ์˜ˆ์ œ๋ฅผ ํ•ด๋ณด๋ฉฐ transformers๋ฅผ ๋ง›๋ณด๊ธฐ?ํ•  ๊ณ„ํš์ด๋‹ค. 

๋จผ์ € BERT์˜ MLM์„ ํ™•์ธํ•ด๋ณด์ž.

pip install transformers
 
from pprint import pprint
from transformers import BertConfig, BertForMaskedLM
from transformers.models.bert.tokenization_bert_fast import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encodings2 = tokenizer("We are very happy to [MASK] you the Transformers library.", return_tensors="pt")
pprint(encodings2)

model= BertForMaskedLM.from_pretrained("bert-base-uncased")
outputs=model(**encodings2)
print(outputs)
 
print(outputs.logits.argmax(dim=-1))
 
 
์˜ ๊ฒฐ๊ณผ๋กœ
We are very happy to [MASK] you the Transformers library ์š” ๋ฌธ์žฅ์ด
 
'input_ids': tensor([[ 101, 2057, 2024, 2200, 3407, 2000, 103, 2017, 1996, 19081,
3075, 1012, 102]]),
 
์—์„œ
 
tensor([[ 1012, 2057, 2024, 2200, 7537, 2000, 2265, 2017, 1996, 19081,
3075, 1012, 1012]])
 
์œผ๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 
๋น„๊ตํ•ด๋ณด๋ฉด, 3407->7537 103->2265๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค. 
์—ฌ๊ธฐ์„œ 103์€ MASK ํ† ํฐ์ด๋‹ค. 
 
๊ทธ๋Ÿผ ๋ฐ”๋€ ๋ฌธ์žฅ ์ฆ‰, BERT๊ฐ€ ์˜ˆ์ธกํ•œ ๋ฌธ์žฅ์€ ๋ฌด์—‡์ธ์ง€ ๋ณด๋ฉด, 
 
print(tokenizer.decode(outputs.logits.argmax(dim=-1).squeeze(0)))

'''we are very pleased to show you the transformers library..
happy๋Š” pleased๋กœ ๋ฐ”๋€Œ์—ˆ๊ณ , show๊ฐ€ ํ™•๋ฅ ์ด ์ œ์ผ ๋†’์Œ - mlm์ด ์ž˜ ๋˜์–ด์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 


์˜ํ™” ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„๋ฅ˜๋ฅผ ํŠธ๋ ˆ์ธํ•ด๋ณด๊ธฐ

pip install transformers
pip install datasets

from datasets import load_dataset
data = load_dataset("imdb")

import torch
from transformers import BertTokenizer, BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
#bert-base-uncased์€ mlm๊ธฐ๋ฐ˜ํ•™์Šต- BertForSequenceClassification์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ€์ค‘์น˜๋งŒ ํ•™์Šต๋จ
from pprint import pprint
from transformers import BertConfig, BertForMaskedLM
from transformers.models.bert.tokenization_bert_fast import BertTokenizerFast
 
 #์ „์ฒ˜๋ฆฌ
import re

def preprocess(sample):
return{
'text': ' '.join(re.sub(r'<[^(?:/>)]+/>',' ',sample['text']).split()),
'label':sample['label']
}

preprocessed = data.map(preprocess)
preprocessed

DatasetDict({

train: Dataset({ features: ['text', 'label'], num_rows: 25000 })

test: Dataset({ features: ['text', 'label'],

num_rows: 25000 })

unsupervised: Dataset({ features: ['text', 'label'], num_rows: 50000 })

})

 

์ „์ฒ˜๋ฆฌ ์ „
๋น„๊ตํ•ด๋ณด๋ฉด <br>๋“ฑ์˜ html ํƒœ๊ทธ๊ฐ€ ์‚ฌ๋ผ์ง

from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained("bert-base-uncased",use_fast=True)#BertTokenizerfast๋กœ ๋จ

#or

from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased",use_fast=True)
preprocessed = preprocessed.map(
    lambda sample: tokenizer(sample['text'],truncation=True),#truncation-๊ธธ์ด๊ฐ€ ๊ธธ๋ฉด ์ž๋ฆ„ 512๋„˜์œผ๋ฉด ์ž๋ฆ„
    remove_columns=['text'],
    batched=True
)
#1000๊ฐœ ๋ฌธ์žฅ์„ ๋ฐฐ์น˜๋กœ ์ž˜๋ผ ์ „์ฒ˜๋ฆฌ
#ํŒจ๋”ฉ์ด ์•ˆ๋งž๋Š”๊ฒฝ์šฐ ํ•˜๋‚˜์˜ ๋ฐฐ์น˜๋ฅผ ๋งŒ๋“ค์–ด์•ผํ•˜๋Š”๊ฒฝ์šฐ
from transformers import DataCollatorWithPadding
collator = DataCollatorWithPadding(tokenizer)
from torch.utils.data import DataLoader
train_loader = DataLoader(preprocessed['train'],batch_size=16, collate_fn=collator, shuffle=True)
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

#or

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
import torch   #finetuningํ• ๋•Œ mlm์œผ๋กœ ํ•™์Šต๋˜์ง€ ์•Š์€ ๋ถ€๋ถ„ ํ•™์Šต์‹œํ‚ค๊ณ ์‹ถ๋‹ค!
optimizer = torch.optim.AdamW(
    [
        {"params":model.bert.parameters(), "lr":3e-5},
        {"params":model.classifier.parameters(), "lr":1e-3},
    ]
)
model.train()
for epoch in range(3):
  print(f"Epoch: {epoch}")
  for encodings in train_loader:
    encodings = {key:value.cuda() for key,value in encodings.items()}
    outputs = model(**encodings)
    outputs.loss.backward()
    print('\rLoss: ',outputs.loss.item(),end='')
    optimizer.step()
    optimizer.zero_grad(set_to_none=False)

loss๊ฐ€ ์ ์  ์ค„์–ด๋“ฆ์„ ํ™•์ธ

 


#from transformers import DataCollatorWithPadding
#from transformers import AutoModelForSequenceClassification ๊นŒ์ง€๋Š” ์‹คํ–‰ ํ›„
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    num_train_epochs=3.0,
    per_device_train_batch_size=16,
    output_dir='dump/test'
)
trainer = Trainer(
    model=model,
    args=training_args, 
    train_dataset= preprocessed['train'],
    eval_dataset=preprocessed['test'],
    data_collator=collator

)
trainer.train()
๋ฐ˜์‘ํ˜•