BERT Fine-tuning

24 min readNov 22, 2020

Task: text classification
Model: Bert/Transformer
Difficulty: hard

BERT

前幾篇我們談了怎麼用 Bert 做克漏字練習，但是當原始的 Bert model 太 general，若你的任務是做特定的預測，例如：我們之前提過的 topic classification，我們就需要重新訓練模型來做實際的預測。這篇我們要談怎麼在 Bert pre-trained model 的基礎下，用相對少量的資料就可以訓練出topic classification model。

Bert 跟 Word2Vec 和 Glove 不同，Contextual word representation 讓相同的字，在不同語境下有不同的 representation（通常是一個 1 x N 的 matrix）；而傳統的詞向量無論上下文，都會讓同 type 的 word token 的 representation 相同。

原因是因為 Google 在預訓練 BERT 時讓它同時進行兩個任務：

克漏字填空（1953 年被提出的 Cloze task，學術點的說法是 Masked Language Model, MLM）
判斷第 2 個句子在原始文本中是否跟第 1 個句子相接（Next Sentence Prediction, NSP）

Fine-tuning BERT

重頭訓練BERT model 非常曠日費時，所以一般都是直接用 pretrained 好的 model，今天要用的是 Hugging Face 團隊開發的 transformers github repo，特別是裡面的 fine-tuning module。

BERT 其實就是 Transformer 中的 Encoder，只是有很多層。他吸引人的地方在於能直接處理各式 NLP 任務的通用架構，也就是事先訓練好一個可以套用到多個 NLP 任務的 BERT 模型，再以此為基礎 fine tune 多個下游任務。大致步驟為下：

Dataset preprocessing 成 InputExample
InputEample 轉換成 Bert 相容的格式
定義Bert 的訓練和預測 function
實際 fine-tune Bert 來執行分類任務

若要一句話說，Fine-tuning 就是在原本的 Bert model 的最後一層，接一個新的 classification layer 做下游任務，並用較少量的文本訓練整個 network，訓練時使用的 loss function 是針對這個新的下游任務的。接下來會逐步介紹怎麼用HuggingFace實作這五個步驟：

pip3 install transformers==3.5.0 # latest version 4.0.0 break things
pip3 install torch==1.4.0

1.) Dataset Preprocessing 成 InputExample

這個部分跟用 Word2Vec + CNN 做 text classification 類似（看前文），簡單來說就是把圖片的 alt-text（通常是一個短句子）做分類到以下的類別：

class_labels = {
 0: ‘Lifestyle&Activity’,
 1: ‘Food’,
 2: ‘Entertainment’,
 3: ‘Sports’,
 4: ‘Home’,
 5: ‘Automotive’,
 7: ‘Technology’,
 8: ‘Entertainment’,
 9: ‘Travel’,
 10: ‘Retail’,
 11: ‘Politics’,
}

首先要創造 MyProcessor 來繼承 huggingface 的 transformer utils 中的 DataProcessor class ，用來處理資料，把 csv 中每一行資料轉化成 InputExample。其實常用的資料集 huggingface 已經有提供內建的 processor（例如 glue 和 Mrpc，範例可以在這邊找到），但如果你是用自己的資料來做 fine-tuning，無可避免的你需要寫自己的 MyProcessor class，除非你讀資料的方式和資料格式跟 glue 或 Mrpc 一模一樣：

from transformers import (DataProcessor, InputExample)
import numpy as npclass MyProcessor(DataProcessor):
 “””Processor for my custom dataset”””def __init__(self, total_data_rows=47963, test_sample_size=.25, filename=”all.tsv”):
 self.msk = np.random.rand(total_data_rows) < 1 — test_sample_size
 self.data_filename = filename
 self.df = Nonedef _load_csv(self, filename):
 self.df = pd.read_csv(open(filename, ‘rb’), encoding=’utf-8', engine=’c’, header=None)
 self.df[0] = self.df[0].map(lambda x: str(x))def _mask_df(self, msk, set_type):
 if self.df is None:
 self._load_csv(self.data_filename)
 target = self.df[msk]
 labels = target[0].tolist()
 data = target[1].tolist()
 logger.debug(“Loaded: {} points, {} labels for task {}”.format(len(data), len(labels), set_type))
 return data, labelsdef _get_masked_examples(self, msk, set_type):
 data, labels = self._mask_df(msk, set_type)
 return self._create_examples(data, labels, set_type)def get_example_from_tensor_dict(self, tensor_dict):
 raise NotImplementedError # this is a tensorflow thing, ignored def get_train_examples(self, data_dir):
 return self._get_masked_examples(self.msk, “train”)def get_dev_examples(self, data_dir):
 return self._get_masked_examples(~self.msk, “dev”)def get_labels(self):
 return [str(i) for i in range(24)]def _create_examples(self, data, labels, set_type):
 “””Creates examples for the training and dev sets.”””
 examples = []
 for i, data in enumerate(zip(data, labels)):
 d = data[0]
 l = data[1]
 guid = “%s-%s” % (set_type, i)
 examples.append(InputExample(guid=guid, text_a=d, label=l))
 return examplesprocessor = MyProcessor(filename=”/path/to/data.csv”)

2.) InputEample 轉換成 Bert 相容的格式

有了 InputExample 格式後，現在我們需要再度轉換成可以跟 Bert 相容的格式。用 MyProcessor 準備好我們的 example data 後，就可以用 Hugging Face 的 glue_convert_examples_to_features 把 example 轉換成 Bert 可以讀進的格式（Hugging Face 稱為 features）：

token tensors：取自 Bert 輸出的 vector，在 huggingface 的 repo 中被叫做 input_ids，用來對應到這個字在 Bert vocabulary 中的 index。
segment token：在 huggingface 的 repo 中被叫做 stoken_type_ids（第一句每個字為 0; 第二句每個字為1…）
mask tensors: 在 huggingface 的 repo 中被叫做 attention_mask，都是用來告訴 Bert 這個字是 real token 還是空白的 padding（空白字元，用來保證每個輸入的句子長度一致），如果是 padding 就不需要放注意力

fine-tune Bert 需要有 token tensor, segment tensor, 和 mask tensor

def load_data_as_examples(args, task, tokenizer, evaluate=False):
 if args.local_rank not in [-1, 0] and not evaluate:
 torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cacheoutput_mode = 'classification'
 # Load data features from dataset file
 label_list = processor.get_labels() examples = (
 processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
 )
 features = convert_examples_to_features(
 examples, tokenizer, max_length=args.max_seq_length, label_list=label_list, output_mode=output_mode,
 )# Convert to Tensors and build dataset
 all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
 all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
 all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
 if output_mode == “classification”:
 all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
 elif output_mode == “regression”:
 all_labels = torch.tensor([f.label for f in features], dtype=torch.float)dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
 return dataset

3.) 定義Bert 的訓練和預測 function

定義訓練和預測 function 是深度學型模型都必備的，fine-tuning 也沒什麼不同。包含定義 optimizer、learning rate scheduler（這邊使用很紅的 warmup）、loss function 和 backprop。值得注意的是，Hugging Face 回傳的 loss 是 tuple，取決於你要的是 loss 還是 logits，你要取的 index 也會不同（參考 BertForSequenceClassification 的 source code），但好處是 forward function 都幫你寫好啦，可以說是非常方便。

幫你寫好的 forward function 做的事情也非常簡單，就是在 BERT output vector 之後加一個新的 linear layer 來做 text classification，回傳 CrossEntropyLoss 之後你再做 backward propagation：

from transformers import (
 WEIGHTS_NAME,
 AdamW,
 get_linear_schedule_with_warmup)def train(args, train_dataset, model, tokenizer):
 “”” Train the model “””train_sampler = RandomSampler(train_dataset) train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
 optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
 )global_step = 0
epochs_trained = 0train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch")
for _ in train_iterator:
 epoch_iterator = tqdm(train_dataloader, desc="Iteration")
 for step, batch in enumerate(epoch_iterator):
  model.train() # set model to train mode
  batch = tuple(t.to(args.device) for t in batch)
  inputs = {“input_ids”: batch[0], “attention_mask”: batch[1], "token_type_ids" = batch[2]“, labels”: batch[3]}
  outputs = model(**inputs)
  loss = outputs[0] # model outputs are always tuple in transformers (see doc)
  loss.backward()
  tr_loss += loss.item()  torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
  optimizer.step()
  scheduler.step() # Update learning rate schedule
  model.zero_grad()
  global_step += 1  loss_scalar = (tr_loss — logging_loss) / args.logging_steps
  learning_rate_scalar = scheduler.get_lr()[0]
  logs[“learning_rate”] = learning_rate_scalar
  logs[“loss”] = loss_scalar
  logging_loss = tr_lossreturn global_step, tr_loss / global_step

訓練之後模型，我們需要知道測試模型的準確率，這時候就要用一開始保留起來的 evaluation dataset 來做預測 function（evaluation）。預測 function 其實本質上和訓練 function 是大同小異，都是用 Dataloader 存取 batch 資料、prepocess 之後傳入 model 做計算，只不過 evaluation 不像訓練 function 會做 loss 的 back propagation 來達到提升模型準確率的目的。預測 function 只會如實的就當前的模型，告訴你輸入資料的預測結果：

def evaluate(args, model, tokenizer, prefix=””):
 eval_dataset = load_data_as_examples(args, eval_task, tokenizer, evaluate=True)
 eval_sampler = SequentialSampler(eval_dataset)
 eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)eval_loss = 0.0
 nb_eval_steps = 0
 preds = None
 out_label_ids = None
 for batch in tqdm(eval_dataloader, desc=”Evaluating”):
  model.eval()
  batch = tuple(t.to(args.device) for t in batch)with torch.no_grad():
 inputs = {“input_ids”: batch[0], “attention_mask”: batch[1], "token_type_ids" = batch[2]“, labels”: batch[3]}
 outputs = model(**inputs)
 tmp_eval_loss, logits = outputs[:2] eval_loss += tmp_eval_loss.mean().item()
 nb_eval_steps += 1eval_loss /= nb_eval_stepsif args.output_mode == “classification”:
 preds = np.argmax(preds, axis=1)
elif args.output_mode == “regression”:
 preds = np.squeeze(preds)
result = {“acc”: simple_accuracy(preds, out_label_ids)}
results.update(result)return results

4.) 實際 fine-tune Bert 來執行分類任務

準備好資料、定義好模型，我們終於可以寫 main function 來做實際的訓練了。

所有訓練機器學習的模型都有很多參數，方便我們告訴程式跟怎麼訓練我們的模型，因此 main function 裡首先我們要讀進參數。Hugging Face 很貼心的提供了 argument parser 的 library，方便使用者用 @datalass 來管理不同類型的參數。fine-tune Bert 的參數可以分為三個來源：ModelArguments、DataProcessingArguments、和 TrainingArguments。

ModelArguments 定義了和 model 選擇有關的參數：model_name_or_path、model_type、tokenizer_name
DataProcessingArguments 定義了和資料前處理有關的參數：task_name、data_dir（存 CSV 資料檔案的資料夾）、max_seq_length（過長要截斷、過短要 padding 才能讓每一筆資料的 tensor 長度都一樣）
TrainingArguments 則是用 Hugging Face 通用的參數，像是 — do_train 和 — do_eval。

@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """model_name_or_path: str = field(
        metadata={"help": "Path or name to pre-trained model "}
    )
    model_type: str = field(metadata={"help": "Model type"})
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name"}
    )@dataclass
class DataProcessingArguments:
    task_name: str = field(
        metadata={"help": "this only supports the selected task"}
    )
    data_dir: str = field(
        metadata={"help": "The input data dir."}
    )
    max_seq_length: int = field(
        default=128,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."
        },
    )

huggingface framework 提供三個基本 class：configuration、models 和 tokenizer，只要調用這三個基本 class，你就可以使用任何一個 huggingface 提供的 pretrained model。調用的方式也很一致，用 model.from_pretrained() method 來下載你想用的模型。

在這邊要先定義我們的模型，如果你想試不只一種模型的表現再從中挑選，可以用 huggingface 的 package AutoModel 。AutoModel 是一個 generic model class ，AutoModel.from_pretrain() 可以用來 load 進任何一個 huggingface 提供的 base model classes，例如 DistilBERT、Bert、GPT2…。但如果要做 text classification，你需要的是 AutoModelForSequenceClassification：

config = AutoConfig.from_pretrained(
    args.config_name, 
    num_labels=num_labels,
    finetuning_task=args.task_name,
    cache_dir=args.cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(
    args.tokenizer_name, 
    cache_dir=args.cache_dir,
)
model = AutoModelForSequenceClassification.from_pretrained(
    args.model_name_or_path,
    from_tf=bool(".ckpt" in args.model_name_or_path),
    config=config,
    cache_dir=args.cache_dir,
)

為了不要過度複雜化，我們直接用 BertModel，更確切地說是 BertForSequenceClassification 和 BertTokenizer：

from transformers import(
 BertConfig,
 BertModelForSequenceClassification,
 BertTokenizer)def main()
 # use HfArgumentParser to read in arguments
 parser = HfArgumentParser((ModelArguments, DataProcessingArguments,  TrainingArguments)) 
 model_args, dataprocessing_args, training_args = parser.parse_args_into_dataclasses()
 label_list = processor.get_labels()
 num_labels = len(label_list)# download pretrained Bert tokenizer and model
 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 model = BertModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=NUM_LABELS)model.to(args.device) # save model# Training
if training_args.do_train:
 train_dataset = load_data_as_examples(args, args.task_name, tokenizer, evaluate=False) # step 1 & 2
 global_step, tr_loss = train(args, train_dataset, model, tokenizer) # step 3# Evaluation
results = {}
if args.do_eval:
 # Load trained model that you have fine-tuned to evaluate
 checkpoints = [args.output_dir]
 model = BertModelForSequenceClassification.from_pretrained(checkpoint)
 model.to(args.device)
 result = evaluate(args, model, tokenizer, prefix=prefix) # step 3
 result = dict((k + “_{}”.format(global_step), v) for k, v in   result.items())
 results.update(result)return resultsif __name__ == "__main__":
    main()

最後的最後，只要在跑 main function 的 command 上加入額外的參數，就可以在你的資料上借用 Bert 的力量獲得新的分類模型了！

— model_type bert
 — model_name_or_path bert-base-uncased
 — task_name mytest
 — data_dir /Users/jeching/Downloads/
 — max_seq_length 128
 — per_gpu_eval_batch_size=8
 — per_gpu_train_batch_size=8
 — learning_rate 2e-5
 — num_train_epochs 3.0
 — output_dir /tmp/mytest/
 — overwrite_output_dir
 — do_train
 — do_eval