wpe零基础自学（wpe专业版使用教程） - 原点资讯

如何训练一个 GPT

接下来我们来动手实践一下如何训练一个 GPT 模型出来，这里以从头训练一个代码补全的 GPT 模型为例。

代码补全有什么用呢，比如我们给模型一个提示：

from transformers import AutoTokenizer, AutoModelForSequenceClassification # build a BERT classifier

然后模型就能够输出：

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

帮我们定义了一个基于 Bert 的分类器出来。

为了训练这样一个模型，首先我们需要准备用于训练的数据，常见的代码补全的数据为 codeparrot。

https://huggingface.co/codeparrot/codeparrot

这里随便打印一条数据（截断了，不然太长了）出来看看，可以看到其实跟我们正常写的代码是一样的。

wpe零基础自学,wpe专业版使用教程(5)

然而模型是不能直接接收这样的“文本”信息的，所以训练 NLP 模型前通常需要对其进行“分词”，转化为由一串数字表示，可以创建一个分词器：

tokenizer = AutoTokenizer.from_pretrained("./code-search-net-tokenizer")

对上面的代码进行分词转化，就可以得到如下的一串 id：

[3, 41082, 17023, 26, 11334, 13, 24, 41082, 173, 2745, 756, 173, 2745, 4397, 173, 2745, 1893, 173, 2745, 3857, 442, 2604, 173, 973, 7880, 978, 3399, 173, 973, 10888, 978, 4582, 173, 173, 973, 309, 65, 552, 978, 6336, 4391, 173, 295, 6472, 8, ...

上面的例子展示了对单条样本进行分词的结果；通常我们会把分词函数定义好（中间会涉及到比如需不需要截断、最大长度多少等细节配置这里就不详细展开了），然后直接对整个数据集进行 map 就可以对整个数据集进行分词了。

def tokenize(element): outputs = tokenizer( element["content"], truncation=True, max_length=context_length, return_overflowing_tokens=True, return_length=True, ) input_batch = [] for length, input_ids in zip(outputs["length"], outputs["input_ids"]): if length == context_length: input_batch.append(input_ids) return {"input_ids": input_batch} tokenized_datasets = raw_datasets.map( tokenize, batched=True, remove_columns=raw_datasets["train"].column_names )

搞定数据后，接下来就需要创建（初始化）一个模型了，GPT 的结构其实就是由 transformer 组成的，网上的轮子已经很多了，这里就不重新造轮子了，最常见的直接用的 transformers 库，通过配置的方式就能够快速定义一个模型出来了。

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig config = AutoConfig.from_pretrained( "gpt2", vocab_size=len(tokenizer), n_ctx=context_length, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, ) model = GPT2LMHeadModel(config) model_size = sum(t.numel() for t in model.parameters()) print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

这是一个 124.2M 模型参数的 GPT2，模型的代码结构打给大家看看（详细的代码实现可以阅读 transformers 库的源码），其实主要就是前面有个 embedding 层，中间 12 个 transformer block，最后有个线性层。

GPT2LMHeadModel( (transformer): GPT2Model( (wte): Embedding(50000, 768) (wpe): Embedding(1024, 768) (drop): Dropout(p=0.1, inplace=False) (h): ModuleList( (0): GPT2Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) ################# 中间省略重复的10层Block ################# (11): GPT2Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=768, out_features=50000, bias=False) )

这里也给出 GPT2 和 GPT 的模型结构图，感兴趣的同学可以仔细看看，可以发现，GPT2 的模型结构(右)较 GPT 的模型结构(左)有所改动。在 GPT2 中的一个 Transformer Block 层中，第一个 LayerNormalization 模块被移到了 Msaked-Multi-Self-Attention 模块之前, 第二个 LayerNormalization 模块也被移到了 Feed-Forward 模块之前；同时 Residual-connection 的位置也调整到了 Msaked-Multi-Self-Attention 模块与 Feed-Forward 模块之后。

wpe零基础自学,wpe专业版使用教程(6)

数据和模型结构都确定下来后，接下来我们需要有一个训练的流程或者框架，最简便的那就是直接调用 transformers 提供的训练器，给定一些配置，模型、分词器、数据集。

from transformers import Trainer, TrainingArguments args = TrainingArguments( output_dir="codeparrot-ds", per_device_train_batch_size=32, per_device_eval_batch_size=32, evaluation_strategy="steps", eval_steps=5_000, logging_steps=5_000, gradient_accumulation_steps=8, num_train_epochs=1, weight_decay=0.1, warmup_steps=1_000, lr_scheduler_type="cosine", learning_rate=5e-4, save_steps=5_000, fp16=True, push_to_hub=True, ) trainer = Trainer( model=model, tokenizer=tokenizer, args=args, data_collator=data_collator, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["valid"], )

然后就可以一键训练了：

trainer.train()

自由度高一点的训练方式也可以自行打造，依次拿到每个 batch 数据、送入模型、计算 loss、反向传播；对于大模型来说，常见的用 accelerate 库来进行加速，比如混合精度、梯度累积等操作。

上述的这些代码（使用训练器或者 accelerate 库进行训练）在 transformers 的官方教程里都有，感兴趣的可以自己跑一跑哦。

https://huggingface.co/course/chapter7/6?fw=pt

训练完模型后我们可以来看一下它的代码生成能力，那就先来跟大家 hello world 一下。

给定 prompt：

def print_hello_world(): """Print 'Hello World!'."""

得到：

def print_hello_world(): """Print 'Hello World!'.""" print('Hello World!')

给定 prompt：

import numpy as

它知道我们常用的缩写就是：

import numpy as np

给定 prompt：

import numpy as np from sklearn.ensemble import RandomForestClassifier # create training data X = np.random.randn(100, 100) y = np.random.randint(0, 1, 100) # setup train test split

它能够帮我们划分训练和测试数据集：

import numpy as np from sklearn.ensemble import RandomForestClassifier # create training data X = np.random.randn(100, 100) y = np.random.randint(0, 1, 100) # setup train test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

这一节中我们走了一遍训练 GPT 的一个流程，这里和训练 ChatGPT 的第一步的差别在于：ChatGPT 第一步采用人工写答案的方式得到的语料对预训练好的 GPT 进行了精调，而本节只是对一个小语料进行了一波预训练。

wpe零基础自学,wpe专业版使用教程(7)