OpenELM FastApi 部署调用

OpenELM是由苹果公司开发的一款先进语言模型，通过一种新的层级缩放策略优化每个Transformer层的参数分配，从而提升模型的效率和准确性。OpenELM还提供了一个开放的训练和推理框架，包含数据集、训练日志和检查点等资源，支持研究的可重复性和透明性。

模型地址：https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca

项目地址：https://github.com/apple/corenet/tree/main/projects/openelm

官方报道：https://machinelearning.apple.com/research/openelm

论文链接：https://arxiv.org/abs/2404.14619

环境准备

本文基础环境如下：

----------------
ubuntu 22.04
cuda 12.4
----------------

本文默认学习者已安装好以上 Pytorch(cuda) 环境，如未安装请自行安装。

首先pip换源加速下载并安装依赖包

# 更换 pypi 源加速库的安装
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

pip install accelerate
pip install tiktoken
pip install einops
pip install transformers_stream_generator==0.0.4
pip install scipy
pip install torchvision
pip install pillow
pip install tensorboard
pip install matplotlib
pip install transformers==4.32.0
pip install gradio==3.39.0
pip install nest_asyncio

模型下载

使用 modelscope 命令行下载模型，参数model为模型名称，参数 local_dir 为模型的下载路径。
注：由于OpenELM使用的是Llama2的Tokenizer，所以我们在下载Llama2-7b时可将权重排除在外打开终端输入以下命令下载模型和Tokenizer

modelscope download --model shakechen/Llama-2-7b-hf  --local_dir /OpenELM/compshared-tmp/Llama-2-7b-hf --exclude .bin *.safetensors configuration.json 
modelscope download --model LLM-Research/OpenELM-3B-Instruct --local_dir /OpenELM/compshared-tmp/OpenELM-3B-Instruct

代码准备

在 /OpenELM/compshared-tmp 路径下新建 api.py 文件并在其中输入以下内容，粘贴代码后请及时保存文件。下面的代码有很详细的注释，大家如有不理解的地方，欢迎提出 issue。

from fastapi import FastAPI, Request
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import uvicorn
import json
import datetime
import torch

# 设置设备参数
DEVICE = cuda  # 使用CUDA
DEVICE_ID = 0  # CUDA设备ID，如果未设置则为空
CUDA_DEVICE = f{DEVICE}:{DEVICE_ID} if DEVICE_ID else DEVICE  # 组合CUDA设备信息

# 清理GPU内存函数
def torch_gc():
    if torch.cuda.is_available():  # 检查是否可用CUDA
        with torch.cuda.device(CUDA_DEVICE):  # 指定CUDA设备
            torch.cuda.empty_cache()  # 清空CUDA缓存
            torch.cuda.ipc_collect()  # 收集CUDA内存碎片

# 创建FastAPI应用
app = FastAPI()

# 处理POST请求的端点
@app.post(/)
async def create_item(request: Request):
    global model, tokenizer  # 声明全局变量以便在函数内部使用模型和分词器
    json_post_raw = await request.json()  # 获取POST请求的JSON数据
    json_post = json.dumps(json_post_raw)  # 将JSON数据转换为字符串
    json_post_list = json.loads(json_post)  # 将字符串转换为Python对象
    prompt = json_post_list.get(prompt)  # 获取请求中的提示

    # 调用模型进行对话生成
    model_inputs = tokenizer(prompt, add_special_tokens=True, return_tensors=pt)[input_ids].cuda()
    generated_ids = model.generate(model_inputs, max_length=384)
    response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    now = datetime.datetime.now()  # 获取当前时间
    time = now.strftime(%Y-%m-%d %H:%M:%S)  # 格式化时间为字符串
    # 构建响应JSON
    answer = {
        response: response,
        status: 200,
        time: time
    }
    # 构建日志信息
    log = [ + time + ]  + , prompt: + prompt + , response: + repr(response) + 
    print(log)  # 打印日志
    torch_gc()  # 执行GPU内存清理
    return answer  # 返回响应

# 主函数入口
if __name__ == __main__:
    # 加载预训练的分词器和模型
    model_path = /OpenELM/compshared-tmp/OpenELM-3B-Instruct
    tokenizer_path = /OpenELM/compshared-tmp/Llama-2-7b-hf
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=False, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_path, device_map=auto, torch_dtype=torch.bfloat16, trust_remote_code=True)

    # 启动FastAPI应用
    # 用6006端口可以将autodl的端口映射到本地，从而在本地使用api
    uvicorn.run(app, host=0.0.0.0, port=11434, workers=1)  # 在指定端口和主机上启动应用

Api 部署

在终端输入以下命令启动api服务：

cd /OpenELM/compshared-tmp
python api.py

默认部署在 11434 端口，通过 POST 方法进行调用，可以使用 curl 调用，如下所示：

curl -X POST http://127.0.0.1:11434 \
     -H Content-Type: application/json \
     -d {prompt: Once upon a time there was}

也可以使用 python 中的 requests 库进行调用，我们在同级目录下创建ask.py，如下所示：

import requests
import json

def get_completion(prompt):
    headers = {Content-Type: application/json}
    data = {prompt: prompt}
    response = requests.post(url=http://127.0.0.1:11434, headers=headers, data=json.dumps(data))
    print(json.dumps(response.json()))
    return response.json()[response]

if __name__ == __main__:
    response = get_completion(Once upon a time there was)

得到的返回值如下所示：

{response: Once upon a time there was a little girl named Rosie. Rosie loved to play dress-up, and her favorite costume was a princess dress. Rosies mommy and daddy dressed her up in her princess dress every chance they got. Rosie loved her princess dress so much, she even wore it to bed! Rosies mommy and daddy loved Rosie very much, and they wanted Rosie to have the best life possible.\n\nOne day Rosies mommy and daddy took Rosie to visit their friends, the Johnsons. Rosies mommy and daddy told Rosie all about their friends, and Rosie couldnt wait to meet them. Rosies mommy and daddy dropped Rosie off at the Johnsons house, and Rosie ran inside to find her princess dress waiting for her. Rosie put on her princess dress and tiptoed upstairs to meet her friends.\n\nWhen Rosie walked into the Johnsons living room, she saw her friends sitting on the couch, dressed in jeans and t-shirts. Rosie smiled and ran over to give her friends a hug. Rosies friends were so happy to see her dressed up like a princess! They asked Rosie lots of questions about her princess dress, and Rosie told them all about her mommy and daddys special tradition. Rosies friends loved hearing all about her princess dress, and they promised to wear their princess dresses whenever Rosie visited them.\n\nAfter playing dress-up for a while, Rosies friends asked Rosie if she wanted to play outside. Rosie loved playing with her friends, but she also loved playing princess dress-up., status: 200, time: 2024-08-24 21:28:34}