可通过llama-cli或llama-server运行模型。
llama-cli -m chinese_q4_0.gguf -p you are a helpful assistant -cnv -ngl 24
其中:
-m
参数后跟要运行的模型-cnv
表示以对话模式运行模型-ngl
:当编译支持 GPU 时,该选项允许将某些层卸载到 GPU 上进行计算。一般情况下,性能会有所提高。其他参数详见官方文档llama.cpp/examples/main/README.md at master · ggerganov/llama.cpp (github.com)
llama.cpp提供了完全与OpenAI API兼容的API接口,使用经过编译生成的llama-server可执行文件启动API服务。如果编译构建了GPU执行环境,可以使用-ngl N
或 --n-gpu-layers N
参数,指定offload层数,让模型在GPU上运行推理。未使用-ngl N
或 --n-gpu-layers N
参数,程序默认在CPU上运行
./llama-server -m /mnt/workspace/my-llama-13b-q4_0.gguf -ngl 28
可从以下关键启动日志看出,模型在GPU上执行
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 1002.00 MiB
llm_load_tensors: CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
会启动一个类似web服务器的进程,默认端口号为8080,这样就启动了一个 API 服务,可以使用 curl 命令进行测试。
curl --request POST \
--url http://localhost:8080/completion \
--header Content-Type: application/json \
--data {prompt: What color is the sun?,n_predict: 512}
{content:.....,generation_settings:{frequency_penalty:0.0,grammar:,ignore_eos:false,logit_bias:[],mirostat:0,mirostat_eta:0.10000000149011612,mirostat_tau:5.0,......}}
此外可通过web页面或者OpenAI api等进行访问。安装openai依赖
pip install openai
使用OpenAI api访问:
import openai
client = openai.OpenAI(
base_url=http://127.0.0.1:8080/v1,
api_key = sk-no-key-required
)
completion = client.chat.completions.create(
model=qwen, # model name can be chosen arbitrarily
messages=[
{role: system, content: You are a helpful assistant.},
{role: user, content: tell me something about michael jordan}
]
)
print(completion.choices[0].message.content)