优云智算 | llama.cpp一键部署

LLM

WebUI

llama.cpp

llama.cpp环境配置以及模型转换

0元/小时

v1.0

可通过llama-cli或llama-server运行模型。

llama-cli -m chinese_q4_0.gguf -p you are a helpful assistant -cnv -ngl 24

其中：

-m参数后跟要运行的模型
-cnv表示以对话模式运行模型
-ngl：当编译支持 GPU 时，该选项允许将某些层卸载到 GPU 上进行计算。一般情况下，性能会有所提高。

其他参数详见官方文档llama.cpp/examples/main/README.md at master · ggerganov/llama.cpp (github.com)

模型API服务

llama.cpp提供了完全与OpenAI API兼容的API接口，使用经过编译生成的llama-server可执行文件启动API服务。如果编译构建了GPU执行环境，可以使用-ngl N或 --n-gpu-layers N参数，指定offload层数，让模型在GPU上运行推理。未使用-ngl N或 --n-gpu-layers N参数，程序默认在CPU上运行

./llama-server -m /mnt/workspace/my-llama-13b-q4_0.gguf -ngl 28

可从以下关键启动日志看出，模型在GPU上执行

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0

会启动一个类似web服务器的进程，默认端口号为8080，这样就启动了一个 API 服务，可以使用 curl 命令进行测试。

curl --request POST \
    --url http://localhost:8080/completion \
    --header Content-Type: application/json \
    --data {prompt: What color is the sun?,n_predict: 512}

{content:.....,generation_settings:{frequency_penalty:0.0,grammar:,ignore_eos:false,logit_bias:[],mirostat:0,mirostat_eta:0.10000000149011612,mirostat_tau:5.0,......}}

此外可通过web页面或者OpenAI api等进行访问。安装openai依赖

pip install openai

使用OpenAI api访问：

import openai

client = openai.OpenAI(
    base_url=http://127.0.0.1:8080/v1,
    api_key = sk-no-key-required
)

completion = client.chat.completions.create(
    model=qwen, # model name can be chosen arbitrarily
    messages=[
        {role: system, content: You are a helpful assistant.},
        {role: user, content: tell me something about michael jordan}
    ]
)
print(completion.choices[0].message.content)