线下现场 RAG 聊天机器人搭建 #AI 动手实战营#

DDH · 2024 年11 月 24 日 12:21

前言：

Centos 7.9+8G内存+50G，当时搭建配置不足报错，没继续搭建下去。
推荐最低：10G内存+70G存储

AI 动手实战营线下现场搭建(步骤 1 ~ 5跳过)

1. 安装 [Python 3.9.5] 和 [pip]

疑似需要的版本区间是大于3.9且小于4.0,我使用的版本是3.9.6（官网下载慢，我使用国内镜像）

wget https://mirrors.huaweicloud.com/python/3.9.5/Python-3.9.5.tgz

依赖包安装

yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make libffi-devel

2. 安装 [Poetry]

export PATH="$PATH:/usr/local/python3/bin"
/usr/local/python3/bin/python3 -m pip install poetry

这里还遇到个问题

    raise ImportError(
ImportError: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'OpenSSL 1.0.2k-fips  26 Jan 2017'. See: https://github.com/urllib3/urllib3/issues/2168

卸载重装
pip3 uninstall urllib3
pip3 install urllib3==1.26.15

我这下载非常慢。若想快速点可参考#AI 实战营 #使用OB搭建RAG 聊天机器人 - 社区问答- OceanBase社区-分布式数据库

3. clone项目

git clone https://gitee.com/oceanbase-devhub/ai-workshop-2024.git

4. 安装 Docker

 yum install -y docker-ce

5. 部署 OceanBase 集群

5.1 启动 OceanBase docker 容器

systemctl start docker

5.2 启动一个 OceanBase docker 容器

docker run --ulimit stack=4294967296 --name=ob433 -e MODE=mini -e OB_MEMORY_LIMIT=8G -e OB_DATAFILE_SIZE=10G -e OB_CLUSTER_NAME=ailab2024 -p 127.0.0.1:2881:2881 -d quay.io/oceanbase/oceanbase-ce:4.3.3.1-101000012024102216

如果上述命令执行成功，将会打印容器 ID，如下所示：

f7095ace669670874d67bf43c42bdbd046430d954e6d3dda9a708ee99d4bd607

5.3 检查 OceanBase 初始化是否完成

容器启动后，您可以使用以下命令检查 OceanBase 数据库初始化状态：

docker logs -f ob433

初始化过程大约需要 2 ~ 3 分钟。当您看到以下消息（底部的 boot success! 是必须的）时，说明 OceanBase 数据库初始化完成：

配置不足报错了。

[root@localhost ~]# docker logs -f ob433
+--------------------------------------------------+
|                   Cluster List                   |
+------+-------------------------+-----------------+
| Name | Configuration Path      | Status (Cached) |
+------+-------------------------+-----------------+
| demo | /root/.obd/cluster/demo | stopped         |
+------+-------------------------+-----------------+
Trace ID: 715efe88-a87d-11ef-b400-0242ac110002
If you want to view detailed obd logs, please run: obd display-trace 715efe88-a87d-11ef-b400-0242ac110002
repository/
.....................................................
.....................................................
.....................................................
[WARN] OBD-1007: (172.17.0.2) The recommended number of stack size is unlimited (Current value: 4194304)
[WARN] OBD-1017: (172.17.0.2) The value of the "vm.max_map_count" must be within [327600, 1310720] (Current value: 65530, Recommended value: 655360)
[WARN] OBD-1017: (172.17.0.2) The value of the "fs.file-max" must be greater than 6573688 (Current value: 760168, Recommended value: 6573688)
[ERROR] OBD-2000: (172.17.0.2) not enough memory. (Free: 248M, Buff/Cache: 4G, Need: 8G), Please reduce the `memory_limit` or `memory_limit_percentage`
[WARN] OBD-1012: (172.17.0.2) clog and data use the same disk (/)
[ERROR] OBD-2003: (172.17.0.2) / not enough disk space. (Avail: 14G, Need: 19G), Please reduce the `datafile_size` or `datafile_disk_percentage`

See https://www.oceanbase.com/product/ob-deployer/error-codes .
Trace ID: 786cb5f8-a87d-11ef-a08d-0242ac110002
If you want to view detailed obd logs, please run: obd display-trace 786cb5f8-a87d-11ef-a08d-0242ac110002
boot failed!

6. 开通 OB Cloud 个人实例

使用 OB Cloud 云数据库免费试用版，平台注册和实例开通请参考OB Cloud 云数据库 365 天免费试用；
创建数据库，账号
并导入25M数据
配置白名单和连接串

7. 注册阿里云百炼账号并获取 API Key

8 设置环境变量.env

聊天机器人所需的环境变量

cp .env.example .env
# 更新 .env 文件中的值，特别是 API_KEY 和数据库连接信息
vi .env

API_KEY=  #步骤7中的API-KE
LLM_MODEL="qwen-turbo-2024-11-01"
LLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"

HF_ENDPOINT=https://hf-mirror.com
BGE_MODEL_PATH=BAAI/bge-m3

OLLAMA_URL=
OLLAMA_TOKEN=

OPENAI_EMBEDDING_API_KEY= #步骤7中的API-KE
OPENAI_EMBEDDING_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings"
OPENAI_EMBEDDING_MODEL=text-embedding-v3

UI_LANG="zh"

# 如果你使用的是 OB Cloud 的实例，请根据实例的连接信息更新下面的变量
DB_HOST="步骤6中的连接串IP"
DB_PORT="步骤6中的连接串端口"
DB_USER="步骤6中的创建的用户"
DB_NAME="test"
DB_PASSWORD="步骤6中账号密码"

9. 测试远程连接数据库

# 步骤6中连接串测试
mysql -h127.0.0.1 -P2881 -uroot@test -A -e "show databases"

bash utils/connect_db.sh
# 如果顺利进入 MySQL 连接当中，则验证了环境变量设置成功

10. 准备文档数据

10.1 克隆文档仓库

首先我们将使用 git 克隆 observer 和 obd 两个项目的文档到本地。

git clone --single-branch --branch V4.3.4 https://github.com/oceanbase/oceanbase-doc.git doc_repos/oceanbase-doc
# 如果您访问 Github 仓库速度较慢，可以使用以下命令克隆 Gitee 的镜像版本
git clone --single-branch --branch V4.3.4 https://gitee.com/oceanbase-devhub/oceanbase-doc.git doc_repos/oceanbase-doc

10.2 文档格式标准化

# 把文档的标题转换为标准的 markdown 格式
poetry run python convert_headings.py doc_repos/oceanbase-doc/zh-CN

10.3 将文档转换为向量并插入 OceanBase 数据库

# 生成文档向量和元数据
poetry run python embed_docs.py --doc_base doc_repos/oceanbase-doc/zh-CN/640.ob-vector-search

11. 启动聊天界面

执行以下命令启动聊天界面：

poetry run streamlit run --server.runOnSave false chat_ui.py

访问终端中显示的 URL 来打开聊天机器人应用界面。

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://172.xxx.xxx.xxx:8501
  External URL: http://xxx.xxx.xxx.xxx:8501 # 这是您可以从浏览器访问的 URL，浏览器打开

12. 参考文档：

obpilot · 2024 年11 月 25 日 08:58

碰到 SSL 报错问题：

ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with LibreSSL 2.8.3

下面这个解决方法无用

pip uninstall urllib3
pip install urllib3==1.26.15

估计这个版本要生效还得 poetry 管理才行。
修改 pyproject.toml
增加：

urllib3 = "^1.2.15"

然后执行

poetry lock
poetry install

然后就不报错了。

不过最终在解析文档的还是还报另外的错误：

[mq@server065 ai-workshop-2024]$ poetry run python embed_docs.py --doc_base doc_repos/oceanbase-doc/zh-CN/640.ob-vector-search
args Namespace(doc_base='doc_repos/oceanbase-doc/zh-CN/640.ob-vector-search', table_name='corpus', skip_patterns=['oracle'], batch_size=4, component='observer', limit=300, echo=False)
Using RemoteOpenAI
  0%|                                                                                                                                                                                                                                                 | 0/9 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/mq/ai-workshop-2024/embed_docs.py", line 128, in <module>
    insert_batch(batch, comp=args.component)
  File "/home/mq/ai-workshop-2024/embed_docs.py", line 112, in insert_batch
    vs.add_documents(
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/langchain_core/vectorstores/base.py", line 287, in add_documents
    return self.add_texts(texts, metadatas, **kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/langchain_community/vectorstores/oceanbase.py", line 300, in add_texts
    embeddings = self.embedding_function.embed_documents(texts)
  File "/home/mq/ai-workshop-2024/rag/embeddings.py", line 74, in embed_documents
    res = requests.post(
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 716, in urlopen
    httplib_response = self._make_request(
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/home/mq/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/mq/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1294, in _send_request
    self.putheader(hdr, value)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 224, in putheader
    _HTTPConnection.putheader(self, header, *values)
  File "/home/mq/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1226, in putheader
    values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-15: ordinal not in range(256)
[mq@server065 ai-workshop-2024]$

看编码已经是 utf8 了。这个报错就很奇怪了。

[mq@server065 ai-workshop-2024]$ poetry run python -c "import sys; print(sys.getdefaultencoding())"
utf-8
[mq@server065 ai-workshop-2024]$

彤栩 · 2024 年11 月 25 日 14:52

叮咚叮咚 · 2024 年11 月 26 日 11:31

邓老师这次质量确实杠杠滴

与义 · 2024 年11 月 27 日 16:48

您好，能否在 rag/embeddings.py 文件的 76 行 headers 处显式指定请求字符集试试呢？类似下面这样

obpilot · 2024 年11 月 27 日 17:11

谢谢，改了还是不行。

[mq@server065 ai-workshop-2024]$ poetry run python embed_docs.py --doc_base doc_repos/oceanbase-doc/zh-CN/640.ob-vector-search
args Namespace(doc_base='doc_repos/oceanbase-doc/zh-CN/640.ob-vector-search', table_name='corpus', skip_patterns=['oracle'], batch_size=4, component='observer', limit=300, echo=False)
Using RemoteOpenAI
  0%|                                                                                                                                              | 0/9 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/mq/ai-workshop-2024/embed_docs.py", line 128, in <module>
    insert_batch(batch, comp=args.component)
  File "/home/mq/ai-workshop-2024/embed_docs.py", line 112, in insert_batch
    vs.add_documents(
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/langchain_core/vectorstores/base.py", line 287, in add_documents
    return self.add_texts(texts, metadatas, **kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/langchain_community/vectorstores/oceanbase.py", line 300, in add_texts
    embeddings = self.embedding_function.embed_documents(texts)
  File "/home/mq/ai-workshop-2024/rag/embeddings.py", line 74, in embed_documents
    res = requests.post(
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 716, in urlopen
    httplib_response = self._make_request(
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/home/mq/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/mq/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1294, in _send_request
    self.putheader(hdr, value)
  File "/home/mq/.cache/pypoetry/virtualenvs/ai-workshop-2NuhJKpg-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 224, in putheader
    _HTTPConnection.putheader(self, header, *values)
  File "/home/mq/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1226, in putheader
    values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-15: ordinal not in range(256)

与义 · 2024 年11 月 28 日 15:03

您的机器操作系统和内核版本是多少呢？另外，能否试一下用 Python 3.10 的环境试试呢？

文心 · 2024 年11 月 28 日 18:17

obpilot · 2024 年11 月 28 日 18:30

WSL2 Ubuntu-22.04

与义 · 2024 年11 月 28 日 20:57

您好，请问您 .env 中的 OPENAI_EMBEDDING_API_KEY 这一项是否填写了呢？如果没有填写请填写和 API_KEY 一样的值，这个问题应该就能够解决了。

AntTech_PALA3E · 2024 年11 月 29 日 01:21

赞一个加油 nice

obpilot · 2024 年11 月 30 日 14:23

我大概知道什么原因了。
源码一直在更新，重新拉又可了。
配置文件 .env 里请问这两个参数如何填？

OLLAMA_URL= 
OLLAMA_TOKEN=

我本地也部署了一个 ollama 并且有多个模型可用。

C:\Users\MQ>curl http://192.168.0.102:8000/
Ollama is running                                                                                                                                                        
C:\Users\MQ>ollama list
NAME                       ID              SIZE      MODIFIED                                                                                                            
nomic-embed-text:latest    0a109f422b47    274 MB    2 weeks ago                                                                                                         
gemma:7b-instruct-fp16     f689ad351c8d    17 GB     2 weeks ago                                                                                                         
qwen2:7b                   dd314f039b9d    4.4 GB    2 weeks ago                                                                                                         
qwen:1.8b

mittens · 2024 年11 月 30 日 20:24

文档很详细啊~~~

mittens · 2024 年12 月 1 日 10:34

很棒

AntTech_PALA3E · 2024 年12 月 1 日 14:58

加油

obpilot · 2024 年12 月 2 日 10:36

恩，看 github上的 readme 是部署成功了。
这个例子目前大模型调用的还是阿里云百炼的。
我看配置文件里有
OLLAMA_URL
OLLAMA_TOKEN
是不是可以配置为调用本地的 OLLAMA 服务。

我的部署记录：笔记本A卡玩转大模型应用场景

ljware · 2024 年12 月 2 日 10:51

很详细