客户端embedding问题

左涛 · 2025 年11 月 21 日 08:54

我做了client/server的这种架构，server上安装了seekdb没有问题。运行客户端（官网示例程序），有一个疑问，我看到客户端（windows）上下载了all-MiniLM-L6-v2这个模型，我能不能认为受到这个模型的能力影响，我只能chunk到256大小？

另外，既然我的seekdb是装在server（Linux）上，为什么embedding不能由server端来承担（能力强一些）

# Alternative: Server mode (connecting to remote SeekDB server)
client = pyseekdb.Client(
    host="__my_server_ip__",
    port=2881,
    database="test",
    user="root",
    password=""
)

# ==================== Step 2: Create a Collection with Embedding Function ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"

# Create collection with default embedding function
# The embedding function will automatically convert documents to embeddings
collection = client.create_collection(
    name=collection_name,
)

print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")

# ==================== Step 3: Add Data to Collection ====================
# With embedding function, you can add documents directly without providing embeddings
# The embedding function will automatically generate embeddings from documents

documents = [
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language",
    "Vector databases enable semantic search",
    "Neural networks are inspired by the human brain",
    "Natural language processing helps computers understand text"
]

ids = ["id1", "id2", "id3", "id4", "id5"]

# Add data with documents only - embeddings will be auto-generated by embedding function
collection.add(
    ids=ids,
    documents=documents,  # embeddings will be automatically generated
    metadatas=[
        {"category": "AI", "index": 0},
        {"category": "Programming", "index": 1},
        {"category": "Database", "index": 2},
        {"category": "AI", "index": 3},
        {"category": "NLP", "index": 4}
    ]
)

运行程序能够明显看到客户端（windows运行）用了all-MiniLM-L6-v2，而且也有384维，感觉有点拉低了性能。

王小A · 2025 年11 月 24 日 11:35

那肯定没有这么傻的设计吧，肯定支持自定义的emebdding function的。
可以看看这个：embedding_function.py，可以安装sentence-transformers然后用这个包里面的其他模型。

左涛 · 2025 年11 月 24 日 13:22

谢谢，可能最初提供一个minilm-l6就是为了快速上手吧。是的我们是要重新用专业的embedding模型替代的。