How to Deploy an LLM+RAG Knowledge Base System on Ubuntu

2025-03-24T00:00:00Z | 8 minute read | Updated at 2025-03-24T00:00:00Z

@ Dank

Since the emergence of Large Language Models (LLMs), implementing locally deployed knowledge bases using LLM+RAG has become a hot topic.

This article will provide a complete guide to deploying an LLM+RAG knowledge base system on Ubuntu 24.04 and share tuning tips to help developers build efficient, private AI knowledge bases.

Basic Information Overview

Overview of Ollama

Ollama is an open-source tool designed to simplify the process of running large language models (LLMs) locally. It supports various models like Llama 2, Mistral, etc., and provides a user-friendly interface, suitable for developers or researchers using LLMs in a local environment. Its main advantages are data privacy and local execution, reducing reliance on cloud services.

Overview of AnythingLLM

AnythingLLM is a full-featured AI application supporting RAG (Retrieval-Augmented Generation), AI agents, and document management. It allows users to choose different LLM providers (like Ollama, OpenAI, etc.) and embedding models, making it suitable for building private knowledge bases. Its desktop version supports Linux, Windows, and macOS, emphasizing local execution and multi-user management.

Overview of DeepSeek V3 Model

DeepSeek V3 is an open-source large language model (LLM) developed by DeepSeek. It employs a Mixture-of-Experts (MoE) language model architecture with a total of 671 billion parameters, activating 37 billion parameters per inference. By activating only a portion of the expert network, it achieves computational efficiency, reducing computational costs for inference and training.

Overview of DeepSeek-R1-Distill-Llama-70B Model

DeepSeek R1 is a large language model (LLM) developed by the Chinese AI company DeepSeek, focusing on areas like logical reasoning, mathematics, and coding. It utilizes a Mixture-of-Experts (MoE) architecture with a total of 67.1 billion parameters.

In addition to releasing the DeepSeek R1 model itself, DeepSeek has also open-sourced several distilled models. These distilled models are created by transferring the knowledge from DeepSeek R1 into smaller open-source models, aiming to retain most of the performance while reducing model size and improving computational efficiency.

Overview of QwQ-32B Model

QwQ-32B is an inference model developed by the Qwen team (Alibaba Cloud) with 32.5 billion parameters, excelling at coding and mathematical reasoning. The model supports long contexts (up to 131K tokens), making it suitable for complex tasks, but requires at least 24GB of GPU memory to run (when quantized).

Overview of Bge-m3 Embedding Model

Bge-m3 is an embedding model developed by BAAI (Beijing Academy of Artificial Intelligence). It supports multifunctionality (dense retrieval, multi-vector retrieval, sparse retrieval), multilingual capabilities (100+ languages), and multi-granularity (from short sentences to 8192 token documents). It excels in multilingual retrieval tasks and is suitable for document embedding in knowledge bases.

Deployment Prerequisites

Software Environment

System: Ubuntu 24.04 (Latest LTS version recommended)

Hardware Requirements

GPU memory: At least 16GB; If sufficient GPU memory is unavailable, lower quantized versions of the models can be used.
Sufficient disk space to store model files and knowledge base data.

Deployment Process

Install Ollama

Detailed documentation can be found in the Official Ollama Documentation

Installation

Open a terminal and execute the following command:
```
curl -fsSL https://ollama.com/install.sh | sh
```
Start the ollama service:
```
ollama serve
```
Verify ollama version:
```
ollama --version
```
List locally downloaded models:
```
ollama list
```

Set up Service Autostart

Create user and group:

sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
sudo usermod -a -G ollama $(whoami)

Create the service file /etc/systemd/system/ollama.service:
```
sudo vim /etc/systemd/system/ollama.service
```

Add the following content to the file:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=$PATH"

[Install]
WantedBy=multi-user.target

After saving the file modifications, execute:

sudo systemctl daemon-reload
sudo systemctl enable ollama

Configuration

By default, Ollama runs with the API service address http://127.0.0.1:11434. If you need to allow API access from other URLs, you can configure the environment variable:
```
sudo vim /etc/systemd/system/ollama.service
```
Add the Environment configuration under the [Service] section in the file:
```
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
```

After saving the file modifications, execute:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Download LLM Models

Download QwQ-32B Model

Run the following command in the terminal and wait for the download to complete:
```
ollama pull qwq
```
Check if the model already exists:
```
ollama list
```

Download DeepSeek R1 Model

Run the following command in the terminal and wait for the download to complete. By default, this downloads the 4-bit quantized model:

ollama pull deepseek-r1:1.5b // Download 1.5b model
ollama pull deepseek-r1:7b // Download 7b model
ollama pull deepseek-r1:8b // Download 8b model
ollama pull deepseek-r1:14b // Download 14b model
ollama pull deepseek-r1:32b // Download 32b model
ollama pull deepseek-r1:70b // Download 70b model

To download non-quantized models, use the following command:
```
ollama pull deepseek-r1:32b-qwen-distill-fp16
```

Download Embedding Models

Download Bge-m3 Model

Run the following command in the terminal and wait for the download to complete:
```
ollama pull bge-m3
```

Install and Deploy AnythingLLM

AnythingLLM has a desktop version available for download from the Desktop Download Page . Download the version corresponding to your operating system, install it, and run it directly.

If you need to deploy the server version to support multiple users, use the following commands:

Installation

Set the AnythingLLM installation directory:

sudo mkdir -p /home/anythingllm && cd /home/anythingllm

Download the Docker image:

docker pull mintplexlabs/anythingllm:master

Set environment variables:

export STORAGE_LOCATION="/home/anythingllm" && touch "$STORAGE_LOCATION/.env"

Run the container:

docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v ${STORAGE_LOCATION}:/app/server/storage -v ${STORAGE_LOCATION}/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" --add-host=host.docker.internal:host-gateway mintplexlabs/anythingllm:master

Configuration

Access http://<YOUR_REACHABLE_IP>:3001/ to configure. Click the icon in the bottom left corner of the page to enter the configuration page and begin AnythingLLM setup.
Configure Admin User: On the initial login, you need to enter a username and password, then proceed with the configuration. Once configured, you can access it via http://<YOUR_REACHABLE_IP>:3001/.
Configure LLM Preference: The “LLM Provider” option allows selecting Ollama, OpenAI, etc.
- If choosing an external model like OpenAI, configure the API Key.
- If selecting Ollama, note that for AnythingLLM running via Docker, the Ollama Base URL is http://172.17.0.1:11434.
If the Ollama service connects successfully, the Ollama Model option will display a list of available models. Currently selected: qwq:latest.
Embedder Preference: The “Embedding Engine Provider” allows selecting AnythingLLM’s default embedding engine, Ollama, or other third-party services.
- AnythingLLM’s default embedding engine is all-MiniLM-L6-v2, which is primarily optimized for English documents and has limited multilingual support.
- If choosing other third-party services like OpenAI, you need to configure the API Key.
- If selecting Ollama, note that for AnythingLLM running via Docker, the Ollama Base URL is http://172.17.0.1:11434.
If the Ollama service connects successfully, the Ollama Model option will display a list of available models. Currently selected: bge-m3:latest.
Finish Configuration and Return to Main Page:

Creating a Local Knowledge Base

Data files uploaded in AnythingLLM can be used by multiple knowledge bases.

The process for creating a new knowledge base is as follows:

Create a new workspace -> Upload or select existing data files -> Add data files to the workspace -> Vectorize data

Once data vectorization is complete, the knowledge base for that workspace is created.

Create New Workspace

New Workspace

Click the “New Workspace” button and name the new workspace.
Workspace Settings Click the icon to enter the workspace settings page. Select Ollama for “Workspace LLM Provider”. This will display the “Workspace Chat Model” option, where you can choose the desired model. Currently selected: qwq:latest.

Use default values for other workspace configurations.

Upload Local Files

Open Upload Files Dialog Click the icon shown in the image to open the upload files dialog.
Upload Files and Complete Vectorization

Drag and drop the files you want to import into the knowledge base into the upload dialog.
Select the knowledge base data files you wish to import into the workspace from the list.
Click the “Move to Workspace” button to import the selected files into the workspace.
Click the “Save and Embed” button and wait for the system to vectorize the data. Data vectorization is now complete.

Tuning the Local Knowledge Base

Embedding Vector Model Selection

Different embedding vector models have different characteristics:

Feature	all-MiniLM-L6-v2	BAAI/bge-m3
Model Size	25MB	~1.5GB
Hardware Req.	Runs on CPU, >=2GB RAM	GPU recommended
Multilingual Task	Primarily English	Supports >100 languages
Inference Speed (CPU)	Fast, good for low-end	Slower, GPU recommended

Text Chunk Size and Overlap Selection

Text Splitter Preference

A text chunk refers to a smaller segment of a document after it has been split, often necessary due to the input length limitations of Large Language Models (LLMs).

Overlap refers to the shared portion between adjacent chunks, intended to maintain contextual continuity and prevent loss of information at boundaries.

Impact of Text Chunk Size Chunk size directly affects the semantic representation of vectors and downstream task performance:
- Contextual Understanding: Very small chunks might not capture sufficient context.
- Computational Efficiency: Larger chunks require more computational resources; increased input length significantly raises processing time.
- Granularity and Task Matching: Smaller chunks are suitable for tasks needing fine-grained information (e.g., specific information retrieval); larger chunks are better for tasks requiring overall context (e.g., document summarization).
Impact of Overlap Amount Overlap ensures contextual continuity at chunk boundaries:
- Context Preservation: Overlap helps prevent information loss, e.g., a sentence spanning across chunks remains intact through overlap.
- Redundancy vs. Efficiency: Excessive overlap (e.g., >20%) increases computational cost.
- Task Relevance: The amount of overlap should be adjusted based on the task. For instance, in information retrieval, overlap ensures queries match information near chunk boundaries; in question-answering systems, overlap helps maintain context integrity.

The choice of text chunk size and overlap can be adjusted based on the task and hardware environment. Different types of text may require different strategies. For example, technical documents might benefit from smaller chunks to capture details, while narrative texts might need larger chunks to maintain story context.

Chat Prompt Optimization

Chat Prompt Optimization Prompts for the LLM chat model are used to control the model’s behavior, helping it better understand and answer questions.

LLM Temperature Setting

The LLM temperature setting controls the model’s randomness, allowing for better control over the quality of the output.

Vector Database Settings

Max Context Chunks: Controls the number of context chunks sent to the LLM with each request. A higher number means the LLM receives more data, potentially improving result quality but requiring more computational resources.
Document Similarity Threshold: Controls the similarity between documents. A higher threshold means returned results are more strongly related to the source documents, but if retrieval yields few results, it might lead to inaccurate answers.

Previous page Implementing Access Control for Cloudflare R2 Custom Domains: From Presigned URLs to Cloudflare Workers

Next page The Average Person's Best State: Not Perfection in Everything, but Balance Across These Six Dimensions