hra-ftu-rag-supporting-information

Usage Guide (usage.md)

This document provides step-by-step instructions for using the main components of HRAftu-LM-RAG. Each section covers how to run scripts and services in the src/ directory, along with example commands.

Prerequisites

Configuration File
Before invoking any script, ensure you have copied and edited config/example_config.yaml to config/config.yaml, filling in:
- fastgpt.api_key & fastgpt.host
- Vision model device/port settings
- ClickHouse connection details (if using similarity evaluation)

Environment Activation

cd HRAftu-LM-RAG
source venv/bin/activate       # or: conda activate hraftu

1. Data Import

The Data Import script (import_data_to_knowledge_datatabase.py) uploads a directory of local files (PDF, TXT, DOCX, etc.) into a FastGPT dataset/collection.

1.1 Command Syntax

python src/import_data/import_data_to_knowledge_datatabase.py \
  --directory_path <PATH_TO_FILES> \
  --database <FASTGPT_DATASET_NAME> \
  [--collect_name <COLLECTION_NAME>] \
  [--parm <create|update|delete>] \
  [--parentId <PARENT_COLLECTION_ID>]

--directory_path (required): Absolute or relative path to the folder containing documents.
--database (required): The FastGPT dataset/collection name under which to import.
--collect_name (optional): A human-readable name for the collection; if omitted, defaults to the directory name.
--parm (optional):
- create (default) → add all new documents, skip existing ones.
- update → replace documents that already exist.
- delete → remove documents matching names in the folder.
--parentId (optional): If you want to nest this collection under an existing parent collection ID.

1.2 Example

python src/import_data/import_data_to_knowledge_datatabase.py \
  --directory_path ./data/technical_papers \
  --database ResearchCorpus \
  --collect_name “TechPapers” \
  --parm create

This will:

Read every file under ./data/technical_papers.
Create or update the FastGPT collection named ResearchCorpus/TechPapers.
Skip any files that are already indexed.

2. Embedding Service

The Embedding Service (embedding_web.py) is a Flask-based HTTP server that loads multiple embedding models. Clients send text and specify which model to use, and the service returns normalized vectors.

2.1 Starting the Service

python src/embedding_service/embedding_web.py --port 55443

--port (optional): Port number on which Flask will listen (default: 55443).

Logs

When the service starts, you’ll see console output similar to:

[INFO] Loading model: pubmedbert
[INFO] Loading model: all-MiniLM-L6-v2
[INFO] Loading model: bge-large-en-v1.5
[INFO] Loading model: gte-large
[INFO] Flask server running on http://0.0.0.0:55443

2.2 API Endpoint

URL: POST http://<host>:<port>/v1/embeddings

Request body (JSON):

{
  "input": ["Text sentence 1", "Text sentence 2", ...],
  "model": "all-MiniLM-L6-v2"
}

Response body (JSON array):

[
  [0.123, -0.456, ...],   // normalized vector for "Text sentence 1"
  [0.789,  0.012, ...]    // normalized vector for "Text sentence 2"
]

2.3 Example `curl` Request

curl -X POST http://localhost:55443/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["The quick brown fox jumps over the lazy dog."],
    "model": "all-MiniLM-L6-v2"
  }'

You should receive a JSON array containing one embedding vector.

3. LLM Batch Query

The LLM Batch Query script (llm_query.py) sends multiple prompts in parallel to FastGPT’s Chat Completions endpoint and collects responses.

3.1 Command Syntax

python src/llm_query/llm_query.py \
  --input_file <PATH_TO_PROMPTS_FILE> \
  --output_file <PATH_TO_OUTPUT_JSON> \
  [--max_workers <NUM_THREADS>] \
  [--timeout <SECONDS>]

--input_file (required): Path to a CSV or JSON file containing a list of prompts.
--output_file (required): Path where the script will write the aggregated JSON responses.
--max_workers (optional): Number of parallel threads to use (default: 4).
--timeout (optional): Timeout in seconds per request (default: 30).

Input File Format

CSV:

id,prompt
1,"What is the capital of France?"
2,"Explain the theory of relativity in simple terms."

JSON:

[
  {"id": "1", "prompt": "What is the capital of France?"},
  {"id": "2", "prompt": "Explain the theory of relativity in simple terms."}
]

3.2 Example

python src/llm_query/llm_query.py \
  --input_file examples/sample_prompts.csv \
  --output_file output/llm_responses.json \
  --max_workers 8

The script will read each prompt, submit it to FastGPT’s /v1/chat/completions, and write a list of objects like:

[
  {"id": "1", "response": "Paris is the capital of France."},
  {"id": "2", "response": "The theory of relativity, developed by Einstein, states that ..."}
]

4. LVM Query

LVM Query refers to querying vision-language models separately from the LLM batch query. Each model (LLaVa, Llama-3.2-Vision, Phi-3-Vision, Phi-3.5-Vision, Pixtral) exposes its own RESTful endpoint. Below are example commands for each.

Note: Ensure you have already started the relevant LVM service (see docs/installation.md).

4.1 LLaVa:34B (Ollama)

Endpoint: POST http://localhost:11434/completions

Example curl:

curl -X POST http://localhost:11434/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Describe this image: <Base64-encoded-image-data>"
  }'

Response:

{
  "id": "abc123",
  "choices": [
    {"text": "A dog is running in a grassy field.", "finish_reason": "stop"}
  ]
}

4.2 Llama-3.2-11B-Vision (vLLM)

Endpoint: POST http://localhost:8000

Payload Format:

{
  "inputs": [
    {"type": "image", "data": "<Base64-or-URL>"},
    {"type": "text", "text": "What objects do you see in this image?"}
  ]
}

Example curl:

curl -X POST http://localhost:8000 \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {"type": "image", "data": "<Base64-encoded-image>"},
      {"type": "text", "text": "What objects do you see in this image?"}
    ]
  }'

Response:

{
  "id": "req-456",
  "outputs": [
    {"text": "I see a cat sitting on a windowsill.", "entities": []}
  ]
}

4.3 Phi-3-Vision-128k-Instruct (vLLM)

Endpoint: POST http://localhost:8001

Example curl:

curl -X POST http://localhost:8001 \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {"type": "image", "data": "<Base64>"},
      {"type": "text", "text": "Identify any medical anomalies in this X-ray image."}
    ]
  }'

Response:

{
  "id": "req-789",
  "outputs": [
    {"text": "There is a small opacity in the lower right lung field.", "entities": [{"label":"Anomaly","text":"opacity","confidence":0.92}]}
  ]
}

4.4 Phi-3.5-Vision-Instruct (vLLM)

Endpoint: POST http://localhost:8002

Example curl:

curl -X POST http://localhost:8002 \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {"type": "image", "data": "<Base64>"},
      {"type": "text", "text": "Extract any chemical structures from this image."}
    ]
  }'

Response:

{
  "id": "req-012",
  "outputs": [
    {"text": "I see ethanol and benzene rings.", "entities": [{"label":"Molecule","text":"ethanol","confidence":0.87}, {"label":"Molecule","text":"benzene","confidence":0.90}]}
  ]
}

4.5 Pixtral-12B-2409 (vLLM)

Endpoint: POST http://localhost:8003

Example curl:

curl -X POST http://localhost:8003 \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {"type": "image", "data": "<Base64>"},
      {"type": "text", "text": "Describe the scene and list any textual labels visible."}
    ]
  }'

Response:

{
  "id": "req-345",
  "outputs": [
    {"text": "A street sign says 'Main St.' and a traffic light is green.", "entities": [{"label":"Text","text":"Main St.","confidence":0.95}]}
  ]
}

5. Similarity Evaluation

The Similarity Evaluation script (jaccard_similarity.py) compares LVM outputs against ground truth annotations and writes an Excel report containing Jaccard similarity (and optional EMD) scores.

5.1 Command Syntax

python src/similarity/jaccard_similarity.py \
  --ground_truth <PATH_TO_GROUND_TRUTH_XLSX> \
  --clickhouse_table <CLICKHOUSE_TABLE_NAME> \
  --output <PATH_TO_OUTPUT_XLSX> \
  [--host <CLICKHOUSE_HOST>] \
  [--port <CLICKHOUSE_PORT>] \
  [--user <CLICKHOUSE_USER>] \
  [--password <CLICKHOUSE_PASSWORD>] \
  [--database <CLICKHOUSE_DB>]

--ground_truth (required): Excel file with annotated entities/relations (e.g., schema-test.xlsx).
--clickhouse_table (required): Table where LVM outputs are stored (e.g., clkg.vision_outputs).
--output (required): Path to the resulting Excel report (e.g., results_with_all_similarity_and_emd5.xlsx).
ClickHouse connection parameters default to those in config/config.yaml if not provided.

5.2 Ground Truth Format

Your ground truth Excel should have columns such as:

id    | entity_name    | entity_label    | relation_type    | ...
---------------------------------------------------------------
img_1 | “cat”          | “Animal”        | “has_tail”       | ...
img_2 | “benzene”      | “Molecule”      | “...”            | ...

Each row corresponds to one entity or relation annotation for a specific id.

5.3 Example

python src/similarity/jaccard_similarity.py \
  --ground_truth data/schema-test.xlsx \
  --clickhouse_table clkg.vision_outputs \
  --output results_with_all_similarity_and_emd5.xlsx

The script will:
1. Connect to ClickHouse (host, port, user, password, database from config/config.yaml by default).
2. Query all rows in clkg.vision_outputs and parse the output_json column into entity/relation sets.
3. Compare against the schema-test.xlsx annotations.
4. Compute Jaccard similarity for each record and write results to results_with_all_similarity_and_emd5.xlsx.

6. Example Workflow

Import Data

python src/import_data/import_data_to_knowledge_datatabase.py \
  --directory_path ./data/technical_papers \
  --database ResearchCorpus

Start Embedding Service

python src/embedding_service/embedding_web.py --port 55443

Verify Embedding API

curl -X POST http://localhost:55443/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["Deep learning for NLP."],
    "model": "all-MiniLM-L6-v2"
  }'

Run LLM Batch Query

python src/llm_query/llm_query.py \
  --input_file examples/sample_prompts.csv \
  --output_file output/llm_responses.json

Query LVM (LLaVa Example)

curl -X POST http://localhost:11434/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Describe the following image: <Base64-encoded-image>"
  }'

Store LVM Output in ClickHouse
- Assuming your inference client writes to clkg.vision_outputs in ClickHouse.

Run Similarity Evaluation

python src/similarity/jaccard_similarity.py \
  --ground_truth data/schema-test.xlsx \
  --clickhouse_table clkg.vision_outputs \
  --output results_with_all_similarity_and_emd5.xlsx

After following these steps, you should have:

Documents indexed in FastGPT.
Embedding Service running and returning float vectors.
LLM responses saved to JSON.
LVM inferences stored in ClickHouse.
Similarity scores exported to an Excel report.

This site is open source. Improve this page.

hra-ftu-rag-supporting-information

Usage Guide (usage.md)

Prerequisites

1. Data Import

1.1 Command Syntax

1.2 Example

2. Embedding Service

2.1 Starting the Service

Logs

2.2 API Endpoint

2.3 Example curl Request

3. LLM Batch Query

3.1 Command Syntax

Input File Format

3.2 Example

4. LVM Query

4.1 LLaVa:34B (Ollama)

4.2 Llama-3.2-11B-Vision (vLLM)

4.3 Phi-3-Vision-128k-Instruct (vLLM)

4.4 Phi-3.5-Vision-Instruct (vLLM)

4.5 Pixtral-12B-2409 (vLLM)

5. Similarity Evaluation

5.1 Command Syntax

5.2 Ground Truth Format

5.3 Example

6. Example Workflow

2.3 Example `curl` Request