hra-ftu-rag-supporting-information

Usage Guide (usage.md)

This document provides step-by-step instructions for using the main components of HRAftu-LM-RAG. Each section covers how to run scripts and services in the src/ directory, along with example commands.


Prerequisites

  1. Configuration File
    Before invoking any script, ensure you have copied and edited config/example_config.yaml to config/config.yaml, filling in:
    • fastgpt.api_key & fastgpt.host
    • Vision model device/port settings
    • ClickHouse connection details (if using similarity evaluation)
  2. Environment Activation
    cd HRAftu-LM-RAG
    source venv/bin/activate       # or: conda activate hraftu
    

1. Data Import

The Data Import script (import_data_to_knowledge_datatabase.py) uploads a directory of local files (PDF, TXT, DOCX, etc.) into a FastGPT dataset/collection.

1.1 Command Syntax

python src/import_data/import_data_to_knowledge_datatabase.py \
  --directory_path <PATH_TO_FILES> \
  --database <FASTGPT_DATASET_NAME> \
  [--collect_name <COLLECTION_NAME>] \
  [--parm <create|update|delete>] \
  [--parentId <PARENT_COLLECTION_ID>]

1.2 Example

python src/import_data/import_data_to_knowledge_datatabase.py \
  --directory_path ./data/technical_papers \
  --database ResearchCorpus \
  --collect_name “TechPapers” \
  --parm create

This will:

  1. Read every file under ./data/technical_papers.
  2. Create or update the FastGPT collection named ResearchCorpus/TechPapers.
  3. Skip any files that are already indexed.

2. Embedding Service

The Embedding Service (embedding_web.py) is a Flask-based HTTP server that loads multiple embedding models. Clients send text and specify which model to use, and the service returns normalized vectors.

2.1 Starting the Service

python src/embedding_service/embedding_web.py --port 55443

Logs

When the service starts, you’ll see console output similar to:

[INFO] Loading model: pubmedbert
[INFO] Loading model: all-MiniLM-L6-v2
[INFO] Loading model: bge-large-en-v1.5
[INFO] Loading model: gte-large
[INFO] Flask server running on http://0.0.0.0:55443

2.2 API Endpoint

2.3 Example curl Request

curl -X POST http://localhost:55443/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["The quick brown fox jumps over the lazy dog."],
    "model": "all-MiniLM-L6-v2"
  }'

You should receive a JSON array containing one embedding vector.


3. LLM Batch Query

The LLM Batch Query script (llm_query.py) sends multiple prompts in parallel to FastGPT’s Chat Completions endpoint and collects responses.

3.1 Command Syntax

python src/llm_query/llm_query.py \
  --input_file <PATH_TO_PROMPTS_FILE> \
  --output_file <PATH_TO_OUTPUT_JSON> \
  [--max_workers <NUM_THREADS>] \
  [--timeout <SECONDS>]

Input File Format

3.2 Example

python src/llm_query/llm_query.py \
  --input_file examples/sample_prompts.csv \
  --output_file output/llm_responses.json \
  --max_workers 8

4. LVM Query

LVM Query refers to querying vision-language models separately from the LLM batch query. Each model (LLaVa, Llama-3.2-Vision, Phi-3-Vision, Phi-3.5-Vision, Pixtral) exposes its own RESTful endpoint. Below are example commands for each.

Note: Ensure you have already started the relevant LVM service (see docs/installation.md).

4.1 LLaVa:34B (Ollama)

4.2 Llama-3.2-11B-Vision (vLLM)

4.3 Phi-3-Vision-128k-Instruct (vLLM)

4.4 Phi-3.5-Vision-Instruct (vLLM)

4.5 Pixtral-12B-2409 (vLLM)


5. Similarity Evaluation

The Similarity Evaluation script (jaccard_similarity.py) compares LVM outputs against ground truth annotations and writes an Excel report containing Jaccard similarity (and optional EMD) scores.

5.1 Command Syntax

python src/similarity/jaccard_similarity.py \
  --ground_truth <PATH_TO_GROUND_TRUTH_XLSX> \
  --clickhouse_table <CLICKHOUSE_TABLE_NAME> \
  --output <PATH_TO_OUTPUT_XLSX> \
  [--host <CLICKHOUSE_HOST>] \
  [--port <CLICKHOUSE_PORT>] \
  [--user <CLICKHOUSE_USER>] \
  [--password <CLICKHOUSE_PASSWORD>] \
  [--database <CLICKHOUSE_DB>]

5.2 Ground Truth Format

Your ground truth Excel should have columns such as:

id    | entity_name    | entity_label    | relation_type    | ...
---------------------------------------------------------------
img_1 | “cat”          | “Animal”        | “has_tail”       | ...
img_2 | “benzene”      | “Molecule”      | “...”            | ...

Each row corresponds to one entity or relation annotation for a specific id.

5.3 Example

python src/similarity/jaccard_similarity.py \
  --ground_truth data/schema-test.xlsx \
  --clickhouse_table clkg.vision_outputs \
  --output results_with_all_similarity_and_emd5.xlsx

6. Example Workflow

  1. Import Data

    python src/import_data/import_data_to_knowledge_datatabase.py \
      --directory_path ./data/technical_papers \
      --database ResearchCorpus
    
  2. Start Embedding Service

    python src/embedding_service/embedding_web.py --port 55443
    
  3. Verify Embedding API

    curl -X POST http://localhost:55443/v1/embeddings \
      -H "Content-Type: application/json" \
      -d '{
        "input": ["Deep learning for NLP."],
        "model": "all-MiniLM-L6-v2"
      }'
    
  4. Run LLM Batch Query

    python src/llm_query/llm_query.py \
      --input_file examples/sample_prompts.csv \
      --output_file output/llm_responses.json
    
  5. Query LVM (LLaVa Example)

    curl -X POST http://localhost:11434/completions \
      -H "Content-Type: application/json" \
      -d '{
        "prompt": "Describe the following image: <Base64-encoded-image>"
      }'
    
  6. Store LVM Output in ClickHouse

    • Assuming your inference client writes to clkg.vision_outputs in ClickHouse.
  7. Run Similarity Evaluation

    python src/similarity/jaccard_similarity.py \
      --ground_truth data/schema-test.xlsx \
      --clickhouse_table clkg.vision_outputs \
      --output results_with_all_similarity_and_emd5.xlsx
    

After following these steps, you should have: