Query Your Excel Files Locally: An Offline RAG Pipeline with Specialized AI Agents for SQL Generation

Publicerad April 2, 2025 av Joel Thyberg

Query Your Excel Files Locally: An Offline RAG Pipeline with Specialized AI Agents for SQL Generation

The Challenge: Querying Sensitive Data in Excel Files

Unlocking insights hidden in local Excel files using plain language queries is a significant challenge. It often requires deep Excel knowledge for complex analysis, and sending large or sensitive files to cloud-based AI services is rarely an option due to privacy risks and practical limitations. Furthermore, raw, large-scale Excel structures are difficult for most AI models to handle effectively.

This project demonstrates a fully offline AI pipeline that solves these problems. By combining Retrieval-Augmented Generation (RAG), vector databases, and specialized, locally-run AI agents (including a quantized 7B model), complex questions in natural language—even with negations—are converted into secure SQL queries. The result is a powerful, local tool that makes Excel data accessible for analysis without requiring SQL skills or an internet connection, guaranteeing complete data privacy and control.

Goals and Criteria

The main goal was to create a system that allows users to interact with their local Excel data using natural Swedish. Specific sub-goals and criteria included:

  • Interpretation of Complex Language: The system must understand nuances such as negations ("not," "lacks") and the relationships between concepts in the query.
  • Handling of Unstructured Data: The ability to extract information even from Excel files that lack a perfect tabular structure.
  • Automatic & Secure SQL: Generate correct SQL queries automatically and implement strict security barriers to only allow data reading (SELECT) and block malicious operations.
  • Local Execution (Offline): The entire process, including the AI models, must run locally to ensure data privacy (no data leaves the user's computer) and independence from an internet connection.
  • Availability on Standard Hardware: Design the system to run on standard consumer hardware, achieved through optimizations like model quantization (e.g., using a quantized 7B LLM).

The Solution: A Modular AI-Pipeline

The solution is a modular AI pipeline built in Python that locally orchestrates several AI models and technologies:

  1. Data Conversion & Schema Analysis: Excel files (.xlsx) are converted into SQLite tables. An analysis module identifies the table structure and data types, even in less structured files.
  2. Vectorization (Embedding): Column names and unique cell values are converted into vectors using a local embedding model (via Ollama, e.g., Nomic Embed) and stored in a local vector database (Qdrant) for semantic search. Hash values are used for efficient updates when data changes.
  3. Advanced Query Processing:
    • The Swedish query undergoes: Lemmatization (Stanza), Negation Analysis (spaCy), Term Extraction, and Named Entity Recognition (NER with spaCy for persons, places, organizations).
    • Selective Translation: A hybrid strategy is used where general parts of the query are translated to English using the Helsinki-NLP/opus-mt-sv-en model (for better LLM compatibility), while identified entities and domain-specific terms (with low translation confidence) are kept in Swedish. This maximizes the LLM's understanding while preserving critical values exactly.
  4. Contextual Search (Retrieval): The processed terms and their vectors are used to search in Qdrant. The system finds the most relevant database tables, columns (via name embedding), and specific values (via value embedding) for the query.
  5. SQL Generation (Generation - RAG):
    • A local LLM agent via Ollama (specifically a quantized version of qwen2.5-s-7b-instruct-q5_K_M) is called to generate SQL.
    • Dynamic Prompts: Instead of a generic prompt, a system with over 900 lines of specialized prompts is used. The correct prompt is dynamically selected based on the analysis of the query (e.g., match type—column/value/mixed, subtype, and whether negation was identified). This gives the LLM extremely specific guidance to generate correct and secure SQL for that exact scenario.
    • The prompt includes context from the vector search (matched columns/values) and strict rules for syntax and security.
  6. Secure Execution: The generated SQL code is rigorously validated: only SELECT statements are allowed, and a list of forbidden commands (UPDATE, DELETE, DROP, etc.) is actively blocked. Only after validation is the query executed against the local SQLite database. The results are collected and presented.

This entire pipeline, from query analysis to SQL execution, runs entirely locally without sending any data externally.

Result: A Concrete Example

The result is a robust and secure system that successfully bridges the gap between human language (Swedish) and structured data hidden in Excel files. The system was tested and validated on a dataset of three Excel files, each with about 1000 rows and 60 columns, containing bird observation data. For a typical complex query requiring analysis across all three files, the entire process from entering the Swedish query to receiving the presented answer took about 40 seconds on standard hardware. The system's capability can be illustrated with a concrete example based on the bird observation data:

Example Query: "Hur många unika artnamn har inte Joel Thyberg rapporterat?" (How many unique species names has Joel Thyberg not reported?)

  1. Query Processing: The query is lemmatized, negation ("inte") is identified, and key terms ("joel thyberg", "artnamn") are extracted. "Joel Thyberg" is identified as a person (NER). The query is selectively translated (important names/terms are kept in Swedish).
  2. Contextual Search (Retrieval): The system searches the vector database (Qdrant) and finds that "joel thyberg" is a value in the "Rapportör" (Reporter) column and "artnamn" matches the "Artnamn" (Species Name) column in several tables (e.g., faglar_2022, faglar_2023, faglar_2024).
  3. Dynamic Prompt & SQL Generation: Since the query contains a specific value ("Joel Thyberg"), a column reference ("Artnamn"), and a negation, the system selects a specialized negation prompt for mixed value/column queries. This prompt instructs the local LLM (quantized qwen2.5-s-7b-instruct-q5_K_M) to:
    • Create a subquery to find all "Artnamn" that have been reported by "Joel Thyberg" in each respective table.
    • Create a main query that counts (COUNT(DISTINCT ...) ) the unique "Artnamn" that are not in the subquery's result (WHERE "Artnamn" NOT IN (SELECT ...)), while also filtering out invalid values (NULL, empty strings, and the column header 'Artnamn' itself).
    • Generate separate, validated SQL queries for each relevant table (2022, 2023, 2024).
  4. Secure Execution & Results: The generated SELECT queries are validated (no forbidden operations) and then run against the local SQLite database. The results from each table are collected:
    • faglar_2022: {'COUNT(DISTINCT "Unnamed: 6")': 25}
    • faglar_2023: {'COUNT(DISTINCT "Unnamed: 6")': 15}
    • faglar_2024: {'COUNT(DISTINCT "Unnamed: 6")': 10}
  5. Presentation: The answer is presented to the user in an aggregated and easy-to-understand format.

This example shows how users can ask complex questions and get precise answers without SQL knowledge, with full insight into the process via logging. The system demonstrates the power of combining:

  • RAG: To retrieve relevant context.
  • Specialized AI Agents/Prompts: For high-quality, secure SQL.
  • Local AI Processing: For data integrity, security, and offline functionality.

This project provides a concrete example of how a well-designed AI pipeline can solve real-world problems and make data more accessible.