Researchers, including myself, confront an escalating challenge: the exponential growth of scholarly literature threatens to overwhelm traditional research methodologies. I use this “LLM Text Analysis toolkit” to address this challenge by creating an intelligent interface between Zotero reference management and modern language models. This transforms how I process and analyse my academic libraries.

What My System Does

My system establishes an intelligent layer atop Zotero’s open-source reference management platform. Rather than requiring manual PDF review, I can now:

  • Extract and summarise academic papers automatically from my existing Zotero library
  • Query my documents by tags or collections for targeted analysis
  • Generate structured insights using state-of-the-art language models
  • Maintain my literature reference workflows through Obsidian
  • Process diverse content types, including podcast transcripts for strategic analysis

I achieve this through direct SQLite database integration, eliminating manual exports or complex API configurations. I can process entire document collections, extract full text, and generate comprehensive summaries across multiple AI providers.

My system also relies heavily on Obsidian, including its Zotero integration.

My Technical Architecture

I designed my system with a modular architecture build on a unified LLM interface supporting multiple AI providers: Ollama (open-source models run locally), Claude, and OpenAI.

Core Components

Zotero Database Integration

The foundation component I created (zotero_obsidian.py) directly queries Zotero’s SQLite database to:

  • Extract papers by tags, collections, or specific criteria
  • Retrieve comprehensive metadata including authors, publication dates, and DOIs
  • Access PDF file paths within Zotero’s storage hierarchy
  • Generate structured CSV exports for my research planning

PDF Text Extraction Engine

I utilise PyMuPDF (https://pymupdf.readthedocs.io/en/latest/) in my system to perform robust text extraction from academic PDFs, managing:

  • Complex multi-column layouts characteristic of academic publications
  • Mathematical formulas and specialised characters
  • Tabular data and figure descriptions
  • Error handling for corrupted or protected documents

Multi-Provider LLM Interface

I built a unified abstraction layer that provides consistent access to:

  • Ollama: Local inference using models such as Llama 3.3 and Mistral
  • Claude: Anthropic’s API, featuring models like Claude-3.5-Sonnet
  • OpenAI: GPT-4 and related ChatGPT variants

Literature Reference Management

My system integrates seamlessly with Obsidian-style markdown workflows, including:

  • Automated citekey extraction and matching
  • Cross-referencing with my existing literature databases
  • Missing reference tracking and comprehensive reporting
  • Timestamped overview generation

Implementation Technologies

PyMuPDF: My PDF Processing Foundation

I chose PyMuPDF as my text extraction engine because it provides:

  • High-fidelity text extraction from complex academic documents
  • Robust handling of varied document structures
  • Efficient processing of large document collections
  • Cross-platform compatibility

The library is available here:

https://pymupdf.readthedocs.io/en/latest/

# How I extract text from PDFs
doc = fitz.open(full_path)
text = "\n".join([page.get_text() for page in doc])

Direct SQLite Database Access

I bypass Zotero’s API limitations through direct database queries I crafted:

# My SQL query for extracting papers with metadata
query = """
SELECT i.key, ia.path, iv_title.value as title, 
       iv_author.value as author, GROUP_CONCAT(t.name) as tags
FROM items i
JOIN itemAttachments ia ON i.itemID = ia.itemID
JOIN tags t ON it.tagID = t.tagID
WHERE t.name = ? AND ia.contentType = 'application/pdf'
"""

Multi-Provider LLM Implementation

I designed my system to provide consistent interfaces across three distinct AI providers:

My Local Ollama Integration

# How I query local models
response = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": model, "prompt": prompt, "stream": False}
)

My Anthropic Claude API Implementation

# How I connect to Claude
response = requests.post(
    "https://api.anthropic.com/v1/messages",
    headers={"x-api-key": api_key, "anthropic-version": "2023-06-01"},
    json={"model": model, "max_tokens": 4000, "messages": [...]}
)

My OpenAI Integration

# How I interface with OpenAI
response = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}"},
    json={"model": model, "messages": [...], "max_tokens": 4000}
)

How I Use My Research Workflow

Document Discovery and Analysis

I initiate analysis using Zotero’s organisational systems with commands like:

# How I analyze papers tagged with "Momentum" using Claude
... --tag "Momentum" --provider claude --model claude-3-5-sonnet-20241022

# How I process a collection with local Ollama
... --collection "Trading Papers" --provider ollama --model llama3.3:latest

My Intelligent Content Processing

I extract complete text from PDFs and apply targeted analytical prompts using code like:

# How I customize prompts for different analysis types
prompt = f"{custom_prompt}\n\n{extracted_text}" if custom_prompt else f"Summarize the following academic paper:\n{extracted_text}"

Literature Integration and Cross-Referencing

Each document I process undergoes systematic cross-referencing where:

  • Documents with existing citekeys utilise my established reference systems
  • Missing references are catalogued in dedicated tracking files I maintain
  • Obsidian-compatible citation links are generated automatically

My Structured Output Generation

I organise results as markdown files, featuring:

  • Standardised naming conventions ({citekey}_{model}.md)
  • Comprehensive timestamped processing logs
  • Integration with my established note-taking workflows in Obsidian.

Technical considerations

The Provider Abstraction Layer

This is a unified LLM interface that enables seamless transitions between local and cloud-based models:

# My function for switching between AI providers
def query_llm(text: str, provider: str = "ollama", model: Optional[str] = None):
    if provider == "claude":
        return query_claude(text, model, custom_prompt)
    elif provider == "openai":
        return query_openai(text, model, custom_prompt)
    elif provider == "ollama":
        return query_ollama(text, model, custom_prompt)

Advanced File Management I Built

My system navigates Zotero’s complex storage architecture by:

  • Resolving “storage:” prefixed paths to actual file locations
  • Managing database attachment relationships
  • Handling diverse PDF naming conventions and storage structures

Robust Error Handling and Rate Management

I implemented comprehensive protections, including:

  • Configurable request delays
  • Extensive timeout handling
  • Graceful degradation when models become unavailable
  • Detailed logging for troubleshooting and optimisation

Real-World Impact on My Research

My system represents a substantial advancement in my academic workflow automation. I can now:

  • Process comprehensive document collections in minutes rather than days
  • Maintain analytical consistency across extensive literature reviews
  • Leverage multiple AI providers to optimise model selection for specific domains
  • Integrate seamlessly with my established note-taking and reference management systems

The Zotero integration proves particularly valuable because it adapts to my existing library organisation without requiring workflow modifications or file reorganisation.

I specifically want to fully leverage the Zotero platform for this purpose. The simpler alternative is to simply run an analysis one PDF at a time. However, the organisation of both the original papers and insights gained from them quickly becomes insurmountable, without a robust system.

Future Implications of My Work

AI’s potential to augment human expertise is clear. By automating time-intensive initial processing and summarisation tasks, it enables me to focus on higher-level analysis, synthesis, and insight generation.

The modular architecture I designed facilitates extension to additional domains—legal document review, medical literature analysis, and corporate research workflows represent immediate applications. My combination of direct database integration, robust text extraction, and flexible AI provider support establishes a framework for AI-augmented knowledge work across diverse fields.

As large language models continue advancing and becoming more accessible, systems like mine will likely become standard components of academic research infrastructure, fundamentally altering how we interact with the expanding corpus of human knowledge.

Leave a Reply

Your email address will not be published. Required fields are marked *