The Best Way to Parse Complex PDFs for RAG: Hybrid Multimodal Parsing

Discover the best way to make complex PDFs AI-ready for Retrieval-Augmented Generation, from LlamaParse to Hybrid Multimodal Parsing with Instill VDP.

George Strong

November 28, 2024

Insight

An organization’s most vital data is often held within documents. To tap into this valuable unstructured data resource, businesses are increasingly turning to AI-powered document interaction features. These enable users to access and interact with critical information fast, improving decision-making processes and operational efficiency. Ensuring your documents are AI-ready is essential for staying ahead in today’s competitive landscape, but low quality poorly parsed documents can lead to inaccurate AI outputs, undermining the effectiveness of these features.

In this article we will explain the challenges and complexities that can arise when parsing complex documents for AI model ingestion. We will also introduce and explore Instill AI’s Hybrid Multimodal Approach for advanced document parsing.

👉 Before we go any further, try it out for yourself by clicking the button below!

Try the Demo

Bridging the Gap Between Documents and AI Models

For an LLM (Large Language Model) to make effective use of the information stored in documents, the files need to be converted into a format that the model can process. We recommend using Markdown-formatted text because it strikes a good balance between simplicity and structure. Markdown captures essential elements like headings, lists, and links without excessive metadata. Its clean syntax ensures readability, easy tokenization, and good alignment with typical LLM training datasets. Additionally, it supports rich content and can be extended with custom annotations, making it flexible and efficient for preparing content for LLMs.

In many real-world scenarios, however, documents can be much complex, containing tables, images, and intricate layouts that require sophisticated parsing techniques. Imagine handwritten notes, scanned and partially redacted medical files, or historical archives with faded text and irregular formatting. These types of documents, typically PDF files, pose unique challenges and demand advanced techniques for accurate and meaningful data extraction!

Here are some examples of the types of complex documents we’re talking about:

Healthcare report with graphs and tables.

RAG document with bullet points and multi-page tables.

👉 See for yourself how our Hybrid Multimodal Approach can handle these examples in our demo!

Why is PDF Parsing Challenging?

While converting simple document formats like DOCX to Markdown is relatively straightforward, parsing PDF files requires more care.

The primary reason for this is that they are built to look good on screen or paper, not to make data easy to extract. As such, the text in a PDF is often stored in non-linear order, with characters or lines positioned absolutely on the page. This makes it difficult to reconstruct logical flows like paragraphs. Additionally, layout features such as columns, tables, and embedded images add complexity, as they aren’t inherently structured in the file itself.

Moreover, PDF files vary in how they store content. Some PDFs use vector graphics or images for text, which requires Optical Character Recognition (OCR) to extract. Encoding issues, special characters, and the lack of standardized metadata further complicate the process. These factors all make PDF parsing one of the more challenging and frustrating aspects of data cleaning for AI applications.

It’s all About the Data

The bottom line is that data is the foundation on which all AI systems are built, and as is ubiquitously known in machine learning, “garbage in = garbage out.” If the quality of input data for building AI systems isn’t up to par, then the AI systems will have suboptimal performance. In other words, if we can’t properly parse PDFs into high-quality Markdown text, our RAG retrieval and response capabilities will typically be low-quality and often erroneous.

👉 The right document parsing solution is crucial.

Heuristic vs. AI-based Document Parsing

There are numerous heuristics-based tools for parsing PDFs, such as pdfminer.six, pypdf, pdfplumber, and pdf2md. These tools are designed to extract text, tables, and structural elements directly from PDFs, providing efficient solutions for a wide range of document processing needs. We also have our own Document Operator designed specifically for efficiently processing documents for AI tasks. The problem with these heuristic based methods is that they typically struggle with complex layouts, tables, and images, often leading to inaccuracies and incomplete outputs.

To address these limitations, there is an emerging alternative that involves converting the pages of a document into images and using multimodal Visual Language Models (VLMs) directly to parse these images into structured text via document-level OCR. Tools like LlamaParse and Zerox OCR employ this approach. The problem here is that VLM models often hallucinate or randomly omit information, which can lead to inaccuracies in the output that are difficult to detect.

A Hybrid Multimodal Approach to Document Parsing

We believe that effective document parsing doesn’t have to be an “either-or” scenario. Instead, we adopt a Hybrid Multimodal Approach in our advanced document parsing pipeline which combines heuristic methods and multimodal VLMs to get the best of both worlds.

This hybrid approach helps to mitigate issues like hallucinations and random omissions that sometimes occur with purely AI-based methods. It also effectively captures information hierarchies that are often implicit in the layout and formatting of documents but are typically overlooked by non-visual heuristic methods.

To illustrate this, please see the example below, where we compare the parsed response from heuristic, AI-based, and our hybrid parsing approach on a challenging PDF file:

Example complex PDF document containing multi-column text and tables.

Heuristic-based parsing using our Document Operator. The result lacks table formatting and mixes up the layout of multi-column text.

Document-level OCR parsing using a purely VLM-based approach. Randomly omits the final section of the page.

Multimodal hybrid parsing. The result is accurate and complete, capturing and properly formatting the full content of the document.

How it Works

Under the hood, our document parsing demo simply calls the production-grade API provided by the Instill Core Instill Core Pipeline. Please checkout the following flow chart for a visual explanation in how this pipeline operates:

Figure 1: Flow chart showing the key stages in the hybrid multimodal document parsing pipeline.

Here’s a more detailed walkthrough of the key stages displayed in this diagram:

Document-to-Image Conversion: The pipeline begins by converting the pages of the document into high-resolution images (300 DPI). This enables it to capture the visual layout and implicit hierarchies often overlooked by heuristic methods.
Parsing with Document Operator: Simultaneously, the pipeline processes the document using our heuristics-based Document Operator to generate an initial Markdown draft. This provides a structured textual representation of the document, offering a solid starting point for further refinement.
Batch Processing for Parallelization: The images and Markdown drafts are split into manageable batches, typically groups of four. This enables efficient parallel processing, speeding up the pipeline, maintaining alignment between images and text, and providing improved formatting consistency between pages.
Refinement with Visual Language Models (VLMs): Each batch of Markdown is paired with its corresponding batch of images, and these batches are then fed, in parallel, to a Visual Language Model. The VLM iteratively enhances each batch of Markdown by:
- Correcting inaccuracies in text formatting and structure.
- Formatting complex elements like tables.
- Adding descriptive details for visual elements (e.g., images or diagrams).
- Ensuring coherence and completeness across the content.
Fallback Mechanism with Document OCR: In cases where an initial Markdown draft cannot be generated (e.g., scanned PDFs or non-textual documents), the pipeline activates a document-level OCR fallback process. This extracts text directly from the high-resolution images.
Final Integration: Refined Markdown batches are merged into a cohesive and rich output, resulting in a complete and well-structured representation of the original document.

Check out the pipeline overview below:

Figure 2: Advanced complex document parsing pipeline from inside the pipeline editor.

Conclusion

Extracting the knowledge contained in complex documents to make it accessible for LLMs is a complex challenge, particularly for PDF files. The variety of data types and formats, such as text, tables, and images, adds to the complexity. However, by using a Hybrid Multimodal Approach, combining heuristics-based parsing with multimodal VLMs, you can unlock the full AI potential of your documents. This approach ensures accurate extraction and processing, resulting in high-quality, Markdown-formatted text ready for AI model ingestion.

Parsing documents to Markdown-formatted text is the first, and arguably most critical step in building a RAG-ready knowledge base as it lays the groundwork for everything that follows. With Instill Core Artifact, you can easily automate the entire knowledge base creation process, ensuring that your data is well structured, embedded, and stored for efficient retrieval. Here’s a breakdown of the process that Artifact automates:

Convert Documents to Markdown: The input files are first converted into Markdown-formatted text using our Hybrid Multimodal Approach.
Context-Aware Chunking: The text is then chunked using the inherent structure of Markdown, ensuring that the chunks align with natural semantic boundaries.
Embedding and Vectorization: Each chunk is embedded into vectors that capture its semantic meaning, enabling fast and precise retrieval.
Storage in a Knowledge Base: The chunks are stored in a vector database, forming a searchable, AI-ready knowledge base.

By leveraging the power of Instill Core Pipeline and Artifact, you can make even the most complex documents RAG-ready and prepared for deep integration into AI systems.

Update 2024.02.22:

💡 For a simpler solution, try our new Instill AI product - a user-friendly chat interface that delivers high-accuracy document processing right out of the box:

Get Early Access