Maximizing Data Extraction Precision with Dual LLMs Integration and Human-in-the-Loop

PeterTHLee · Sep 3, 2024

While improving data extraction accuracy is vital, validating the correctness of the extracted data is equally important. Leveraging the Layout model in Document Intelligence, combined with markdown format and semantic chunking, plays a key role in dividing documents into clear sections and subsections. This approach enhances navigation, comprehension, and information retrieval by preserving the relationships between different sections and other structured format (such as tables, paragraphs, sections, and figures). This structure helps LLMs understand data more contextually and accurately during extraction. To learn more details on this concept:

More accuracy and Human-in-the-Loop

However, our customers continue to face challenges in achieving nearly 100% accuracy. They also seek ways to incorporate human validation in the process, particularly in a Human-in-the-Loop (HITL) approach, to ensure critical data points—such as financial figures, legal terms, or medical data—are accurately captured, especially in the initial stages before potentially phasing out human intervention if needed.

This article proposes a dual-approach leveraging two Large Language Models (LLMs)- Data Extraction and Data Validation - akin to the "two heads are better than one" concept. Data extraction involves converting the document to markdown format and using an LLM (e.g., GPT-4o) to extract data in a JSON format based on a predefined schema and pass back to the system, then system to call the validation with the same schema to extract data from Document Intelligent or different LLM to validate against data extracted from the first data extraction process. Discrepancy data identified will be sent to front end UI for the human validation.

The Data validation in this setup must be interoperable, allowing easy integration with other LLMs as models evolve with new capabilities. The goal is to create a flexible system that can adapt to future improvements in LLM technology.

For our demonstration, we are utilizing the latest Document Field Extraction model, which harnesses generative AI to accurately extract specific fields from documents, regardless of their visual templates. This custom model combines advanced document intelligence special algorithm with Large Language Models (LLMs) and precise custom extraction schemas. Additionally, it provides confidence scores for each field and offers training capabilities to further enhance accuracy.

Below is a summary that illustrates how the process works.

Overall data extraction and validation process with human in the loop

Define the schema in JSON format to extract data.
The system to call Data Extraction to convert PDF or image files in markdown format and send the markdown along with your pre-defined schema in the prompt message. Completion of output JOSN format will be sent back to the system.
The system will initiate the data validation process by calling the Data Validation. Documents can be submitted for analysis using the REST API or client libraries. The custom generative AI model is effective at extracting straightforward fields without needing labeled samples, but providing labeled examples can significantly enhance accuracy, especially for complex fields like tables. Note: If a different LLM model is used, consider applying the same approach (markdown) as in the initial data extraction.
The validation process compares the extracted values based on the schema. the mismatched values with the flagged are sent to the user interface (UI) for human validation.
Users validate the mismatched data, selecting the correct value based on the displayed PDF or image file with highlighted discrepancies. They also have the option to input a new value if both presented values are incorrect. This approach, which focuses on reviewing only the mismatched data rather than entire fields, leverages two LLMs to enhance accuracy while minimizing the need for extensive human involvement.

JSON Schema Definition and prompt in Data extraction:

Code:

    system_message = """
    ### you are AI assistant that helps extract information from given context.
    - context will be given by the user.
    - you will extract the relevant information using this json schema:
        ```json
        {
            "amount_of_consideration": {
                "type": "number"
            },
            "borrower_name": {
                "type": "string"
            },
            "trustor": {
                "type": "string"
            },
            "apn_number": {
                "type": "number"
            },
            "title_order_number": {
                "type": "number"
            }
        }
        ```
    - if you are unable to extract the information, return JSON with the keys and empty strings or 0 as values.
    """

    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": document_content}
    ]
    try:
        response = client.chat.completions.create(
            model=azure_openai_model, # The deployment name you chose when you deployed the GPT-35-Turbo or GPT-4 model.
            messages=messages,
            response_format={ "type": "json_object" },
        )
        response_message = response.choices[0].message
        return response_message.content

Data read from Document Intelligence in Data validation:

Code:

def get_schema_from_model():
    """
    This function is responsible for getting the schema from the custom model
    """

    url = f"{docintel_endpoint}documentintelligence/documentModels/{docintel_custom_model_name}"
    headers = {
        "Ocp-Apim-Subscription-Key": docintel_key
    }
    params = {
        "api-version": "2024-07-31-preview"
    }
    try:
        response = requests.get(url, headers=headers, params=params)
        resp = response.json()
        # print(resp)
        field_schema = resp["docTypes"][docintel_custom_model_name]["fieldSchema"]
        return field_schema
    except Exception as e:
        print(f"Error: {e}")

    return None


def get_response_from_ai_doc_intel(target_file):
    # get file from documents folder in the main directory
    with open(target_file, "rb") as f:
        url = f"{docintel_endpoint}documentintelligence/documentModels/{docintel_custom_model_name}:analyze"
        headers = {
            "Ocp-Apim-Subscription-Key": docintel_key,
            "Content-Type": "application/octet-stream"
        }
        params  = {
            "api-version": "2024-07-31-preview",
            "outputContentFormat": "markdown"
        }
        sumbit_analysis = requests.post(url, params=params , headers=headers, data=f)
        # print(sumbit_analysis)

        if sumbit_analysis.status_code != 202:
            print(f"Error: {sumbit_analysis.json()}")
            return None
        
        # print headers
        # print(sumbit_analysis.headers)

        # get the operation location
        operation_location = sumbit_analysis.headers["Operation-Location"]
        print(operation_location)

        # do while loop
        while True:
            response = requests.get(operation_location, headers={"Ocp-Apim-Subscription-Key": docintel_key})
            # print(response)

            if response.status_code != 200:
                print(f"Error: {response.json()}")
                return None
            
            analysis_results = response.json()

            if analysis_results["status"] == "running":
                # wait for 5 seconds
                print("Analysis is still running...")
                time.sleep(5)
                continue
            
            if analysis_results["status"] != "succeeded":
                print(f"Error: {analysis_results}")
                return None

            # print(analysis_results)
            return analysis_results["analyzeResult"]

Output:

Code:

{
 "amount_of_consideration": 751741195400000,
 "borrower_name": "TAYLER A GARDNER AND RYAN D LUTZ, TENANCY BY ENTIRETY.",
 "trustor": "TAYLER A GARDNER AND RYAN D LUTZ, TENANCY BY ENTIRETY,",
 "apn_number": 751714195400000,
 "title_order_number": 2209095561
}

The front-end interface below demonstrates how users can manage discrepancies by correcting only the mismatched fields, significantly improving the accuracy of data extraction to nearly 100%. In the left column under "Fields," you'll find a list of extracted fields. Selecting a field's radio button will display the comparison results under "Field Information." A green highlight indicates an exact match, while a red highlight points to a mismatch, accompanied by a lower confidence score (e.g., 0.159), as shown in the figure below. Users should focus on these red-highlighted fields, either accepting the correct value or overwriting it with a new one in the editable text box if both options are incorrect.

Front App for Human Validation

For detailed implementation and clear image, please visit our GitHub repository

LLMs Selection

For the selection of correct LLMs, you would typically need a language model that can understand and process text effectively, including the structure, syntax specific to markdown, and validation capabilities. AI Studio, as a platform. Offers a variety of language models (LLMs) you can choose from.

So all possible options.

Document Filed Extraction within Document Intelligence (LLM + Doc intelligence special algorithm)
GPT-4 or GPT-3.5 Turbo
BERT-based models: Models like BERT, RoBERTa, or DistilBERT are excellent for understanding context and can be trained or fine-tuned on specific validation tasks

Maximizing Data Extraction Precision with Dual LLMs Integration and Human-in-the-Loop

PeterTHLee

​

More accuracy and Human-in-the-Loop​

JSON Schema Definition and prompt in Data extraction:​

Data read from Document Intelligence in Data validation:​

Output:​

LLMs Selection​

More accuracy and Human-in-the-Loop

JSON Schema Definition and prompt in Data extraction:

Data read from Document Intelligence in Data validation:

Output:

LLMs Selection