<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2604436&amp;fmt=gif">
Docugami Science Team at BayLearn 2024 to demonstrate research on the science of agentic reasoning with Small Language Models (SLMs)
Docugami Science Team at BayLearn 2024 to demonstrate research on the science of agentic reasoning with Small Language Models (SLMs)
Document Engineering

Docugami’s Small Agentic Reasoning Model Outperforms Virtually All Expensive GPT-4 Frameworks in Public Industry Benchmarks


Docugami was honored to be selected to present agentic research at the prestigious BayLearn machine learning symposium earlier this year. Last week, this work was evaluated in several public industry benchmarks utilizing top mathematical reasoning datasets, showing Docugami’s small agentic reasoning models outperform all open-source reasoning models of comparable size and virtually all GPT-4 based frameworks. Our small agentic reasoning models achieve exceptional results without the need for costly proprietary LLMs or extensive human prompting.

Without question, the mathematical reasoning capabilities of AI are increasing rapidly. Most of the attention has focused on approaches using Large Language Models (LLMs) and intensive human prompt engineering to address complex mathematical reasoning problems.

However, these approaches have important limitations and dependencies. Prompt-engineering is labor intensive, often requiring hand-crafted prompts, and may not capture the full complexity and diversity of the data. Proprietary Large Language Models raise data privacy and security concerns, especially with sensitive business documents, and present scalability issues due to their high cost and latency.

Our work in this area stems from Docugami’s roots as a business document foundation model, focused on the unique business documents of individual organizations. Docugami’s world-class Science team is working to advance AI approaches that can tackle complex mathematical reasoning problems over business documents with extraordinary precision, lower costs, greater efficiency, and improved data privacy and security for sensitive information.

We recently developed a novel cost-effective method for training AI agents for tabular data problem-solving through reasoning, planning, and tool use. With a progressive self-improvement paradigm and iterative weak supervision, Docugami’s small agentic reasoning model requires only 3.8B/8B Small Language Models (SLMs), making it especially well-suited for local hosting and sensitive business contexts where data privacy is vital.

Using flexible and reusable tools across different datasets, Docugami’s small agentic model achieves exceptional performance with excellent scalability across a variety of shared tasks.

We recently evaluated Docugami’s small agentic model, using several of the most widely used datasets for benchmarking mathematical problem-solving. The FinQA dataset and the TAT-QA dataset represent complex real-world reasoning scenarios, and include tabular and text data, requiring reasoning over financial information with intricate contexts. The TabMWP dataset involves mathematical word problems over tabular data. All three of these datasets require multi-step reasoning, information extraction, data manipulation, tabular and contextual understanding, and numerical calculations, making them useful evaluators for AI intended for use with lengthy business documents and complex business scenarios.

For the FinQA dataset, the experimental results demonstrate that Docugami’s small agentic model (identified as Docugami MATATA, for “MAthematical Tool-Assisted reasoning for Tabular Applications) outperforms all existing models, even large closed-source models using GPT-4 and Chat GPT or open-source models that use human expert annotations for training.

Table 1. FinQA Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the FinQA dataset for various proprietary and open-source approaches.
PE: Prompt Engineering; FT: Fine-Tuning.

  Framework Closed-source/
Open-source
Model Method FinQA
Accuracy
1 Docugami MATATA Open-source Ministral-8B FT 77.59
2 TAT-LLM Open-source Llama2-70B FT 76.81
3 EEDP Closed-source GPT-4 PE 76.05
4 TAT-LLM Open-source Llama2-13B FT 71.93
5 Docugami MATATA Open-source Phi3-mini 3.8B FT 70.10
6 PoT Closed-source PoT-SC-Codex PE 68.1
7 TAT-LLM Open-source Llama2-7B FT 65.13
8 EEDP Closed-source ChatGPT PE 61.88
9 FinQANet Open-source RoBERTa FT 61.24

 

For the TAT-QA dataset, the experimental results demonstrate that Docugami’s small agentic models achieve outstanding results, outperforming virtually all other approaches, including many with much larger and more expensive model sizes.

Table 2. TAT-QA Dataset Exact Match Accuracy for Proprietary and Open-Source Models. Leaderboard of exact match accuracy scores for mathematical reasoning with the TAT-QA dataset for various proprietary and open-source approaches. Source: https://nextplusplus.github.io/TAT-QA/
PE: Prompt Engineering; FT: Fine-Tuning.

  Framework Closed-source/
Open-source
Model Method TAT-QA
Accuracy
1 TAT-LLM Open-source Llama2-70B FT 81.4
2 Docugami MATATA Open-source Ministral-8B FT 77.6
3 TAT-LLM Open-source Llama2-13B FT 77.5
4 Code Generation for
Table-Text Question
na na na 76.8
5 TAT-LLM Open-source Llama2-7B FT 76.4
6 AeNER na na na 75.0
7 Docugami MATATA Open-source Phi3-mini-3.8B FT 74.2
8 Code Generation for
Table-Text Question
na na na 73.7
9 Encore Open-source BART-Large FT 71.8
10 KFEX-N Open-source Delberta-V#-
Large
FT 71.0


Similarly, for the TabMWP dataset, the experimental results demonstrate that Docugami’s small agentic model outperforms all existing open-source models in reasoning across the TabMWP dataset, even open-source models that rely on much larger LLMs or training data annotated by GPT-4. Our approach also outperforms all but one closed source model, and nearly matches the performance of the much larger and much more expensive GPT-4 model, with 98.13 percent accuracy compared to 98.78 percent for GPT-4.

Table 3. TabMWP Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the TabMWP dataset for various proprietary and open-source approaches.
Source: https://promptpg.github.io/leaderboard.html
PE: Prompt Engineering; FT: Fine-Tuning.

  Framework Closed-source/
Open-source
Model Method TabMWP Accuracy
1 Chameleon Closed-source GPT-4 PE 98.78
2 Docugami MATATA Open-source Ministral-8B FT 98.13
3 PoT GPT-4 Closed-source GPT-4 PE 96.93
4 Docugami MATATA Open-source Phi3-mini-3.8B FT 96.66
5 CREATOR Closed-source ChatGPT PE 94.7
6 Chameleon Closed-source ChatGPT PE 93.28
7 TaCo Open-source TAPEX-large FT 92.91
8 Pot ChatGPT + Doc Closed-source PoT-SC-Codex PE 92.69
9 CoT GPT-4 Closed-source GPT-4 PE 90.81
10 CoS-Planning Closed-source ChatGPT PE 90.00
11 PoT ChatGPT Closed-source ChatGPT PE 89.49


As a company focused on transforming businesses of all sizes and across all sectors by unlocking the data and insights contained in complex business documents, Docugami’s work in this area aims to reduce the reliance on proprietary Large Language Models and human prompt-engineering, resulting in comparable or better performance with lower costs, faster local results, and increased data privacy. This work will be particularly important in business settings where cost-effectiveness, operational efficiency and flexibility, and data privacy and security are key considerations.

We’re excited by the strong initial support for our work at BayLearn, and we see enormous potential in our novel approach to unlock complex document data and advance mathematical reasoning. Evaluated in several public industry benchmarks, early results demonstrate that Docugami’s agentic approach, using tool-augmented Small Language Models with weak-supervision, can meet or exceed the performance of existing reasoning methods using costly LLMs like GPT-4 and ChatGPT, as well as extensive human prompt-engineering. Our dedicated Science team will continue to accelerate progress in the important area of agentic models.


Note: You can read our detailed paper here:
https://arxiv.org/abs/2411.18915. Ministral-8B was used under the "The Mistral AI Non-Production License" (link), for research purposes only; Phi3-mini was used under the MIT License (link). Funding for this work was provided in part by the Mitacs, the Canadian innovation organization.  

 

Get noticed on the latest Document Engineering insights

Be the first to know about the latest news, use cases, and innovative features.