Docugami’s Small Agentic Reasoning Model Outperforms Virtually All Expensive GPT-4 Frameworks in Public Industry Benchmarks
Docugami was honored to be selected to present agentic research at the prestigious BayLearn machine learning symposium earlier this year. Last week, this work was evaluated in several public industry benchmarks utilizing top mathematical reasoning datasets, showing Docugami’s small agentic reasoning models outperform all open-source reasoning models of comparable size and virtually all GPT-4 based frameworks. Our small agentic reasoning models achieve exceptional results without the need for costly proprietary LLMs or extensive human prompting.
Without question, the mathematical reasoning capabilities of AI are increasing rapidly. Most of the attention has focused on approaches using Large Language Models (LLMs) and intensive human prompt engineering to address complex mathematical reasoning problems.
However, these approaches have important limitations and dependencies. Prompt-engineering is labor intensive, often requiring hand-crafted prompts, and may not capture the full complexity and diversity of the data. Proprietary Large Language Models raise data privacy and security concerns, especially with sensitive business documents, and present scalability issues due to their high cost and latency.
Our work in this area stems from Docugami’s roots as a business document foundation model, focused on the unique business documents of individual organizations. Docugami’s world-class Science team is working to advance AI approaches that can tackle complex mathematical reasoning problems over business documents with extraordinary precision, lower costs, greater efficiency, and improved data privacy and security for sensitive information.
We recently developed a novel cost-effective method for training AI agents for tabular data problem-solving through reasoning, planning, and tool use. With a progressive self-improvement paradigm and iterative weak supervision, Docugami’s small agentic reasoning model requires only 3.8B/8B Small Language Models (SLMs), making it especially well-suited for local hosting and sensitive business contexts where data privacy is vital.
Using flexible and reusable tools across different datasets, Docugami’s small agentic model achieves exceptional performance with excellent scalability across a variety of shared tasks.
We recently evaluated Docugami’s small agentic model, using several of the most widely used datasets for benchmarking mathematical problem-solving. The FinQA dataset and the TAT-QA dataset represent complex real-world reasoning scenarios, and include tabular and text data, requiring reasoning over financial information with intricate contexts. The TabMWP dataset involves mathematical word problems over tabular data. All three of these datasets require multi-step reasoning, information extraction, data manipulation, tabular and contextual understanding, and numerical calculations, making them useful evaluators for AI intended for use with lengthy business documents and complex business scenarios.
For the FinQA dataset, the experimental results demonstrate that Docugami’s small agentic model (identified as Docugami MATATA, for “MAthematical Tool-Assisted reasoning for Tabular Applications) outperforms all existing models, even large closed-source models using GPT-4 and Chat GPT or open-source models that use human expert annotations for training.
Table 1. FinQA Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the FinQA dataset for various proprietary and open-source approaches.
PE: Prompt Engineering; FT: Fine-Tuning.
Framework | Closed-source/ Open-source |
Model | Method | FinQA Accuracy |
|
1 | Docugami MATATA | Open-source | Ministral-8B | FT | 77.59 |
2 | TAT-LLM | Open-source | Llama2-70B | FT | 76.81 |
3 | EEDP | Closed-source | GPT-4 | PE | 76.05 |
4 | TAT-LLM | Open-source | Llama2-13B | FT | 71.93 |
5 | Docugami MATATA | Open-source | Phi3-mini 3.8B | FT | 70.10 |
6 | PoT | Closed-source | PoT-SC-Codex | PE | 68.1 |
7 | TAT-LLM | Open-source | Llama2-7B | FT | 65.13 |
8 | EEDP | Closed-source | ChatGPT | PE | 61.88 |
9 | FinQANet | Open-source | RoBERTa | FT | 61.24 |
For the TAT-QA dataset, the experimental results demonstrate that Docugami’s small agentic models achieve outstanding results, outperforming virtually all other approaches, including many with much larger and more expensive model sizes.
Table 2. TAT-QA Dataset Exact Match Accuracy for Proprietary and Open-Source Models. Leaderboard of exact match accuracy scores for mathematical reasoning with the TAT-QA dataset for various proprietary and open-source approaches. Source: https://nextplusplus.github.io/TAT-QA/
PE: Prompt Engineering; FT: Fine-Tuning.
Framework | Closed-source/ Open-source |
Model | Method | TAT-QA Accuracy |
|
1 | TAT-LLM | Open-source | Llama2-70B | FT | 81.4 |
2 | Docugami MATATA | Open-source | Ministral-8B | FT | 77.6 |
3 | TAT-LLM | Open-source | Llama2-13B | FT | 77.5 |
4 | Code Generation for Table-Text Question |
na | na | na | 76.8 |
5 | TAT-LLM | Open-source | Llama2-7B | FT | 76.4 |
6 | AeNER | na | na | na | 75.0 |
7 | Docugami MATATA | Open-source | Phi3-mini-3.8B | FT | 74.2 |
8 | Code Generation for Table-Text Question |
na | na | na | 73.7 |
9 | Encore | Open-source | BART-Large | FT | 71.8 |
10 | KFEX-N | Open-source | Delberta-V#- Large |
FT | 71.0 |
Similarly, for the TabMWP dataset, the experimental results demonstrate that Docugami’s small agentic model outperforms all existing open-source models in reasoning across the TabMWP dataset, even open-source models that rely on much larger LLMs or training data annotated by GPT-4. Our approach also outperforms all but one closed source model, and nearly matches the performance of the much larger and much more expensive GPT-4 model, with 98.13 percent accuracy compared to 98.78 percent for GPT-4.
Table 3. TabMWP Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the TabMWP dataset for various proprietary and open-source approaches.
Source: https://promptpg.github.io/leaderboard.html
PE: Prompt Engineering; FT: Fine-Tuning.
Framework | Closed-source/ Open-source |
Model | Method | TabMWP Accuracy | |
1 | Chameleon | Closed-source | GPT-4 | PE | 98.78 |
2 | Docugami MATATA | Open-source | Ministral-8B | FT | 98.13 |
3 | PoT GPT-4 | Closed-source | GPT-4 | PE | 96.93 |
4 | Docugami MATATA | Open-source | Phi3-mini-3.8B | FT | 96.66 |
5 | CREATOR | Closed-source | ChatGPT | PE | 94.7 |
6 | Chameleon | Closed-source | ChatGPT | PE | 93.28 |
7 | TaCo | Open-source | TAPEX-large | FT | 92.91 |
8 | Pot ChatGPT + Doc | Closed-source | PoT-SC-Codex | PE | 92.69 |
9 | CoT GPT-4 | Closed-source | GPT-4 | PE | 90.81 |
10 | CoS-Planning | Closed-source | ChatGPT | PE | 90.00 |
11 | PoT ChatGPT | Closed-source | ChatGPT | PE | 89.49 |
As a company focused on transforming businesses of all sizes and across all sectors by unlocking the data and insights contained in complex business documents, Docugami’s work in this area aims to reduce the reliance on proprietary Large Language Models and human prompt-engineering, resulting in comparable or better performance with lower costs, faster local results, and increased data privacy. This work will be particularly important in business settings where cost-effectiveness, operational efficiency and flexibility, and data privacy and security are key considerations.
We’re excited by the strong initial support for our work at BayLearn, and we see enormous potential in our novel approach to unlock complex document data and advance mathematical reasoning. Evaluated in several public industry benchmarks, early results demonstrate that Docugami’s agentic approach, using tool-augmented Small Language Models with weak-supervision, can meet or exceed the performance of existing reasoning methods using costly LLMs like GPT-4 and ChatGPT, as well as extensive human prompt-engineering. Our dedicated Science team will continue to accelerate progress in the important area of agentic models.
Note: You can read our detailed paper here: https://arxiv.org/abs/2411.18915. Ministral-8B was used under the "The Mistral AI Non-Production License" (link), for research purposes only; Phi3-mini was used under the MIT License (link). Funding for this work was provided in part by the Mitacs, the Canadian innovation organization.