Document Engineering

Docugami’s Small Agentic Reasoning Model Outperforms Virtually All Expensive GPT-4 Frameworks in Public Industry Benchmarks

Dec 10, 2024

Docugami was honored to be selected to present agentic research at the prestigious BayLearn machine learning symposium earlier this year. Last week, this work was evaluated in several public industry benchmarks utilizing top mathematical reasoning datasets, showing Docugami’s small agentic reasoning models outperform all open-source reasoning models of comparable size and virtually all GPT-4 based frameworks. Our small agentic reasoning models achieve exceptional results without the need for costly proprietary LLMs or extensive human prompting.

Without question, the mathematical reasoning capabilities of AI are increasing rapidly. Most of the attention has focused on approaches using Large Language Models (LLMs) and intensive human prompt engineering to address complex mathematical reasoning problems.

However, these approaches have important limitations and dependencies. Prompt-engineering is labor intensive, often requiring hand-crafted prompts, and may not capture the full complexity and diversity of the data. Proprietary Large Language Models raise data privacy and security concerns, especially with sensitive business documents, and present scalability issues due to their high cost and latency.

Our work in this area stems from Docugami’s roots as a business document foundation model, focused on the unique business documents of individual organizations. Docugami’s world-class Science team is working to advance AI approaches that can tackle complex mathematical reasoning problems over business documents with extraordinary precision, lower costs, greater efficiency, and improved data privacy and security for sensitive information.

We recently developed a novel cost-effective method for training AI agents for tabular data problem-solving through reasoning, planning, and tool use. With a progressive self-improvement paradigm and iterative weak supervision, Docugami’s small agentic reasoning model requires only 3.8B/8B Small Language Models (SLMs), making it especially well-suited for local hosting and sensitive business contexts where data privacy is vital.

Using flexible and reusable tools across different datasets, Docugami’s small agentic model achieves exceptional performance with excellent scalability across a variety of shared tasks.

We recently evaluated Docugami’s small agentic model, using several of the most widely used datasets for benchmarking mathematical problem-solving. The FinQA dataset and the TAT-QA dataset represent complex real-world reasoning scenarios, and include tabular and text data, requiring reasoning over financial information with intricate contexts. The TabMWP dataset involves mathematical word problems over tabular data. All three of these datasets require multi-step reasoning, information extraction, data manipulation, tabular and contextual understanding, and numerical calculations, making them useful evaluators for AI intended for use with lengthy business documents and complex business scenarios.

For the FinQA dataset, the experimental results demonstrate that Docugami’s small agentic model (identified as Docugami MATATA, for “MAthematical Tool-Assisted reasoning for Tabular Applications) outperforms all existing models, even large closed-source models using GPT-4 and Chat GPT or open-source models that use human expert annotations for training.

Table 1. FinQA Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the FinQA dataset for various proprietary and open-source approaches.
PE: Prompt Engineering; FT: Fine-Tuning.

	Framework	Closed-source/ Open-source	Model	Method	FinQA Accuracy
1	Docugami MATATA	Open-source	Ministral-8B	FT	77.59
2	TAT-LLM	Open-source	Llama2-70B	FT	76.81
3	EEDP	Closed-source	GPT-4	PE	76.05
4	TAT-LLM	Open-source	Llama2-13B	FT	71.93
5	Docugami MATATA	Open-source	Phi3-mini 3.8B	FT	70.10
6	PoT	Closed-source	PoT-SC-Codex	PE	68.1
7	TAT-LLM	Open-source	Llama2-7B	FT	65.13
8	EEDP	Closed-source	ChatGPT	PE	61.88
9	FinQANet	Open-source	RoBERTa	FT	61.24

For the TAT-QA dataset, the experimental results demonstrate that Docugami’s small agentic models achieve outstanding results, outperforming virtually all other approaches, including many with much larger and more expensive model sizes.

Table 2. TAT-QA Dataset Exact Match Accuracy for Proprietary and Open-Source Models. Leaderboard of exact match accuracy scores for mathematical reasoning with the TAT-QA dataset for various proprietary and open-source approaches. Source: https://nextplusplus.github.io/TAT-QA/
PE: Prompt Engineering; FT: Fine-Tuning.

	Framework	Closed-source/ Open-source	Model	Method	TAT-QA Accuracy
1	TAT-LLM	Open-source	Llama2-70B	FT	81.4
2	Docugami MATATA	Open-source	Ministral-8B	FT	77.6
3	TAT-LLM	Open-source	Llama2-13B	FT	77.5
4	Code Generation for Table-Text Question	na	na	na	76.8
5	TAT-LLM	Open-source	Llama2-7B	FT	76.4
6	AeNER	na	na	na	75.0
7	Docugami MATATA	Open-source	Phi3-mini-3.8B	FT	74.2
8	Code Generation for Table-Text Question	na	na	na	73.7
9	Encore	Open-source	BART-Large	FT	71.8
10	KFEX-N	Open-source	Delberta-V#- Large	FT	71.0

Similarly, for the TabMWP dataset, the experimental results demonstrate that Docugami’s small agentic model outperforms all existing open-source models in reasoning across the TabMWP dataset, even open-source models that rely on much larger LLMs or training data annotated by GPT-4. Our approach also outperforms all but one closed source model, and nearly matches the performance of the much larger and much more expensive GPT-4 model, with 98.13 percent accuracy compared to 98.78 percent for GPT-4.

Table 3. TabMWP Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the TabMWP dataset for various proprietary and open-source approaches.
Source: https://promptpg.github.io/leaderboard.html
PE: Prompt Engineering; FT: Fine-Tuning.

	Framework	Closed-source/ Open-source	Model	Method	TabMWP Accuracy
1	Chameleon	Closed-source	GPT-4	PE	98.78
2	Docugami MATATA	Open-source	Ministral-8B	FT	98.13
3	PoT GPT-4	Closed-source	GPT-4	PE	96.93
4	Docugami MATATA	Open-source	Phi3-mini-3.8B	FT	96.66
5	CREATOR	Closed-source	ChatGPT	PE	94.7
6	Chameleon	Closed-source	ChatGPT	PE	93.28
7	TaCo	Open-source	TAPEX-large	FT	92.91
8	Pot ChatGPT + Doc	Closed-source	PoT-SC-Codex	PE	92.69
9	CoT GPT-4	Closed-source	GPT-4	PE	90.81
10	CoS-Planning	Closed-source	ChatGPT	PE	90.00
11	PoT ChatGPT	Closed-source	ChatGPT	PE	89.49

As a company focused on transforming businesses of all sizes and across all sectors by unlocking the data and insights contained in complex business documents, Docugami’s work in this area aims to reduce the reliance on proprietary Large Language Models and human prompt-engineering, resulting in comparable or better performance with lower costs, faster local results, and increased data privacy. This work will be particularly important in business settings where cost-effectiveness, operational efficiency and flexibility, and data privacy and security are key considerations.

We’re excited by the strong initial support for our work at BayLearn, and we see enormous potential in our novel approach to unlock complex document data and advance mathematical reasoning. Evaluated in several public industry benchmarks, early results demonstrate that Docugami’s agentic approach, using tool-augmented Small Language Models with weak-supervision, can meet or exceed the performance of existing reasoning methods using costly LLMs like GPT-4 and ChatGPT, as well as extensive human prompt-engineering. Our dedicated Science team will continue to accelerate progress in the important area of agentic models.

Note: You can read our detailed paper here: https://arxiv.org/abs/2411.18915. Ministral-8B was used under the "The Mistral AI Non-Production License" (link), for research purposes only; Phi3-mini was used under the MIT License (link). Funding for this work was provided in part by the Mitacs, the Canadian innovation organization.

Document Engineering AI For Documents LLM Generative AI for Business Documents Docugami Science

Docugami’s Small Agentic Reasoning Model Outperforms Virtually All Expensive GPT-4 Frameworks in Public Industry Benchmarks

Similar posts

Docugami Science Team to Present Research on Agentic Workflows at BayLearn 2024

Docugami Expands to Europe with New Subsidiary based in France to Advance Open and Sovereign Document AI

Docugami installe son siège européen en France pour bâtir une IA documentaire souveraine et maîtrisable

Docugami’s Small Agentic Reasoning Model Outperforms Virtually All Expensive GPT-4 Frameworks in Public Industry Benchmarks

Similar posts

Docugami Science Team to Present Research on Agentic Workflows at BayLearn 2024

Docugami Expands to Europe with New Subsidiary based in France to Advance Open and Sovereign Document AI

Docugami installe son siège européen en France pour bâtir une IA documentaire souveraine et maîtrisable

Get noticed on the latest Document Engineering insights