Today, we are delighted to announce an initial integration of LlamaIndex with Docugami, via the Llama Hub.
The LlamaIndex framework provides a simple, flexible interface between a user’s information and Large Language Models (LLMs). Coupling LlamaIndex with Docugami’s unique ability to generate a Document XML Knowledge Graph Representation of long-form Business Documents opens exciting new opportunities for LlamaIndex developers to build powerful and accurate LLM applications that connect users to their own Business Documents, without being limited by document size or context window restrictions.
While there are a number of LLMs in use today, general purpose LLMs cannot deliver the accuracy needed for business, financial, legal, and scientific settings because they are trained on the public internet, which introduces a wide range of irrelevant and low-quality source materials. By contrast, Docugami is trained exclusively for business scenarios, resulting in greater accuracy and reliability.
- To get started, follow the quick start guide here.
- Tag us @docugami on twitter to share your results and experience.
- We welcome your technical questions, please post them on our new Docugami Discord.
We are excited to see what you will build with this integration!
How Docugami Enhances Document Understanding
Systems aiming to understand the content of documents, such as retrieval and question-answering, will benefit greatly from Docugami's semantic Document XML Knowledge Graph Representation. Our unique approach to document chunking allows for better understanding and processing of your documents:
- Intelligent Chunking: Docugami breaks down every document into a hierarchical semantic XML tree of chunks of varying sizes, from single words or numerical values to entire sections. These chunks follow the semantic contours of the document, providing a more meaningful representation than arbitrary length or simple whitespace-based chunking.
- Structured Representation: In addition, the XML tree indicates the structural contours of every document, using attributes denoting headings, paragraphs, lists, tables, and other common elements, and does that consistently across all supported document formats, such as scanned PDFs or DOCX files. It appropriately handles long-form document characteristics like page headers/footers or multi-column flows for clean text extraction.
- Semantic Annotations: Chunks are annotated with semantic tags that are coherent across the document set, facilitating consistent hierarchical queries across multiple documents, even if they are written and formatted differently. For example, in set of lease agreements of completely different styles and formats, you can easily identify key provisions like the Landlord, Tenant, or Renewal Date, as well as more complex information such as the existence and/or wording of any sub-lease provision or whether a specific jurisdiction has an exception section within a Termination Clause.
- Additional Metadata: Chunks are also annotated with additional metadata, if a user has been using Docugami. This additional metadata can be used for high-accuracy Document QA without context window restrictions. See detailed code walk-through in this notebook.
With this Docugami integration, users of LlamaIndex powered apps can now unlock the full potential of their own Business Documents. We can't wait to see the innovative solutions you create.
To celebrate this initial integration with LlamaIndex we are expanding our standard 14-day free-trial for LlamaIndex developers: Start a free Docugami trial and file a support ticket mentioning LlamaIndex to receive an extended 30-day trial with an upgraded 2000-page limit.
Happy building!