Mateusz Jacek Płonka (Silesian U Technology) et al. have posted “Evaluating the Effectiveness of Document Splitters for Large Language Models in Legal Contexts” on SSRN. Here is the abstract:
The study explores the development and application of an advanced artificial intelligence-based system aimed at improving the efficiency and accuracy of legal document processing. Due to the high volume, specialised vocabulary, and complexity of legal texts, traditional document management techniques often prove inadequate and error-prone, creating significant challenges for legal practitioners. The proposed method leverages natural language processing and machine learning algorithms to automate key processes such as summarisation, analysis, search, and classification. By utilising vector embedding techniques, the system enables precise information retrieval from large legal document collections, while advanced summarisation methods generate concise and relevant summaries of extensive texts. The study employs a Retrieval-Augmented Generation approach, combining large language models (LLMs) with external knowledge bases to enhance the accuracy and contextual relevance of generated responses, addressing common issues such as hallucinations and outdated information in traditional LLMs. The research provides an in-depth analysis of the application of various text-splitting algorithms in the context of legal document databases. The findings highlight the characteristics of appropriate algorithms and offer recommendations on the conditions under which specific mechanisms should be employed.
