The value of structured, machine-readable data in training generative AI
In the era of artificial intelligence (AI), there is a growing interest in the use of AI models like ChatGPT to enhance financial reporting processes. However, it is crucial to recognise the importance of structured, machine-readable data in training AI models effectively. A new research paper uses XBRL-tagged textual disclosures to train a large language model to classify accounting topics, with additional benefits gained from the use of XBRL-tagged data over unstructured data.
The researchers, Jenna Burke (University of Colorado Denver), Rani Hoitash (Bentley University), Udi Hoitash, and Summer Xiao (Northeastern University) leverage the requirement in the U.S. to tag each financial statement note with a standardised label, mapping it to a specific accounting concept in the FASB Accounting Standards. XBRL tagging provided the authors with a vast set of structured data (more than 350,000 XBRL tags) to train the large language model more accurately – and without the need to attempt to train their system with non-expert humans.
The paper focuses on the most frequent taxonomy tags of financial statement notes, which captures 92.5 percent of tags utilised in annual 10-K filings. For example, the most prevalent TextBlock tag is “IncomeTaxDisclosureTextBlock” which is used by almost all companies.
The authors chose a large language model trained on a large amount of financial-specific data. Using the XBRL tagged data, the authors “taught” the model to classify text into accounting topics. After teaching the model, the authors examined its performance in never-seen out-of-sample data. The model accurately classified text into topics 95% percent of the time. Following this, the researchers put the model to work classifying the problem area of untagged paragraphs, for example, in management discussion and analysis, with some success.
The study demonstrates how tagged data can be instrumental in training large language models, overcoming the need (and accuracy risks) of manual coding by humans. Combining structured, machine-readable data with AI can mitigate the risks associated with it, enabling accurate training, consistency, and interpretability.
This kind of research would seem to suggest that next generation Augmented Intelligence should be able to increase the accuracy of tagging. However, management must have the final say in relation to what is published for investors. High quality, well thought out official XBRL taxonomies will continue to grow in value and relevance.
Read the paper here.