Technological results. Developing an ETL Pipeline and Graph Database for Historical Data with LLMs
Recently, we at Historica Tech Lab have developed an innovative ETL (Extract, Transform, Load) pipeline integrated with a graph database to transform and analyze historical data. As part of our research, we aimed to prototype a "historical ontology," a knowledge repository of human history. Leveraging large language models (LLMs), this approach converts unstructured historical texts into structured data, advancing digital humanities significantly.
ETL Pipeline Innovation
The pipeline consists of three stages:
1) Data Extraction
2) Data Transformation
3) Data Loading
The pipeline is designed to extract data from historical texts, transform this data and load it into a database. The ontology was built using Amazon Neptune to ensure efficient storage and access to these data. Amazon Neptune was chosen for its flexibility, scalability, and performance, crucial for modeling complex historical data relationships and executing intricate queries.
Historical Ontology Development
The created "historical ontology" categorizes data into Units of Topography (UT), Units of Stratigraphy (US), and Actors (AC), detailing events, material evidence, and participants.
Integrating with OpenAI models it allows users to interact with the database through natural language queries, making it accessible to researchers without specialized query language knowledge.
Future Prospects and Challenges
The ETL pipeline and historical ontology prototype have demonstrated high efficiency and accuracy. This advancement is set to transform historical data processing and analysis, enabling new research and discoveries.
Challenges remain in ensuring accurate data extraction from poorly structured texts and optimizing the ETL pipeline for large data volumes. Security is also crucial for sensitive historical data. We plan to expand our historical ontology to more regions and periods, enhancing LLM-based solutions for better data extraction accuracy. These advancements are expected to have broad applications in digital humanities and beyond.