Technological results. Developing an ETL Pipeline and Graph Database for Historical Data with LLMs

July 17, 2024

•

min read

Share this news on:

Recently, we at Historica Tech Lab have developed an innovative ETL (Extract, Transform, Load) pipeline integrated with a graph database to transform and analyze historical data. As part of our research, we aimed to prototype a "historical ontology," a knowledge repository of human history. Leveraging large language models (LLMs), this approach converts unstructured historical texts into structured data, advancing digital humanities significantly.

ETL Pipeline Innovation

The pipeline consists of three stages:

1) Data Extraction

2) Data Transformation

3) Data Loading

The pipeline is designed to extract data from historical texts, transform this data and load it into a database. The ontology was built using Amazon Neptune to ensure efficient storage and access to these data. Amazon Neptune was chosen for its flexibility, scalability, and performance, crucial for modeling complex historical data relationships and executing intricate queries.

Historical Ontology Development

The created "historical ontology" categorizes data into Units of Topography (UT), Units of Stratigraphy (US), and Actors (AC), detailing events, material evidence, and participants.

Integrating with OpenAI models it allows users to interact with the database through natural language queries, making it accessible to researchers without specialized query language knowledge.

Future Prospects and Challenges

The ETL pipeline and historical ontology prototype have demonstrated high efficiency and accuracy. This advancement is set to transform historical data processing and analysis, enabling new research and discoveries.

Challenges remain in ensuring accurate data extraction from poorly structured texts and optimizing the ETL pipeline for large data volumes. Security is also crucial for sensitive historical data. We plan to expand our historical ontology to more regions and periods, enhancing LLM-based solutions for better data extraction accuracy. These advancements are expected to have broad applications in digital humanities and beyond.

‍

Don't miss out on the latest news!

Oops! Something went wrong while submitting the form.

Technological results. Developing an ETL Pipeline and Graph Database for Historical Data with LLMs

ETL Pipeline Innovation

Historical Ontology Development

Future Prospects and Challenges

People also read

April 2025. New Technological Milestones in Historical Mapping

Detailed Look at Historica’s Latest Technological Advancements

Historica 2024 Highlights

Contribute to Historica's blog!

FAQs