Technology

Historica is working to employ AI to process and manage vast amounts of data from various scientific fields. We are sharing our technological diary about our experience with using AI to create a digital map of human history.

An abstract, steampunk-inspired mechanism

Table of contents

Special thanks to Anadea and Vyacheslav Balakhontsev

Part 1. A “Toy” Dataset for the Initial Learning

This article covers the first stage of our research into generating historical maps using neural networks. Our initial work focuses on a simple "toy" dataset for preliminary learning. Here, we examine how the StableDiffusion model responds to textual prompts and what results can be expected at this early stage.

Project Goals and our ideas

Map generation project was focused on the idea that accurate historical maps can be generated using neural networks based on textual information

Our research aimed to test a hypothesis that historical maps of high quality can be generated with diffusion models from prompts with textual description of requested region and historical period. Our task was to check that specifically StableDiffusion can be used, and to understand its possibilities and limitations with regard to how it understands prompts.

We decided to start with the following ideas in mind:

  • Use StableDiffusion as our base network
  • Explore how to generate maps from different prompts - starting from easiest to hardest.
  • Conduct experiments with one prompt at a time
  • Each experiment has its own training data.
  • During the experiment network is additionally trained to generate maps from textual prompts like in training data.
  • After the first phase of experiments is done, collect a large corpora of historical texts and maps – and train StableDiffusion on various maps and real historical texts.

Used data

To speed up our research, we used a “toy” dataset of historical maps for experiment purposes. It was based on this video and consisted of yearly maps of Europe in a single style. This dataset was not historically accurate, yet it let us understand the challenges of real-world data.

Time period of maps that we used for experiments was limited to 1000 A.D. - 1800 A.D.

A original map of Europe in a year 1400
Original map. Prompts for different tasks are different

Each image in a dataset had a corresponding prompt where region, year and list of countries could have been mentioned.

We used StableDiffusion (mostly, v.2.1) and trained the U-Net part of the model using <image, prompt> pairs, and then evaluated using prompts in and out of the given time period. A typical dataset for such an experiment consisted of 700 to 8000 image-text pairs, which was enough for experimenting purposes.

Experiments and results

We’ve done five major experiments, each time feeding the model with more complex prompts:

1. Generate a map given prompt with a year and an image of a map.

A map of Europe in a year 1400

The goal was to check the possibility of generating maps that would change based on a year in a prompt and see how SD catches map style.

We got good results:

- Generated maps borders are highly accurate

- There were almost no artifacts, generation quality is good

- We even got readable country names, which was not expected

Generated examples: (click to open larger images)

A map of Europe in a year 1400 and A map of Europe in a year 1600
Prompt: “A map of Europe in a year 1400” | “A map of Europe in a year 1600”

These results can conclude that the model can change countries and their borders using a given year.

2. Generate a map given prompt with a year and list of countries on a part of a map.

A part of a map of Europe in a year 1561

The goal was to check if the model can understand connections between country and region, and generate smaller maps where countries from the prompt would be displayed.

Results for this experiment were great, the model was drawing accurate crops of different regions. By scaling the dataset with different crops we could potentially train a model to generate very small parts of the map or manipulate the scale of the map through the prompt.

a map of a part of Europe in a year 1800 and a map of a part of Europe in a year 1000

3. Generate a map given prompt with a year and a message like “country A conquered country B”.

a map of  Europe in a year 1022

In our next experiment we wanted to check if parts of the generated map can be manipulated through the prompt. Experiment results were mixed - final “accuracy” of the model is ~40% (based on the number of correct re-drawings of country B into colors of country A).

Although the results were poor, the experiment was still useful:

  • We discovered LoRA - a method to quickly learn new “concepts”, such as borders of specific countries - and to learn it in literally seconds.
  • We learned that not all the prompts fit diffusion models - partially because most prompts in datasets used for pretraining only describe an image - and do not give any instructions on how to change it. We concluded that manipulating a generated map is a different task.
  • We found a special version of SD trained specifically to edit images -Instruct Pix2Pix. Training InstructPix2Pix to edit maps became our next experiment.
three maps of Europe in 1050 (Hungary, Poland and France conquered Holy Roman Empire

4. Edit a map given prompt with a year and a message like “country A conquered country B”.

Original image and edited image - "HRE conquered France"

In this experiment we wanted to check if map can be edited with a prompt that describes certain changes (i.e., year is the same, but a certain country conquered another country). We trained InstructPix2Pix to edit the original map with a certain edit instruction.

Results were accurate - model did exactly what was asked by a prompt:

Five images of maps in 1009 (Original image, France conquered Hungary, Hungary annexed France, Rus conquered Poland, Poland occupied Holy Roman Emnpire)

5. Edit and generate maps based on historical texts of arbitrary length

A fundamental limitation of the StableDiffusion model is that the max length of its text encoder (CLIP) is limited to 77 tokens.  To overcome this limitation we implemented a trick of splitting input text into chunks of 77 tokens, encoding each chunk into its own embedding, and feeding the model with average embedding. To verify that average embeddings do not negatively affect model generation capabilities we trained both vanilla SD that generates maps from long inputs and InstructPix2Pix model for editing maps based on longer texts.

For longer prompts, we decided to query DBPedia for information on historical events:

- For XVth century, we got descriptions of 286 events, 213 of them were longer than 77 tokens

- For InstructPix2Pix model we used the scheme from previous experiment - having a map and event for a year X, we can:

  • Use map for a year X as edited image
  • Use event description as prompt
  • Use map for a year X-N (N close to 10) as an original map
  • Generate samples with different N re-writing the prompts using LLM

- ~2200 pairs - enough for basic training

Results:

Image of three maps: ( original map of 1442 и  The Battle of Formigny, fought on 15 April 1450 )

Results demonstrated that the model understands long prompts, uses texts from various parts of the prompt and can work with very long texts.

Experiments conclusions

Aforementioned experiments were a necessary foundation to understand model behavior while training on real maps and texts.

During experiments, we discovered a few important things that allow us to speed up future development. Here are just a few of them:

  1. Found an effective way to increase generated image size and quality up to 16x using SuperResolution. This allows us to potentially generate images up to size 12288x12288. (Generated examples in the doc are 3072x3072 and are already of decent quality).
  2. Started using LPIPS as an evaluation metric - it allows us to measure how similar the generated image is to the original .
  3. Sped up training process by at least 3 times (using memory-efficient optimizers, experimental implementation of certain network layers and utilizing more aggressive training strategies that allow us reduce number of training steps 10 times)
  4. Experimented with training models on different checkpoints at different resolutions, and concluded that the highest quality of images can be achieved with SD-2.1 at 768x768, with SD2.1 being the best checkpoint available so far overall (SD-1.5 may be better at generating texts, but is generally worse and unable to work with 768x768 without proper fine-tuning on high-res images). Most of our experiments were on images of 512x512 due to time and resource constraints.

Part 2. Experiments on Generating Historical Maps Using the StableDiffusion Model on Real Data

In the evolving realm of digital cartography, the role of advanced models in generating detailed and accurate historical maps has become paramount. This article delves into the recent experiments conducted with the StableDiffusion model, focusing on its application to real-world maps. We explore the challenges and nuances of training this model using a diverse dataset comprising maps and historical texts.

Training StableDiffusion model on real maps

Second part of our work was dedicated to building a foundation model for various maps and texts. Our plan was to fine-tune Stable Diffusion on a large dataset of pairs of maps and historical texts relevant to them.

As for the data, we decided to proceed with WiT dataset, as it already includes both texts and maps of high quality - and is a great tool for building a foundation model. WiT consists of images and related text fields - we used  a combination of page title, abstract and image caption as a text illustrating a map together with a map itself. To train our model only with relevant information we built two supplementary models, one for filtering images, and another one for texts. We worked with an `en`-only subset of WiT (5.4 mln entries).

We managed to filter ~80000 maps with image classification model (‘map’, ‘no map’) and ~4 mln relevant texts with text classification model (‘relevant’,’irrelevant’) classification based on a set of criterias. After that we combined relevant maps and descriptions, and got a dataset of 100k map-text pairs.

We trained Stable Diffusion 2.1 on this data to see what it would generate from free-form historical textual description. Despite the fact that training took only 100k steps (single GPU, 5 days of training), it achieved significant results in generating maps from historical data.

  • Ability of the model to generate maps from prompt
  • Map quality - lack of artifacts, map details, etc.
  • Ability to understand time period and region described in prompts

Firstly, we compared what is generated by default checkpoint of Stable Diffusion (image below to the left) and our trained model (to the right):

Alexander III image and map image
When Wallace was growing up, King Alexander III ruled Scotland. His reign had seen a period of peace and economic stability. On 19 March 1286, however, Alexander died after falling from his horse.[18][19] The heir to the throne was Alexander\'s granddaughter, Margaret, Maid of Norway. As she was still a child and in Norway, the Scottish lords set up a government of guardians. Margaret fell ill on the voyage to Scotland and died in Orkney in late September 1290.[20] The lack of a clear heir led to a period known as the "Great Cause", with a total of thirteen contenders laying claim to the throne…

Two maps of XVI century japan
With supply difficulties hampering both sides, neither the Japanese nor the combined Ming and Joseon forces were able to mount a successful offensive or gain any additional territory, resulting in a military stalemate in the areas between Hanseong and Kaesong. The war continued in this manner for five years, and was followed by a brief interlude between 1596 and 1597 during which Japan and the

As you can see, the model trained on our data is able to identify historical region where events take place and draw its map. The prompt used for inference was not in the training dataset (and is of a different format).Maps are very diverse.

Let’s take a look at some more examples:

map
El Cid fought against the Moorish stronghold of Zaragoza, making its emir al-Muqtadir a vassal of Sancho. In the spring of 1063, El Cid fought in the Battle of Graus, where Ferdinand\'s half-brother, Ramiro I of Aragon, was laying siege to the Moorish town of Graus, which was fought on Zaragozan lands in the valley of the river Cinca. Al-Muqtadir, accompanied by Castilian troops including El Cid, fought against the Aragonese.
map
The sons of Hacı I Giray contended against each other to succeed him. The Ottomans intervened and installed one of the sons, Meñli I Giray, on the throne. Menli I Giray, took the imperial title "Sovereign of Two Continents and Khan of Khans of Two Seas.

Despite the fact that the maps in this variation do not suffice goals of the project, one may argue that increasing dataset scale, applying better filters and using more tricks during preprocessing step combined with much longer pre-train will give accurate results of desired quality. Note that this version of the model was trained with less than 0.001% compute used during the training process of a proper StableDiffusion.

Additionally, we’d like to demonstrate how the quality of generated maps improves with longer training. Results after 40000 training steps are to the left, results after 100000 training steps are to the right

Map of Gaul (modern day France)
Constantine III (died 411) was a common Roman soldier who was declared emperor in Roman Britain in 407.  He moved to Gaul (modern France), taking all of the mobile troops from Britain, with their commander Gerontius, to confront bands of Germanic invaders who had crossed the Rhine the previous winter. With a mixture of fighting and diplomacy Constantine stabilized the situation and established control over Gaul and Hispania (modern Spain and Portugal), establishing his capital at Arles
map
The sons of Hacı I Giray contended against each other to succeed him. The Ottomans intervened and installed one of the sons, Meñli I Giray, on the throne. Menli I Giray, took the imperial title "Sovereign of Two Continents and Khan of Khans of Two Seas.
Map of 1475 Greek Principality of Theodoro and the Genoese colonies at Cembalo, Soldaia, and Caffa (modern Feodosiya).
In 1475 the Ottoman forces, under the command of Gedik Ahmet Pasha, conquered the Greek Principality of Theodoro and the Genoese colonies at Cembalo, Soldaia, and Caffa (modern Feodosiya). Thenceforth the khanate was a protectorate of the Ottoman Empire. The Ottoman sultan enjoyed veto power over the selection of new Crimean khans. The Empire annexed the Crimean coast but recognized the legitimacy of the khanate rule of the steppes, as the khans were descendants of Genghis Khan

Conclusions

During our research we demonstrated that diffusion models can be used to generate maps.

We showed that, depending on the prompt, diffusion models are capable of generating parts of the map (region, province or even smaller scale), accurately re-draw borders according to a historical period, and editing already-existing maps based on new context.

We’ve shown that using generalized map dataset we can create a maps-only version of diffusion network, and with power of unsupervised pre-training at scale it will be able to achieve high generalization capabilities.

Map quality, accuracy and diversity is highly dependable on exact task and used dataset, but it is achievable to state the problem in such a way that generated map could be used externally.

Additional steps that may be taken into consideration

  1. Train a model for longer and on a larger dataset
  2. Explore different text2image networks
  3. Tune the model on “downstream tasks” - i.e., editing maps based on prompt, generating map given year and region, etc.
  4. Improve understanding of textual part of the network by training on text only

Next steps with regard to overall project development

To verify the hypothesis in full and make generated maps usable on physical maps, the following tasks need to be addressed:

  1. Learn how to translate generated maps into external format (i.e., OpenStreetMap). It would be necessary to “read” generated maps. In our vision, this problem has to be split into two parts: identification of countries and borders and matching them with textual information (i.e., country names). It is possible to integrate identification of countries and borders into a diffusion network.
  2. Maps should be generated  in different styles. For that, one may assign various labels to maps in the dataset, based on content (i.e., political map, religious map) and style (lithography map, globe). Once labeled, labels would be inserted in prompt and be later used for generation.
  3. Finally, we could add some input-output layers to the model, and get different outputs from it (i.e., provide each map with coordinates of its boundaries, map type and any other information)

Part 3. How we use the Perceptual similarity metric (LPIPS)

In the ongoing series of experiments surrounding the generation of historical maps, this article introduces a crucial tool for evaluating the fidelity of generated images: the Perceptual Similarity Metric, or LPIPS. Rather than relying on mere pixel-by-pixel comparisons, LPIPS leverages the power of neural networks to provide a more nuanced understanding of image similarity.

This document describes how similarity between generated and original images can be evaluated with a perceptual similarity metric (LPIPS), and how we can use it to compare “quality” of generated images during training.

Reminder about data

A map of Europe in a year 1400.
A map of Europe in a year 1400.

Metric description

LPIPS is a metric that compares similarity between two images.

Instead of comparing two images by pixels, it uses features that can be extracted from a pre-trained neural network - meaning, we feed a network an image, and get some information from hidden layers of the network as an output.

Perceptual image similarity metric has two properties:

  • It is large when human observes large difference between images
  • It is small when observers consider images similar

We used this library to compare LPIPS between our images. To evaluate generation results we resized generated and original images to 1080x1080 (original images were cropped, generated images were downscaled to 3072x3072). Additionally, we compared LPIPS values for the same images downsized to 512x512.

How LPIPS can be used?

  • Validation metric during training (to check if generation improved?)
  • Comparing different models with one another
  • Selecting best frame from multiple generated samples
  • Selecting best model version after training
  • Having multiple maps of the same style but in different periods, LPIPS can answer which map is “the closest in time” to some 3rd map - due to a general rule that the more years between the two maps - the more border changes there are on a map. This way, LPIPS can be used to cluster maps by period.
  • Identify specific years or periods where generated images are of lower quality - and work on parts of the dataset related to it.

Let’s say we have an original image of a map of Europe in the year 1400.

A map of Europe in a year 1400.

We want to compare it with a generated image

A map of Europe in a year 1400.

LPIPS(original_1400, generated_1400) = 0.13760228

Now, let’s calculate LPIPS between original image and image generated for, let’s say, year 1600:

A map of Europe in a year 1600

LPIPS(original_1400, generated_1600) = 0.4240757

And an image generated for year 1500

A map of Europe in a year 1500

LPIPS(original_1400, generated_1500) = 0.3257943

Comparing LPIPS for similar images

LPIPS differences are smaller if generated images are very similar.

For instance, let’s compare original image for a year 1400 and three generated images for this year

original image for a year 1400 and three generated images for this year
LPIPS equals to 0.14 (left), 0.17(center) and 0.15 (right)

Image with higher LPIPS in the center has different colors for two countries in the middle of a map (Lithuania and Moldova). While comparing all three images, we can see that only some parts of the map are visually different. After taking a closer look,  you may notice that only Ottoman and Polish-Lithuanian borders are different, and on other parts of the map only few artifacts are different. 

This case is interesting because it demonstrates that LPIPS captures “smaller”, local changes on a map.

Relation of age differences between maps

Table values
Table values are MeanLPIPS between original image and 10 generated images

*note that years 950 & 1800 were not present in a training data, generated images are purely fictional

It may be observed that the bigger the age gap between maps, the bigger LPIPS becomes

A note on resolution

We compared LPIPS on resolution 512x512

Table values

Comparing results between two tables (1080 x 1080 vs 512x512 resolution), it may be seen that when LPIPS@1080 is larger (i.e., 0.45), LPIPS@512 becomes roughly the same or slightly smaller (roughly -3% difference). 

However, smaller LPIPS@1080 values (i.e., 0.15) leads to a bigger difference with LPIPS@512 (~25% difference). This implies that we should upscale our images to detect smaller differences on the map.

Conclusions

LPIPS widely extends our capabilities to understand generated maps quality.

It can be used in validation, selecting the best generated image and evaluation of model predictive power in general.

Moreover, LPIPS approach can be extended, and be potentially used to compare original and generated maps and show regions where model makes mistakes.

FAQs

How can I contribute to or collaborate with the Historica project?
If you're interested in contributing to or collaborating with Historica, you can use the contact form on the Historica website to express your interest and detail how you would like to be involved. The Historica team will then be able to guide you through the process.
What role does Historica play in the promotion of culture?
Historica acts as a platform for promoting cultural objects and events by local communities. It presents these in great detail, from previously inaccessible perspectives, and in fresh contexts.
How does Historica support educational endeavors?
Historica serves as a powerful tool for research and education. It can be used in school curricula, scientific projects, educational software development, and the organization of educational events.
What benefits does Historica offer to local cultural entities and events?
Historica provides a global platform for local communities and cultural events to display their cultural artifacts and historical events. It offers detailed presentations from unique perspectives and in fresh contexts.
Can you give a brief overview of Historica?
Historica is an initiative that uses artificial intelligence to build a digital map of human history. It combines different data types to portray the progression of civilization from its inception to the present day.
What is the meaning of Historica's principles?
The principles of Historica represent its methodological, organizational, and technological foundations: Methodological principle of interdisciplinarity: This principle involves integrating knowledge from various fields to provide a comprehensive and scientifically grounded view of history. Organizational principle of decentralization: This principle encourages open collaboration from a global community, allowing everyone to contribute to the digital depiction of human history. Technological principle of reliance on AI: This principle focuses on extensively using AI to handle large data sets, reconcile different scientific domains, and continuously enrich the historical model.
Who are the intended users of Historica?
Historica is beneficial to a diverse range of users. In academia, it's valuable for educators, students, and policymakers. Culturally, it aids workers in museums, heritage conservation, tourism, and cultural event organization. For recreational purposes, it serves gamers, history enthusiasts, authors, and participants in historical reenactments.
How does Historica use artificial intelligence?
Historica uses AI to process and manage vast amounts of data from various scientific fields. This technology allows for the constant addition of new facts to the historical model and aids in resolving disagreements and contradictions in interpretation across different scientific fields.
Can anyone participate in the Historica project?
Yes, Historica encourages wide-ranging collaboration. Scholars, researchers, AI specialists, bloggers and all history enthusiasts are all welcome to contribute to the project.