1. Collection and Processing of the initial data.
To begin with, it is necessary to collect data relevant to geographical locations (cities). First of all, we are talking about geographical coordinates, as well as the founding date of the city (or the time of its first mention in historical chronicles). The primary source for information collection is the Wikipedia database. In addition to the basic parameters, it is also necessary to obtain the following information about particular geolocation: the affiliation of a city to various states and the history of the name of a town (if applicable).
Special data miner services download pages, articles or data in any other format from reliable sources and store them on the Historica server for further processing. This step is necessary to bypass the protection against bots implemented by every self-respecting data source. We store copies of the data, with almost no processing, and make notes that allow us to determine whether the original has changed during the next crawl. If the original changes or new data appears, the entire Historica pipeline restarts, ultimately allowing it to display verified facts on the map.
Services such as fact miners come in next. Their task is to choose the events we are interested in: information about cities, countries or individuals. All extracted facts are stored in a separate database – a raw facts database. To obtain facts from unstructured and poorly structured information, fact miners use natural language processing (NLP) algorithms. Simultaneously with the selection of raw facts, all references from the data source and a link to it, if possible, are entered into the processing queue. The processing queue serves as input for the data miners mentioned above, as well as for reprocessing the original sources. One fact miner task is the endless, cyclic bypassing of original sources. Unlike data miners, fact miners can work in a highly loaded parallel mode without the fear of getting banned from information sources, since all the data is now with us.
For the formation of the raw facts database, it is planned to use sources representing texts in various languages. Before processing such articles with NLP algorithms, all these texts must be translated into the base language, which is English. It is assumed that for the preparation of the data, machine translation systems will be used. For example, in Wikipedia, as a rule, each significant geographical object (city) has a whole set of articles written in different languages. In some cases, these are not merely translated from one language to another, but texts are created independently of each other. The facts about the same city outlined in these articles may object, complement or even disprove each other. In any case, the system receives enough basic material for the fact selector algorithms, which will be engaged in the verification and selection of real facts. Such algorithms should use AI / ML systems, which will allow you to leave only facts that do not contradict each other, as well as common sense. All facts obtained are stored in a form that:
- Makes it very cheap (algorithmically) to check consistency;
- Displays facts without additional transformations and calculations. For example, in some cases, it is cheaper to keep the fact that the city A appeared and disappeared in X + 34 in X + 34 than to rename X to A, and continuously check this fact when the year displayed on the map changes.
2. Data display on the map.
2.1. General principles and approaches.
At a user request, Map Presenter displays all the facts selected at the previous stages on the world map for the year chosen by the user. To view the map, we use Open Layers. This solution is focused on working with GIS and makes it possible to manipulate data without worrying about the engine, allowing you to focus on the data itself and its presentation. The built-in conveyor allows us to supplement and update the set of displayed facts as a response to changes in the primary sources while remaining relevant.
2.2. Drawing the countries’ borders in historical dynamics.
At first, we assumed that the formation of state borders could be based on machine learning principles associated with linear classification algorithms. One such algorithm could be, for example, the “Support Vector Machine” or SVM. This algorithm, using the methods of supervised learning, would allow the creation of many hyperplanes that separate geolocation groups belonging to two different states. Being sequentially interconnected, they would form a broken line on the interstate border, relatively close to that which existed in any given historical era. To implement this idea for each country, it would be necessary to extract the complete information about all possible geolocations, down to the smallest areas (such as villages). The more information that could be collected (especially for border areas), the more accurate a mathematically calculated interstate border would be. When studying the dataset that Wikipedia provides, we found that despite the abundance of articles on various small geolocations with geographical coordinates, they lack data on the time of the occurrence of these settlements, which makes it almost impossible to correctly display them on the map in historical dynamics.
In addition, there are enough options on the world map where real borders can pass through areas where there are practically no settlements – for example, in the desert zones of the African continent. And the borders in these places, as a rule, are straight lines with a weakly expressed degree of fracture. Besides, the natural borders of states can be the coastlines of lakes, seas or oceans, or the fairways of rivers. In this case, there are a large number of intractable problems in creating algorithms for calculating boundary lines.
Applying similar approaches to the mathematical calculation of borders from the depths of history to our time, we run the risk of accumulating significant errors, and the calculated picture of the borders between modern states can differ significantly from the real one. Also, understanding that as you move backwards in the timeline, the number of geographic features will naturally decrease, which means that the accuracy of drawing borders will also decrease.
In light of the previous, it was decided to abandon such methods of calculating borders in favour of the so-called “deductive” method of determining borders, which is turned backwards in time. We are talking about a technique where the actual, currently existing borders are set in the form of initial conditions. Then, based on historical facts extracted from the database that have influenced the change in the boundaries, we will gradually change the coordinates of their lines. Perhaps inaccuracies will also be present in this method, however, the closer we choose an era to the current time, the more accurate the borders will become. It is clear that upon returning to the present moment, the borders will be drawn as accurately as possible since their coordinates will coincide with the real ones, which have been set initially. Such a method will allow us to get rid of the need to extract information about minor geolocations from the database, and focus only on relatively large objects, the historical details on which will be quite complete and will allow us to display it on the timeline accurately.
However, a new difficulty arises – how to change the borders of states if, for example, we met an event when a city belonged to one country at first and then was captured by another? In this case, the borders should be moved, and the corresponding section of the map should be repainted in the desired colour. But it should be taken into account that the city displayed on the map is a geometric point, and not some polygonal area. Obviously, the capture of the town is accompanied by the automatic capture of the territory around it (the province). Therefore, it would be advisable to operate with many polygons, reflecting the binding of the area to a specific city. Each country can consist of a large number of such landmasses, which may change shape and size, depending on the historical era. The principles of the formation and dynamic change of such landmasses are yet to be developed. In any case, they will have the same initial conditions as at the borders – that is, they will be based on data taken from current reality.
2.3. Country Colouring on the map.
When all borders are drawn, the system will face the task of colouring the map. It is clear that when all the data on the coordinates of the borders are available, then each state can be considered as a separate classified polygonal object, which can be filled with a particular colour within its borders. The algorithm will also have to solve the problem of colour choice for each country using the principle of contrast display. There is a classic mathematical problem, “The Problem of Four Colors”, which states that any map can be coloured using just four colours so that any two areas with a common border area are painted in different colours. Our task is simplified in the aspect that we are not limited only to a set of four colours, and our services have the entire possible colour spectrum. Despite the fact that international law recognizes the borders of states not only not on land, but also water (seas, oceans) when colouring we will assume that the borders will be coastlines (at least, that’s how they are usually displayed on political maps of the world). However, there are several problems. For example, the history of our civilization knows many periods when some territories were not part of any states, but at the same time belonged to some tribes (or tribal unions). On normal maps, such facts are noted not by shading the territories, but only by the designation of the names of these people (ethnic groups) approximately in the places where they lived. We will have to develop separate algorithms to identify similar territories and facts associated with them for relatively correct visualizations of the information extracted and processed by the NLP algorithms. In such cases, the concept of “border” and “colour of colouring” will, of course, be absent.
2.4. Nonlinear Timeline.
The historical timeline covers the period from the inception of the first civilizations, starting from about ten millennium BC to the present day. The minimum step (MS) of the timeline for displaying the historical conditions of political maps has a basis proportional to the logarithm and depends on getting into a specific historical period. At the moment, we intend to use the following nonlinear factors of the minimum step of the timeline as the initial conditions:
- From the X millennium BC 4th millennium BC – 1 MSH = 1000 years
- From the 4th millennium BC to the 1st millennium BC – 1 MSH = 100 years
- From the 1st millennium BC 500 BC – 1 MSH = 10 years
- From 500 BC until 1500 AD – 1 MSH = 1 year
- From 1500 AD and at the moment – 1 MS = 1 month