24. October 2022
My internship at QFS: Sorting and comparing XML files (and other things)
"Geodata systems and optimization: Evaluation and optimization of the testability of geodata systems, in particular using QF-Test" was the topic of my internship semester at QFS. It was part of my studies in "Geoinformatics and Navigation" at the University of Applied Sciences Munich and lasted from September 2021 until the beginning of February 2022.
Geodata and also geoinformation systems have often been part of lectures during my studies and so I had an idea about how complex these applications are. They are used to manage, store and/or display geodata (i.e. information that can somehow be drawn on a map). Often such systems have an interface where various georeferenced geometries (such as estates, roads, or boundaries) are displayed on maps. Since the functionality can be very diverse and far-reaching, the topic of testability is very important and exciting.
Beyond the internship semester, I have been working as a student trainee at QFS for quite some time and was able to continue and complete my projects even after the official end of the internship.
In search of a problem
In order to get a deeper insight into the use of QF-Test at geo-related companies, I conducted several interviews with customers. On the one hand, the questions were aimed at general topics, such as whether there are special requirements for the geospatial industry in the area of quality assurance. On the other hand, I asked what daily work with QF-Test looks like and which processes are tested.
Another question was about future use: Is there room for optimization? Which areas might perhaps not yet have a working solution? With these questions I also hoped to find a main project for my practical semester.
The result of the interviews was that in most cases QF-Test already works great and is used in many ways. There were only a few difficulties concerning QF-Test, for example with the image comparison (as this can be unstable and time-consuming to maintain).
The decisive idea for my project came from Uwe Päsler from the "Staatsbetrieb Geobasisinformation und Vermessung Sachsen" (GeoSN). He suggested among other things to improve the procedure for the comparison of XML files which is part of QF-Test.
Since the data exchange of the applications used by GeoSN mostly takes place via a NAS interface, corresponding tests have been carried out using XML files for some time. In addition to the informational and geometric data of the individual objects, these XML files also contain references of the objects to each other, which leads to very time-consuming tests with correspondingly large files (10 MB and up).
An improved procedure for comparing large XML files
For a long time it was possible to compare XML files via the procedure
qfs.utils.xml.compareXMLFiles of the standard library
The procedure provides a variety of ways to customize the comparison, such as ignoring nodes or attributes.
The previous comparison worked via a Jython script that parses the files using the minidom parser and then compares them. The problem with minidom is that before the files can be compared, they are processed fully and represented in a DOM tree. This leads to a very high memory requirement for large files, which also explains the long runtime of it's previous comparison.
With my new script, the speed problem is now solved. The comparison is no longer implemented using the minidom parser, but is based on the SAX parser. The abbreviation "SAX" stands for "The Simple Application Programming Interface for XML" and it's a Python library just like minidom. In contrast to the minidom parser, the SAX parser reads the file line by line and attribute by attribute. This and the structure of the script prevents the entire file from getting loaded into memory and thus speeds up the comparison enormously, so that it only takes about one minute for a 10 MB file!
The comparison with minidom quickly reached a CPU utilization of over 90 percent for large files, or exhausted the RAM available to QF-Test. This could cause the program to crash, or to abort the comparison with an error message.
In my new version, each node of a file is read by a thread and stored in a special queue. There is a separate thread for each of the two files, which takes care of the parsing. A third parallel thread compares the contents of the two queues simultaneously and clears them afterwards. This ensures that the amount of memory required is kept within reasonable limits.
The new SAX-based XML comparison function as seen in the QF-Test standard library.
All functionalities offered by the previous procedure are also offered in the new procedure.
Sorting before comparing is also possible. However, the content of the file is stored differently, namely in a custom class derived from
TreeMap consists again of at least two
TreeMaps. In these, the nodes of a given level are stored. As soon as the original node contains a child node, a new instance of this
TreeMap is created and stored in the
child node TreeMap of the original node. In the end you get a single
TreeMap which holds the entire file. Now you can iterate through it and compare the individual contents.
The reason why a
TreeMap is used here is that the contents can already be stored sorted by a comparator at runtime. The
TreeMap also makes the process of input and output very fast.
Since the entire file must still be kept in memory, the previous sorting increases the duration of the comparison, although it is of course still faster than the same action with the previous minidom process.
Conclusion of my practical semester
Besides this big project of my internship semester, I did many smaller tasks and activities. These made the period highly varied and interesting.
During the entire internship semester, I also had support from my supervisor, who helped me a lot with the implementation of this project.
Thanks to this internship, I was able to apply, deepen and expand the skills I had acquired during my previous studies. Especially the cooperation with QFS customers was exciting and I was able to gain many new impressions.
PS from a colleague: Sarah's work is very valuable and useful for many who work with XML data. Therefore, her solution has been built into QF-Test and released as part of the standard library for all with version 6.0.2.