The PopulAid data generation tool supports developers during testing of applications. It facilitates the gathering of high quality data and the overall process of testing.
Motivation
Real customer data is considered the best basis for application testing. Unfortunately, the use of real data is not possible in many cases. The reasons for this situation are manifold. First of all, there are legal requirements that must be met. In Germany, for example, access and distribution of personal data, such as patient records, is regulated by law. Second, there are situations, in which the available size of real customer data is insufficient for testing of a certain aspect of an application. In those cases, it is either necessary to reduce or enlarge the data size. It is important to preserve the original data’s characteristics, e.g. value distributions and relations between records.
Also, even if real data is accessible, due to security reasons, not all developers should get access to the customer’s business-critical data. Hybrid solutions are a common practice to face this problem. In those approaches a selected group of developers gets access to the real data while others use anonymized or generated data in day-to-day business. Because anonymization that keeps the characteristics of the original data is at least as difficult as the generation of new data with that given properties, test data generation is an often used approach in software development. The quality of this generated data is crucial for the viability of those approaches.
The previously mentioned reasons for test and development data generation are not new. Often developers face them by writing custom data generation scripts for a certain application. The obvious disadvantage of this approach is the focus on only one specific use case and therefore reduced reusability. On the other hand, writing those scripts is in most cases a simple task, because the waiver of generality reduces the overall complexity of the task. In contrast to custom scripts, numerous general test data generators are available, which promise generality in various stages. Even though those generators are more reusable, they are also often too unspecific to satisfy most developer’s needs, especially when it comes to generating domain specific data. The basic idea for PopulAid is based on the research work at the Hasso Plattner Institute. The proof of concept built there was transferred to the Innovation Center in Potsdam, where a new version, adapted to the needs of XS Engine, was implemented.
To overcome the before mentioned issues, we have equipped PopulAid with a broad toolchain, which can be broken down to the following features:
Generation of standard data types
PopulAid supports standard features, common for a majority of existing data generators at this time. There is the possibility to generate random data for numerical and text-based entries as well as dates and timestamps. PopulAid defines separate generators for each column. Within the configuration of a generator for a field, it is possible to specify the value distribution and data quality properties, such as pollution with NULL values.
Data Pools
Another feature is the possibility to generate data based on a pool of valid field values. For example, you can specify pools of values for status fields and only those values are randomly chosen from. Since pools can also be specified via SQL, already existing data on the system can be referenced to. Data Pools are also the proposed means to ensure foreign key dependencies between tables.
Intra-row dependencies
PopulAid is able to ensure relations of within inside one data record. This feature allows reuse of previously generated fields to construct other column values. In PopulAid, this can be done via patterns.
Complex hierarchies between records
Sometimes, data records are dependent on each other in a way, that one data row works as a master for subsequent rows. PopulAid supports the generation of this kind of hierarchical data.
Performance and focus on SAP developers
With focus on HANA XS Engine and SAPUI5, PopulAid supports major technologies of SAP HANA. The generator can be installed on a system as a Delivery Unit. Due to this, PopulAid operates directly on the data and is able to generate values in parallel, which is beneficial for the overall performance. This enables PopulAid to be used for the generation of large datasets.
Example
An example of covered features by PopulAid is depicted below. The scenario represents a highly simplified employee table. The usage of predefined pools allows readable values for first and last name which are randomly selected.
PopulAid can generate continuous text based ID values for a field (blue rectangle). Subsequent layers in a kind of hierarchy are possible as well (red rectangle). In this case, each ORG_ID is modelled to have four groups.
The generator for the year of birth field is defined to create numerical values in a 10-year range, starting 1960. This value is reused in two places. First, the entry date of an employee is defined to be in a certain value range, based on the date of birth. This way it can be ensured that only rows are generated where the employee has been between 18 and 21 years of age when being hired (green arrows). Second, the email address uses the last two digits of the year.
The gender field is realized using a pattern which choses randomly between “m” and “f”. Even though the same result could be achieved with the use of a data pool, PopulAid offers a more convenient way to define such generators with patterns like [“m”|”w”]. A more complex use case for a pattern is the email field (green rectangle). As mentioned before, it reuses the year of birth as well as first and last name. Additionally, there is a constant text appended and a random domain, chosen from two possibilities.
Future Work
PopulAid is currently ready to use with the mentioned features. For a preview version and further questions, feel free to contact michael.kusber or thomas.klingbeil. Next steps in the development are usability improvements in company with the addition of new features, specifically developed for the needs of SAP developers. One of those features is the integration of a tool to directlyaccess SAP data dictionary information.