Hi folks, this is my first blog post and it is actually a summary of a presentation that I made. I’ve shared it as SlideShare below. If you would like to share and discuss ideas on this topic, I will respond to the comments below.
Using the frameworks depends entirely on the context of your project. I’m not gonna tell you something like this is better or you should use this framework. They are choices that can change suitability to your team’s capabilities, even to your software development environment. These are referred to as automation frameworks in many places. What about the manual? Of course, these frameworks can be operated manually, but they are more meaningful with automation.
When you look at the definition of ISTQB glossary, he says that the scripting technique that uses data files is called data-driven testing. Do you use data files in your automation projects? It does not have to be a file. I mean do you prepare a data source before automation starts?
We do not prepare any kind of data before test automation. Each case creates its own data when it is running. That’s why I give that name to this post/presentation. Data-driven testing is more than a data source preparation.
Ok, when we use keywords mostly verbs as the data it is called keyword-driven testing. If you use behaviours as the data it is called behaviour. Of course, it is not that easy, but we can summarize them like this way.
Test Data Management Concepts
As the complexity of our software project increases, the management of test data becomes inextricable. Especially if you are testing legacy software, it may not even be possible to find the test data. Here we will name approaches to test data management. Subsetting is the name given to the transfer of a certain part of the data from one data source (usually the source is live) to target data source (which you may think of testing as our subject). Masking is used in the sense that the confidential data is completely obscured or left with a few verifiable tips. It’s like a puzzle. Find my name. 6 letters initial letter is m and the last letter is t. Synthetic data generation is the name given to the generation process that does not correspond to a real person but without destroying data integrity. Speaking of data integrity, we need to elaborate on it.
There are many aspects of data integrity. I’ve listed the most important ones. To illustrate; The referential integrity is related to the absence of orphan records. If there are no records in the related table with a foreign key, referential integrity is corrupted. It is very possible to produce false-positive results if there are orphan records in your test environment. It is very possible to produce false-positive results if your data is incomplete and nulls are swarmed. There could be two customer records that belong to the same identity number. Also, duplicate records may result in false-positive bugs. For example, a $ 10 million bill or a customer born in 1880 may have generated false-positive results in your test environment because they are not valid. You should keep your domain-related tables stable. These are mostly lookup or definition tables. For example, you have a customer type table, where you keep your customer types. You don’t want to see meaningless customer types in your CRM test environment. Those will produce false-positive results. Consequently, trying to maintain data integrity is about trying to create accurate and usable test data.
You need to pay attention to data integrity in all approaches you will use. Approaches are tools for profiling test data. If you know how to use them, your test automation or manual test will be credible. Otherwise, there will be some conversations like; “you’ve opened a bug that is not real I am rejecting” or “Hey Man! This is working in the live environment. Your test data could be the problem. I will reject this bug”. Rejecting, rejecting and rejecting. This will bother you.
Let’s start with subsetting.
We said subsetting is transferring a particular set of data from one source to another. This is how it works. The circle in the middle of the live environment is your subset. Let’s say the first hundred records of your customer table. Now you need to create your dependency tree. You should now find records around your customer table. These arrows represent the direction of the relation between your tables. These records will be transferred to the test environment. However, you cannot insert all the records. For example, you will have to append some records. Your domain-related records that we just mentioned. Your system needs to know if it will make subsetting. Regardless of which tool you use, if you want to profile test data with subsetting approach, you should model and discover your data. Like specifying the dependencies (some of your tables may not have foreign keys) by exploring your data (not null and default values, etc…) It will also make you aware of your database and make improvements. Of course, we have to make masking to meet our regulatory responsibilities such as GDPR and KVKK.
Let’s take a quick look at the masking techniques. Masking can be done statically or dynamically. In static masking, the data should be saved after masking or encrypting in the database.
In dynamic data masking, you place a masking proxy between the application and your database. Of course, here you need to mask according to the end user’s authority. Your proxy layer needs to know this. In the meantime, I do not know how much is used, but the database providers are making improvements to mask directly from the database server.
We use the Postgres database for 15 years. We’ve just started to make the masking proxy dynamically with the Postgres-specific pgbouncer technology. Thanks to Mehmet Emin Karakaş because it is open source and you can use it; https://github.com/emin100/pg_ddm
Synthetic Data Generation
And now I want to touch on synthetic data generation, which is actually the future of test data management. I divided it into two
In general, most of the tools in the market generate data at the database level. Actually, there’s more work here than subsetting in terms of effort. The data modelling and data discovery you make in subsetting are also available here. If you are at the database level, these two are unavoidable. But here you will also need rule sets to be used in production in order to prepare your data. For example; You keep your national identification number (NID) in the database as an integer value. However, if you give any integer value to your NID number, your test will probably explode if your NID number have an algorithm like “the last digit should be an even number”. You should be able to define the algorithm of the NID number to the application that will generate synthetic data.
The other is the application-level generation. You have some much more duties and challenges here. However, you get very good quality output. Of course, this one has a limitation. Your applications need to have an API that can generate test data. If you do not have an API, this method may not be very suitable for you. With this approach, you focus on your application. You will not need to model and discover data within this approach. You won’t have to deal with regulations like GDPR and masking. But here you will also need to mock the 3rd party APIs.
How can we virtualize the 3rd party application? So we need to use service virtualization for unlimited testing.
Stubs, hubs and mocking are different from service virtualization. Service virtualization is an extended simulation version of dummy stubs, hubs and mock services. In service virtualization, you can also simulate network. You can change response times or create congestion at the network. While service virtualization is used at performance, integration and system level, stub drivers and mock services are mostly used at the component level. Here you can see an example. Let’s name this X as the main component of your application and it uses Y component for its operation. When Y is not ready and you want to test your main component you should put something instead of Y to test X. So this something is called a stub. Otherwise, I mean when your main application is not ready you need to write hubs. Mock services also behave like stubs but they are commonly used for 3rd party integrations.
Of course, we need tools to do so much work. Featured ones are;
the companies we all hear about. Subsetting, masking, synthetic data generation and service virtualization. They have many products for what we talked.
You should take into account interoperability, speed, cost, learning curve, regulations and decentralization when choosing a tool. Interoperability is very important here. Ask a POC for the tool in your environment.
Another post will be a detailed application-level synthetic data generation. Looking forward to your comments. Please do not hesitate to share any opinion with me.