In this project, the database testing came into the picture for the first time from an application migration point of view. There had been many ‘hit and trial’ and the strategies kept on evolving. Now at this point of time we have figured out few strategies which can be used as a reference for the similar projects. Therefore this white paper lists down the best practices which can be followed for Database Testing projects.
The data/application migration involves the migrations of a number of objects — database objects and application objects. Application migration includes converting batch jobs or Night Cycles. Batch jobs are the combination of the above mentioned objects which are run in a Night Cycle or schedule. Application migration includes the migration of such batch jobs. Hence, testing a database from this prospective is slightly different from conventional methods of database testing. This white paper deals with database testing from the application migration perspective.
In application migration projects, the data needs to be tested after the application has been converted or migrated from one database to another. Therefore, the data is assumed to be migrated before the application conversion starts. This involves testing steps to check whether the data which is present in both the databases (the original database and the new one) match or not. For example, assume Sybase is the original database and Oracle is the new database. The data has already been migrated from Sybase to Oracle. Now the applications are being converted from Sybase to Oracle. After the application is converted, the data in both the databases should be in sync to result in a successful conversion.
Data Integrity and Query retrieval are the key testing areas in this domain. Since this involves a huge volume of data to be tested, the testing becomes a little complex.
Key Concepts
Database testing requires some knowledge of key concepts. Various table types are defined below:
Read Tables — This is the classification given to the tables from which SPs, scripts etc. Read the data to be processed. These are the input to the SP’s, scripts etc.
Write Tables — This is the classification given to the tables to which the SP’s, scripts etc. write the processed data. Insert, Update & Delete happen on these output tables.
Trigger Tables — When any modification is done to a physical table in both the databases, the data needs to be validated. In order to do this, a trigger table is created for every physical table, which stores only the modified data. Now these trigger tables can be compared against both the databases to validate the data. These types of tables are called trigger tables. In order to track the modifications done on huge tables like insert, delete, update, trigger is created for each table. A trigger code is written for each table in the data base. A trigger table is created for each physical table. This trigger code inserts the modified record of the physical table into the corresponding trigger table.
The advantage of having trigger tables is, during the validation process, entire tables need not be validated and instead the trigger table having only the modified records can be validated.
Back-up Tables — When a script is executed at both the databases, the data gets modified. Validation of these tables may take time, hence to achieve continuous execution, if this data is modified, data is backed up for time being and can be validated further to ensure progress. Hence, separate tables are created which are the replica of the trigger table or physical table which is called as backup table. This is the classification given to the tables which are created from the write tables and are used only for testing purposes. The write tables are backed up (after the job run is complete) for testing.
Environments:
Gold Copy environment — This environment should contain the master data for all other environments. This data should only be used to refresh other environments. There should be no modifications done on the data of this environment.
Development environment — This environment should be strictly used for development purposes.
System testing environment — This environment should be strictly used for system testing purpose
BETA/Staging Environment — This environment should be used to test the data before loading the converted piece into production and after SIT is complete
Production environment — This environment has real time data and is an unstable environment as the data keeps on changing
Refresh (Environment): All the environments being used for different purposes should be regularly refreshed with production data. This helps to have a stable development and testing environments as well as have almost real time data.
Following lists down the standard testing steps followed in the data base testing for the application conversion projects:
1. Table Refresh - This step is done mainly to ensure that the data between the environments (different database environments- Sybase and Oracle) are in sync. When a job needs to be tested, the end validation involves data validation of the write tables of the job. This is executed in both of the database environments. Hence, the data to be tested should be in sync before execution of the job. This is the reason for table refresh of the job. All the read /write tables of the job should be refreshed.
The table refresh process includes the below.
· Truncate the physical table.
· Disable the trigger of the table.
· Insert the data into the physical table from the gold copy data.
· Enable the trigger of the table.
· Truncate the trigger table.
BCP Refresh:
To refresh huge tables having more than 5 million records, especially in slow environments (old data bases like Sybase) instead of manual refresh, a bcp refresh could speed up the process. This method suggests loading the data into the physical table, after truncation, from the flat file rather than from the gold copy table. Since bulk copy is made, the process completes fast as compared to manual refresh from the gold copy table.
The below command can be used to bcp any table.
bcp "db..table" out file.dat -t "<EOFD>" -r "<EORD>" -b 50000 -c -U <user_name> -P <password> -S <server_name>
In the case of Oracle, the huge table may not take much time to be refreshed, but this process can still be improved by breaking the refresh process into threads (perhaps 8 or 9) which run parallel, reducing the overall time taken.
The refresh process can be automated using excel or a tool.
<Please refer the section 'Tools' for the details of the tool used >
1. Environment Check/Refresh Check
An environment check involves making sure that no processes are running in the environment currently in use.
The refresh check involves making sure that all the tables are refreshed properly by checking the trigger table count to zero (Trigger truncation is the last step in the refresh process). Another way to do this is by checking the physical tables’ count to match the count of the corresponding gold copy table.
2. Job Run
After the environment check/refresh check is complete, jobs are run in both the databases (Sybase & Oracle).The job can be executed in the path where the job code is deployed. The log files for the running job should be checked for runtime errors and execution completion.
After the job is complete, the trigger tables as well as the write tables should be backed up from both the databases (Sybase & Oracle). Backup of the tables is necessary because, the environment shall be freed for other jobs and the data in the backed up table can remain the same until it is dropped. Back-up of tables is also important when more than one job runs parallel. Back-ups are taken for the validation step which follows next.
3. Validation
After the write tables and the trigger tables are backed-up, the data validation is done to determine whether the data matches or not. The validation can be done manually using excel or it can be automated using a Unix Script or QTP script etc.
4. Analysis
After validation, the analysis of the validation result is very important. Analysis of the validation results reveals whether the conversion was successful or not. If the data between the two databases (Sybase & oracle) match, then the conversion is successful. If the data doesn’t match, then another level of analysis is required to find out the reasons for the failure. This extra analysis helps the developers to fix the issue efficiently.
Challenges and Solutions
A. Production testing issue
Challenge: In case of unit, system and integration testing, the data between Sybase and Oracle is static and remains unchanged until the entire environment refresh is carried out bi weekly in both Sybase and Oracle environments. But in the case of production testing, the data is dynamic and keeps changing every day since the client executes the entire cycle in the old database for their business day. Hence the data between two data bases may not be in Sync always.
Solution: In order to overcome this problem, care should be taken so that the Oracle data exactly reflects the Sybase data till the execution completes. This can be achieved by using a pre-loader. Pre-loader is code which loads all the mentioned tables in the Oracle data base with the data of Sybase. Before Cycle execution, if the pre-loader code is executed, then the data between Sybase and Oracle shall be in sync and the cycle may be started at the same time in both of the databases.
B. Trigger issue
Challenge: If the same record in a physical table is modified more than once, all such records get inserted into the trigger table. As a result, during validation of the trigger table, the sort order specification will become a problem since the primary or composite key will no longer be unique.
Solution: This issue can be handled by creating a view which contains only the unique modified records from the physical table. This can be achieved using the following query:
Create view <view_name> as
select * from <physical_table> where <primary_key> in
Select distinct <primary_key> from <trigger_table>.
if Column1 and column2 are the composite key then
create view <view_name> as
select * from <physical_table> where
physical_table.column1=trigger_table.column1 and
physical_table.column2=trigger_table.column2;
C. Environmental issues
Challenge: Availability of only one environment may result in loss of efficiency since only independent Jobs can be executed in parallel. Jobs may or may not have the same set of read/write tables. If the write table of one Job is the read table for another, running both of the Jobs in parallel will create issues. So the jobs will be required to run one after the other, resulting in a time consuming process.
It turns out to be challenging to execute one or more dependent jobs in parallel in a single environment.
Setting different environments for unit testing, development, the master database environment and system testing is also a big challenge.
Solution: Environment issues can be resolved to some extent by setting up different environment like development, testing, gold copy, beta/staging etc. Setting up different environments for different purposes minimizes the rework. Also these environments can be refreshed regularly (say once in 2 weeks) so as to have the latest data from production.
After different environments are set-up for different purposes, the running of required Batch Jobs in an environment is another challenge. Following are the identified techniques which can be used to solve the challenges related to how to run different batch jobs:
1. Sequential: A sequential technique for testing involves running one Job at a time. This is a time consuming process. Also, more than one system environment will be required to increase the efficiency. No code changes will be required.
Sequential approach deals with the normal way of testing a job between two environments by simply backing up the write tables after the job run is completed. Here the constraint is that only one job can be executed at a time in one environment when the read /write tables used by the jobs are common.
1. Parallel: A parallel technique for testing involves running more than one job simultaneously in an environment. To achieve this, some modifications may be required in the actual code for testing purposes. The write tables should be renamed and created for each job in the environment before the job run starts. The Stored Procedures/scripts need to be modified with the newly created write table names before running the Jobs.
This technique will save a lot of time by running Jobs in parallel. In addition, the testing can be carried out with just one environment. This saves time, because now every job will have renamed write tables that they can update. As a result even if a write table of ‘job 1” happens to be the read table for another “job 2”, the original write table is still not affected.
A. Data volume
Challenge: Validating a huge volume of data is a very big challenge.
Solution: There are two ways to determine which method to follow for database testing. The write tables of a job are always validated because these are the only ones being affected by the Job run.
1. Large volume of data: If the write tables of a job have huge volumes of data (example: 1million or above), then it is always better to validate its Trigger Table. Trigger tables will get updated whenever the physical table gets updated during the Job run. This will benefit as Trigger tables will only get updated by the changes made to the Physical table during a Job run. This means that the trigger tables will have a much smaller volume of data as compared to the Physical table.
2. Smaller volume of data: If the write tables of a Job have a smaller volume of data (example: less than 1 million entries), then it is always better to validate the Physical table itself.
Whatever method we choose, following are the steps which need to be followed to complete a data validation:
· After the job run is complete, take back-ups of the write tables so that the environment is free for another Job run.
· As per the design strategy, create views is required to smoothen the testing process.
· Once the views/tables are in place, use Excel Comparison or any automated scripts to validate the data.
Excel Comparison can be used only for a limited volume of data. It is always better to use some automated scripts to do the data validation. In our project we used two types of automated scripts. One was a .net script running in QTP; another was a UNIX script running in UNIX Box. Using any kind of automated scripts increases the efficiency and also makes the analysis easier.
— Excel validation
The above figure shows an example of excel validation using the ‘one in a million formula’. If the result satisfies ‘one in a million condition’, the result will be TRUE else FALSE. TRUE means data validation has passed whereas FALSE means data validation has failed. The above example is just for one value of a column.
Following are the important points to keep in mind while validating the data:
Sorting — The manner in which the columns of table/view are sorted plays a very important role in data validation. If the sorting order is wrong, the results will not be accurate. The results will show data mismatches resulting from an incorrect sorting order.
Validation exceptions — While any migration or conversion of database, the migrated/converted database may be slightly different from the original ones. This may be due to few enhancements. So from a table’s point of view, there may be few data which is new or deleted in the new table. Therefore it is very important to consider all these exceptions before doing the data validation. This helps to analyze data mismatches better in the later stages.
Base table validation — Data mismatches due to the Base Tables not being in sync is very common. Base tables are the main tables which are used in almost all jobs. If these tables are not in sync, then it is likely that there are mismatches in all of the tables. Therefore, it is very important to have these base tables in sync.
Read table validation — As well as write tables, read tables also play a very important role. If the table from which the job is reading data is out of sync, then the write table data will be affected. It is important that the read tables are validated at least once because of this.
Reduced time for validation — In case of very huge tables, the trigger tables can be validated instead of physical tables since the trigger table contains only the affected records. The other way is that, a view can be created which shall contain only the affected records (by getting the primary key say acct_no from the trigger table). Validate the view to avoid the time and sorting issue.
A. Precision issues
Challenge: Whenever there is a data calculation involved, different databases behave differently in terms of precision. For example- Oracle calculates 1/1.5 as 0.666666667 whereas Sybase will calculate 1/1.5 as .66667. Such differences later result in inconsistent final values. In other words, you will be dealing with mismatches. Such scenarios should be taken into consideration while testing.
Solution: To overcome the challenge of precision, there might be some threshold values for each column given by the client. These threshold values help to decide if the column has passed or failed the data validation.
For example: if the acceptable precision mismatch (threshold value) for column “amount” is e-06, then the mismatches beyond this will pass the data validation.
The above mentioned thresh-hold value can be calculated using the following formula — ((Sybase Value — Oracle Value)/Sybase Value)
This represents a one in a million scenario. Hence anything beyond this is in acceptable mismatch range.
Sometimes, when there is a data mismatch which seem like precision issue, then such data along with few data points like
· Column to which the data belongs.
· Account (primary key) in which the mismatch is seen.
· Technical reason for the mismatch which may be due to more accurate calculation etc.
should be incorporated in a slide and should be presented to client. These kinds of issues shall be documented for future reference.