Challenges In Big Data Testing
The Big Data is rapidly becoming a much efficient substitute for traditional computing techniques for storage and processing of large data sets. There are a variety of techniques, frameworks and tools available for testing the multiple aspects of Big Data including creation, storage and analysis.
The current article is dedicated to explain the challenges in Big Data testing while understanding various stages of this process. The Big Data applications are tested in these two steps including Data Validation, Architecture / Performance Testing.
1. Data Validation – This stage is carried-out in 3 steps:-
a. Data Staging Validation
Also known as 'pre-Hadoop Stage', this is the first step in testing Big Data, which includes the validation of the complete process. In order to check if the correct data is being fed into the system, the data is validated through various sources such as RDBMS (Relational Database Management System), Social Media and Weblogs. Following this, a matching of sourced data and Hadoop System data should be done to cross check if they match or not. The last step of this stage is the verification of the extracted and loaded data locations into the right HDFS.
b. "MapReduce" Validation
This is the second step of the Big Data testing, which focuses on MapReduce validation. The validation of the business logic is verified by the tester including every node. This is followed by the validation of all nodes by running them against a number of nodes. This step ensures – correct functioning of MapReduce, proper implementation of the data aggregation rules, generation of the key value payers and data validation following MapReduce.
c. Out-Put Validation
This is the last and the most important step of the Big Data testing as it involves the out-put validation process. In this step, the generated output data files are moved to an Enterprise Data Warehouse (or a related system). The three most important activities of this step include checking correct application of the transformation rules, checking if correct data is successfully loaded into the system which keeps its integrity intact and comparing the target data with the data in HDFS file system to iron-out any data corruption.
2. Architecture Testing – The architecture testing is another important stage of Big Data testing. Catering to the resource intensive Hadoop environment which process huge amount of data, architecture testing becomes the key to success of Big Data project. If not done properly, the Big Data project could fail to meet the requirements due to dwindling performance. The performance testing is required to be carried out in Hadoop environment.
Performance Testing – Here are the three important actions for Big Data performance testing –
a. Ingestion of Data -
In this step, the speed of consumption of data by the system gets verified by the tester through different data sources. The tester also identifies the number of messages that can be processed by the queue in a given timeline. The speed of data insertion or insertion rate in the available data store such as Cassandra or Mongo database also gets tested in this testing.
b. Processing of Data -
In this step, the speed of query execution is tested along with testing the data processing in isolation. The example of this could be - carrying-out the MapReduce jobs on the underlying HDFS.
c. Performance of Sub Component
Each of the involved components needs to be tested in isolation as the Big Data systems host a number of components. One such example can be testing the speed of message indexing and consumption. Other testing can include search, MapReduce, query performance etc.
Comparison: Traditional Database Testing Vs Big Data Testing
a. Difference in Type of Data
While in traditional database testing one need to work with both unstructured and structured data, in Big Data, the tester requires to only deal with structured data.
b. Difference in Testing Approach
While the traditional testing approach only gets defined after significant R&D efforts, the Big Data testing approach is well defined in advance.
c. Difference in Infrastructure
Due to file size being limited, the Big Data testing doesn’t require any special testing environment. However, in traditional testing, a special testing environment is required due to large data size and files.
d. Difference in Validation Tools
For traditional testing, there are no defined tools of validations as it can be picked from a range of available tools like HIVEQL or MapReduce. On the other hand, in Big Data testing, tester uses Automation tools based on UI or Excel based Macros.
Big Data Testing Challenges – Each stage of Big Data testing involves its respective challenges.