Q.5 What is data cleaning? Write down its importance and benefits. How to ensure it before the analysis of data?
Course: Introduction to Educational Statistics
Course Code 8614
Topics
- What is Data Cleaning?
- Importance of Data cleaning
- Benefits of Data cleaning
- Data Cleansing for a Cleaner Database
Answer:
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools or as batch processing through scripting.
After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data.
The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records). Some data cleansing solutions will clean data by cross-checking with a validated data set. A common data cleansing practice is data enhancement, where data is made more complete by adding related information. For example, appending addresses with any phone numbers related to that address.
Data cleansing may also involve activities like harmonization of data, and standardization of data. For example, harmonization of shortcodes (st, rd, etc.) to actual words (street, road, and etcetera). Standardization of data is a means of changing a reference data set to a new standard, ex, the use of standard codes.
Data cleansing is a valuable process that can help companies save time and increase their efficiency. Data cleansing software tools are used by various organizations to remove duplicate data and fix and amend badly formatted, incorrect, and incomplete data from marketing lists, databases, and CRMs. They can achieve in a short period what could take days or weeks for an administrator to work manually to fix. This means that companies can save not only time but money by acquiring data-cleaning tools.
Data cleansing is of particular value
to organizations that have vast swathes of data to deal with. These organizations
can include banks or government organizations but small to medium enterprises
can also find a good use for the programmers. In fact, it’s suggested by many
sources that any firm that works with and holds data should invest in cleansing tools.
The tools should also be used regularly as inaccurate data levels can
grow quickly, compromising the database and decreasing business efficiency.
Data Cleansing for a Cleaner Database
Companies may also find that cleansing enables them to remain compliant with standards that are legally expected of them. In most territories, companies are duty-bound to ensure that their data is as accurate and current as possible. The tools can be used for everything from correcting spelling mistakes to postcodes, whilst removing unnecessary records from systems, which means that space, can be preserved and that information that is no longer needed – or data that companies are no longer permitted to keep – can be removed simply, quickly and efficiently.
Users of data cleansing software can set their
own rules to increase the efficiency of a database, making the capabilities of
the cleansing software as applicable to the company’s needs and requirements as
possible. Some common problems with databases can also include incorrectly
formatted phone numbers and e-mail addresses, rendering clients and customers uncontestable.
The software can be used to put things right in a matter of seconds. This makes it a perfect tool for companies that need to stay in touch with outside parties. Meanwhile, companies that employ more than one database – companies that are spread across various branches or offices for example – can use the tools to ensure that each branch of their organization can share the same accurate information.
Related Topics
Chi-Square, and independent test.
Measures of Dispersion
What is measure of difference? Explain different types of test
Concept of Reliability, Types and methods of Reliability
Level of Measurement
Types of Variable in Stats
Measures of Central Tedency and Dispersion,
Role of Normal Distribution, and also note on Skewness and Kurtosis.
No comments:
Post a Comment
If you have any question related to children education, teacher education, school administration or any question related to education field do not hesitate asking. I will try my best to answer. Thanks.