Data cleansing

Data cleansing is the process of identifying and fixing corrupt or fallacious records in a record set, table, or database. It also deals with identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data, and then replacing, modifying, or deleting the infected data. Data entry and acquisition is inherently prone to errors, both simple and complex. There is much effort involved in this frontend process, but the fact remains that errors are common in large datasets. With respect to big data management, data cleaning is very important, for the following reasons:

  • The main data is usually spread across different legacy systems, including spreadsheets, text files, and web pages
  • By ensuring that the data is as accurate as possible, an organization can maintain good relationships with its customers, improving the organization's efficiency
  • Correct and complete data provides better insights into the process that the data concerns

There are libraries for Python (Pandas) and R (Dplyr) that can help with this process. In addition, there are other premium services available in the market, including Trifacta, OpenRefine, Paxata, and so on. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.94.152