The world is messy, and so is the data from the real world. A recent survey report shows that 60% of data scientists’ time is spent organizing data. Unfortunately, 57% of people think this is the most troublesome part of their job.
Organizing data is very time-consuming, but many tools have been developed to make this crucial step slightly more bearable. The Python community provides many libraries to make data organized—from formatting DataFrames to anonymizing datasets.
Tell us which libraries you find useful - we're always working on optimizing the libraries that go into Mode Python Notebooks.
Dora
Dora is designed for exploratory analysis. Especially the most painful parts of automated analysis - like feature selection and extraction, visualization, and you guessed it - data cleaning. Functions related to data cleaning can:
Read data tables containing missing data and unstandardized data
Assign values to missing data
Standardized variables
Developer: Nathan Epstein
More information: https://github.com/ NathanEpstein/Dora
datacleaner
As the name suggests, datacleaner cleans your data - but only if your data is a pandas DataFrame instance. Developer Randy Olson said: "Datacleaner is not magic. It cannot magically parse your unstructured data."
It can delete rows containing missing data, or use the mode or median of the column to fill in missing data, replacing non-structured data. Numeric variables are converted into numeric variables. This library is very new, but considering that DataFrame is the basic data structure for Python data analysis, it is worth giving it a try.
Developer: Randy Olson
More information: https://github.com/rhiever/datacleaner
PrettyPandas
DataFrames are powerful, but they can’t make tables you can show directly to your boss. PrettyPandas uses the pandas style API to convert DataFrame into a presentation-ready table. Generate data summaries, set styles, and adjust data formats, columns, and rows. Bonus: Robust, highly readable usage documentation.
Developer: Henry Hammond
More information: https://github.com/HHammond/PrettyPandas
tabulate
tabulate allows you to generate small and attractive tables with just one function call. Great for making tables more readable by adjusting decimal column alignment, data formatting, table headers and more.
It has a super cool function that allows the table to be output into different formats: HTML, PHP or Markdown Extra, so that you can use other tools or languages to continue to use the data you have tabulated.
Developer: Sergey Astanin
More information: https://pypi.python.org/pypi/tabulate
scrubadub
Data scientists in the health and financial fields often need to anonymize data sets. Scrubadub can remove private information (PII) from text. For example:
Name (noun)
Email address
Internet link
Phone number
Username/password set
Skype username
Social Security Number
The document does a good job of demonstrating the ways you can Customize scrubadub's behavior, such as defining new PII or retaining specific PII.
Developer: Datascope Analytics
More information: http://scrubadub.readthedocs.io/en/stable/index.html
Arrow
Let’s be honest: dealing with dates and times in Python is a pain . The local time zone is not recognized automatically. It takes several uncomfortable lines of code to convert time zones and timestamps.
Arrow aims to solve this problem and fill this functional gap, so that you can complete date and time operations with less code and imported libraries. Unlike Python's standard time library, Arrow automatically recognizes time zones and UTC by default. You can perform time zone conversion or parse time strings with just one line of code.
Developer: Chris Smith
More information: http://arrow.readthedocs.io/en/latest/
Beautifier
Beautifier’s mission is simple: clean URLs and email addresses and make them look prettier. You can parse email by domain name and username; parse URL by domain name and parameters. (UTM or tag)
Developer: Sachin Philip Mathew
More information: https://github.com/sachinvettithanam/beautifier
ftfy
ftfy (fixes text for you) takes in bad Unicode outputs good Unicode. Basically , it fixes all the junk characters. “quotesâ€x9d becomes "quotes"; ü becomes ü;
ftfy (fixes text for you) converts messy Unicode into recognizable Unicode. Simply put, it handles all garbage characters. “quotesâ€x9d becomes "quotes"; ü becomes ü;
Developer: Luminoso
More information: https://github.com/LuminosoInsight/python-ftfy