My Top 5 Data Cleaning and EDA Techniques

Rachael Friedman
4 min readAug 10, 2021

--

It’s been a few weeks since I have last posted on the blog. In that time, I have gotten to work on some interesting regression and classification modeling projects the General Assembly Data Science Immersive course. For each project, I devoted a significant amount of time to data cleaning and exploratory data analysis and found out just how important this part of the modeling process is. For this blog, I want to share what I learned by telling you my top 5 favorite data cleaning and EDA techniques.

Data Cleaning and EDA is extremely important for modeling. It’s the phase of the project where you really get to know your data. I’ve learned that the better you know your data, the better you can make your models. I like to approach the data cleaning and EDA phase as a puzzle that I need to solve. It starts off messy and unorganized, and slowly we work through it and clean it up until we have a clear picture. My top 5 favorite data cleaning techniques are listed below with some project snippets to from a regression project on housing prices to show them in use. Hopefully they can be helpful to you as well!

1. Value counts — This might be one of my most used Pandas functions. I usually like to call value counts on just about every field in my dataset. I find it helpful to call the regular value counts option as well as normalized value counts to view the category distributions out of 100%. I highly recommend using this function to get a better understanding of the categories in your data and their distributions. It is also sometimes helpful for catching category misspellings.

2. Mapping for ordinal data — The map function comes in handy whenever you are working with any ordinal data. It’s extremely easy to tack on the map function at the end of a column and specify the ordinal values that the category should use for transformations. You can also create an ordinal dictionary if you need to refer to the same ordinal values more than once.

3. Masking — This is the best filtering technique! I use this if I ever just want to slice my data and view it based on certain condition just as I am doing EDA. I also use this function to filter out outliers or create new data fames based on specified conditions.

4. Creating a data cleaning function — If you have repetitive data cleaning tasks, this one is for you. A cleaning function is perfect if you have data that needs to be transformed in the same way for multiple data frames, files, or even training and testing groups. Instead of coding the data cleaning transformations line by line, build a master function to do all of this in one swoop when you call the function. I usually include all my transformations for repetitive tasks in a data cleaning function such as, reformatting column names, filtering out outliers, filling or removing nulls, creating ordinal variables or dummy variables for fields, changing data types, and creating new columns.

5. Pairplots — I have found that Seaborn’s pairplot is an awesome and quick way to visualize relationships in data. You can have the pairplot y variable set as your target variable across the board and view against x variables, or you can create pairplots to view the relationship of all x variables as the x and y variables and plotted against each other. I recommend doing both ways because this will provide the most information about any relationships in the data and what the pattern is like (ex: linear, exponential, etc.) which helps in your modeling decisions.

Try these techniques out and let me know how they work for you!

--

--

Rachael Friedman
Rachael Friedman

Written by Rachael Friedman

Data Science Student at General Assembly

No responses yet