Hello there, here we are again. Today I will tell you the first step that we data scientists take in data analysis. The great “EDA” :)
EDA stands for “Exploratory the Data”. So what does this EDA do? Why do we use it? How do we use it? etc. Don’t worry, you will get the answers to these questions and more now. So If you’re ready, let’s get started.
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
I would like to show you the EDA on one of the most well-known projects in Kaggle, Titanic.
- Load and Investigate the file
df.head (), shows the first 5 lines of data. It shows the first 20 lines of the data because I put 20 in parentheses.
df.sample () shows a random line of data. If you type a number in parentheses, it will show that number of random lines from the data.
df.tail (), shows the last 5 rows of data. Here, too, if you type a number in parentheses, it will show that number of lines from the end of the data.
df.isnull (). sum () , shows the number of blank data in the data for each feature. For example, there are 177 empty data in the ‘Age’ column . This function is very important because we need to fill in empty data with appropriate techniques before doing machine learning.
df.info () allows us to see the type of columns, the total number of rows and columns in the data. It also shows the number of non-null data for each column. Thanks to this function, we can detect columns that are objects and we can make these columns appropriate for use in machine learning.
- Analyze the data
df.describe (), does the following for each column: count,mean,std(standart devision),min,max,%(25,50 and 75)
This function provides a great convenience for us to detect outliers. Values 3 std to the right of the mean and 3 std to the left of the mean are outlier.
df.corr (), calculates the correlation for each column with the other features. In this way, it becomes easier for us to identify the columns we will use while doing machine learning.
We do not take features that correlate more than 90% with our target column because the data in these features has duplicated itself in another form.
And that’s it. These are our basics EDA functions. And it provides a great convenience for us to explore and understand data.
We have reached the end of another article. I hope it was clear. If you have anything to ask or want to share with me, leave the comments. See you next time, take care:)