Data Exploration In Machine Learning 

Data exploration is the most important part of machine learning. Its must necessary to know about the data because raw data have many problems like missing value, outliers etc. For training data, we must need to prepare raw data so we can get the best result.





The step of data exploration and preparation:

 There is no fixed number of step in data exploration and preparation but there is some step which may give you the best result.

  1.  Identify the variable
  2.  Univariate analysis
  3.  BI  -variate analysis
  4.  Find out the Missing value
  5.  Remove the outliers
  6.  Variable transformation and Creation

1. Identify the variable  : 

                                              First of all, we identify the Predictor( Input) and Target ( output) variable. then we identify the data-type or Category of the variable.

Type of data:                     Predictor   or Target
Data type::                         Numerical or Character
Variable category:           Categorical or Continous  

2. univariate analysis  :

  In this step, we explore the variables one by one. univariate analysis depends on the variable type is categorical or continuous. If Continous variable than we need to understand the central tendency and spread of a variable.If Categorical variable we use the frequency table to understand the distribution.

3. Bivariate analysis :  

  The bivariate analysis finds out the relationship between two variables. here we look association and disassociation between two variables. 

The combination is as 
  1.  continuous &   continuous
  2.  continuous  &   categorical
  3.  categorical  &   categorical  

4. Missing value Treatment :

                                                       First of all, a question is arise that Why the missing value treatment is required? Because missing data in the training data set can reduce the power of model ( here power mean accuracy ).Or can lead to the biased model and our model may be Over-fit or under-fit.  these all thing can lead to wrong prediction & classification.
  • Why data have missing value:       there is much reason but the main is as
    1.   data extraction
   2.   data collection    



Method to treat missing value  

A. Deletion:

                          There is two type of deletion method.first is listwise and another one is pairwise. Deletion method reduces the power of the machine learning model.Deletion method is used when the nature of missing data is missing completely.

B.Mean/Mode/Median Impulation:

  In this method, we fill the missing value by the estimated one. here Mean and Median are quantitative attribute and Mode is the qualitative attribute.

C.Prediction Model:

 One of the sophisticated method for handling missing data because here we create the predictive model to estimate the value

In this case, we divided our dataset into two data sets

  1.  no missing value data set  ( we use this dataset as training data set)
  2. with the missing value  ( we use this dataset as test data sets)
Drawbacks 

1) some time In this model estimate value is better to behave than the correct one.
2) if there is no relationship in data sets than it is not useful.

D. KNN Imputation: 

                                         In this model, we find out the similarity of two attributes which is determined by the distance function. In short term, we fill the missing value by most similar attribute to the given attributes.

There are some advantage or disadvantage 
Advantage:  predict both qualitative and quantitative attribute
Disadvantage: very time-consuming in a large amount of database.

5)The technique of outlier detection and treatment:

Outliers :

                    outliers is an observation that appears far away and diverges from an overall pattern in a sample. In simple word, we can say that an outlier value is not similar to the sample. Example: in age data age is 1000 we can say it's not possible its the outliers.


What Cause of Outliers
1) Data entry error
2) Measurement error
3) Sampling error
4) Natural error
5)Data processing error 

To detect the outliers commonly we used Visualization like Box-plot, Histogram, Scatterplot.

How  to remove outliers 
                                         Deal with outliers is similar to the missing data treatment. There is some method to treat outliers.
1) deleting  outliers
2)Transforming and binning value 
3)Imputing
4) Treat separately  

6) Art of Feature Engineering

Here Feature engineering mean Extracting more information from existing data. Feature engineering itself divided into two parts.
1) Variable Transformation
2) Variable Feature Creation


If you guys have any problem with data exploration, you can ask your problem in the comment or in contact us.