Applied feature engineering techniques to find the factors that influence price negotiations while buying a house. Seggergated the source data in to numerical(continuous) and categorical(discrete) dataframes and used pearson correlation method to find out the correlation amongst the numerical data. Used OneHot Encoding to prepare the categorical dataframe for further testing and determined the P-Value to perform the Chi square test on the categorical fields to finally realise what are the important variable fields
- Pearson Correlation principle
- OneHot Encoding
- Chi-Square test (P-Value)
- Data Viz using Pairplot, Boxplot, Countplot and Heatmap
There are a total of 80 variable fields in the dataset, some of which are listed below -
- MsZoning : Identifies the general zoning classification of the sale |
- MsSubclass : Identifies the type of dwelling involved in sale |
- LotFrontage : Linear feet of street connected to property |
- LotArea : Lot size in square feet |
- Street : Type of road access to property |
- Alley : Type of alley access to property |
- Neighborhood : Physical locations within the city limits |
- Foundation : Type of foundation |
- ExterCond : Evaluates the present condition of the material on the exterior |
- BsmntQual : Evaluates the height of the basement
- GarageCars and GarageArea are highly correlated with each other hence we'll drop one feature from these two which has less correlation value with the sale price.
- TotalBsmtSF and 1stFlrSF are also highy correlated with each other hence we'll drop one feature from these two which has less correlation value with the sale price.
- TotRmsAbvGrd and GrLivArea are also highly correlated with each other hence we'll drop one feature from these two which has less correlation value with the sale price.
- Also We can drop both FullBath and YearRemodAdd features as well as these feautures inter-correlated with other features.
A total of 24 columns out of the initial 80 variable fields have been dropped since they are not in any case useful for any type of analysis as they do not correlate or impact our target field, sale price. In the end after cleaning and analysing our dataset we are finally left with 56 variable fields that can be important to us and maybe loaded in to our database for further investigation.