December 27, 2021

Exploratory Data Analysis
Ames House

The Ames housing data set (De Cock 2011) is an excellent resource for learning about models that we will use throughout this project. It contains data on 2,930 properties in Ames, Iowa, including columns related to house characteristics, location, lot information, rating of condition and quality and saleprice. Arie will provide details about EDA, and make insight from data visualize using R Programming Language.

Data Fields

  • SalePrice — the property's sale price in dollars. This is the target variable that you're trying to predict.
  • MSSubClass — The building class
  • MSZoning — The general zoning classification
  • LotFrontage — Linear feet of street connected to property
  • LotArea — Lot size in square feet
  • Street — Type of road access
  • Alley — Type of alley access
  • LotShape — General shape of property
  • LandContour — Flatness of the property
  • Utilities — Type of utilities available
  • LotConfig — Lot configuration
  • LandSlope — Slope of property
  • Neighborhood — Physical locations within Ames city limits
  • Condition1 — Proximity to main road or railroad
  • Condition2 — Proximity to main road or railroad (if a second is present)
  • BldgType — Type of dwelling
  • HouseStyle — Style of dwelling
  • OverallQual — Overall material and finish quality
  • OverallCond — Overall condition rating
  • YearBuilt — Original construction date
  • YearRemodAdd — Remodel date
  • RoofStyle — Type of roof
  • RoofMatl — Roof material
  • Exterior1st — Exterior covering on house
  • Exterior2nd — Exterior covering on house (if more than one material)
  • MasVnrType — Masonry veneer type
  • MasVnrArea — Masonry veneer area in square feet
  • ExterQual — Exterior material quality
  • ExterCond — Present condition of the material on the exterior
  • Foundation — Type of foundation
  • BsmtQual — Height of the basement
  • BsmtCond — General condition of the basement
  • BsmtExposure — Walkout or garden level basement walls
  • BsmtFinType1 — Quality of basement finished area
  • BsmtFinSF1 — Type 1 finished square feet
  • BsmtFinType2 — Quality of second finished area (if present)
  • BsmtFinSF2 — Type 2 finished square feet
  • BsmtUnfSF — Unfinished square feet of basement area
  • TotalBsmtSF — Total square feet of basement area
  • Heating — Type of heating
  • HeatingQC — Heating quality and condition
  • CentralAir — Central air conditioning
  • Electrical — Electrical system
  • 1stFlrSF — First Floor square feet
  • 2ndFlrSF — Second floor square feet
  • LowQualFinSF — Low quality finished square feet (all floors)
  • GrLivArea — Above grade (ground) living area square feet
  • BsmtFullBath — Basement full bathrooms
  • BsmtHalfBath — Basement half bathrooms
  • FullBath — Full bathrooms above grade
  • HalfBath — Half baths above grade
  • Bedroom — Number of bedrooms above basement level
  • Kitchen — Number of kitchens
  • KitchenQual — Kitchen quality
  • TotRmsAbvGrd — Total rooms above grade (does not include bathrooms)
  • Functional — Home functionality rating
  • Fireplaces — Number of fireplaces
  • FireplaceQu — Fireplace quality
  • GarageType — Garage location
  • GarageYrBlt — Year garage was built
  • GarageFinish — Interior finish of the garage
  • GarageCars — Size of garage in car capacity
  • GarageArea — Size of garage in square feet
  • GarageCond — Garage condition
  • PavedDrive — Paved driveway
  • WoodDeckSF — Wood deck area in square feet
  • OpenPorchSF — Open porch area in square feet
  • EnclosedPorch — Enclosed porch area in square feet
  • 3SsnPorch: — Three season porch area in square feet
  • ScreenPorch — Screen porch area in square feet
  • PoolArea — Pool area in square feet
  • PoolQC — Pool quality
  • Fence — Fence quality
  • MiscFeature — Miscellaneous feature not covered in other categories
  • MiscVal — $Value of miscellaneous feature
  • MoSold — Month Sold
  • YrSold — Year Sold
  • SaleType — Type of sale
  • SaleCondition — Condition of sale
  • Import Package and Data

    Started with imports of some basic libraries that are needed throughout the case. This includes dplyr, tidyr, stringr, lubridate. For visualization we use ggplot2 and plotly.

    library(dplyr)
    library(tidyr)
    library(stringr)
    library(lubridate)
    library(ggplot2)
    library(plotly)
    

    For this exercise, the data set (.csv format) is downloaded to a local folder.

    df <- read.csv('House Price.csv')
    

    Start With the Target

    Perform a univariate analysis of the target variable, namely SalePrice and describe what insights can be obtained.

    ##find missing value:
    df %>% is.na() %>% colSums()
    #no missing value in SalePrice
    
    ##check mean, mode, Min Value, Max Value, and Standard Deviation 
    
    getmode <- function(v) {
      uniqv <- unique(v)
      uniqv[which.max(tabulate(match(v, uniqv)))]
    }
    
    getmode(df$SalePrice) #mode
    sd(df$SalePrice) #SD
    summary(df$SalePrice) #min, max, median
    
    ggplot(data = df, mapping = aes(x = SalePrice)) + 
      geom_histogram(bins = 50, color = "white", fill = "coral") #Visualize histogram SalePrice
    
    ggplot(df, aes(SalePrice)) +
      geom_histogram(binwidth = 100000) #Visualize histogram with binwidth 100000
    

    We can look visualize histogram with binwidth 100000.

    Insight

    • From the histogram display, it can be concluded that the sale price data is not normally distributed and tends to be skewed to the left.
    • the most sold house with a price of 140000.
    • The most expensive house is sold for 755000 which is an outlier from the data.
    • Saleprice house data is distributed under the price of 250000.

    Find the 5 variables with the strongest correlation (positively or negatively) with SalePrice

    cor.test(df$MSSubClass,df$SalePrice) # -0.084
    cor.test(df$LotFrontage,df$SalePrice) # 0.35
    cor.test(df$LotArea,df$SalePrice) # 0.26
    cor.test(df$OverallQual,df$SalePrice) # 0.790
    cor.test(df$OverallCond,df$SalePrice) # -0.077
    cor.test(df$YearBuilt,df$SalePrice) # 0.522
    cor.test(df$YearRemodAdd,df$SalePrice) # 0.507
    cor.test(df$MasVnrArea,df$SalePrice) # 0.477
    cor.test(df$BsmtFinSF1,df$SalePrice) # 0.386
    cor.test(df$BsmtFinSF2,df$SalePrice) # -0.011
    cor.test(df$BsmtUnfSF,df$SalePrice) # 0.214
    cor.test(df$TotalBsmtSF,df$SalePrice) # 0.613
    cor.test(df$X1stFlrSF,df$SalePrice) # 0.605
    cor.test(df$X2ndFlrSF,df$SalePrice) # 0.319
    cor.test(df$LowQualFinSF,df$SalePrice) # -0.0256
    cor.test(df$GrLivArea,df$SalePrice) # 0.708
    cor.test(df$BsmtFullBath,df$SalePrice)# 0.227
    cor.test(df$BsmtHalfBath,df$SalePrice) # -0.0168
    cor.test(df$BedroomAbvGr,df$SalePrice) # 0.168
    cor.test(df$KitchenAbvGr,df$SalePrice) # -0.135
    cor.test(df$TotRmsAbvGrd,df$SalePrice) # 0.533
    cor.test(df$Fireplaces,df$SalePrice) # 0.466
    cor.test(df$GarageYrBlt,df$SalePrice) # 0.486
    cor.test(df$GarageCars,df$SalePrice) # 0.640
    cor.test(df$GarageArea,df$SalePrice) # 0.623
    cor.test(df$WoodDeckSF,df$SalePrice) # 0.324
    cor.test(df$OpenPorchSF,df$SalePrice) # 0.315
    cor.test(df$EnclosedPorch,df$SalePrice) # -0.128
    cor.test(df$X3SsnPorch,df$SalePrice) # 0.044
    cor.test(df$ScreenPorch,df$SalePrice) # 0.111
    cor.test(df$PoolArea,df$SalePrice) # 0.0924
    cor.test(df$MiscVal,df$SalePrice) # -0.0211
    cor.test(df$MoSold,df$SalePrice) # 0.0464
    cor.test(df$YrSold,df$SalePrice) # -0.0289
    

    Insight

    POSITIVE STRONG CORRELATION

    • OverallQual : this makes sense if the higher the SalePrice, the quality of the house can be said to be better from the range 1-10.
    • GrLivArea : this makes sense if the higher the SalePrice of a house, the house will have a large living room size too.
    • GarageCars : this makes sense if the higher the SalePrice the house will have a garage that can fit a few more cars than the cheaper house.
    • GarageArea : this makes sense because a house with a high sale price will have a high garage area too.
    • TotalBsmtSF : this makes sense because houses with high sales prices have a large basement too.

    NEGATIVE STRONG CORRELATION

    • EnclosedPorch : Saleprice is weakly negatively correlated with the enclosed patio area so it can be ignored and this doesn't make sense because neither houses with low salesprice nor high average have a covered terrace.
    • KitchenAbvGr : SalePrice has a weak negative correlation with KitchenAbvGr so it can be ignored and this can be said to be unreasonable/does not affect SalePrice because the average house has 1 kitchen, either cheap or expensive.
    • LowQualFinSF : SalePrice has a weak negative correlation with LowQualFinSF and this can be said to be reasonable because the higher the price, the lower the floor area of low quality, but this can also be ignored because the correlation is low.
    • MSSubClass : Saleprice has a weak negative correlation with MSSubClass so this can be ignored and this also doesn't make sense because every number from MSSubClass is categorical data where the highest SalePrice is in categorical 60 which is a new house with 2 floors.
    • OverallCond : SalePrice has a weak negative correlation with OverallCond so this can be ignored and this doesn't make sense because the highest and lowest salesprice are in the same OverallCond value of 5.

    It is never hurt to test basic knowledge

    There is a view that a low OverallQual has a lower price tendency, and a house with a higher OverallQual Quality has a higher price tendency. Perform an analysis of the relationship between OverallQual and SalePrice.

    df %>% ggplot(aes(x=OverallQual,y=SalePrice)) + geom_point()
    cor.test(df$OverallQual,df$SalePrice)
    

    Insight

    OverallQual can be said to be a categorical variable which is described in numerical form 1-10, where 1 is very bad and 10 is very good. a high price has a high OverallQual. Based on the correlation, it can be seen that SalePrice and OverallQUall have a very strong correlation, namely 0.790 and it can be concluded that OverallQual greatly affects SalePrice from home.

    Beware of False Correlation

    There is a tendency that new homes have a higher price. However, we should not be in a hurry to conclude that a new house must have a higher selling price, because if the new house that is built is not good, of course the price cannot be high either. What do you think makes a new home a higher value?

    df %>% ggplot(aes(x=YearBuilt,y=YearRemodAdd)) + geom_point()
    
    df %>% 
      mutate(df, usia_rumah = YearRemodAdd-YearBuilt, .after ="SalePrice" ) %>% 
      ggplot(aes(x=usia_rumah,y=SalePrice)) + geom_point()
    

    Insight

    The thing that makes new homes have a higher value is when they do YearRemodadd with a short period of time with YearBuilt. Through visualization, it can be seen that houses that were carried out YearRemodadd and YearBuilt in the same year tend to have a higher sale price.

    Haunted places(?)

    Pay attention to the following scatter plot

    df %>% ggplot(aes(x=GrLivArea,y=SalePrice)) + geom_point()
    

    On the right, there are two houses, which have a very large GreenLivingArea, but the SalePrice is cheap. Lets analyze why the two houses are cheap.

    df %>% ggplot(aes(x=GrLivArea,y=SalePrice)) + 
      geom_point()+
      coord_cartesian(xlim=c(4500,6000)) #visualize house with GrLivArea >4500
    x<- filter(df, GrLivArea>4500) #filter house with GrLivArea diatas 4500
    

    Insight

    The two houses cheap robably because even though the house has a large GrLvArea area, the house is on a bad contour where its position is at a sharp turn and on an incline. houses with LandCountour type 'Bnk' do not have a high selling value.

    EDA for Top Month Sold

    df %>% ggplot(aes(x=YearBuilt,y=YearRemodAdd)) + geom_point()
    
    df %>% 
      mutate(df, usia_rumah = YearRemodAdd-YearBuilt, .after ="SalePrice" ) %>% 
      ggplot(aes(x=usia_rumah,y=SalePrice)) + geom_point()
    

    Insight

    • The most sold homes are in June, July and May.
    • The least sold homes are in February, January and December.
    • It can be said that the house sells better in the middle of the year and the worst sales at the beginning and end of the year.
    • This can be caused because in the middle of the year a person's financial average can be said to be stable because there is no mid-year holiday or big day in that month.
    • While at the beginning of the year, it can be said that people spend a lot of time on holidays or leave because the beginning and end of the year are Christmas and New Year's holidays.
    • So that people tend to be reluctant to buy a house when they do a lot of spending, this is what causes the drop in home sales in those months.
    • To anticipate this, discounts can be done at the beginning of the year or the end of the year to attract people to buy houses at the end and beginning of the year.