Dreaming House

Explanatory Data Analysis to Build some Buy&Sell House Company Insights

Greener Houses is an 8 yrs company with a business model focused on buying and selling houses through its digital platform. It is a very technological company and it performs all services based on market analysis decision making.

But as time goes by, Greener Houses has been noticing that its profits and incomes haven't increased since their last big negotiation seven months ago.

Holding this important role as a technological symbol in the market, the company decided it was time to hire a Data Scientist to analyze the situation and to come up with better insights and business opportunities on the house market. The CEO is very proud of his idea and is counting on finding a professional very well qualified to help him increase Greener Houses revenue.

The company's main strategy is to buy some good standard houses in attractive locations at low prices, and then sell these houses at higher prices. The more the difference between buying and selling, the higher the company's profit on the deal; thus, the business revenue increases. However, that is not as simple as it sounds. Houses have many attributes which could turn them into great deals or not for both buyers and sellers. Location and period of the year (summer or winter, for instance) could also directly affect the houses pricing.

And that is where a Data Scientist will greatly come in handy for Greener Houses businesses to come.

Table of Contents

  1. Business and Analysis Understanding

    Business Problem
    Business Assumptions
    Dataset Description
    Solution Planning
    Hypothesis

  2. Imports

  3. Descriptive Analysis
  4. Data Cleaning and Preparation
  5. Exploratory Data Analysis

    Business Problem Nº1
    Business Problem Nº2

  6. Communicating Results

    Evaluating Hypothesis
    Graphical Data Visualization

  7. Conclusion

Business and Analysis Understanding

Business Problem

There are still some important bullet points about Greener Houses we must take into consideration. The CEO has been very clear about their strategies and he doesn't accept revenues losses. Also, he expects to receive some valuable insights that can be useful and quickly applicable. This way, to come into a solution for the company's business, there are three core questions to be answered:

At the very end, Greener Houses will receive a final report with all the renovation, buying, and selling recommendations, along with the description of the decision making.

Business Assumptions

It is not unusual that we face outliers on datasets. Outliers are observations that lie a notable difference from other observations. In other words, an outlier can be defined as a data point that differs significantly from other values in a random sample from a population. Outliers may cause serious problems in statistical analyses due to the unexpected variability they reflect in many statistical measures. Thereby, these so-called errors are sometimes excluded from the dataset.

However, it is very important to understand the reason for these outliers to be there in the first hand because not always an outlier will be seen as an error, since it could merely be a reflection from seasonality, for example. Consequently, it is also the data scientist role to identify these outliers, to define the reasons for their appearance, and to decide if they should be considered abnormal, meaning they would be excluded or not from the statistical analysis.

IQR Method of Outlier Detection

For Greener Houses business problems applied the IQR method to detect outliers on the dataset the company provided. This method consists in finding the Inter-Quartile Range (IQR), and then calculating the Lower and Upper Bounds to determine the values that will be considered as outliers or not. To have a better understanding on this outlier detection method, here are some useful definitions:

The difference between Q3 and Q1 is called the Inter-Quartile Range or simply IQR.

    
   IQR = Q3 - Q1
    

After defining the IQR range, we created new boundaries that we called decision ranges, where any point lying outside these ranges were considered as outliers. They were as given below:

    
   Lower Bound: (Q1 - 1.5 * IQR)
   Upper Bound: (Q3 + 1.5 * IQR)
    

This way, any data more than the Upper Bound or less than the Lower Bound were considered as an outlier.

Dataset Description

Greener Houses provided a dataset containing houses sale prices and many attributes for each one of them. This is where the company wants to start evaluating the opportunities and it is important to understand what this data tell us about the houses pricing.

Atributte Type Description
id Integer Unique ID for each home sold.
date Object Date of the home sale.
price Float Price of each home sold.
bedrooms Integer Number of bedrooms.
bathrooms Float Number of bathrooms, where .5 accounts for a room with a toilet but no shower.
sqft_living Integer Square footage of the apartments interior living space.
sqft_lot Integer Square footage of the land space.
floors Float Number of floors.
waterfront Integer A dummy variable for whether the apartment was overlooking the waterfront or not.
view Integer An index from 0 to 4 of how good the view of the property was.
condition Integer An index from 1 to 5 on the condition of the apartment.
grade Integer An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
sqft_above Integer The square footage of the interior housing space that is above ground level.
sqft_basement Integer The square footage of the interior housing space that is below ground level.
yr_built Integer The year the house was initially built.
yr_renovated Integer The year of the house’s last renovation.
zipcode Integer What zipcode area the house is in.
lat Float Lattitude.
long Float Longitude.
sqft_living15 Integer The square footage of interior housing living space for the nearest 15 neighbors.
sqft_lot15 Integer The square footage of the land lots of the nearest 15 neighbors.

Solution Planning

To come up with a solution, we will need to analyze how the houses prices are affected by each attribute presented on the company's dataset. With this understanding, it'll be possible to answer the business problem questions and also to evaluate some hypothesis created upon the data.

But it is very important to make a solution planning prior to start analyzing the data, and that is because a it allows us to see the steps we need to take to find the results wee seek. A solution planning is nothing more than our goals roadmap. With that having said, here are the steps we'll take for each question to be answered:

Each region has its specific set of considerations for pricing the houses. It means that it wouldn't be very accurate to directly compare prices for houses in different regions, even though they have the same attributes. With this in mind, the houses will be first separated by region, represented by its zipcodes. Then, the median will be the value for which houses prices are split in two halves: the 50% below and the 50% above the median. Finally, the houses in good conditions and priced below the median value will have a Buy status. On the other hand, houses priced above the median will have a Not to buy status.

Some set of conditions will be up to be evaluated prior to the final decision and so the house will hold an On Hold status up to this moment. This status will then be changed on the next few steps. For instance, houses in poorly conditions and priced way below the median would have to be carefully evaluated to see if it's worthy to buy and renovate it or not.

Similar to the region variance, it is very likely that seasonality might affect the houses pricing along the year. Thus, it is highly recommended that the company avoids to buy houses in high seasons and try to sell it in low seasons. Understanding how the seasonality works on the houses pricing might be a great deal for the future businesses, since it might prevent the company to face difficulties to sell the houses at the intended prices.

If Greener Houses gets to buy a house cheaper than the corresponding median region and seasonality, a suggested selling price is to increase in 30% the amount paid for the purchase. That is because sometimes good houses can be sold at lower prices if the owner needs to make a deal in a rush. This way, the company can go for a higher markup.

On the other hand, if the house is bought at a higher price than the median, it is not very reasonable to look for big markups, or the company might find it difficult to get the deal done. A suggestion is that Greener Houses sell these houses targeting a 10% markup average.

Hypothesis

Business hypothesis are assumption that the value proposition and strategy builds on. In other words, they are all the things that would have to be true for the business idea to work. We will test these hypothesis and, as a result, support or reject them; these might help us having insights about the business other than only answering the business problems. They are evaluated under the section Evaluating Hypothesis, right after the main business problems are analyzed.

1. Imports

Importing packages and functions:

Loading the dataset into a pandas dataframe:

The data shape is (21613 rows x 21 columns). It means that it has 21613 registered houses and 21 attributes for each one of them.

2. Descriptive Analysis

Descriptive analysis is the stage at which we become familiar with the data. It allows us to have a better understanding of the data we are manipulating, and to answer questions like "What is happening?", helping decision makers to make decisions right away. Also, it is represented by brief descriptive coefficients that summarize a given set of data, which can be either a representation of the entire or a sample of a population.

Usually, descriptive analysis might be useful for two main purposes:

  1. To provide basic information about variables in a dataset;
  2. And to highlight potential relationships between variables.
Besides that, it can be broken down into three types of measures:
  • Measures of Central Tendency: describe the "average" member of the population of interest. They include:
  • - Mean: the sum of a variable's values divided by the total number of values.
    - Median: the middle value of a variable.
    - Mode: the value that occurs most often.
  • Measures of Variability (Dispersion): provide information about the spread of data. They include:
  • - Range: the difference between the smallest and largest values in the data.
    - Variance: the most commonly used measure of dispersion. It is calculated by taking the average of the squared differences between each value and the mean.
    - Standard Deviation: the square root of the variance.
  • Measures of Association: indicate whether two variables are related.
  • - Correlation: to measure the strength of the relationship between numeric variables.
These are some of the statistical parameters available to describe a dataset, but they are not the only ones. Finally, most of the descriptive analysis results are displayed graphically.

To perform this descriptive analysis, the four bullet points are:

For the purpose of this data analysis, there will be no need for changing data types at this moment. The attribute date will have its type modified later on for the sections to come.

This dataset has no missing values for any of its columns.

There are some ids duplicated here; but before removing anything, we have to make sure they are really duplicated and not only meant to be there more than once. As this set of data relates to houses for sale and it includes the date they were first announced, it wouldn't be wrong to infer that if the duplicated ids have different announced dates, it could probably be because they couldn't be sold by the time of its first announcing. So, the owner took his second shoot and put it for sale once again in another season, changing its sale price, or not, for better suiting hers or his needs.

We will then make sure these are truly undesirable data.

Since there are no records of the same house being announced on the same day, we'll consider these as "partial duplication", and they'll not be removed because they'll have influence on the median price calculation later on.

There are some real outliers on this dataset, since it is unlikely to have a house with 33 bedrooms but less than 2 bathrooms. Also, some of the square footage of the houses are too small to support a house this big. But there's no need to bother with these data for now, since a predictive model will not be performed here and these houses will probably fall into the "Not to Buy" category for prices reason. If they occasionally fall into the "Buy" category, we'll track them on a reduced dataset and it will be easier to identify and remove them.

3. Data Cleaning and Preparation

Regarding the fact the dataset the company provided does not include any missing data, data with undesirable types, or any other issue that could interfere on the analysis, we'll consider the set of data as already cleaned and prepared. In fact, there might be some suspicious data if we were to perform a predictive model, since it appears on the descriptive statistics that we could have some outliers presented (like houses with 33 bedrooms). However, because we are only performing an explanatory and descriptive analysis, there will be no data removed from this dataset up to this point.

Moreover, we may be able to see how the attributes correlate with one another. This can be very useful when creating hypothesis. But it's important to remember that correlation does not imply causality, and every result must be prior analyzed before drawing conclusions.

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. This piece of information allows analysts to hypothesize under pairs of variables that are most likely to affect each other.

To have a better understanding on what these values mean:

Yet, sometimes a spurious correlation may occur. When two random variables track each other closely on the matrix, it is easy to suspect correlation, or a relationship between the two factors, meaning that a change affects the other. However, many times these aligned movements are coincidental or caused by a third factor that affects the first two. This can be also called a confounding factor and scientists who analyze data must be on the lookout for spurious relationships all the time.

A practical example can be seen on the correlation results for the variables price and sqft_living15. It is very unlikely that the price of a house is so highly affected by the square footage of interior housing living of its nearest neighbors. Maybe this result is seen because the nearest houses have similar interior living square footage as the house we are analyzing, since they are located on a close or even on the same neighborhood. Still, it doesn't necessarily imply that the house we are interested in will always have a similar square footage as its neighbors. Thus, to say that a house price is directly correlated to its own living space square footage is more likely to be accurate than to imply that it is directly affected by its neighbors living spaces.

4. Exploratory Data Analysis

After understanding some of the data behavior, we have tools to answer the questions we have stated on the Business Problem section. We will start grouping houses by region to calculate the median price for that specific region, and then setting a condition for whether the company should buy a specific house or not. Then, the price variation due to seasonality will be analyzed to update the previous analysis.

Business Problem Nº1

When the house's price is below the median price of its region and has a condition index of 3 or higher, it is evaluated as a good deal and the company is advised to buy it. When the house's price is higher than the median price of its region and has a condition index of 3 or lower, it is evaluated as a risky deal and the company is advised to not buy it.

On the other hand, if the house's price is just a little bit higher than the median price of its region and both its condition and grade indexes are well rated, than the deal is put on hold for now. The final decision for these cases will be make on the next steps, after verifying some hypothesis and also the seasonality interference in sale prices.

If we were to conclude our analysis here, Greener Houses would have 10554 houses analyzed as good deals to buy, 10899 houses as businesses to avoid, and another 160 houses still in need of some more analysis prior decision making. However, there are still some important information to take into consideration, such as seasonality, that we'll consider on the next few topics, and that might change some of the results presented so far.

Business Problem Nº2

Now we need to understand how seasonality might affect houses pricing. Grouping houses by region and by seasonality (Summer, Spring, Fall and Winter), and then calculating the median price for each specific set of conditions will give us a broader view of price variation. Thus, it might interfere with the decision making process.

With data properly organized and some meaningful statistic parameters calculated, the houses will get an updated status considering new set of conditions to buy or not to buy a house.

Differently from the previous set of considerations, here we also consider as good deals houses with prices higher than the median for its own season, but lower than the max median among all seasons calculated over the same zipcode. This is because the company can make a good negotiation if it sells houses in a season when prices rise. Moreover, houses priced 60% or less of its season's median and that shape a good condition, holding a grade of at least 4, are considered as houses to renovate.

Finally, if the house is either overpriced or depreciated, it will be receive a not to buy label. The houses we put on hold on the previous section might have its status updated into "buy to renovate" or "not to buy", considering these new conditions.

To see the impact of these changes, we can print a new summary counting the labels and compare it with the one we generated before.

After setting up the new considerations, 1286 more houses now hold a "Buy" label and only 14 houses are up to be renovated. This is a significant changing on the business, since the company has now many other good deals available if some actions are taken.

Knowing this, it is now possible to indicate Greener Houses when to sell the houses and at what prices. The following table will only display houses over a "Buy" status.

The next table will display houses holding a "Buy to renovate" status. At this point, we'll take some other issues in consideration.

The first considerations is that houses will have up to 40% their own prices available to spend in renovation so their condition ratings increase. Both second and third conditions relate to the profits. If a house's price is 50% or less the median price for the season, its selling price will be calculated as 60% more the house's price after the investments on renovation. On the other hand, if the house's price is more than 50% the median price for the season, its selling price will be 40% more the house's price after the outlay on renovation.

This analysis provides Greener Houses with a complete summary containing all the houses status, sales prices, sales seasons, profits, and outlays on renovation when required. Now the company can use these information to improve its decision-making process, and also to have a better understanding on its own businesses, incomes and outcomes on each particular deal.

However, to have an even better looking on the results of this analysis, we'll provide Greener Houses with a neat graphical data visualization, performed on the next section.

5. Communicating Results

Evaluating Hypothesis

Here we can see whether to support or reject a hypothesis. All hypothesis here are evaluated both by means of graphs and calculations.

H1: Overlooking the Waterfront

Testing if houses overlooking the waterfront are 30% more expensive, on average.

From a first look on the chart, it is possible to assume that this hypothesis will be reject even without confirming through calculations. However, to make it totally visible, the numbers below describe the accurate relationship between the two cases.

REJECTED

H2: Built Prior to 1995

Testing if houses with a construction date prior to 1995 are 50% cheaper, on average.

We can also see that the second bar represents more than 50% of the first one. It means that it is not possible that houses built prior to 1995 are 50% cheaper than houses built after 1995. However, the accurate relationship between the two cases are calculated below.

REJECTED

H3: Prices for 2 or 3 Bathrooms

Testing if houses with 2 bathrooms are 35% cheaper than houses with 3 bathrooms, on average.

It's possible to confirm from the graph that houses with 3 bathrooms are more expensive than houses with 2 bathrooms on average. We can also see that the first bar represents more than 50% of the second one. It means that it is possible that houses with 2 bathrooms are close to 35% cheaper than houses with 3 bathrooms. However, to have the number calculating the accurate relationship between the two cases, the percentage is shown as below.

SUPPORTED

H4: Grades and Prices

Testing if houses with grades 9 or higher are 75% more expensive than houses with grades 7 or lower, on average.

From the graph, it's noticeable that houses graded 9 or higher have more than twice the price of houses graded 7 or less. It means that this hypothesis will probably be rejected. However, to have an accurate relationship, the numbers below calculate the average prices for the two cases, and the percentage as displayed.

REJECTED

H5: With or Without Basements

Testing if houses with basements have a total area (sqft lot) 40% larger than houses without basements.

From a first look on the chart, it is possible to assume that this hypothesis will be reject. However, to make it clear, the numbers below describe the accurate relationship between the two cases.

REJECTED

H6: MoM positive Growth

Testing if houses with 3 bathrooms have a MoM (Month over Month) positive growth within time.

Houses with 3 bathrooms have a miscellaneous MoM (Month over Month) price variation. Thus, we are not able to state, form this set of data, that this variation will always grow positively within time. Additionally, to support the results, we can also see how the price behave over time on the graphic below.

REJECTED

Graphical Data Visualization

Under this section, the business problems answers are found in a graphical shape, communicating the tables results on the previous sections in a different and more readable format. Here are the graphics you might find:

Prices Dispersion

Prices Dispertion for Houses to Buy

Median Prices by Zipcode

Total Number of Houses to Buy

Average Profit on Houses to Buy by Region

Total Sales by Season

Houses to Renovate Summary

6. Conclusion

As a result of this analysis, Greener Houses has about 11840 good deals on the construction market. In addition, the company could still have other 14 opportunities if it is willing to invest on renovation prior to selling.

If the company chooses to invest only on the two more profitable regions out of the seventy available, it would still have 216 houses stated as good deals. This would represent a rate of 3 to 4 houses per month over the next 4 to 6 years, totalizing an average profit of U\$234.344,38 per house and about U\$700.000 to U\$1.000.000 per month.

Now Greener Houses is capable of making its decisions based on the market opportunities, the financial returns, and also on the company revenue available to invest on these businesses. Moreover, the best time for selling and the profits for each deal will also be available for the company to evaluate prior to the decision-making.