Have you ever felt overwhelmed hearing about the power of machine learning but found it too complex to know where to start? Today, I'll guide you through using the simplest linear regression algorithm to predict house prices. Through this practical project, you'll gain a more intuitive understanding of machine learning.
Before we start, let's talk about linear regression. The concept is quite simple—imagine plotting some points on a coordinate system and fitting a line through them; that's what linear regression does.
Predicting house prices is a great application scenario. For example, we want to understand the relationship between house size and price. If you plot each house's size and price on a coordinate system, what kind of distribution do you think it would show? Exactly, it usually shows a certain linear relationship.
Before we officially start, we need to prepare some tools. Python's scientific computing ecosystem is very powerful; we'll mainly use these libraries:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
Should I explain what this code means?
Next, let's create some simulated data. In actual work, you might read data from a CSV file or database, but to make the concept clearer, we'll start with simulated data:
np.random.seed(42) # Set random seed for reproducibility
house_size = np.random.normal(100, 20, 100) # Generate 100 house size data, mean 100, standard deviation 20
house_price = house_size * 3 + np.random.normal(0, 10, 100) # Set relationship between price and size, add some random noise
data = pd.DataFrame({
'Size': house_size,
'Price': house_price
})
print(data.head())
I personally think using simulated data has a big advantage: you fully understand how the data is generated, so later when verifying the model, you can better understand whether the model has truly learned the patterns in the data.
Before modeling, it's important to develop the habit of observing the data. Let's plot the data to see:
plt.figure(figsize=(10, 6))
plt.scatter(data['Size'], data['Price'], alpha=0.5)
plt.xlabel('House Size (square meters)')
plt.ylabel('House Price (ten thousand yuan)')
plt.title('Relationship between House Size and Price')
plt.grid(True)
plt.show()
What do you observe from this graph? Can you see that there is indeed a linear relationship between size and price? This is exactly why we can use linear regression.
Now for the exciting part—modeling. We will split the data into training and testing sets:
X = data[['Size']].values
y = data['Price'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print(f'Slope (price per square meter): {model.coef_[0]:.2f}')
print(f'Intercept: {model.intercept_:.2f}')
I'd like to share a little tip: in actual work, I often print out the model parameters to take a look. Why? Because these parameters often have real physical meaning. For example, in this case, the slope represents how much the price increases for each additional square meter.
After training the model, let's see how it performs:
y_pred = model.predict(X_test)
r2_score = model.score(X_test, y_test)
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual')
plt.plot(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('House Size (square meters)')
plt.ylabel('House Price (ten thousand yuan)')
plt.title(f'Linear Regression Prediction Results (R² = {r2_score:.2f})')
plt.legend()
plt.grid(True)
plt.show()
The ultimate goal of learning algorithms is to solve real-world problems. Let's use the trained model to predict prices for some new houses:
new_houses = np.array([[80], [120], [150]])
predicted_prices = model.predict(new_houses)
for size, price in zip(new_houses, predicted_prices):
print(f'Predicted price for a house of {size[0]} square meters: {price:.2f} ten thousand yuan')
In actual work, I've summarized a few practical tips to share with you:
Data cleaning is important. Actual data often has missing values, outliers, and other issues that need to be addressed before modeling.
Feature engineering is key. In addition to size, price may also be related to location, age, and other factors. Considering all these factors can improve the model's predictive performance.
Model evaluation should be comprehensive. In addition to the R-squared score, you can also look at metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE).
Today, through a simple house price prediction case, we learned the basic application of linear regression. Did you find that machine learning isn't as difficult as you imagined? Of course, this is just the beginning; many interesting algorithms await our exploration.
What do you think of this example? Feel free to share your thoughts in the comments. If you're interested in other machine learning algorithms, let me know, and we can discuss them next time.
By the way, if you want to continue learning, I suggest you: 1. Try collecting real house price data to train the model 2. Add more features, such as house age, location, etc. 3. Explore other regression algorithms, like polynomial regression
Let's continue to advance on the path of Python data analysis. Learning is always on the road.