Linear regression is a statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (predictors). It is one of the simplest and most widely used techniques in machine learning, predictive analytics, and statistics.
Types of Linear Regression
-
Simple Linear Regression:
- Involves one independent variable.
- Equation: , where:
- is the dependent variable.
- is the independent variable.
- is the intercept.
- is the slope (coefficient of ).
- is the error term.
-
Multiple Linear Regression:
- Involves two or more independent variables.
- Equation: .
Assumptions of Linear Regression
- Linearity: The relationship between the dependent and independent variables is linear.
- Independence: The residuals (errors) are independent.
- Homoscedasticity: The variance of residuals is constant across all levels of the independent variable.
- Normality: The residuals are normally distributed.
- No Multicollinearity (in multiple regression): Independent variables are not highly correlated with each other.
Key Metrics in Linear Regression
-
R-squared ():
- Measures the proportion of variance in the dependent variable explained by the model.
- Values range from 0 to 1, with higher values indicating better fit.
-
Adjusted R-squared:
- Similar to but adjusts for the number of predictors in the model.
-
Mean Squared Error (MSE):
- Measures the average squared difference between observed and predicted values.
- Lower MSE indicates better fit.
-
Coefficients:
- Represent the change in the dependent variable for a one-unit change in an independent variable, holding others constant.
Applications
- Predicting house prices.
- Estimating sales based on advertising spend.
- Analyzing the impact of temperature on energy consumption.
- Financial forecasting.
Example in Python
Here's how you might perform simple linear regression using Python's scikit-learn library:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Sample data
X = [[1], [2], [3], [4], [5]]
y = [2, 4, 5, 4, 5]
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R-squared:", r2_score(y_test, y_pred))
This is a foundational approach, and additional techniques can make the analysis more robust, such as handling outliers, scaling features, or performing feature selection. Let me know if you'd like to dive deeper into any specific aspect!
No comments:
Post a Comment