Artificial Intelligence
Regression Algorithms in Machine Learning

Regression Algorithms in Machine Learning

Imagine trying to predict the price of a house based on its size, location, and features. This is where the regression algorithm plays its role. These algorithms work on a predictive modelling basis, allowing us to identify relationships between variables and future predictions. From estimating prices to predicting sales, regression provides a correct mathematical framework for making accurate predictions based on the past found data. It won’t be wrong to claim, regression is an important tool for uncovering patterns and making data-driven decisions.

Regression algorithms play a vital role in numerous fields such as finance, healthcare, and marketing, where the need for predictions is essential. For example, in finance, a regression model can predict stock prices or access investment risks. In healthcare, it can help in predicting disease progression or patient outcomes. By analysing data trends, businesses can make better and uniform decisions. 

What are Regression Algorithms?

Regression algorithms are the subsets of supervised learning, where the model/ computer is trained on a labelled dataset consisting of input features (independent variables) and a continuous target (dependent) variable. The prior goal is to know and understand the relationship between these variables so that the model can predict accurate future outcomes.

Picture yourself as a data scientist at a healthcare organization, working with a vast amount of patient data that includes age, weight, and medical history. With a regression algorithm, you can build a model to predict the outcomes or the chart based on the specific conditions or the variables.

How is Regression Used in Machine Learning

Regression is a well-known learning technique that allows the machine learning models to perform the below-mentioned tasks:

  • Predict Continuous Values

    Regression analysis allows machine learning models to predict numerical values, such as predicting sales value for e-commerce businesses, based on the previous sales data or maybe on the advertisements spent.


  • Identify Relationships Between Variables

    Machine learning utilizes the features of regression to identify which features have the most significant impact on the target variable. For example, in weather forecasts, algorithms can reveal how much temperature, humidity, and wind speed contribute to overall energy consumption.


  • Make Long-term Forecasts

    Regression algorithms allow machine learning systems to perform time series forecasting, for example predicting stock market prices and product demand needs.


  • Detect Anomalies

    Regression helps identify outliers by understanding the data points that do not fit expected patterns. For example, companies often use this learning technique to help with fraud detection.

Types of Regression Algorithms

Understanding regression algorithms types can help businesses to achieve their business objectives and goals. Each type is designed to model and predict continuous outcomes. Let’s explore the types in detail below:

  • Linear Regression

    Linear regression is a combination of two main types: simple and multiple. Simple linear regression analyses the relationship between a dependent variable and a single independent variable, fitting a straight line to the data for clear insights. On the other hand, multiple linear regression uses several independent variables to predict the dependent variable, offering a broader perspective on how multiple factors affect the outcome together.

    For example: Estimating a student's academic performance based on the number of hours spent studying (assuming a linear relationship between study time and performance).

    Python Based Example:

    pip install scikit-learn numpy matplotlib
    
    import numpy as np
    from sklearn.linear_model import LinearRegression
    import matplotlib.pyplot as plt
    
    # Sample data: hours studied vs academic performance
    # For simplicity, we'll use fictional data here.
    hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
    academic_performance = np.array([50, 55, 60, 65, 70, 75, 80, 85, 90, 95])
    
    # Create and train the model
    model = LinearRegression()
    model.fit(hours_studied, academic_performance)
    
    # Predict performance for a new set of hours
    new_hours = np.array([2.5, 4.5, 6.5, 8.5]).reshape(-1, 1)
    predicted_performance = model.predict(new_hours)
    
    # Print the predictions
    for hours, performance in zip(new_hours, predicted_performance):
       print(f"Hours studied: {hours[0]}, Predicted academic performance: {performance:.2f}")
    
    # Plotting the data and the regression line
    plt.scatter(hours_studied, academic_performance, color='blue', label='Data points')
    plt.plot(hours_studied, model.predict(hours_studied), color='red', label='Regression line')
    plt.xlabel('Hours Studied')
    plt.ylabel('Academic Performance')
    plt.title('Linear Regression for Academic Performance')
    plt.legend()
    plt.show()
            
  • Polynomial Regression

    The polynomial regression approach uses Polynomial curves to introduce non-linearity into the model, rather than replacing the straight lines. As the polynomial degree increases, the model captures more complex patterns. However, higher degrees can risk overfitting, where the model fits the training data too closely, resulting in poor generalization to new data.

    For example: Predicting house prices in a rapidly developing area, where prices may follow a non-linear trend due to various updates in economic factors and market dynamics.

    Python Based Example:

    pip install scikit-learn numpy matplotlib
    
    import numpy as np
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression
    from sklearn.pipeline import make_pipeline
    import matplotlib.pyplot as plt
    
    # Sample data: house size in square feet vs house price in dollars
    # For simplicity, we'll use fictional data here.
    house_size = np.array([500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750]).reshape(-1, 1)
    house_price = np.array([150000, 180000, 210000, 250000, 290000, 330000, 370000, 420000, 470000, 530000])
    
    # Create a pipeline with polynomial features and linear regression
    polynomial_degree = 3  # Degree of the polynomial
    model = make_pipeline(PolynomialFeatures(degree=polynomial_degree), LinearRegression())
    
    # Train the model
    model.fit(house_size, house_price)
    
    # Predict prices for new house sizes
    new_house_sizes = np.array([600, 1100, 1600, 2100, 2600]).reshape(-1, 1)
    predicted_prices = model.predict(new_house_sizes)
    
    # Print the predictions
    for size, price in zip(new_house_sizes, predicted_prices):
        print(f"House size: {size[0]} sq ft, Predicted price: ${price:.2f}")
    
    # Plotting the data and the regression curve
    plt.scatter(house_size, house_price, color='blue', label='Data points')
    
    # Generate a range of values for house size to plot the polynomial regression curve
    size_range = np.linspace(house_size.min(), house_size.max(), 500).reshape(-1, 1)
    plt.plot(size_range, model.predict(size_range), color='red', label='Polynomial regression curve')
    
    plt.xlabel('House Size (sq ft)')
    plt.ylabel('House Price ($)')
    plt.title('Polynomial Regression for House Prices')
    plt.legend()
    plt.show()
            
  • Ridge and Lasso Regression

    This approach addresses overfitting in linear regression by involving penalties for overly complex models:

    Ridge Regression

    In Ridge Regression, the penalty is added to the cost function based on the sum of squared coefficients, which shrinks them towards zero and simplifies the model.

    For example: In gene expression data analysis, Ridge Regression is used to highlight relevant genes while avoiding irrelevant ones by minimizing the risk of overfitting.

    Python Based Example:

    pip install scikit-learn numpy matplotlib
    
    import numpy as np
    from sklearn.linear_model import Ridge
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    import matplotlib.pyplot as plt
    
    # Simulate gene expression data
    np.random.seed(0)  # For reproducibility
    num_samples = 100
    num_genes = 20
    
    # Generate random gene expression data (features)
    X = np.random.randn(num_samples, num_genes)
    # Generate random target variable (e.g., disease severity)
    true_coefficients = np.random.randn(num_genes)
    y = X.dot(true_coefficients) + np.random.randn(num_samples) * 0.1
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Initialize and train the Ridge Regression model
    alpha = 1.0  # Regularization strength
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    
    # Predict on test data
    y_pred = model.predict(X_test)
    
    # Calculate the Mean Squared Error
    mse = mean_squared_error(y_test, y_pred)
    print(f"Mean Squared Error: {mse:.2f}")
    
    # Plot the actual vs predicted values
    plt.scatter(y_test, y_pred, color='blue', label='Predicted vs Actual')
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.title('Ridge Regression: Actual vs Predicted Values')
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')
    plt.legend()
    plt.show()
    
    # Print the coefficients of the model
    print("Ridge Regression Coefficients:")
    for i, coef in enumerate(model.coef_):
        print(f"Gene {i + 1}: {coef:.2f}")
    
    # Print the intercept
    print(f"Intercept: {model.intercept_:.2f}")
            

    Lasso Regression

    Similar to Ridge Regression, Lasso adds a penalty based on the absolute value of coefficients. This approach pushes some coefficients to zero, effectively performing feature selection.

    For example: While analyzing customer churn, Lasso regression can manage the most potential factors, such as the particular habits that strongly affect churn.

    Python Based Example:

    pip install scikit-learn numpy matplotlib
    
    import numpy as np
    from sklearn.linear_model import Lasso
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    import matplotlib.pyplot as plt
    
    # Simulate customer data
    np.random.seed(0)  # For reproducibility
    num_samples = 200
    num_features = 15
    
    # Generate random customer data (features)
    X = np.random.randn(num_samples, num_features)
    # Generate random target variable (e.g., churn: 1 if churn, 0 if not)
    true_coefficients = np.random.randn(num_features)
    y = (X.dot(true_coefficients) + np.random.randn(num_samples) * 0.5 > 0).astype(int)
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Initialize and train the Lasso Regression model
    alpha = 0.1  # Regularization strength
    model = Lasso(alpha=alpha)
    model.fit(X_train, y_train)
    
    # Predict on test data
    y_pred = model.predict(X_test)
    y_pred_binary = (y_pred > 0.5).astype(int)  # Convert probabilities to binary predictions
    
    # Calculate the Mean Squared Error
    mse = mean_squared_error(y_test, y_pred_binary)
    print(f"Mean Squared Error: {mse:.2f}")
    
    # Plot the actual vs predicted values
    plt.scatter(y_test, y_pred_binary, color='blue', label='Predicted vs Actual')
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.title('Lasso Regression: Actual vs Predicted Values')
    plt.plot([0, 1], [0, 1], color='red', linestyle='--')
    plt.legend()
    plt.show()
    
    # Print the coefficients of the model
    print("Lasso Regression Coefficients:")
    for i, coef in enumerate(model.coef_):
        print(f"Feature {i + 1}: {coef:.2f}")
    
    # Print the intercept
    print(f"Intercept: {model.intercept_:.2f}")
            
  • Decision Tree Regression

    This approach divides the data into smaller subsets based on feature values, creating tree-like structures for making predictions. Each split in a tree is a decision maker, where the final leaf predicts the outcome based on its roots and features. This method is highly interpretable, as it shows users how specific features contribute to decision-making.

    For example: Forecasting student performance based on their study plans, hours, and previous grades. The decision tree groups students based on these factors to estimate outcomes or scores.

    Python Based Example:

    pip install scikit-learn numpy matplotlib
    
    import numpy as np
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.metrics import mean_squared_error
    import matplotlib.pyplot as plt
    
    # Simulate data
    np.random.seed(0)  # For reproducibility
    num_samples = 200
    
    # Simulated features
    study_hours = np.random.uniform(1, 10, num_samples)  # Hours spent studying
    study_plans = np.random.choice(['daily', 'weekly', 'monthly'], num_samples)  # Study plans
    previous_grades = np.random.uniform(50, 100, num_samples)  # Previous grades
    
    # Target variable (e.g., future performance score)
    performance_scores = (0.3 * study_hours + 
                          0.5 * (study_plans == 'daily').astype(int) + 
                          0.2 * previous_grades + 
                          np.random.normal(0, 5, num_samples))
    
    # Convert categorical features into numerical format
    features = np.column_stack([study_hours, study_plans, previous_grades])
    
    # Define preprocessing for the study_plans column
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), [0, 2]),  # Standardize numerical features
            ('cat', OneHotEncoder(), [1])        # One-hot encode categorical features
        ])
    
    # Define the pipeline with preprocessing and decision tree regressor
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', DecisionTreeRegressor())
    ])
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(features, performance_scores, test_size=0.3, random_state=42)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on test data
    y_pred = model.predict(X_test)
    
    # Calculate and print Mean Squared Error
    mse = mean_squared_error(y_test, y_pred)
    print(f"Mean Squared Error: {mse:.2f}")
    
    # Plot actual vs predicted values
    plt.scatter(y_test, y_pred, color='blue', label='Predicted vs Actual')
    plt.xlabel('Actual Performance Scores')
    plt.ylabel('Predicted Performance Scores')
    plt.title('Decision Tree Regression: Actual vs Predicted Performance')
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')
    plt.legend()
    plt.show()
    
    # Print feature importances
    print("Feature Importances:")
    feature_names = ['Study Hours', 'Study Plans', 'Previous Grades']
    importances = model.named_steps['regressor'].feature_importances_
    for name, importance in zip(feature_names, importances):
        print(f"{name}: {importance:.2f}")
            

How to Implement Regression Algorithms in Practice

The regression algorithm is a crucial tool in ML, allowing predictive modelling by identifying and understanding the relationships between variables. Let’s understand and learn about how to implement Regression Algorithms in Practice in detail below. 

  • Step 1: Import Necessary Libraries

    First and foremost, the step is to import the necessary libraries required for data manipulation, visualization, and model building.

    # Importing necessary libraries
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score
            
  • Step 2: Load and Explore the Dataset

    Load the dataset and perform initial exploration to understand its structure. Let’s take an example of a sample dataset of house prices to explain.

    # Load dataset
    data = pd.read_csv('house_prices.csv')
    
    # Explore the dataset
    print(data.head())
    print(data.describe())
            
  • Step 3: Data Preprocessing

    Handle missing values if found any, understand categorical variables, and divide the data into features x, and assuming target variable y.

    # Handle missing values (if any)
    data.fillna(data.mean(), inplace=True)
    
    # Convert categorical variables using one-hot encoding
    data = pd.get_dummies(data, drop_first=True)
    
    # Split the data into features (x) and target variable (y)
    x = data.drop('Price', axis=1)
    y = data['Price']
            
  • Step 4: Divide the Training into Testing Sets

    Divide the training into the testing sets to examine the model’s performance.

    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
            
  • Step 5: Train the Regression Model

    Make your simple linear regression model to get trained using the training data.

    # Initialize the Linear Regression model
    model = LinearRegression()
    
    # Train the model
    model.fit(X_train, y_train)
            
  • Step 6: Predict Outcomes

    Use the trained models to predict outcomes on the test data.

    # Make predictions on the test set
    y_pred = model.predict(X_test)
            
  • Step 7: Evaluate the Model

    Understand the model’s performance using different metrics like mean squared error (MSE) and R-squared.

    # Calculate Mean Squared Error (MSE)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Mean Squared Error: {mse}')
    
    # Calculate R-Squared (R**2)
    r2 = r2_score(y_test, y_pred)
    print(f'R-squared: {r2}')
            
  • Step 8: Get Your Results

    Finally, observe the actual vs predicted values to examine the model’s accuracy.

    # Plot actual vs predicted values
    plt.scatter(y_test, y_pred)
    plt.xlabel('Actual Prices')
    plt.ylabel('Predicted Prices')
    plt.title('Actual vs Predicted Prices')
    plt.show()
            

Future Trends in Regression Algorithms

Regression algorithms are continuously growing with several emerging trends giving their application a better shape and effectiveness.

  • Automated Machine Learning (AutoML)

    Automated Machine Learning (AutoML) simplifies automating tasks such as feature selection and hyperparameter tuning. This trend makes it easier for everyone to create accurate models efficiently, reducing the need for extensive manual tuning and making machine learning more accessible.

    For example: Tools like Google AutoML or H2O.ai can automatically handle tasks like feature engineering, model selection, and hyperparameter optimization, allowing users to focus more on interpreting results and making business decisions.

  • Deep Learning for Regression

    Deep learning methods are particularly useful for large and complex datasets, capturing intricate patterns that traditional methods might miss. They are especially valuable in tasks such as time series forecasting and predictive analytics where large amounts of data and complex relationships are involved.

    For example: Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) are frequently used for predicting stock prices or weather forecasting due to their ability to capture temporal dependencies.

  • Advancements in Interpretability

    As models become more complex, understanding how predictions are made becomes increasingly important. New techniques in interpretability help users understand model decisions, which is crucial for fields like healthcare and finance where transparency and trust are vital.

    For example: Techniques like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) provide insights into how individual features influence predictions, making complex models more understandable.

  • Speculating on the Future

    Looking ahead, we may see more hybrid approaches combining traditional machine learning and deep learning methods. There will be a growing focus on explainable AI to enhance model transparency. Additionally, the potential of quantum computing could revolutionize regression analysis, offering new ways to process and analyze data.

    For example: Quantum machine learning models could potentially handle large-scale datasets and complex computations more efficiently, leading to advancements in predictive analytics and decision-making.

Conclusion

Regression algorithms are a key for predicting future trends and understanding how multiple factors are related. Their applications are used in various industries like finance, healthcare, and marketing to make smarter and more accurate decisions. With advancements in technology, the algorithms will become even better at delivering accurate and clear predictions.

AAHENT can be your go-to- partner for all your regression algorithm needs. Our team offers expert guidance in model selection, implementation, and optimization. AAHENT’s customized solutions are best to help your business get the most accurate predictions and insights. Connect with AAHENT to get support with data preparation and analysis.