Machine Learning Techniques for Accurate Vehicle Price Forecasting

VEHICLE PRICE PREDICTION

NAME :

STUDENT NUMBER :

INTAKE :

LECTURER :

HANDS IN DATE :

TABLE OF CONTENTS

CONTENT

ABSTRACT 0

PART A 1

INTRODUCTION, AIM & OBJECTIVES 1

1.1 Background 1

1.2 Dataset Overview 1

1.3 Aim & Objectives 1

1.4 Scope of work 2

RELATED WORKS 3

2.1 Dataset-related study 3

2.2 Kaggle kernels 4

PART B 6

METHODS 6

1.1 Data 6

1.2 Software packages 6

1.3 Machine learning methods 6

1.4 Evaluation metrics 7

DATASET PREPARATION 8

2.1 Dataset Overview 8

2.2 Exploratory data analysis (EDA) 10

2.3 Data selection, cleaning, formatting, and exploring 13

MODEL IMPLEMENTATION 18

3.1 Modelling 18

3.2 Model optimization – hyperparameter tuning 19

MODEL VALIDATION 22

4.1 Model result evaluation 22

ANALYSIS & RECOMMENDATION 24

5.1 Model result analysis 24

5.2 Result comparison with related works 24

5.3 Recommendation 25

CONCLUSION 26

REFERENCES 27

ACKNOWLEDGEMENT 28

ABSTRACT

This project addresses the challenge of estimating the selling price of vehicles based on their attributes. Both buyers and sellers must accurately estimate the value of used cars in order to make decisions and ensure fair transactions. The importance of this issue lies in the demand for accurate pricing data. This work aims to contribute by incorporating extensive pre-processing steps such as missing data imputation, feature engineering, and feature selection, to improve the predictive performance, whereas previous studies have explored various data engineering and machine learning techniques for this task. We extend previous work by applying feature engineering based on information backed by exploratory data analysis methods and efficiently handling missing data using unsupervised machine learning – K-Nearest Neighbour algorithm. Our method seeks to deliver more precise and trustworthy predictions of vehicle prices by addressing these pre-processing issues. Additionally, we contrast various regression models and perform hyperparameter model tuning to determine which is best for this task, taking into account both performance metrics like root mean squared error and R-squared. There are four regression models, including Linear Regression, Random Forest, K-K-Nearest Neighbour (KNN) Regression, and Ridge Regression, are implemented and contrasted. The Random Forest Regression model, which obtained a cross-validated R2 score of 99.83% on the entire dataset, outperforms other models in this study, according to the results. These results support the effectiveness of the selected approach in precisely forecasting car prices.

PART A

INTRODUCTION, AIM & OBJECTIVES

1.1 Background

As consumers, it’s always our goal to get the best deal possible on the things we buy, especially big-ticket items like properties and cars. Families place a high value on getting a car at a fair price because good financial management is crucial for people with average incomes. Families can compare options within the same price range and make better decisions by knowing the general price range for vehicles with particular features. By developing a prediction model that can precisely predict vehicle prices, this project aims to address this problem.

1.2 Dataset Overview

The dataset used in this project came from the 1985 Ward’s Automotive Yearbook and was downloaded from the UCI Machine Learning portal. There are 5130 instances of it, and there are 25 attributes. These characteristics cover a range of car details, including make, engine type, engine location, fuel type, and so on. A unique insurance risk rating for the vehicle appears next to each row in the dataset. The dataset’s attributes also include the relative average loss payment for each insured vehicle year. This dataset offers helpful information for analysing and forecasting car prices based on various factors, making it an excellent resource for machine learning techniques used in the real world.

1.3 Aim & Objectives

The development of a predictive model that can precisely predict vehicle selling prices is the main goal of this project. The goal is to use the attributes of the dataset to build a strong, understandable model that makes reliable predictions. The project also seeks to learn more about the variables that have a big impact on car prices.

1.4 Scope of work

This project’s scope includes performing exploratory data analysis (EDA) to comprehend the features of the dataset and discover trends and connections between the features and the intended variable. The problem of handling numerical and categorical variables will be solved by using feature engineering techniques backed by EDA. Besides, missing data will be imputed using unsupervised machine learning algorithm. The development, tuning and evaluation of four supervised regression models will be a part of the project, with an emphasis on choosing the best model based on performance indicators like root mean squared error (RMSE) and R2 score.

RELATED WORKS

2.1 Dataset-related study

Table 2.1: Papers citing this dataset

No

References

Summary

1

“Bayesian Inverse Regression for Supervised Dimension Reduction with Small Datasets” by Cai et al. (2019).

In supervised learning with limited data, the author suggests a Bayesian inverse regression (BIR) framework for dimension reduction.

2

“Spectral Ranking and Unsupervised Feature Selection for Point, Collective, and Contextual Anomaly Detection” by Zhang et al. (2018).

The author introduces a framework for anomaly detection that combines unsupervised feature selection with spectral ranking.

3

“Improving the Interpretability of Classification Rules Discovered by an Ant Colony Algorithm: Extended Results” by Otero et al. (2013).

To make the generated rules easier to understand, the author suggests a method that incorporates domain knowledge constraints and additional post-processing techniques.

4

“A study of different quality evaluation functions in the cAnt-Miner (PB) classification algorithm” by Medland et al. (2012).

The author compares and analyses the effects of various evaluation functions on the cAnt-Miner(PB) classification algorithm’s classification accuracy and rule complexity.

5

“Integration of Data Mining and Data Warehousing: A Practical Methodology” by Usman et al. (2010).

In order to take advantage of the potential insights concealed within sizable datasets and improve decision-making processes, the author proposes a methodology outlines a step-by-step procedure for successfully integrating data mining techniques into the data warehousing process.

6

“SSC: statistical subspace clustering” by Candilier et al. (2005).

SSC (Statistical Subspace Clustering), a technique for grouping high-dimensional data based on statistical characteristics, is introduced by the author. It takes the most important statistical features and applies a subspace projection approach to find clusters in the data.

Other researchers’ work in the field served as an inspiration for the techniques used in this assignment. For instance, even though dimension reduction techniques were not directly used in this project, the study “Bayesian Inverse Regression for Supervised Dimension Reduction with Small Datasets” by Cai et al. (2019) served as inspiration for the investigation of these techniques. Consideration of various feature engineering techniques to address categorical variables and enhance the performance of the predictive model was based on the concept of feature selection and anomaly detection from “Spectral Ranking and Unsupervised Feature Selection for Point, Collective, and Contextual Anomaly Detection” by Zhang et al. (2018). The emphasis on feature engineering and choosing significant variables that could offer significant insights into vehicle pricing was motivated by the concept of interpretability in rule-based models as discussed in “Improving the Interpretability of Classification Rules Discovered by an Ant Colony Algorithm: Extended Results” by Otero et al. (2013). Additionally, as stated in “Integration of Data Mining and Data Warehousing: A Practical Methodology” by Usman et al. (2010), the significance of integrating data mining techniques into real-world processes prompted the consideration of practical implementation aspects and ensured the applicability of the developed model in real-world scenarios. Although the clustering method described in “SSC: Statistical Subspace Clustering” by Candillier et al. (2005) was not directly used, it served as inspiration for the investigation of statistical features in the data to find features important for the task of predicting vehicle prices. These studies served as a source of inspiration for the methods used in this assignment, which were created to tackle the particular issue of predicting vehicle selling prices based on their characteristics.

2.2 Kaggle kernels

Table 2.2: Related works from Kaggle Kernel

No

Kaggle References

Technique

Result

1

“Automobile Dataset – EDA & Linear Regression” by Yashsdholam (March, 2023)

The author used Ordinal Encoder to transform the categorical attributes in the dataset into numerical data types.

R2 = 66.78%

2

“Linear Regression from scratch using numpy” by Sanskriti (January, 2023).

The author used single attribute – engine size to predict vehicle price.

R2 = 76.10%

3

“Automobile Price Prediction (XGBoost)” by Parth (December, 2022).

XGBoost supervised machine learning model is used after feature engineering using One Hot Encoding on categorical variables.

R2 = 45.00%

4

“Automobile Price Modeling with Custom Estimator” by Alephvnull (June, 2022).

The author pre-select features (engine size, curb weight, horsepower, city mpg, highway mpg, and make) based on domain knowledge.

R2 = 83%

For this project, a different strategy was used in comparison to other people’s work in Kaggle. To extract useful information from the categorical attributes, a feature engineering technique was used based on exploratory data analysis in this project, in place of Ordinal Encoder or One Hot Encoding for all categorical variables as demonstrated by Yashsdholam (2023) and Parth (2022) and . Moreover, unlike the works by Sanskriti (2023) and Alephvnull (2022) that only considered a limited set of characteristics, our approach took into account a wider range of vehicle attributes. This broader scope of features allowed for a more comprehensive analysis and prediction of vehicle prices.

PART B

METHODS

1.1 Data

The UCI Machine Learning Repository provided the dataset for this project Automobile – UCI Machine Learning Repository. It includes details on a variety of vehicle attributes, including make, horsepower, fuel economy, and price.

1.2 Software packages

Python was chosen as the main programming language for this project because of its extensive library of machine learning tools. The benefits of Jupyter Notebook’s features, which enable efficient reporting and visualisation structures during data analysis, led to its selection as the Integrated Development Environment (IDE). For effective data mining and machine learning tasks, the project heavily depends on well-known libraries like Pandas, NumPy, and scikit-learn (sklearn). Data visualisation and the conversion of raw data into insightful insights are facilitated by the use of Matplotlib and Seaborn, which also give data analysts easy charting capabilities.

1.3 Machine learning methods

The first phase of this project entails a data overview, which includes an analysis of data distributions, a review of data types, and summary statistics of attributes. The goal of this step is to comprehend the dataset better. The project moves on to missing value analysis after the data understanding phase and employs imputation techniques during the data cleaning process to handle missing values appropriately. Exploratory data analysis (EDA) is used to gain insights and design new features from a data-driven perspective after the data has been cleaned. This procedure aids in the discovery of connections, patterns, and potentially crucial variables.

The data is prepared for ingestion into machine learning models after the EDA is finished. The following four regression models are created: Linear Regression, Random Forest Regression, KNN Regression and Ridge Regression. The cleaned data is fitted into each model using 3-fold cross-validation or training and evaluation. The results after fine-tuning the parameters of the baseline models are compared with those of the baseline models. The best-performing models are those with the lowest RMSE (root mean squared error) and R2 (coefficient of determination). The top model is then trained with the entire dataset, and its final performance metrics are noted. In this step, the model’s ability to forecast car prices is evaluated based on the selected evaluation metrics.

1.4 Evaluation metrics

Different evaluation metrics were used to evaluate the models’ performance. Root mean squared error (RMSE) and coefficient of determination (R2) served as the project’s main metrics. While R2 measures the percentage of variance in the target variable that is explained by the model, RMSE estimates the average difference between predicted and actual vehicle prices. These measurements provide useful information about the models’ precision and dependability, allowing for comparison and the choice of the most effective strategy.

DATASET PREPARATION

2.1 Dataset Overview

The raw data includes 205 number of observations (rows) and 25 features (columns) and 1 target variable. Table 2.1a below shows the first 5 observations of the automotive dataset.

Table 3.1a: Dataset overview

Machine Learning Techniques for Accurate Vehicle Price Forecasting 1

Here, 16 of the 26 attributes were found to be of non-numerical (object) data types. It was necessary to pre-process the data by converting these non-numerical attributes into numerical data types, such as integers, binaries, or floats, because machine learning algorithms can only handle numerical data.

Table 2.1b: Numerical data summary

Machine Learning Techniques for Accurate Vehicle Price Forecasting 2

Table 2.1b provided above presents an overview of the numerical data in the dataset. Based on the summary, the following observations can be made:

  • At first glance, the count of 5130 in all numerical columns suggests that there are no missing values in these columns (to be verified in later section).
  • The columns symboling, curb-weight, engine-size, city-mpg, and highway-mpg contain only whole numbers, indicating their discrete nature.
  • There is a possibility of skewed data as indicated by the inconsistency between the mean and median values of certain columns.
  • It is worth noting that there are no negative values present in the dataset.

Table 2.1c: Categorical data summary Machine Learning Techniques for Accurate Vehicle Price Forecasting 3

Analyzing the summary of the non-numerical data in Table 2.1c, the following observations can be made:

  • There are missing values denoted by “?” in some of the non-numerical columns.
  • It is important to note that columns such as normalized-losses, bore, stroke, horsepower, peak-rpm, and price are originally of numerical data type. However, the presence of “?” values in these columns categorizes them as non-numerical data.

Figure 2.1a shows the data distribution for all numerical data in the automotive dataset. Upon analyzing the variable distributions, it can be observed that most of the variables exhibit a slight right-skewness. This is particularly evident in the wheel-base, engine-size, compression-ratio, horsepower, and price columns. The right-skewed distribution suggests that there are higher values concentrated towards the right end of the distribution, indicating potential outliers or extreme values in these variables.

Machine Learning Techniques for Accurate Vehicle Price Forecasting 4

Figure 2.1a: Data distribution

2.2 Exploratory data analysis (EDA)

2.2.1 Correlation analysis

A correlation analysis is carried out to study the linear relationship between all attributes with our target variable price. Figure 2.2.1a shows the correlation heatmaps. From the correlation heatmap, we can see engine-size has the highest correlation with price, with ~86% correlation, follows by curb-weight and horsepower (~79%), width (~68%) and length (~63%). Both highway-mpg and city-mpg has ~68% and ~65% negative correlation with price respectively.

Machine Learning Techniques for Accurate Vehicle Price Forecasting 5

Figure 2.2.1a: Correlation analysis

2.2.2 Univariate analysis – vehicle size

Feature engineering technique was demonstrated here by creating a new feature called vehicle size using equation: . Figure 2.2.2 shows the relationship between this new feature with our target variable price.

Machine Learning Techniques for Accurate Vehicle Price Forecasting 6

Figure 2.2.2: Vehicle Size

2.2.3 Multivariate analysis – engine size vs horsepower vs price

A multivariate analysis was performed to investigate the relationship between engine size and horsepower and their impacts on vehicle price. Figure 2.2.3 shows the multivariate plots. The scatter plot shows that larger engine size has greater horsepower, and hence greater price.

Machine Learning Techniques for Accurate Vehicle Price Forecasting 7

Figure 3: Multivariate analysis

2.3 Data selection, cleaning, formatting, and exploring

Prior to proceeding with data cleaning, an analysis of missing values was conducted to identify the pattern of missing values within the dataset. To achieve this, the “?” values were replaced with NaN. This conversion allows for a more comprehensive assessment of missing values and facilitates subsequent data cleaning steps. Figure 2.2a shows the percentage of missing data for each variable in the dataset.

Machine Learning Techniques for Accurate Vehicle Price Forecasting 8

Figure 2.3a: Missing data analysis

The analysis of missing values revealed several notable findings. The variable with the highest number of missing values is normalized-losses, accounting for approximately 27.1% (3740 rows) of the dataset. Additionally, price, stroke, and bore each contain 3.1% of missing values. Furthermore, peak-rpm and horsepower exhibit 0.4% of missing values each.

Upon identifying the missing values, the next step is to address them through imputation. Initially, the decision is made to remove the normalized-losses column due to the substantial number of missing values. For the remaining columns with missing values, the KNN Imputer method is employed, which is an unsupervised learning algorithm that models each feature with missing values based on the values of other features. This imputation approach is chosen to retain as much data as possible. By removing the normalized-losses column and performing imputation, the dataset’s predictive power can be preserved.

The table below provides a summary of the feature engineering applied to each non-numerical column based on exploratory data analysis (EDA). The categorical variables have been ranked according to vehicle price and converted into ordinal form. It is important to note that this methodology is data-driven, in contrast to the domain knowledge-driven approach described in the work by Otero et al. (2013).

Table 2.4: Feature engineering on categorical variables

No

Categorical Variables

Transformed variables

1

Make

Machine Learning Techniques for Accurate Vehicle Price Forecasting 9

Luxury (ordinal value of 3) – jaguar, Mercedes-benz, Porsche

Medium (ordinal value of 2) – alfra-romero, audi, bmw, mercury, saab, volvo, peugot

Other (ordinal value of 1) – The rest

2

Fuel-type

Create a binary column called is_diesel that acts as an indicator whether a vehicle is using diesel or not.

3

Aspiration

Create a binary column called is_turbo that acts as an indicator whether a vehicle is a standard type or turbo type.

4

Body-style

Machine Learning Techniques for Accurate Vehicle Price Forecasting 10

luxury (ordinal value of 2) – convertible, hardtop

other (ordinal value of 1) – The rest

5

Drive-wheels

Machine Learning Techniques for Accurate Vehicle Price Forecasting 11

Create a binary column called is_rwd_drive_wheels that acts as an indicator whether a vehicle is using rwd or not.

6

Engine-location

Machine Learning Techniques for Accurate Vehicle Price Forecasting 12

Create a binary column called is_rear_engine that acts as an indicator whether the engine is located at rear or the other way.

7

Engine-type

Machine Learning Techniques for Accurate Vehicle Price Forecasting 13

Maps the engine type into ordinal form based on their median price as below:

ohcf: 1,

ohc: 2,

rotor: 3,

l: 4,

dohc: 5,

ohcv: 6,

dohcv: 7

8

Number of Cylinders

Convert text forms into numerical format. For example: two = 2.

9

Fuel-system & Number of Doors

Machine Learning Techniques for Accurate Vehicle Price Forecasting 14

Machine Learning Techniques for Accurate Vehicle Price Forecasting 15

Upon analyzing the relationship between the fuel system and number of doors with the vehicle price, it is evident that no clear relationship exists. Therefore, the decision is made to drop these column from the dataset.

MODEL IMPLEMENTATION

3.1 Modelling

In this section, four supervised machine learning regression models were built: Linear Regression, Random Forest, K-Nearset Neighbours (KNN) Regression and Ridge Regression. Figure 3.1 shows the comparison between these four models in terms of R2 evaluation metric. Here, it is clear that KNN performs the best among the three models with the lowest MSE and RMSE values, indicating better accuracy and lower prediction errors compared to Linear Regression and Lasso. In general, all three models have near to similar R-squared (R2) values, indicating that they explain around 87-94% of the variance in the dependent variable.

Machine Learning Techniques for Accurate Vehicle Price Forecasting 16

Machine Learning Techniques for Accurate Vehicle Price Forecasting 17

Figure 3.1: Model comparisons before tuning

3.2 Model optimization – hyperparameter tuning

The models performed better after being tuned using the supplied parameter grid. The tuned Random Forest model achieved the highest accuracy values among the models by showing a decreased MSE and RMSE in comparison to its initial values. Similar goes to KNN model. With a lower MSE and RMSE, the tuned Ridge model also demonstrated improved performance. After tuning, the linear regression model remained unaltered.

Additionally, the best parameter values discovered through tuning for each model are given. The ideal values for RandomForestRegressor are max_depth: None, min_samples_leaf: 1, and min_samples_split: 5. The ideal KNN parameters are algorithm: ‘auto’, leaf_size: 30, n_neighbors: 10, and weights: ‘distance’. Finally, for Ridge, the ideal values for alpha and solver are 10.0 and saga, respectively.

Machine Learning Techniques for Accurate Vehicle Price Forecasting 18

Machine Learning Techniques for Accurate Vehicle Price Forecasting 19

Figure 3.2: Model comparisons after tuning

In section 3.2.1 – 3.2.4, the internal parameters for each regression models used for hyperparameter tuning are explained.

3.2.1 Random Forest Regressor

  • max_depth: This setting regulates the decision trees’ maximum depth in the random forest. By limiting the depth of the trees, it aids in the control of overfitting.
  • min_samples_split: It specifies the bare minimum of samples needed to split an internal node. Smaller leaf nodes with fewer samples are prevented by higher values.
  • min_samples_leaf: It determines the bare minimum of samples that must be present at a leaf node. By defining a minimal threshold for the number of samples in a leaf node, it helps prevent overfitting.

3.2.2 K-Nearest Neighbour (KNN)

  • n_neighbors: It establishes the number of neighbours to take into account when classifying or performing regression. A higher value of n_neighbors takes into account more neighbours, which might result in smoother predictions but also more noise.
  • weights: The weight function used for prediction is specified by this parameter. It can be set to “distance,” where closer neighbours have a greater influence, or “uniform,” where points in each neighbourhood are weighted equally.
  • algorithm: It chooses the algorithm that is used to calculate the nearest neighbours. The available options include “auto,” “ball_tree,” “kd_tree,” and “brute,” each with a unique set of computational characteristics.
  • leaf_size: In the KD tree or Ball tree algorithms, this parameter denotes the size of the leaf node. Smaller values can result in memory-intensive but quicker searches.

3.2.3 Ridge Regression

  • alpha: In a Ridge regression, alpha regulates the degree of regularisation. A higher alpha value improves regularisation, which lessens model complexity and may help avoid overfitting.
  • solver: The solver used to compute the ridge regression is specified by this parameter. Options include “auto,” “svd,” “cholesky,” “lsqr,” “sparse_cg,” and “saga,” each of which has a unique set of computational characteristics.

MODEL VALIDATION

4.1 Model result evaluation

The Random Forest (RF) Regression Model, which obtained a cross-validated R-Squared score of 99.87%, is the best model found based on the performance of the regression models discussed above. This model will now be trained using the complete dataset, incorporating all available data for optimum performance and prediction accuracy, in order to further exploit it. By using the entire dataset to train the RF Model, it is hope to better the model’s predictive abilities by utilising all the information and patterns found in the data. The following outcomes are obtained after training the KNN model with the entire dataset:

Cross-Validated MSE: 127269.739296875

Cross-Validated RMSE: 356.7488462446305

Cross-Validated R2: 0.9982607813220263

These evaluation metrics show that the RF model, which was trained on the entire dataset, excels at predicting vehicle selling prices. The model’s predictions are very accurate and have small errors, as shown by the low MSE and RMSE values. The model’s high predictive power and accuracy are evidenced by the high R2 score of 0.9983, which shows that a significant portion of the variance in the target variable is explained by the model.

To illustrate the distribution of our best model’s prediction against the True Value, a scatter plot of Prediction versus True Value is created in Figure 4.1. The scatter data points should be distributed evenly across the diagonal line in an ideal model.

Machine Learning Techniques for Accurate Vehicle Price Forecasting 20

Figure 4.1: Model Validation

ANALYSIS & RECOMMENDATION

5.1 Model result analysis

The high cross-validated R-Squared score of 0.9983 demonstrates the Random Forest model’s exceptional performance in predicting vehicle selling prices. This demonstrates the model’s high accuracy and predictive power by showing that it accounts for a sizable proportion of the variance in the target variable. The RF model is known for its efficiency in handling regression tasks and capturing complex relationships in the data, so the results match expectations. The model’s high R-Squared score shows that it successfully recognises the patterns and trends in the dataset, which results in reliable predictions. The results showed no notable surprises or irregularities. The model performed as expected and demonstrated a strong capacity to forecast vehicle selling prices. The RF model’s superior performance can be attributed to its capacity to take advantage of the connections between vehicle characteristics and selling prices. Based on the patterns seen in the dataset, the model can produce precise predictions by taking similar vehicles in the closest neighbours into account.

5.2 Result comparison with related works

The obtained results show comparable or even better performance when compared to the related works chosen in Table 2.2 in Part A. The R-Squared score (0.9983) of the RF model was significantly higher than the results in the related works. This implies that the model tuning and feature engineering techniques used in this study were successful in raising the predictive accuracy. Table 5.2 shows the side-by-side result comparison with other Kaggle Kernel.

Table 5.2: Result comparison with other Kaggle Kernel

No

Technique

Result

My Result

1

Ordinal Encoder was used to transform all the categorical attributes in the dataset into numerical data types.

R2 = 66.78%

R2 = 99.83%

2

Single attribute – engine size was used to predict vehicle price.

R2 = 76.10%

3

XGBoost supervised machine learning model is used after feature engineering using One Hot Encoding on categorical variables.

R2 = 45.00%

4

The author pre-select features (engine size, curb weight, horsepower, city mpg, highway mpg, and make) based on domain knowledge.

R2 = 83%

5.3 Recommendation

Based on the analysis of the findings, it is advised to carry out additional research to investigate the integration of extra data sources or different models such as XGBoost, SVM, AdaBoost, and so on, to improve prediction accuracy even more.

CONCLUSION

In conclusion, the goal of this project to use machine learning to create a predictive model for vehicle selling prices was achieved. Important insights and productive results were obtained through the processes of data exploration, cleaning, feature engineering, model training, and evaluation. A significant accomplishment of this project was the Random Forest regression model’s high R-Squared score of 0.9983. This shows that roughly 99.83% of the variation in vehicle selling prices could be explained by the model. The successful application of feature engineering techniques and the careful selection and tuning of the RF regression model are both reflected in the high R-Squared score.

In terms of what went wrong, the dataset with right-skewed target variable comes first. There were fewer instances of high vehicle prices observations in the dataset used for this project as compared to lower prices. Despite the model’s high accuracy, the slightly unbalanced distribution might prevent it from being generalised.

Despite the project’s successes, there are still opportunities for work in the future. Some possible research areas could be investigating the use of ensemble techniques, such as Gradient Boosting, to further improve the precision and robustness of predictions. Nevertheless, various feature selection techniques could be investigated to find the most important features for predicting vehicle prices and possibly reducing dimensionality.

REFERENCES

Cai, X., Lin, G., & Li, J. (2019). Bayesian inverse regression for supervised dimension reduction with small datasets. Journal of Statistical Computation and Simulation, 91, 2817 – 2832.

Zhang, H., Nian, K., Coleman, T.F., & Li, Y. (2018). Spectral ranking and unsupervised feature selection for point, collective, and contextual anomaly detection. International Journal of Data Science and Analytics, 9, 57-75.

Otero, F.E., & Freitas, A.A. (2013). Improving the Interpretability of Classification Rules Discovered by an Ant Colony Algorithm: Extended Results. Evolutionary Computation, 24, 385-409.

Medland, M., & Otero, F.E. (2012). A study of different quality evaluation functions in the cAnt-Miner(PB) classification algorithm. Annual Conference on Genetic and Evolutionary Computation.

Usman, M., & Pears, R. (2010). Integration of Data Mining and Data Warehousing: A Practical Methodology. Int. J. Adv. Comp. Techn., 2, 31-46.

Candillier, L., Tellier, I., Torre, F., & Bousquet, O. (2005). SSC: statistical subspace clustering. European Grid Conference.

Yashsdholam (2023) Automobile Dataset – EDA & Linear Regression. Kaggle [Online]. Available from: https://www.kaggle.com/code/yashsdholam/automobile-dataset-eda-linear-regression#Best-fit-line [Accessed: 25 June 2023].

Sanskriti (2023) Linear Regression from scratch using numpy. Kaggle [Online]. Available from: https://www.kaggle.com/code/csanskriti/linear-regression-from-scratch-using-numpy [Accessed: 25 June 2023].

Parth (2023) Automobile Price Prediction. Kaggle [Online]. Available from: https://www.kaggle.com/code/parthchittawar/automobile-price-prediction-xgboost [Accessed: 25 June 2023].

Alephvnull (2023) Automobile Price Modelling With Custom Estimator. Kaggle [Online]. Available from: https://www.kaggle.com/code/alephvnull/automobile-price-modeling-with-custom-estimator/notebook [Accessed: 25 June 2023].

ACKNOWLEDGEMENT

I want to extend my sincere gratitude to everyone who helped this project be completed successfully. First and foremost, I want to express my gratitude to my supervisor for their support, advice, and insightful comments throughout the project. Their knowledge and suggestions were extremely helpful in determining the course of this work.

I also want to thank the creators of the software packages used in this project as well as the open-source community. The implementation and analysis processes were greatly aided by the availability of tools like sklearn and Jupyter Notebook.

I also want to express my gratitude to the writers of the studies and works that were cited in this project. The methodologies and techniques used benefited greatly from their contributions to the fields of machine learning and data analysis, which also served as an important source of inspiration.

Without all of the aforementioned people’s combined efforts and contributions, this project would not have been possible. I sincerely appreciate their help and encouragement, and I recognise their significant contributions to this work.