Wine Quality Predicting with Python ML

5 Jan 2025 | 7 min read

Introduction to Wine Classification

Around the world, a wide variety of wines are accessible, such as sparkling wines, dessert wines, pop wines, table wines, and vintage wines.

You could be wondering how one determines what wine is good and what isn't. Machine learning is the solution to this query!

There are many different ways to classify wines. Several of them are mentioned below:

Logistic Regression
SVM
Naïve Bayes
CART
Random forest
Perception
KNN

Implementing Wine Classification in Python

Now let's go into a very rudimentary Python wine classification implementation. This will provide you with an introduction to classifiers and show you how to use them in Python for a variety of real-world applications.

1. Modules import

Importing the required modules and libraries into the application is the initial step. A few foundational modules are required for the grouping. Importing each model into the application that uses the Sklearn library is the next step. A few more sklearn library functions will be included as well.

2. Dataset Preparation

The next step is to get our dataset ready. Let me start by providing an overview of the dataset before importing it into our application.

2.1 Introduction to Dataset

There are 12 features overall and 6497 observations in the dataset. None of the variables have NAN values. The data is simply downloadable.

The following are the names and descriptions of the 12 features:

Fixed acidity: The wine's fixed acidity level
The wine's volatile acidity refers to the amount of acetic acid, the amount of citric acid, the amount of residual sugar left over after fermentation, and the amount of salts or chlorides present.
The quantity of sulfur dioxide in its free form. The quantity of sulfur dioxide in its whole form, including both bound and free forms.
Density: The wine's mass/volume density
pH: The wine's pH ranges from 0 to 14.
Sulfurates: The amount of sulfur dioxide gas (S02) in the wine;
Alcohol: The amount of alcohol in the wine;
Quality: The wine's ultimate quality as indicated.

2.2 Loading the Dataset

Load the dataset and print the basic information of the dataset like column names, and data types.

Output:

 <class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 fixed acidity 1599 non-null float 64 1 volatile acidity 1599 non-null float 64 2 citric acid 1599 non-null float 64 3 residual sugar 1599 non-null float 64 4 chlorides 1599 non-null float 64 5 free sulfur dioxide 1599 non-null float 64 6 total sulfur dioxide 1599 non-null float 64 7 density 1599 non-null float 64 8 pH 1599 non-null float 64 9 sulphates 1599 non-null float 64 10 alcohol 1599 non-null float 64 11 quality 1599 non-null int 64 dtypes: float64(11), int64(1) memory usage: 150.0 KB

2.3 Cleaning of Data

Cleaning of the dataset includes dropping the unnecessary columns and the NaN values with the help of the code mentioned below:

2.4 Data Visualization

An important step is to first visualize the data before processing it any further. The visualization is done in two forms namely,

Histographs
Scatterplot Graph

Plotting Histograms

Output:

The distributions of all the variables' values are displayed below. The figures demonstrate that the "pH" and "density" variable values follow a somewhat regular distribution.

The majority of the "fixed_acidity" variable's values fall between 7 and 8;
The majority of the "volatile_acidity" variable's values fall between 0.4 and 0.7;
The "citric_acid" variable's majority of values fall between 0.0 and 0.1;
The majority of the "residual sugar" variable's values fall between 1 and 2.5;
The "chlorides" variable's majority of values fall between 0.085 and 0.15;
The "free_sulfur_dioxide" variable's majority.
The majority of valuesThe variables "total_sulfur_dioxide" fall between 0 - 30;.
The range of \mark>\b>0.996 - 0.998; contains the majority of the values of the "density" variable
The majority of 's valuesThe range of includes the "pH" variable.3.2-3.4;
The range of 0.50 - 0.75; contains the majority of the values of the "sulphates" variable.
The range of 9 - 10; contains the majority of the values of the "alcohol" variable.
The majority of values are the "quality" variable.5 and 6.

Plotting Scatterplot

Output:

In a statistical setting, two or more variables are said to be connected \mark>if their values fluctuate in a way that causes the second variable's value to change along with the value of the first (though it might do so in the other direction). For instance, there is a relationship between the variables "hours worked" and "income earned" if a rise in hours worked is linked to an increase in income earned. If "price" and "purchasing power" are taken into account, then an individual's capacity to purchase items diminishes as their price rises (assuming a constant income).

A statistical measure that indicates the strength and direction of a link between two or more variables is called correlation, and it is represented as a number.

However, a correlation between two variables does not always imply that changes in one variable are the result of changes in the values of the other.

There is a causal link between the two occurrences, as evidenced by the fact that one event results from the occurrence of the other. Another name for this is cause and effect.

The distinction between the two kinds of relationships should be apparent in theory: either an event or an action can cause another (smoking raises the risk of lung cancer, for example) or it can correlate with another (smoking is correlated with alcoholism, but it does not cause alcoholism). In actuality, though, it's still challenging to determine cause and effect with clarity.

2.5 Train-Test Split and Data Normalization

To split the data into training and testing data, there is no optimal splitting percentage.

But one of the fair splitting rules is the 80/20 rule where 80% of the data goes to training data and the rest 20% goes to testing data.

This step also involves normalizing the dataset.

3. Wine Classification Model

In this program we have used two algorithms namely, SVM and Logistic Regression.

3.1 Support Vector Machine (SVM) Algorithm

The accuracy is around 50% of the model.

3.2 Logistic Regression Algorithm

Output:

In this instance, the accuracy also comes out to be around 50%. The model we have utilized or developed is the primary cause of this.

Next TopicPowershell vs python

← prev next →

Wine Quality Predicting with Python ML

Introduction to Wine Classification

Implementing Wine Classification in Python

1. Modules import

2. Dataset Preparation

3. Wine Classification Model

Contact info

Follow us

Tutorials

Interview Questions

Online Compiler

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Misc

Wine Quality Predicting with Python ML

Introduction to Wine Classification

Implementing Wine Classification in Python

1. Modules import

2. Dataset Preparation

3. Wine Classification Model

Related Posts

Why Does C Code Run Faster than Python's

Python Dictionary update() Method

Background Subtraction Using OpenCV in Python

Python Requests - response.reason

Minimax Algorithm in Python

Python Matplotlib 3D Contours

k-nearest Neighbours (kNN) Algorithm in Python

How to Use Pickle to Save and Load Variables in Python

Python Mapping Types

The Celebrity Problem in Python

Subscribe to Tpoint Tech

Contact info

Follow us

Tutorials

Interview Questions

Online Compiler