5. Descriptive data analysis#

What do we use Python for?#

In โ€˜DaLi topic 1: Basics of data evaluationโ€™, the most important forms of visualisation and key figures required for data evaluation were presented.

As soon as we are dealing with the processing of real data sets (usually very extensive), software can help us to create certain key figures and forms of visualisation. This DaLi topic is therefore about analysing and visualising these data sets using software (Python or R).

To get a first impression of how we can use Python in this context, we use the โ€˜Tipsโ€™ dataset from โ€˜DaLi Topic 1: Fundamentals of data analysisโ€™ that we are already familiar with.

Brief description of the data set Tips#

A waiter recorded information on every tip he received in a restaurant over a period of several months. Several variables were recorded:

  • Bill amount in dollars (total_bill)

  • Tip in dollars (tip)

  • Gender of the bill payer (sex)

  • Smokers among the guests (smoker)

  • Day of the week (day)

  • Time of day (time)

  • Size of the group (size)

In the following, we will first briefly explain what the following code does and then you will find the code and the corresponding output.

Get initial information about a data set#

The following two functions are built into Python via the pandas library. They are useful for gaining an initial overview of a dataset and the characteristics of its features.

  • info() provides details about the data structure, such as column names, data types, and the number of non-missing values.

  • describe(include='all') returns summary statistics for each column โ€” including minimum, maximum, mean, quartiles, and more. It also includes categorical variables if include='all' is specified.

These functions are ideal for quickly identifying potential issues (e.g. missing values or incorrect data types) and getting a basic sense of the data distribution.

import pandas as pd
tips = pd.read_csv("tips.csv")

tips.info()
tips.describe(include='all')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
total_bill tip sex smoker day time size
count 244.000000 244.000000 244 244 244 244 244.000000
unique NaN NaN 2 2 4 2 NaN
top NaN NaN Male No Sat Dinner NaN
freq NaN NaN 157 151 87 176 NaN
mean 19.785943 2.998279 NaN NaN NaN NaN 2.569672
std 8.902412 1.383638 NaN NaN NaN NaN 0.951100
min 3.070000 1.000000 NaN NaN NaN NaN 1.000000
25% 13.347500 2.000000 NaN NaN NaN NaN 2.000000
50% 17.795000 2.900000 NaN NaN NaN NaN 2.000000
75% 24.127500 3.562500 NaN NaN NaN NaN 3.000000
max 50.810000 10.000000 NaN NaN NaN NaN 6.000000

You can obtain similar information by using the skim() function from the skimpy library.

from skimpy import skim
import pandas as pd
tips = pd.read_csv("tips.csv")

skim(tips)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ skimpy summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚          Data Summary                Data Types                                                                 โ”‚
โ”‚ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”“ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”“                                                          โ”‚
โ”‚ โ”ƒ Dataframe         โ”ƒ Values โ”ƒ โ”ƒ Column Type โ”ƒ Count โ”ƒ                                                          โ”‚
โ”‚ โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”ฉ                                                          โ”‚
โ”‚ โ”‚ Number of rows    โ”‚ 244    โ”‚ โ”‚ string      โ”‚ 4     โ”‚                                                          โ”‚
โ”‚ โ”‚ Number of columns โ”‚ 7      โ”‚ โ”‚ float64     โ”‚ 2     โ”‚                                                          โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ int64       โ”‚ 1     โ”‚                                                          โ”‚
โ”‚                                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                                          โ”‚
โ”‚                                                     number                                                      โ”‚
โ”‚ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“  โ”‚
โ”‚ โ”ƒ column         โ”ƒ NA  โ”ƒ NA %   โ”ƒ mean    โ”ƒ sd       โ”ƒ p0     โ”ƒ p25     โ”ƒ p50    โ”ƒ p75    โ”ƒ p100   โ”ƒ hist    โ”ƒ  โ”‚
โ”‚ โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ  โ”‚
โ”‚ โ”‚ total_bill     โ”‚   0 โ”‚      0 โ”‚   19.79 โ”‚    8.902 โ”‚   3.07 โ”‚   13.35 โ”‚   17.8 โ”‚  24.13 โ”‚  50.81 โ”‚ โ–‚โ–ˆโ–„โ–‚โ–โ–  โ”‚  โ”‚
โ”‚ โ”‚ tip            โ”‚   0 โ”‚      0 โ”‚   2.998 โ”‚    1.384 โ”‚      1 โ”‚       2 โ”‚    2.9 โ”‚  3.562 โ”‚     10 โ”‚  โ–ˆโ–ˆโ–ƒโ–   โ”‚  โ”‚
โ”‚ โ”‚ size           โ”‚   0 โ”‚      0 โ”‚    2.57 โ”‚   0.9511 โ”‚      1 โ”‚       2 โ”‚      2 โ”‚      3 โ”‚      6 โ”‚   โ–ˆโ–‚โ–‚   โ”‚  โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                     string                                                      โ”‚
โ”‚ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“  โ”‚
โ”‚ โ”ƒ column  โ”ƒ NA  โ”ƒ NA %  โ”ƒ shortest  โ”ƒ longest โ”ƒ min    โ”ƒ max   โ”ƒ chars per row โ”ƒ words per row โ”ƒ total words โ”ƒ  โ”‚
โ”‚ โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ  โ”‚
โ”‚ โ”‚ sex     โ”‚   0 โ”‚     0 โ”‚ Male      โ”‚ Female  โ”‚ Female โ”‚ Male  โ”‚          4.71 โ”‚             1 โ”‚         244 โ”‚  โ”‚
โ”‚ โ”‚ smoker  โ”‚   0 โ”‚     0 โ”‚ No        โ”‚ Yes     โ”‚ No     โ”‚ Yes   โ”‚          2.38 โ”‚             1 โ”‚         244 โ”‚  โ”‚
โ”‚ โ”‚ day     โ”‚   0 โ”‚     0 โ”‚ Sun       โ”‚ Thur    โ”‚ Fri    โ”‚ Thur  โ”‚          3.25 โ”‚             1 โ”‚         244 โ”‚  โ”‚
โ”‚ โ”‚ time    โ”‚   0 โ”‚     0 โ”‚ Lunch     โ”‚ Dinner  โ”‚ Dinner โ”‚ Lunch โ”‚          5.72 โ”‚             1 โ”‚         244 โ”‚  โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ End โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Exercise 1#

๐Ÿง  What was the maximum tip amount, in dollars, that the waiter received and recorded?

Create bar chart#

Below you can see a variant for creating a bar chart. The bar() function from the matplotlib library is used to display the frequencies of the categories โ€˜Femaleโ€™ and โ€˜Maleโ€™ in the sex column of the dataset.

First, the Pandas library is used to load the dataset tips.csv. Then, the value_counts() function is applied to the sex column to count how often each gender appears. These counts are stored in the variable sex_counts, and the corresponding category labels and values are extracted.

Finally, a simple vertical bar chart is created with plt.bar(). The chart includes a title and axis labels to make the information easier to interpret.

import pandas as pd
import matplotlib.pyplot as plt
tips = pd.read_csv("tips.csv")

sex_counts = tips["sex"].value_counts()
labels = sex_counts.index
values = sex_counts.values

plt.bar(labels, values)

plt.title("Number of guests by gender")
plt.xlabel("Gender")
plt.ylabel("Number")
plt.show()
_images/ba5dcd0f6e9ef69f267f0566f4395ece4375b17fed6cc4de45e2a621b25b34c7.png

Split bar chart#

The following code creates a grouped bar chart that shows the number of guests by gender, separated by time of day (Lunch vs. Dinner). This is achieved using seabornโ€™s catplot() function, which allows for the creation of multiple subplots based on the values of a categorical feature.

The argument col="time" specifies that a separate chart should be created for each value in the time column. The result is two side-by-side bar charts showing the distribution of genders for lunch and dinner, respectively.

This visualization is helpful to compare how the gender distribution of guests differs between the two meal times.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
tips = pd.read_csv("tips.csv")
tips = sns.load_dataset("tips")

g = sns.catplot(
    data=tips,
    x="sex",
    kind="count",
    col="time",
)

g.set_titles("Time: {col_name}")
g.set_axis_labels("Gender", "Count")
plt.show()
_images/7c74b40f2259823274301306793dc350ec0ad5ce4cc4f5a499b1f4039be4dd88.png

Exercise 2#

๐Ÿง  During lunch the number of female and male guests who paid was roughly equal, but at dinner male guests paid for the meal significantly more often (more than twice as many). True or false?

Create histogram#

The following code creates a histogram that visualizes the frequency distribution of invoice amounts in the dataset. The function plt.hist() from the Matplotlib library is used to display how often invoice amounts within specific value ranges occur.

edgecolor="black" adds clear borders to each bar.

The x-axis represents the invoice amounts in dollars, while the y-axis shows the number of invoices that fall into each interval.

import pandas as pd
import matplotlib.pyplot as plt
tips = pd.read_csv("tips.csv")

plt.hist(
    tips["total_bill"],
    edgecolor="black"
)
plt.title("Frequency distribution of the invoice amount")
plt.xlabel("Invoice amount in $")
plt.ylabel("Number")
plt.show()
_images/0a4a24f1473034947b729bbeaa02cb35891389cf9310a4f893f4ffd4a371d9e8.png

Create scatter plot#

The following code creates a scatter plot to visualize the relationship between the total bill and the tip amount. The plt.scatter() function from the Matplotlib library is used to plot each observation as a point, where:

  • the x-axis represents the total bill in dollars

  • the y-axis represents the corresponding tip amount

Each point in the diagram corresponds to one row in the dataset. This type of visualization is useful for identifying patterns or trends โ€” for example, whether higher bills are associated with higher tips.

import pandas as pd
import matplotlib.pyplot as plt
tips = pd.read_csv("tips.csv")

plt.scatter(
    tips["total_bill"],
    tips["tip"]
)

plt.title("Tip vs. Total Bill")
plt.xlabel("Total Bill ($)")
plt.ylabel("Tip ($)")
plt.show()
_images/fd6974591fc5f60a39985b679064f6b0b5ad48929ee0ba7e28a2d1955e8d02a3.png

Exercise 3#

๐Ÿง  What was the tip amount for a bill of approximately $48? (Approximate reading is sufficient)

Create scatter plot with regression line#

The following code creates a scatter plot that shows the relationship between the total bill and the tip amount. In addition to the individual data points, it includes a regression line, which models the linear relationship between the two variables.

This is done using the lmplot() function from the seaborn library.

  • The parameter x="total_bill" defines the variable on the x-axis,

  • y="tip" defines the variable on the y-axis,

  • and seaborn automatically fits and draws a linear regression line through the data.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
tips = pd.read_csv("tips.csv")

sns.lmplot(
    data=tips,
    x="total_bill",
    y="tip"
)

plt.title("Tip vs. Total Bill with Regression Line")
plt.xlabel("Total Bill ($)")
plt.ylabel("Tip ($)")
plt.show()
_images/10a62d809b3606c7f7b4f421609a183f2097ea88a0f5f7fa2b1464309b9551f4.png

Create mosaic plot#

The following code creates a mosaic plot to visualize the relationship between the categorical variables gender (sex) and time of day (time) in the dataset.

A mosaic plot shows the relative frequencies of combinations of categorical values using rectangles whose areas are proportional to the number of observations. In this example:

The x-axis is split according to the values of sex (Female / Male), and each section is further divided based on time (Lunch / Dinner). This allows you to easily see whether, for example, a higher proportion of men or women visited the restaurant at a particular time of day.

The plot is created using the mosaic() function from the statsmodels library, which is specifically designed for this type of categorical visualization.

import pandas as pd
from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
tips = pd.read_csv("tips.csv")

mosaic(tips, ['sex', 'time'])

plt.title("Mosaic Plot: Gender vs. Time")
plt.xlabel("Sex")
plt.ylabel("Proportion")
plt.show()
_images/022afadd56c3affd12b7804742a7fb106ecccc9b5aed6cd4be357ad39556ea56.png

Create Boxplot#

This code uses the boxplot() function from the seaborn library (imported as sns) to create a box-and-whisker plot comparing the distribution of total bill amounts for the two time categories: Lunch and Dinner.

Each box represents the spread of the data for one group and includes:

  • the median (horizontal line inside the box),

  • the interquartile range (the box itself),

  • and potential outliers (individual dots).

The x-axis shows the two time categories (time), while the y-axis displays the corresponding invoice amounts (total_bill). This visualization makes it easy to compare whether bills are generally higher at dinner than at lunch.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
tips = pd.read_csv("tips.csv")

sns.boxplot(
    data=tips,
    x="time",
    y="total_bill",
)

plt.title("Boxplot of Total Bill by Time")
plt.xlabel("Time of Day")
plt.ylabel("Total Bill ($)")
plt.show()
_images/56dd818a8b0470711f10d2af52dc6b1f9b72c18223f4c3d23fc0399184a34ad3.png

Determine distribution function#

This code uses the ECDF() class from the statsmodels.distributions.empirical_distribution module to compute the empirical cumulative distribution function of the total_bill values.

An ECDF shows, for each value on the x-axis, the proportion of observations that are less than or equal to that value. The result is a step-shaped curve that increases from 0 to 1. This type of plot is useful to understand how values are distributed across the dataset โ€” for example, to estimate what share of bills are below a certain amount.

The function plt.step() is used to draw the ECDF as a step function, and grid lines are added to improve readability.

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
tips = pd.read_csv("tips.csv")

ecdf = ECDF(tips["total_bill"])

plt.step(ecdf.x, ecdf.y, where="post")
plt.title("Empirical CDF of Total Bill")
plt.xlabel("Total Bill ($)")
plt.ylabel("Cumulative Probability")
plt.grid(True)
plt.show()
_images/3c55fcf937ba1367850e077cb37bcbd1c388d954c243c7f63edd6077fe4f9b49.png

Exercise 4#

๐Ÿง  What percentage of the bills were at most $20?