4. pandas and matplotlib

4. pandas and matplotlib#

Remember the library we installed in the last chapter? That library is called pandas and is arguably the most commonly used library in Python. It allows us to turn any data into a table, which we can then plot and analyze.

Let’s look at it in action:

import pandas as pd
names = ["Alice", "Bob", "Charlie"]
ages = [25, 30, 35]
hobby = ["Completing Python tutorials", "sports", "Completing Python tutorials"]
data = {"Name": names, "Age": ages, "Hobby": hobby}
df = pd.DataFrame(data)
print(df)
      Name  Age                        Hobby
0    Alice   25  Completing Python tutorials
1      Bob   30                       sports
2  Charlie   35  Completing Python tutorials

As you can see, pandas allows us to neatly organize our data into a table.

We can now use pandas group_by function to group our data by a certain column and then apply another function to each group. In this case, we will group by the “Hobby” column and check how many people have the same hobby:

df_grouped = df.groupby("Hobby").size()
print(df_grouped)
Hobby
Completing Python tutorials    2
sports                         1
dtype: int64

Now, lets say we have a csv-file called “prices.csv” that we would normally open in Microsoft Excel that contains data we want to analyze. pandas allows us to read Excel files and turn them into a pandas DataFrame in just one line.

price_data = pd.read_csv("prices.csv")
print(price_data)
      timestamp  difference
0    2022-01-01     -842.52
1    2022-01-02     1490.70
2    2022-01-03     -495.08
3    2022-01-04     -734.92
4    2022-01-05     -610.64
..          ...         ...
360  2022-12-27      112.44
361  2022-12-28     -242.95
362  2022-12-29     -154.67
363  2022-12-30       86.83
364  2022-12-31      -21.54

[365 rows x 2 columns]

It contains timestamps and the differences in a stock course that happened between the timestamps. If we only want to see the data for a specific time period, we can filter the DataFrame by using the “loc” function. In this case, we want to see the data from 2023-01-01 to 2023-01-31:

start_date = "2022-02-01"
end_date = "2023-09-01"
filtered_data = price_data.loc[(price_data["timestamp"] >= start_date) & (price_data["timestamp"] <= end_date)]
print(filtered_data)
      timestamp  difference
31   2022-02-01      503.88
32   2022-02-02      317.67
33   2022-02-03    -1836.18
34   2022-02-04      337.87
35   2022-02-05     4293.08
..          ...         ...
360  2022-12-27      112.44
361  2022-12-28     -242.95
362  2022-12-29     -154.67
363  2022-12-30       86.83
364  2022-12-31      -21.54

[334 rows x 2 columns]

To plot this time-dependent data, we can first turn the timestamp-column into a date format and then plot it.

import matplotlib.pyplot as plt
price_data["timestamp"] = pd.to_datetime(price_data["timestamp"])
price_data.plot(x="timestamp", y="difference")
plt.show()
_images/beb05df876cf7d5e266e0dbb48311ca5da0507ef03411f59a34fc28c62132084.png