Python Implementation on "Storytelling with Data"——Figure 2.6, 2.7: Scatterplot

Dong

本篇文章由 VeriMake 旧版论坛中备份出的原帖的 Markdown 源码生成

原帖标题为：Python Implementation on "Storytelling with Data"——Figure 2.6, 2.7: Scatterplot
原帖网址为：~~https://verimake.com/topics/166~~ （旧版论坛网址，已失效）

原帖作者为：Felix（旧版论坛 id = 28，注册于 2020-04-18 19:59:47）
原帖由作者初次发表于 2020-10-10 17:11:14，最后编辑于 2020-10-10 17:11:14（编辑时间可能不准确）

截至 2021-12-18 14:27:30 备份数据库时，原帖已获得 792 次浏览、0 个点赞、0 条回复

Introduction

"Scatterplots can be useful for showing the relationship between two things". [1]

Code

Import modules

import numpy as np
import matplotlib.pyplot as plt

Set style

plt.style.use('seaborn-whitegrid')

Data

Create data

We use numpy to create our dataset randomly. All the constants are set to make our plot look appropriate.

np.random.seed(7)

x1 = np.random.randint(1000, 1800, 10)
x2 = np.random.randint(1800, 3200, 10)
x3 = np.random.randint(3200, 4000, 10)

data_miles = np.concatenate((x1, x2))
data_miles = np.concatenate((data_miles, x3))

y1 = np.random.randint(1700, 3000, 10)
y2 = np.random.randint(500, 1500, 10)
y3 = np.random.randint(1500, 2500, 10)

data_cost = np.append(y1, y2)
data_cost = np.concatenate((data_cost, y3))
data_cost = data_cost/1000

x = np.mean(data_miles)
y = np.mean(data_cost)

We create variable data_miles as values of data points on x-axis, and data_cost as those on y-axis. The average value of these two variables are assigned to x and y.

Data preview

print(data_miles)
print(data_cost)

Output:

[1175 1196 1537 1502 1579 1211 1615 1348 1185 1398 2335 2145 2166 2354
 2530 2704 2991 2892 2191 2740 3712 3275 3450 3206 3987 3644 3244 3903
 3525 3352]
[1.883 2.649 2.836 2.463 2.913 1.99  2.012 1.901 2.25  2.472 0.994 0.637
 1.355 0.983 0.695 0.572 1.351 1.444 0.757 1.204 1.9   1.874 2.427 1.849
 2.44  2.104 2.264 2.083 1.779 2.022]

Plot

fig, ax = plt.subplots(1, 2, figsize=(9, 3), dpi=150)


"""First plot."""
# scatter plot [2]
ax[0].scatter(data_miles, data_cost, c='grey', s=30) # s parameter changes size of points
# set axis [3]
ax[0].axis([0, 4000, 0.00, 3.00])
# plot the point indicating average x and y values
ax[0].scatter(x, y, c='black', s=60)
# annotate [4]
ax[0].annotate('AVG', [x+100, y])


"""Second plot."""
# for every points in the second plot,
# color orange if it's greater than average value,
# color grey if it's greater than average value,
for i in range (0, len(data_miles)):
    if data_cost[i] > y:
        ax[1].scatter(data_miles[i], data_cost[i], c='orange', s=30)
    else:
        ax[1].scatter(data_miles[i], data_cost[i], c='grey', s=30)
# same as the first plot
ax[1].axis([0, 4000, 0.00, 3.00])
ax[1].scatter(x, y, c='black', s=60)
ax[1].annotate('AVG', [x+100, y])
# plot the line: y = average 
a = np.arange(0, 5000, 1000)
b = np.ones(len(a))*y
ax[1].plot(a, b, '--', c='black', linewidth=0.8)


"""Set some formats."""
# title
ax[0].set_title("Cost per mile by miles driven")
ax[1].set_title("Cost per mile by miles driven")
# x label and y label
ax[0].set_xlabel("Miles driven per month", fontsize=10)
ax[0].set_ylabel("Cost per mile", fontsize=10)
ax[1].set_xlabel("Miles driven per month", fontsize=10)
ax[1].set_ylabel("Cost per mile", fontsize=10)
# remove grid
ax[0].grid(False)
ax[1].grid(False)

Result

Compare between a normal scatterplot and a modified one:

Reference

[1] Cole Nussbaumer Knaflic, Storytelling with Data

[2] matplotlib.axes.Axes.scatter

[3] matplotlib.axes.Axes.axis

[4] matplotlib.axes.Axes.annotate