Plotting two columns at the interval level
One large advantage of having two columns of data at the interval level, or higher, is that it opens us up to using scatter plots where we can graph two columns of data on our axes and visualize data-points as literal points on the graph. The year and averageTemperature column of our climate change dataset are both at the interval level, as they both have meaning differences, so let's take a crack at plotting all of the monthly recorded US temperatures as a scatter plot, where the x axis will be the year and the y axis will be the temperature. We hope to notice a trending increase in temperature, as the line graph previously suggested:
x = climate_sub_us['year']
y = climate_sub_us['AverageTemperature']
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(x, y)
plt.show()
The following is the output of the preceding code:
Oof, that's not pretty. There seems to be a lot of noise, and that is to be expected. Every year has multiple towns reporting multiple average temperatures, so it makes sense that we see many vertical points at each year.
Let's employ a groupby the year column to remove much of this noise:
# Let's use a groupby to reduce the amount of noise in the US
climate_sub_us.groupby('year').mean()['AverageTemperature'].plot()
The following is the output of the preceding code:
Better! We can definitely see the increase over the years, but let's smooth it out slightly by taking a rolling mean over the years:
# A moving average to smooth it all out:
climate_sub_us.groupby('year').mean()['AverageTemperature'].rolling(10).mean().plot()
The following is the output of the preceding code:
So, our ability to plot two columns of data at the interval level has re-confirmed what the previous line graph suggested; that there does seem to be a general trend upwards in average temperature across the US.
The interval level of data provides a whole new level of understanding of our data, but we aren't done yet.