Re-indexing a Series
Re-indexing in pandas is a process that makes the data in a Series conform to a set of labels. It is used by pandas to perform much of the alignment process and is hence a fundamental operation.
Re-indexing achieves several things:
- Re-ordering existing data to match a set of labels
- Inserting NaN markers where no data exists for a label
- Possibly filling missing data for a label using some type of logic (defaulting to adding NaN values)
Re-indexing can be as simple as simply assigning a new index to the .index property of a Series. The following demonstrates changing the index of a Series in this manner:
Flexibility in creating a new index is provided through use of the .reindex() method. One case is in assigning a new index where the number of labels does not match the number of values:
The following re-indexes the Series using a set of labels that has new, missing, and overlapping values:
There are several things here that are important to point out about .reindex(). The first is that the result of a .reindex() method is a new Series and not an in-place modification. The new Series has an index with labels, as specified in the passing to the function. The data is copied for each label that exists in the original Series. If a label is not found in the original Series, then NaN will be assigned as the value. Finally, rows in the Series with labels that are not in the new index are dropped.
Re-indexing is also useful when you want to align two Series to perform an operation on values in two Series but the Series objects do not have labels that align for some reason. A common scenario is that one Series has labels of integer type and the other is strings, but the underlying meaning of the values is the same (this is common when getting data from remote sources). Take the following Series objects as an example:
Although the meaning of the labels in the two Series is the same, they will align due to their data types being different. This is easily fixed once the problem is identified:
The .reindex() method has the default action of inserting NaN as a missing value when labels are not found in the source Series. This value can be changed by using the fill_value parameter. The following example demonstrates using 0 instead of NaN:
When performing a re-index on ordered data such as a time series, it is possible to perform interpolation, or filling of values. There will be a more elaborate discussion on interpolation and filling in Chapter 10, Time Series Data, but the following examples introduce the concept. Let's start with the following Series:
The following example demonstrates the concept of forward filling, often referred to as last known value. The Series is re-indexed to create a contiguous integer index, and by using the method='ffill' parameter, any new index labels are assigned the previously known non-NaN value:
Index labels 1 and 2 are matched to red at label 0, 4 and 5 to green from label 3, and 6 to blue from label 5.
The following example fills backward using method='bfill':
Label 6 did not have a previous value, so it is set to NaN; 4 is set to the value of 5 (blue); and 2 and 1 to the value of label 3 (green).