Python Algorithmic Trading Cookbook
上QQ阅读APP看书,第一时间看更新

DataFrame manipulation—applying, sorting, iterating, and concatenating

Adding to the previous recipe, this recipe demonstrates more operations that can be performed on DataFrame objects: applying a function to all elements in a column, sorting based on a column, iterating over the rows, and concatenating multiple DataFrame objects vertically and horizontally.

Getting ready

Make sure you have followed the previous recipe before trying out this recipe. Ensure you have df in your Python namespace from the previous recipe.

How to do it…

Execute the following steps for this recipe:

  1. Import the necessary modules
>>> import random
>>> import pandas
  1. Modify the values in the timestamp column of df with a different date and time format DD-MM-YYYY HH:MM:SS:
>>> df['timestamp'] = df['timestamp'].apply(
lambda x: x.strftime("%d-%m-%Y %H:%M:%S"))
>>> df

We get the following output:

            timestamp    open    high     low   close volume
0 13-11-2019 09:00:00 71.8075 71.8450 71.7775 71.7925 219512
1 13-11-2019 09:15:00 71.7925 71.8000 71.7800 71.7925 59252
2 13-11-2019 09:30:00 71.7925 71.8125 71.7600 71.7625 57187
3 13-11-2019 09:45:00 71.7600 71.7650 71.7350 71.7425 43048
4 13-11-2019 10:00:00 71.7425 71.7800 71.7425 71.7775 45863
5 13-11-2019 10:15:00 71.7750 71.8225 71.7700 71.8150 42460
6 13-11-2019 10:30:00 71.8150 71.8300 71.7775 71.7800 62403
7 13-11-2019 10:45:00 71.7750 71.7875 71.7475 71.7525 34090
8 13-11-2019 11:00:00 71.7525 71.7825 71.7475 71.7625 39320
9 13-11-2019 11:15:00 71.7625 71.7925 71.7600 71.7875 20190
  1. Create a new DataFrame object by sorting the close column in ascending order:
>>> df.sort_values(by='close', ascending=True)

We get the following output:

            timestamp    open    high     low   close volume
3 13-11-2019 09:45:00 71.7600 71.7650 71.7350 71.7425 43048
7 13-11-2019 10:45:00 71.7750 71.7875 71.7475 71.7525 34090
2 13-11-2019 09:30:00 71.7925 71.8125 71.7600 71.7625 57187
8 13-11-2019 11:00:00 71.7525 71.7825 71.7475 71.7625 39320
4 13-11-2019 10:00:00 71.7425 71.7800 71.7425 71.7775 45863
6 13-11-2019 10:30:00 71.8150 71.8300 71.7775 71.7800 62403
9 13-11-2019 11:15:00 71.7625 71.7925 71.7600 71.7875 20190
0 13-11-2019 09:00:00 71.8075 71.8450 71.7775 71.7925 219512
1 13-11-2019 09:15:00 71.7925 71.8000 71.7800 71.7925 59252
5 13-11-2019 10:15:00 71.7750 71.8225 71.7700 71.8150 42460
  1. Create a new DataFrame object by sorting the open column in descending order:
>>> df.sort_values(by='open', ascending=False)

We get the following output:

            timestamp    open    high     low   close volume
6 13-11-2019 10:30:00 71.8150 71.8300 71.7775 71.7800 62403
0 13-11-2019 09:00:00 71.8075 71.8450 71.7775 71.7925 219512
2 13-11-2019 09:30:00 71.7925 71.8125 71.7600 71.7625 57187
1 13-11-2019 09:15:00 71.7925 71.8000 71.7800 71.7925 59252
7 13-11-2019 10:45:00 71.7750 71.7875 71.7475 71.7525 34090
5 13-11-2019 10:15:00 71.7750 71.8225 71.7700 71.8150 42460
9 13-11-2019 11:15:00 71.7625 71.7925 71.7600 71.7875 20190
3 13-11-2019 09:45:00 71.7600 71.7650 71.7350 71.7425 43048
8 13-11-2019 11:00:00 71.7525 71.7825 71.7475 71.7625 39320
4 13-11-2019 10:00:00 71.7425 71.7800 71.7425 71.7775 45863
  1. Iterate over df to find the average of open, close, high, and low values for each row:
>>> for _, row in df.iterrows():
avg = (row['open'] + row['close'] + row['high'] +
row['low'])/4
print(f"Index: {_} | Average: {avg}")

We get the following output:

Index: 0 | Average: 71.805625
Index: 1 | Average: 71.79124999999999
Index: 2 | Average: 71.781875
Index: 3 | Average: 71.750625
Index: 4 | Average: 71.760625
Index: 5 | Average: 71.795625
Index: 6 | Average: 71.800625
Index: 7 | Average: 71.765625
Index: 8 | Average: 71.76124999999999
Index: 9 | Average: 71.775625
  1. Iterate column-wise over all the values of the first row of df:
>>> for value in df.iloc[0]:
print(value)

We get the following output:

13-11-2019 09:00:00
71.8075
71.845
71.7775
71.7925
219512
  1. Create a sample time-series data as a list of dictionary objects. Assign it to df_new:
>>> df_new = pandas. DataFrame([
{'timestamp': datetime.datetime(2019, 11, 13, 11, 30),
'open': 71.7875,
'high': 71.8075,
'low': 71.77,
'close': 71.7925,
'volume': 18655},
{'timestamp': datetime.datetime(2019, 11, 13, 11, 45),
'open': 71.7925,
'high': 71.805,
'low': 71.7625,
'close': 71.7625,
'volume': 25648},
{'timestamp': datetime.datetime(2019, 11, 13, 12, 0),
'open': 71.7625,
'high': 71.805,
'low': 71.75,
'close': 71.785,
'volume': 37300},
{'timestamp': datetime.datetime(2019, 11, 13, 12, 15),
'open': 71.785,
'high': 71.7925,
'low': 71.7575,
'close': 71.7775,
'volume': 15431},
{'timestamp': datetime.datetime(2019, 11, 13, 12, 30),
'open': 71.7775,
'high': 71.795,
'low': 71.7725,
'close': 71.79,
'volume': 5178}])
>>> df_new

We get the following output:

            timestamp    open    high     low   close volume
0 2019-11-13 11:30:00 71.7875 71.8075 71.7700 71.7925 18655
1 2019-11-13 11:45:00 71.7925 71.8050 71.7625 71.7625 25648
2 2019-11-13 12:00:00 71.7625 71.8050 71.7500 71.7850 37300
3 2019-11-13 12:15:00 71.7850 71.7925 71.7575 71.7775 15431
4 2019-11-13 12:30:00 71.7775 71.7950 71.7725 71.7900 5178
  1. Create a new DataFrame by concatenating df and df_new vertically:
>>> pandas.concat([df, df_new]).reset_index(drop=True)

We get the following output:

             timestamp    open    high     low   close volume
0 13-11-2019 09:00:00 71.8075 71.8450 71.7775 71.7925 219512
1 13-11-2019 09:15:00 71.7925 71.8000 71.7800 71.7925 59252
2 13-11-2019 09:30:00 71.7925 71.8125 71.7600 71.7625 57187
3 13-11-2019 09:45:00 71.7600 71.7650 71.7350 71.7425 43048
4 13-11-2019 10:00:00 71.7425 71.7800 71.7425 71.7775 45863
5 13-11-2019 10:15:00 71.7750 71.8225 71.7700 71.8150 42460
6 13-11-2019 10:30:00 71.8150 71.8300 71.7775 71.7800 62403
7 13-11-2019 10:45:00 71.7750 71.7875 71.7475 71.7525 34090
8 13-11-2019 11:00:00 71.7525 71.7825 71.7475 71.7625 39320
9 13-11-2019 11:15:00 71.7625 71.7925 71.7600 71.7875 20190
10 2019-11-13 11:30:00 71.7875 71.8075 71.7700 71.7925 18655
11 2019-11-13 11:45:00 71.7925 71.8050 71.7625 71.7625 25648
12 2019-11-13 12:00:00 71.7625 71.8050 71.7500 71.7850 37300
13 2019-11-13 12:15:00 71.7850 71.7925 71.7575 71.7775 15431
14 2019-11-13 12:30:00 71.7775 71.7950 71.7725 71.7900 5178

How it works...

In step 1, you import the pandas package.

Applying: In step 2, you modify all the values in the timestamp column of df by using the apply method. This method takes a function as an input to be applied. You pass a lambda function here which expects a datetime object as a single input, and converts it to a string in the required format using strftime(). (Refer to Converting a datetime object to a string recipe for more details on strftime()). The apply method is called on the timestamp column of df, which is a pandas.Series object. The lambda function is applied to each value in the column. This call returns a new pandas.Series object, which you assign back to the timestamp column of df. Note, after this, the timestamp column of df holds timestamps as string objects, and not datetime objects as earlier.

Sorting: In step 3, you create a new DataFrame object by sorting the close column of df in ascending order. You use the sort_values() method to perform the sorting. Similarly, in step 4, you create a new DataFrame object by sorting the open column of df in descending order.

Iterating: In step 5, you iterate over df using the iterrows() method to find and print the average of open, close, high, and low values for each row. The iterrows() method iterates over each row as an (index, pandas.Series) pair. In step 6, you iterate over all the values of the first row of df using df.iloc[0]. You get the timestamp, open, high, low, close, and volume column values for the first row as the output.

Concatenation: In step 6, you create a new DataFrame similar to the one created in step 2 of Creating a pandas.DataFrame object recipe, and assign it to df_new. You use the pandas.concat() function to create a new DataFrame by vertically concatenating dt and df_new. This implies that a new DataFrame would be created with the rows of df_new appended below the rows of df. You pass a list containing df and df_new as an argument to the pandas.concat() function. Also, to create a fresh index starting from 0, you use the reset_index() method with the argument drop passed as True. If you don't use reset_index(), the indices of the concatenated DataFrame would look something like this—0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4. (Refer to Creating a pandas.DataFrame object recipe to know more about the DataFrame index.)

There's more

You can also use the pandas.concat() function to concatenate two DataFrame objects together horizontally, which is column-wise by, passing the axis argument a value of 1 to the pandas.concat() method. This is shown in the following steps:

  1. Import random module from the Python standard library:
>>> import random
  1. Create a DataFrame object with a single column, open, and random values. Assign it to df1 and print it:
>>> df1 = pandas.DataFrame([random.randint(1,100) for i in 
range(10)], columns=['open'])
>>> df1

We get the following output. Your output may differ:

   open
0 99
1 73
2 16
3 53
4 47
5 74
6 21
7 22
8 2
9 30
  1. Create another DataFrame object with a single column, close, and random values. Assign it to df2 and print it:
>>> df2 = pandas.DataFrame([random.randint(1,100) for i in 
range(10)], columns=['close'])
>>> df2

We get the following output:

   close
0 63
1 84
2 44
3 56
4 25
5 1
6 41
7 55
8 93
9 82
  1. Create a new DataFrame by concatenating df1 and df2 horizontally
>>> pandas.concat([df1, df2], axis=1)

We get the following output. Your output may differ:

    open  close
0 99 93
1 73 42
2 16 57
3 53 56
4 47 25
5 74 1
6 21 41
7 22 55
8 2 93
9 30 82