Python for Finance: Analyze Big Financial Data

c 3.50 Francesc 30 d 4.50 Yves 40 z 5.75 Henry 100

One of the strengths of pandas is working with missing data. To this end, consider the

following code that adds a new column, but with a slightly different index. We use the

rather flexible join method here:

In [ 16 ]: df.join(pd.DataFrame([ 1 , 4 , 9 , 16 , 25 ], index=[‘a’, ‘b’, ‘c’, ‘d’, ‘y’], columns=[‘squares’,])) # temporary object Out[16]: floats names numbers squares a 1.50 Guido 10 1 b 2.50 Felix 20 4 c 3.50 Francesc 30 9 d 4.50 Yves 40 16 z 5.75 Henry 100 NaN

What you can see here is that pandas by default accepts only values for those indices that

already exist. We lose the value for the index y and have a NaN value (i.e., “Not a

Number”) at index position z. To preserve both indices, we can provide an additional

parameter to tell pandas how to join. In our case, we use how=“outer” to use the union of

all values from both indices:

In [ 17 ]: df = df.join(pd.DataFrame([ 1 , 4 , 9 , 16 , 25 ], index=[‘a’, ‘b’, ‘c’, ‘d’, ‘y’], columns=[‘squares’,]), how=‘outer’) df Out[17]: floats names numbers squares a 1.50 Guido 10 1 b 2.50 Felix 20 4 c 3.50 Francesc 30 9 d 4.50 Yves 40 16 y NaN NaN NaN 25 z 5.75 Henry 100 NaN

Indeed, the index is now the union of the two original indices. All missing data points,

given the new enlarged index, are replaced by NaN values. Other options for the join

operation include inner for the intersection of the index values, left (default) for the

index values of the object on which the method is called, and right for the index values of

the object to be joined.

Although there are missing values, the majority of method calls will still work. For

example:

In [ 18 ]: df[[‘numbers’, ‘squares’]].mean() # column-wise mean Out[18]: numbers 40 squares 11 dtype: float64 In [ 19 ]: df[[‘numbers’, ‘squares’]].std() # column-wise standard deviation Out[19]: numbers 35.355339 squares 9.669540 dtype: float64

Second Steps with DataFrame Class

From now on, we will work with numerical data. We will add further features as we go,

like a DatetimeIndex to manage time series data. To have a dummy data set to work with,

generate a numpy.ndarry with, for example, nine rows and four columns of

pseudorandom, standard normally distributed numbers: