Python for Finance: Analyze Big Financial Data

(Elle) #1
                                    c               3.50        Francesc                            30
d 4.50 Yves 40
z 5.75 Henry 100

One of the strengths of pandas is working with missing data. To this end, consider the


following code that adds a new column, but with a slightly different index. We use the


rather flexible join method here:


In  [ 16 ]: df.join(pd.DataFrame([ 1 ,   4 ,     9 ,     16 ,    25 ],
index=[‘a’, ‘b’, ‘c’, ‘d’, ‘y’],
columns=[‘squares’,]))
# temporary object
Out[16]: floats names numbers squares
a 1.50 Guido 10 1
b 2.50 Felix 20 4
c 3.50 Francesc 30 9
d 4.50 Yves 40 16
z 5.75 Henry 100 NaN

What you can see here is that pandas by default accepts only values for those indices that


already exist. We lose the value for the index y and have a NaN value (i.e., “Not a


Number”) at index position z. To preserve both indices, we can provide an additional


parameter to tell pandas how to join. In our case, we use how=“outer” to use the union of


all values from both indices:


In  [ 17 ]: df  =   df.join(pd.DataFrame([ 1 ,   4 ,     9 ,     16 ,    25 ],
index=[‘a’, ‘b’, ‘c’, ‘d’, ‘y’],
columns=[‘squares’,]),
how=‘outer’)
df
Out[17]: floats names numbers squares
a 1.50 Guido 10 1
b 2.50 Felix 20 4
c 3.50 Francesc 30 9
d 4.50 Yves 40 16
y NaN NaN NaN 25
z 5.75 Henry 100 NaN

Indeed, the index is now the union of the two original indices. All missing data points,


given the new enlarged index, are replaced by NaN values. Other options for the join


operation include inner for the intersection of the index values, left (default) for the


index values of the object on which the method is called, and right for the index values of


the object to be joined.


Although there are missing values, the majority of method calls will still work. For


example:


In  [ 18 ]: df[[‘numbers’,  ‘squares’]].mean()
# column-wise mean
Out[18]: numbers 40
squares 11
dtype: float64
In [ 19 ]: df[[‘numbers’, ‘squares’]].std()
# column-wise standard deviation
Out[19]: numbers 35.355339
squares 9.669540
dtype: float64

Second Steps with DataFrame Class


From now on, we will work with numerical data. We will add further features as we go,


like a DatetimeIndex to manage time series data. To have a dummy data set to work with,


generate a numpy.ndarry with, for example, nine rows and four columns of


pseudorandom, standard normally distributed numbers:

Free download pdf