c 3.50 Francesc 30
d 4.50 Yves 40
z 5.75 Henry 100
One of the strengths of pandas is working with missing data. To this end, consider the
following code that adds a new column, but with a slightly different index. We use the
rather flexible join method here:
In [ 16 ]: df.join(pd.DataFrame([ 1 , 4 , 9 , 16 , 25 ],
index=[‘a’, ‘b’, ‘c’, ‘d’, ‘y’],
columns=[‘squares’,]))
# temporary object
Out[16]: floats names numbers squares
a 1.50 Guido 10 1
b 2.50 Felix 20 4
c 3.50 Francesc 30 9
d 4.50 Yves 40 16
z 5.75 Henry 100 NaN
What you can see here is that pandas by default accepts only values for those indices that
already exist. We lose the value for the index y and have a NaN value (i.e., “Not a
Number”) at index position z. To preserve both indices, we can provide an additional
parameter to tell pandas how to join. In our case, we use how=“outer” to use the union of
all values from both indices:
In [ 17 ]: df = df.join(pd.DataFrame([ 1 , 4 , 9 , 16 , 25 ],
index=[‘a’, ‘b’, ‘c’, ‘d’, ‘y’],
columns=[‘squares’,]),
how=‘outer’)
df
Out[17]: floats names numbers squares
a 1.50 Guido 10 1
b 2.50 Felix 20 4
c 3.50 Francesc 30 9
d 4.50 Yves 40 16
y NaN NaN NaN 25
z 5.75 Henry 100 NaN
Indeed, the index is now the union of the two original indices. All missing data points,
given the new enlarged index, are replaced by NaN values. Other options for the join
operation include inner for the intersection of the index values, left (default) for the
index values of the object on which the method is called, and right for the index values of
the object to be joined.
Although there are missing values, the majority of method calls will still work. For
example:
In [ 18 ]: df[[‘numbers’, ‘squares’]].mean()
# column-wise mean
Out[18]: numbers 40
squares 11
dtype: float64
In [ 19 ]: df[[‘numbers’, ‘squares’]].std()
# column-wise standard deviation
Out[19]: numbers 35.355339
squares 9.669540
dtype: float64
Second Steps with DataFrame Class
From now on, we will work with numerical data. We will add further features as we go,
like a DatetimeIndex to manage time series data. To have a dummy data set to work with,
generate a numpy.ndarry with, for example, nine rows and four columns of
pseudorandom, standard normally distributed numbers: