Introducing the Series and DataFrame

Introducing the Series and DataFrame

This blog is the first note of Master Pandas for finance . This notes introduce the basics of the Series and DataFrame objects.

The Basics of The Series and DataFrame Objects

Creating a Series and accessing elements

It;s very easy to use numpy create a pandas Series:

1
2
3
4
import numpy as np
import pandas as pd
np.random.seed(1)
s = pd.Series(np.random.randn(100))

Individual elements of a series can be retrieved using the [] operator of the series object. The item with the index label 2 can be retrieved using the following code:

1
2
3
4
5
s[2]
## Multiple values can be retrieved using an array of label values, as shown here:
s[[2, 5, 20]]
## A series supports slicing using : slice notation.
s[3:8] ## Include 3 but not 8

The .head() and .tail() methods are provided by pandas to examine just the first or the last few records in s Series. By default, these return the first five orlast five rows, respectively, but you can use the n parameter or just pass in an interger to specify the number of rows.

1
2
s.head()
s.tail()

Retrieve index and values

1
2
3
4
## Retrieve index
s.index
## Retrieve values
s.values

Note the type of index and values:

1
2
3
4
type(s.index)
Out[23]: pandas.core.indexes.range.RangeIndex
type(s.values)
Out[24]: numpy.ndarray

Creates a series and set the index:

1
2
3
4
5
6
7
8
s2 = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
s2
Out[25]:
a 1
b 2
c 3
d 4
dtype: int64

A series can be directly initialized from a Python dictionary

1
2
3
4
5
6
7
8
9
s3 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e':5})
s3
Out[26]:
a 1
b 2
c 3
d 4
e 5
dtype: int64

Size, shape, uniqueness, and counts of values

length

1
2
3
s = pd.Series([10, 0, 1, 1, 2, 3, 4, 5, np.nan])
len(s)
Out[27]: 9

shape: return a tuple which contain the object’s dimension

1
2
s.shape
Out[28]: (9,)

count: the count of elements declude from NaN

1
2
s.count()
Out[29]: 8

unique values in a series

1
2
s.unique()
Out[30]: array([ 10., 0., 1., 2., 3., 4., 5., nan])

Obtaining the count of each of the unique items in a series.

Also declude from NaN:

1
2
3
4
5
6
7
8
9
10
s.value_counts()
Out[31]:
1.0 2
5.0 1
4.0 1
3.0 1
2.0 1
0.0 1
10.0 1
dtype: int64

Alignment via index labels

1
2
s3 = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
s4 = pd.Series([4, 3, 2, 1], index = ['d', 'c', 'b', 'a'])

The process of adding two Series objects differs from an array as it first aligns data upon the index label values instead of simple applying the operations to elements in the same position.

1
2
3
4
5
6
7
8
s3 + s4

Out[35]:
a 2
b 4
c 6
d 8
dtype: int64

While for numpy array:

1
2
3
4
5
a1 = np.array([1, 2, 3, 4])
a2 = np.array([4, 3, 2, 1])
a1 + a2

Out[36]: array([5, 5, 5, 5])

Creating a DataFrame

the most straightforward method is creating it from a NumPy array:

1
2
3
4
5
6
pd.DataFrame(np.array([[10, 11], [20, 21]]))

Out[37]:
0 1
0 10 11
1 20 21

A DataFrame can also be initialized by passing a list of Series objects:

1
2
3
4
5
6
7
df1 = pd.DataFrame([pd.Series(np.arange(10, 15)),
pd.Series(np.arange(15, 20))])
df1
Out[40]:
0 1 2 3 4
0 10 11 12 13 14
1 15 16 17 18 19

the demensions of it:

1
2
df1.shape
Out[39]: (2, 5)

column names can be specified at the time of creating the DataFrame using the columns parameter of the DataFrame constructor:

1
2
3
4
5
6
7
8
9
10
11
df = pd.DataFrame(np.array([[10, 11], [20, 21]]), columns = ['a', 'b'])
df

Out[41]:
a b
0 10 11
1 20 21

df.columns

Out[42]: Index(['a', 'b'], dtype='object')

change the column names:

1
2
3
4
5
6
df.columns = ['c1', 'c2']
df
Out[43]:
c1 c2
0 10 11
1 20 21

columns and index

1
2
3
4
5
6
df = pd.DataFrame(np.array([[0, 1], [2, 3]]), columns=['c1', 'c2'], index = ['r1', 'r2'])
df
Out[45]:
c1 c2
r1 0 1
r2 2 3

A DataFrame can also be created by passing a dictionary containing one or more Series objects, where the dictionary keys contain the column names and each Series is one column of data.

1
2
3
4
5
6
7
8
9
10
s1 = pd.Series(np.arange(1, 6, 1))
s2 = pd.Series(np.arange(6, 11, 1))
pd.DataFrame({'c1': s1, 'c2': s2})
Out[46]:
c1 c2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10

注意:如果直接用数组和序列创建数据框,原来的数组和序列都是按行排列的,但是如果使用字典创建数据框会按列排列。
A DataFrame also does automatic alignment of the data for each Series passed in by a dictionary:

1
2
3
4
5
6
7
8
9
s3 = pd.Series(np.arange(12, 14), index=[1, 2])
pd.DataFrame({'c1': s1, 'c2': s2, 'c3': s3})
Out[47]:
c1 c2 c3
0 1 6 NaN
1 2 7 12.0
2 3 8 13.0
3 4 9 NaN
4 5 10 NaN

Example data

First I use TuShare package create a DataFrame df:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import tushare as ts
df = ts.get_k_data('000001')
df.head()
Out[54]:
date open close high low volume code
0 2015-12-15 9.577 9.489 9.624 9.441 364487.0 000001
1 2015-12-16 9.537 9.465 9.561 9.457 392762.0 000001
2 2015-12-17 9.529 9.608 9.664 9.521 648274.0 000001
3 2015-12-18 9.577 9.736 9.991 9.569 1022944.0 000001
4 2015-12-21 9.664 9.959 10.126 9.640 1281142.0 000001

df.index
Out[55]: RangeIndex(start=0, stop=641, step=1)

## 可以比较下面两个用法的小区别:
#############################################
df['date'].head(3)
Out[63]:
0 2015-12-15
1 2015-12-16
2 2015-12-17
Name: date, dtype: object
df[['date']].head(3)
Out[64]:
date
0 2015-12-15
1 2015-12-16
2 2015-12-17
#############################################

df[['date', 'open']]
Out[57]:
date open
0 2015-12-15 9.577
1 2015-12-16 9.537
.. ... ...
640 2018-07-30 9.320
[641 rows x 2 columns]

Selecting columns of a DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
df.date.head(3)
Out[59]:
0 2015-12-15
1 2015-12-16
2 2015-12-17
Name: date, dtype: object
df[['date', 'open']].head(3)
Out[60]:
date open
0 2015-12-15 9.577
1 2015-12-16 9.537
2 2015-12-17 9.529

Slecting rows of a DataFrame using the index

Slicing using the [] operator

1
2
3
4
5
6
7
df[:3]

Out[65]:
date open close high low volume code
0 2015-12-15 9.577 9.489 9.624 9.441 364487.0 000001
1 2015-12-16 9.537 9.465 9.561 9.457 392762.0 000001
2 2015-12-17 9.529 9.608 9.664 9.521 648274.0 000001

Slecting rows by the index label and location .loc[] & .iloc[]

if DataFrame has a numeric index, .loc[] & .iloc[] have same function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
df.loc[14]
Out[100]:
date 2016-01-05
open 8.972
close 9.075
high 9.21
low 8.876
volume 663269
code 000001
Name: 14, dtype: object
df.loc[[12, 13]]
Out[101]:
date open close high low volume code
12 2015-12-31 9.632 9.545 9.656 9.537 491258.0 000001
13 2016-01-04 9.553 8.995 9.577 8.940 563497.0 000001
df.iloc[14]
Out[102]:
date 2016-01-05
open 8.972
close 9.075
high 9.21
low 8.876
volume 663269
code 000001
Name: 14, dtype: object
df.iloc[[12, 13]]
Out[103]:
date open close high low volume code
12 2015-12-31 9.632 9.545 9.656 9.537 491258.0 000001
13 2016-01-04 9.553 8.995 9.577 8.940 563497.0 000001
df.index = df.date
df.loc['2018-06-20']
Out[105]:
date 2018-06-20
open 9.713
close 9.752
high 9.792
low 9.605
volume 763637
code 000001
Name: 2018-06-20, dtype: object
df.iloc[[1]]
Out[106]:
date open close high low volume code
date
2015-12-16 2015-12-16 9.537 9.465 9.561 9.457 392762.0 000001

It’s possible to look up the location in index of a specific label value, which can be used to retrieve the row(s):

1
2
3
4
5
6
7
8
i1 = df.index.get_loc('2018-06-20')
i2 = df.index.get_loc('2018-06-21')
df.iloc[[i1, i2]]
Out[107]:
date open close high low volume code
date
2018-06-20 2018-06-20 9.713 9.752 9.792 9.605 763637.0 000001
2018-06-21 2018-06-21 9.772 9.703 9.880 9.703 840070.0 000001

Selecting rows by the index label and/or location - .ix[]
the method is deprecated. please use .loc[] for label based indexing or .iloc[] for positional indexing.
df.ix[[‘2018-06-20’, ‘2018-06-21’]]
df.ix[[1, 2, 3]]

Scalar lookup by label or location using .at[] and .iat[]

Scalar values can be looked up by label using .at[] by passing the row label and then the column name/value:

1
2
3
4
df.at['2018-06-20', 'open']
Out[114]: 9.7129999999999992
df.iat[0, 0]
Out[115]: '2015-12-15'

Selecting rows using the Boolean selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
df.open < 10
Out[118]:
date
2015-12-15 True
·····
2018-07-30 True
Name: open, Length: 641, dtype: bool
df[df.open < 8]
Out[119]:
date open close high low volume code
date
2016-01-27 2016-01-27 7.905 7.865 7.945 7.642 569037.0 000001
······
2016-03-02 2016-03-02 7.762 8.040 8.064 7.738 676613.0 000001
open = df[df.open < 9][['open']]

Arithmetic on a DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
np.random.seed(123456)
df = pd.DataFrame(np.random.randn(5, 4), columns = ['A', 'B', 'C', 'D'])
df
Out[11]:
A B C D
0 0.469112 -0.282863 -1.509059 -1.135632
1 1.212112 -0.173215 0.119209 -1.044236
2 -0.861849 -2.104569 -0.494929 1.071804
3 0.721555 -0.706771 -1.039575 0.271860
4 -0.424972 0.567020 0.276232 -1.087401
df * 2
Out[12]:
A B C D
0 0.938225 -0.565727 -3.018117 -2.271265
1 2.424224 -0.346429 0.238417 -2.088472
2 -1.723698 -4.209138 -0.989859 2.143608
3 1.443110 -1.413542 -2.079150 0.543720
4 -0.849945 1.134041 0.552464 -2.174801
df = df - df.iloc[0]
subframe = df[1:4][['A', 'B']]
subframe
Out[15]:
A B
1 0.743000 0.109649
2 -1.330961 -1.821706
3 0.252443 -0.423908

The following demonstrates subtraction along a column axis by using the DataFrame object; the .sub() method subtracts the A column from every column

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df
Out[16]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 0.743000 0.109649 1.628267 0.091396
2 -1.330961 -1.821706 1.014129 2.207436
3 0.252443 -0.423908 0.469484 1.407492
4 -0.894085 0.849884 1.785291 0.048232

a_col = df['A']
df.sub(a_col, axis = 0)
Out[17]:
A B C D
0 0.0 0.000000 0.000000 0.000000
1 0.0 -0.633351 0.885267 -0.651603
2 0.0 -0.490745 2.345090 3.538397
3 0.0 -0.676351 0.217041 1.155049
4 0.0 1.743968 2.679375 0.942316

Reindexing the Series and DataFrame objects

The process of performing a reindex does the following:

  1. Reorders existing data to match a set of labels;
  2. Inserts NaN markers where no data exists for a label;
  3. Fills missing data for a label using a type of logic(defaulting to adding NaNs)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
np.random.seed(1)
s = pd.Series(np.random.randn(5))
s
Out[21]:
0 1.624345
1 -0.611756
2 -0.528172
3 -1.072969
4 0.865408
dtype: float64
s.index = ['a', 'b', 'c', 'd', 'e']
s
Out[23]:
a 1.624345
b -0.611756
c -0.528172
d -1.072969
e 0.865408
dtype: float64

Creater flexibility in creating a new index is provide using the .reindex() method. One example of flexibility of .reindex() over assigning the .index property directly is that the list provided to .reindex() can be of a different length than the number of rows in the Series:

1
2
3
4
5
6
7
8
9
10
s2 = s.reindex(['a', 'c', 'e', 'g'])
s2
Out[26]:
a 1.624345
c -0.528172
e 0.865408
g NaN
dtype: float64
s['a']
Out[27]: 1.6243453636632417

注意下面的两个序列由于索引格式不一致而无法对应相加

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
s1 = pd.Series([0, 1, 2], index = [0, 1, 2])
s1
Out[30]:
0 0
1 1
2 2
dtype: int64
s2 = pd.Series([3, 4, 5], index = ['0', '1', '2'])
s1 + s2
Out[32]:
0 NaN
1 NaN
2 NaN
0 NaN
1 NaN
2 NaN
dtype: float64

把索引变成数值型索引之后即可相加

1
2
3
4
5
6
7
8
9
10
11
12
13
s2
Out[34]:
0 3
1 4
2 5
dtype: int64
s2.index = s2.index.values.astype(int)
s1 + s2
Out[36]:
0 3
1 5
2 7
dtype: int64

Filling

The default action of inserting NaN as a missing value during .reindex() can be changed using fill_value of the method. The following command demonstrates using 0 instead of NaN:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
s
Out[39]:
a 1.624345
b -0.611756
c -0.528172
d -1.072969
e 0.865408
dtype: float64
s2 = s.copy()
s2.reindex(['a', 'f'], fill_value = 0)
Out[41]:
a 1.624345
f 0.000000
dtype: float64

前向填充:使用前面的一个元素填充
The following command demonstrated forward filling, often referred to as the last known value. The Series is reindexed to create a continuous integer index, and using the method = ‘ffill’ parameter, any new index labels are assigned a value from the previously seen value value along the Series. Here’s the command:

1
2
3
4
5
6
7
8
9
10
11
s3 = pd.Series(['red', 'green', 'blue'], index = [0, 3, 5])
s3.reindex(np.arange(0, 7), method = 'ffill')
Out[43]:
0 red
1 red
2 red
3 green
4 green
5 blue
6 blue
dtype: object

后向填充:使用后一个值填充,如果没有则使用缺失值填充

1
2
3
4
5
6
7
8
9
10
s3.reindex(np.arange(0, 7), method = 'bfill')
Out[44]:
0 red
1 green
2 green
3 green
4 blue
5 blue
6 NaN
dtype: object

# Python

评论

程振兴

程振兴 @czxa.top
截止今天,我已经在本博客上写了659.4k个字了!

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×