Often you want to sort Pandas data frame in a specific way. Typically, one may want to sort pandas data frame based on the values of one or more columns or sort based on the values of row index or row names of pandas dataframe. Pandas data frame has two useful functionspython
Each of these functions come with numerous options, like sorting the data frame in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm and so on.ui
Here is a quick Pandas tutorial on multiple ways of using sort_values() and sort_index() to sort pandas data frame using a real data set (gapminder).this
Let us first load the gapminder data from software carpentry URL.url
1
2
3
4
5
|
# read data from url as pandas dataframe
gapminder
=
pd.read_csv(data_url)
# print the first three rows
print
(gapminder.head(n
=
3
))
|
We can sort pandas dataframe based on the values of a single column by specifying the column name wwe want to sort as input argument to sort_values(). For example, we can sort by the values of 「lifeExp」 column in the gapminder data likespa
1
|
sort_by_life
=
gapminder.sort_values(
'lifeExp'
)
|
1
2
3
4
5
|
print
(sort_by_life.head(n
=
3
))
country year pop continent lifeExp gdpPercap
1292
Rwanda
1992
7290203.0
Africa
23.599
737.068595
0
Afghanistan
1952
8425333.0
Asia
28.801
779.445314
552
Gambia
1952
284320.0
Africa
30.000
485.230659
|
Note that by default sort_values sorts and gives a new data frame. The new sorted data frame is in ascending order (small values first and large values last). With head function we can see that the first rows have smaller life expectancy. Using tail function the sorted data frame, we can see that the last rows have higher life expectancy.code
1
2
3
4
5
|
print
(sort_by_life.tail(n
=
3
))
country year pop continent lifeExp gdpPercap
802
Japan
2002
127065841.0
Asia
82.000
28604.59190
671
Hong Kong China
2007
6980412.0
Asia
82.208
39724.97867
803
Japan
2007
127467972.0
Asia
82.603
31656.06806
|
2. How to Sort Pandas Dataframe based on the values of a column (Descending order)?three
To sort a dataframe based on the values of a column but in descending order so that the largest values of the column are at the top, we can use the argument ascending=False.ip
1
|
sort_by_life
=
gapminder.sort_values(
'lifeExp'
,ascending
=
False
)
|
In this example, we can see that after sorting the dataframe by lifeExp with ascending=False, the countries with largest life expectancy are at the top.ci
1
2
3
4
5
|
print
(sort_by_life.head(n
=
3
))
country year pop continent lifeExp gdpPercap
803
Japan
2007
127467972.0
Asia
82.603
31656.06806
671
Hong Kong China
2007
6980412.0
Asia
82.208
39724.97867
802
Japan
2002
127065841.0
Asia
82.000
28604.59190
|
Often a data frame might contain missing values and when sorting a data frame on a column with missing value, we might want to have rows with missing values to be at the first or at the last.get
We can specify the position we want for missing values using the argument na_position. With na_position=’first’, it will have the rows with missing values first.
1
|
sort_na_first
=
gapminder.sort_values(
'lifeExp'
,na_position
=
'first'
)
|
In this example, there are NO missing values and that is why there is no na values at the top when sorted with the option na_position=’first’.
1
2
3
4
5
|
sort_na_first.head()
country year pop continent lifeExp gdpPercap
1292
Rwanda
1992
7290203.0
Africa
23.599
737.068595
0
Afghanistan
1952
8425333.0
Asia
28.801
779.445314
552
Gambia
1952
284320.0
Africa
30.000
485.230659
|
By default sorting pandas data frame using sort_values() or sort_index() creates a new data frame. If you don’t want create a new data frame after sorting and just want to do the sort in place, you can use the argument 「inplace = True」. Here is an example of sorting a pandas data frame in place without creating a new data frame.
1
|
gapminder.sort_values(
'lifeExp'
, inplace
=
True
, ascending
=
False
)
|
We can see that the data frame sorted as lifeExp values at the top are smallest and the row indices are not in order.
1
2
3
4
5
|
print
(gapminder.head(n
=
3
))
country year pop continent lifeExp gdpPercap
1292
Rwanda
1992
7290203.0
Africa
23.599
737.068595
0
Afghanistan
1952
8425333.0
Asia
28.801
779.445314
552
Gambia
1952
284320.0
Africa
30.000
485.230659
|
Note that, the row index of the sorted data frame is different from the data frame before sorting.
We can use sort_index() to sort pandas dataframe to sort by row index or names. In this example, row index are numbers and in the earlier example we sorted data frame by lifeExp and therefore the row index are jumbled up. We can sort by row index (with inplace=True option) and retrieve the original dataframe.
1
|
gapminder.sort_index(inplace
=
True
)
|
Now we can see that row indices start from 0 and sorted in ascending order. Compare it to the previous example, where the first row index is 1292 and row indices are not sorted.
1
2
3
4
5
|
print
(gapminder.head(n
=
3
))
country year pop continent lifeExp gdpPercap
0
Afghanistan
1952
8425333.0
Asia
28.801
779.445314
1
Afghanistan
1957
9240934.0
Asia
30.332
820.853030
2
Afghanistan
1962
10267083.0
Asia
31.997
853.100710
|
Often, you might want to sort a data frame based on the values of multiple columns. We can specify the columns we want to sort by as a list in the argument for sort_values(). For example, to sort by values of two columns, we can do.
1
|
sort_by_life_gdp
=
gapminder.sort_values([
'lifeExp'
,
'gdpPercap'
])
|
We can see that lifeExp column is sorted in ascending order and for each values of lifeExp, gdpPercap is sorted.
1
2
3
4
5
|
print
(sort_by_life_gdp.head())
country year pop continent lifeExp gdpPercap
1292
Rwanda
1992
7290203.0
Africa
23.599
737.068595
0
Afghanistan
1952
8425333.0
Asia
28.801
779.445314
552
Gambia
1952
284320.0
Africa
30.000
485.230659
|
Note that when sorting by multiple columns, pandas sort_value() uses the first variable first and second variable next. We can see the difference by switching the order of column names in the list.
1
|
sort_by_life_gdp
=
gapminder.sort_values([
'gdpPercap'
,
'lifeExp'
])
|
1
2
3
4
5
|
print
(sort_by_life_gdp.head())
country year pop continent lifeExp gdpPercap
334
Congo Dem. Rep.
2002
55379852.0
Africa
44.966
241.165877
335
Congo Dem. Rep.
2007
64606759.0
Africa
46.462
277.551859
876
Lesotho
1952
748747.0
Africa
42.138
298.846212
|