Joakim Lustig's blog

How to iterate over a dataframe in pandas

April 19, 2018


When most Pandas beginners want to iterate over a dataframe the goto solution is to write a standard python loop. This is not suprising since loops are how you normally interact with python objects. I also did this when I started using pandas.

# don't do this
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3], 'col2': [3,4,5], 'col3': [0,0,0]})

for i in range(0,len(df)):
    df.iloc[i,2] = df.iloc[i,0] + df.iloc[i,1]

In this post I want to describe a better way of doing it. Pandas dataframes are a bit special since they have a lot of built in opimizations, and using a loop you effectively circumvent all the hard work the Pandas people have put into optimizing the framework.

So let's take a look at a few ways to do this better.

Looping with .iterrows()

If you really need to write a loop for whatever reason the best way to do that is by using the .iterrows() generator, which returns an index of each row and the row itself.

import pandas as pd
df = pd.DataFrame({'col1': [1,2,3], 'col2': [3,4,5], 'col3': [0,0,0]})

for i,row in df.iterrows():
    row[2] = row[0] + row[1]

This way gives a big performance improvement, as well as is more readable than the loop above. But it's still not the best way to do it.

Looping with .apply()

For most use cases .apply() should be your goto way to iterate over a dataframe. What it does is to apply a function over the rows or columns in your dataframe.

It takes two arguments, the first is the function you want to apply, and the second is if you want to apply it to rows or columns. The second argument is called axis (axis=1 for rows and axis=0 for columns).

import pandas as pd

def summarize(row):
    row[2] = row[0] + row[1]
    return row

df = pd.DataFrame({'col1': [1,2,3], 'col2': [3,4,5], 'col3': [0,0,0]})

df.apply(lambda row: summarize(row), axis=1)

Now this might look a little scary if you've never used lambda functions in python before. But it shouldn't be, lambda functions are just a way of creating anonymous functions in python, meaning functions without a name that doesn't need to be saved for later use. They are mostly used when you need to create a function or method which takes a function as input, like .apply() in pandas.

If you want you can also save lambda functions to a variable and use them:

f = lambda x, y : x + y
print(f(1,1))

Which is be equivalent to writing a normal python function:

def f(x,y):
  return x + y

print(f(1,1))

So a lambda function is just a way to write a python function without having to name it, but you still have the option to do so if you want.

Using .apply() will give you significant performance boost over loops, and I hope I have showed that they are no more difficult to use.

If you need an even bigger performance boost you should look into vectorization in numpy, which let's you do the computations in parallell. I might do a write up about it later.


Joakim Lustig's face

Joakim Lustig

Software Developer & Data Scientist