Efficient way of collecting sum of missing values per row in Pandas




While doing project assigned from the Udacity Nanodegree program I'm currently attending, I had to collect the number of null values in each row and display it in the histogram. However, the Pandas dataset contained 891221 rows, which I had to wait quite a long time to iterate through the rows using the following code:

df.apply(lambda row: sum_of_nulls_in_row(row), axis=1)

Although it was suggested in this post that using apply() is much faster than using iterrow(), it was still too slow to finish the project efficiently. After several search, I found this discussion. In Icyblade's answer, he mentioned this:

"When using pandas, try to avoid performing operations in a loop, including apply, map, applymapetc. That's slow!"

Icyblade's suggestion was to use following code:

df.isnull().sum(axis=1)

I've applied it into my code, and Boom! It worked like a charm. Long waiting was eliminated and the result was there in a blink. A good lesson learned.

[This article has been copied from hojoongchung.wordpress.com. The original article has been written on Sep 15, 2018, 6:35 PM]

댓글

이 블로그의 인기 게시물

Project Owl: Kinect V2 Point Cloud Generator for Grasshopper

Project Owl: Video is up!

Where to find Revit Journal files?