Efficient way of collecting sum of missing values per row in Pandas
While doing project assigned from the Udacity Nanodegree program I'm currently attending, I had to collect the number of null values in each row and display it in the histogram. However, the Pandas dataset contained 891221 rows, which I had to wait quite a long time to iterate through the rows using the following code:
df.apply(lambda row: sum_of_nulls_in_row(row), axis=1)
Although it was suggested in this post that using apply() is much faster than using iterrow(), it was still too slow to finish the project efficiently. After several search, I found this discussion. In Icyblade's answer, he mentioned this:
"When using pandas, try to avoid performing operations in a loop, including apply, map, applymapetc. That's slow!"
Icyblade's suggestion was to use following code:
df.isnull().sum(axis=1)
I've applied it into my code, and Boom! It worked like a charm. Long waiting was eliminated and the result was there in a blink. A good lesson learned.
[This article has been copied from hojoongchung.wordpress.com. The original article has been written on Sep 15, 2018, 6:35 PM]
댓글