Prevent index name collision on index reset for interpolated df #16

kuanb · 2017-04-20T21:54:25Z

Small PR. In the associated file, interpolatestoptimes resets the dataframe df_for_interpolation. If df_for_interpolation is indexed by one of the columns, the reset_index operation will error because the existing column name will conflict with the one to be inserted during the index reset.

Here's an example traceback:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "urbanaccess/gtfs/network.py", line 301, in interpolatestoptimes
    df_for_interpolation.reset_index(inplace=True)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3041, in reset_index
    new_obj.insert(0, name, values)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2511, in insert
    allow_duplicates=allow_duplicates)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3763, in insert
    raise ValueError('cannot insert %s, already exists' % item)
ValueError: cannot insert unique_trip_id, already exists

By setting drop to True, we avoid this.

…ine 306

…ion just to skip the merge step, when there really is not significant perf gains to be achieved through this

pksohn · 2017-04-20T22:39:48Z

@kuanb Thanks for taking a look at this! I see you're still pushing commits so let me know when this is ready to review.

kuanb · 2017-04-20T22:40:34Z

Ready to review :)
Noticed there was a downstream side effect as well - and resolved that (the last 3 commits).

pksohn

Looks good to me 👍

pksohn · 2017-04-21T00:13:28Z

Thanks again for this. As I think you figured out, we're only using the reset_index() so as not to lose the index in the merge process (like in this example).

I'm curious though, how did you end up with a DataFrame at this point (i.e. line 301) with unique_trip_id as the index? I wasn't able to replicate that. Maybe I'm using the wrong transit network. I ask because I want to make sure to handle this in the unit test.

kuanb · 2017-04-21T00:17:12Z

No prob!

I've got this running example file here. Running that will create the situation. I didn't flow upstream to where the unique_trip_id was being cast as the index, but assumed that as I was following a pretty straightforward workflow this is something that could easily happen and ought to be accounted for.

kuanb · 2017-04-21T00:24:17Z

Walking through it right now to provide you with a better answer:

Logged the index name to both inputs to interpolatestoptimes and neither was unique_trip_id. So unique_trip_id is being set within the function itself.

So I walked through the function and pinpointed where it occurs (line 284):

    print('df_for_interpolation 1')
    print(df_for_interpolation.index.name)
    df_for_interpolation = df_for_interpolation.merge(last_valid_stop_df,
                                                      left_on='unique_trip_id',
                                                      right_index=True)
    print('df_for_interpolation 2')
    print(df_for_interpolation.index.name)

That merge sets the index to unique_trip_id.

Output from print statements:

df_for_interpolation 1
None
df_for_interpolation 2
unique_trip_id

pksohn · 2017-04-21T00:44:22Z

That's really interesting. My df_for_interpolation.index.name doesn't change after the merge. Could it be a pandas version thing? I'm on 0.19.2 (I think latest).

kuanb · 2017-04-21T15:59:58Z

Not sure off the top of my head but if it is, that would be an excellent argument for locking in dependency versions (right now they are all >=).

Also, on a related note, if dependencies are creating different results, this would be another argument in support of developing in a prespecified environment. This is what I've been doing - my Dockerfile (#12) creates the exact environment described in the README and that's what I have been developing and executing the UA library within.

Note: I am using the same version of Pandas:

>>> import pandas as pd
>>> pd.__version__
u'0.19.2'

pksohn · 2017-04-21T18:08:05Z

I think I figured out why this is happening. The Madison data is clean (i.e. no data to interpolate), so the df_for_interpolation is empty, which causes slightly different behavior like the index column assignment you caught. Your PR takes care of this edge case well, though we should also add a return statement early in the function if there's no interpolation to do. I'll take care of that outside this PR, and add a test case for this behavior.

Thanks again! 👍

need to reset index and drop index name if exists in col of dataframe

69c7e8c

sablanchard requested a review from pksohn April 20, 2017 22:03

kuanb added 3 commits April 20, 2017 15:30

set index if _check_if_index_name_in_cols true to ensure no snag on l…

b7c32d6

…ine 306

dropped interpolated_times empty check, it over complicates the funct…

7979189

…ion just to skip the merge step, when there really is not significant perf gains to be achieved through this

readd drop_bool = False

cd25462

pksohn approved these changes Apr 21, 2017

View reviewed changes

pksohn merged commit 6a48626 into UDST:master Apr 21, 2017

kuanb deleted the kuanb-interpolate-df-reset-index branch April 21, 2017 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent index name collision on index reset for interpolated df #16

Prevent index name collision on index reset for interpolated df #16

kuanb commented Apr 20, 2017

pksohn commented Apr 20, 2017

kuanb commented Apr 20, 2017 •

edited

Loading

pksohn left a comment

pksohn commented Apr 21, 2017

kuanb commented Apr 21, 2017

kuanb commented Apr 21, 2017 •

edited

Loading

pksohn commented Apr 21, 2017

kuanb commented Apr 21, 2017 •

edited

Loading

pksohn commented Apr 21, 2017

Prevent index name collision on index reset for interpolated df #16

Prevent index name collision on index reset for interpolated df #16

Conversation

kuanb commented Apr 20, 2017

pksohn commented Apr 20, 2017

kuanb commented Apr 20, 2017 • edited Loading

pksohn left a comment

Choose a reason for hiding this comment

pksohn commented Apr 21, 2017

kuanb commented Apr 21, 2017

kuanb commented Apr 21, 2017 • edited Loading

pksohn commented Apr 21, 2017

kuanb commented Apr 21, 2017 • edited Loading

pksohn commented Apr 21, 2017

kuanb commented Apr 20, 2017 •

edited

Loading

kuanb commented Apr 21, 2017 •

edited

Loading

kuanb commented Apr 21, 2017 •

edited

Loading