Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent index name collision on index reset for interpolated df #16

Merged
merged 4 commits into from
Apr 21, 2017

Conversation

kuanb
Copy link
Member

@kuanb kuanb commented Apr 20, 2017

Small PR. In the associated file, interpolatestoptimes resets the dataframe df_for_interpolation. If df_for_interpolation is indexed by one of the columns, the reset_index operation will error because the existing column name will conflict with the one to be inserted during the index reset.

Here's an example traceback:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "urbanaccess/gtfs/network.py", line 301, in interpolatestoptimes
    df_for_interpolation.reset_index(inplace=True)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3041, in reset_index
    new_obj.insert(0, name, values)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2511, in insert
    allow_duplicates=allow_duplicates)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3763, in insert
    raise ValueError('cannot insert %s, already exists' % item)
ValueError: cannot insert unique_trip_id, already exists

By setting drop to True, we avoid this.

@sablanchard sablanchard requested a review from pksohn April 20, 2017 22:03
kuanb added 3 commits April 20, 2017 15:30
…ion just to skip the merge step, when there really is not significant perf gains to be achieved through this
@pksohn
Copy link
Collaborator

pksohn commented Apr 20, 2017

@kuanb Thanks for taking a look at this! I see you're still pushing commits so let me know when this is ready to review.

@kuanb
Copy link
Member Author

kuanb commented Apr 20, 2017

Ready to review :)
Noticed there was a downstream side effect as well - and resolved that (the last 3 commits).

Copy link
Collaborator

@pksohn pksohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

@pksohn
Copy link
Collaborator

pksohn commented Apr 21, 2017

Thanks again for this. As I think you figured out, we're only using the reset_index() so as not to lose the index in the merge process (like in this example).

I'm curious though, how did you end up with a DataFrame at this point (i.e. line 301) with unique_trip_id as the index? I wasn't able to replicate that. Maybe I'm using the wrong transit network. I ask because I want to make sure to handle this in the unit test.

@kuanb
Copy link
Member Author

kuanb commented Apr 21, 2017

No prob!

I've got this running example file here. Running that will create the situation. I didn't flow upstream to where the unique_trip_id was being cast as the index, but assumed that as I was following a pretty straightforward workflow this is something that could easily happen and ought to be accounted for.

@kuanb
Copy link
Member Author

kuanb commented Apr 21, 2017

Walking through it right now to provide you with a better answer:

Logged the index name to both inputs to interpolatestoptimes and neither was unique_trip_id. So unique_trip_id is being set within the function itself.

So I walked through the function and pinpointed where it occurs (line 284):

    print('df_for_interpolation 1')
    print(df_for_interpolation.index.name)
    df_for_interpolation = df_for_interpolation.merge(last_valid_stop_df,
                                                      left_on='unique_trip_id',
                                                      right_index=True)
    print('df_for_interpolation 2')
    print(df_for_interpolation.index.name)

That merge sets the index to unique_trip_id.

Output from print statements:

df_for_interpolation 1
None
df_for_interpolation 2
unique_trip_id

@pksohn
Copy link
Collaborator

pksohn commented Apr 21, 2017

That's really interesting. My df_for_interpolation.index.name doesn't change after the merge. Could it be a pandas version thing? I'm on 0.19.2 (I think latest).

@kuanb
Copy link
Member Author

kuanb commented Apr 21, 2017

Not sure off the top of my head but if it is, that would be an excellent argument for locking in dependency versions (right now they are all >=).

Also, on a related note, if dependencies are creating different results, this would be another argument in support of developing in a prespecified environment. This is what I've been doing - my Dockerfile (#12) creates the exact environment described in the README and that's what I have been developing and executing the UA library within.

Note: I am using the same version of Pandas:

>>> import pandas as pd
>>> pd.__version__
u'0.19.2'

@pksohn
Copy link
Collaborator

pksohn commented Apr 21, 2017

I think I figured out why this is happening. The Madison data is clean (i.e. no data to interpolate), so the df_for_interpolation is empty, which causes slightly different behavior like the index column assignment you caught. Your PR takes care of this edge case well, though we should also add a return statement early in the function if there's no interpolation to do. I'll take care of that outside this PR, and add a test case for this behavior.

Thanks again! 👍

@pksohn pksohn merged commit 6a48626 into UDST:master Apr 21, 2017
@kuanb kuanb deleted the kuanb-interpolate-df-reset-index branch April 21, 2017 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants