NYC dataset changed format and S3 url #2

kyleaddis · 2022-05-27T17:21:57Z

NYC.gov has changed all their files to Parquet. The csv files are no longer available through the provided S3 links.
The new link is https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.parquet
But it requires some additional processing to follow a long. This mostly applies to video DE Zoomcamp 1.2.2 - Ingesting NY Taxi Data to Postgres, but it may pop up in other places throughout the course.

First
pip install pyarrow

Then convert the parquet to pandas:

import pyarrow.parquet as pq
trips = pq.read_table('yellow_tripdata_2021-01.parquet')
df = trips.to_pandas()

Finally, run this command and wait. It will take awhile then return a number when it is finished.
df.to_sql(name='yellow_taxi_data', con=engine, if_exists='replace', chunksize=100000)

Alternatively, the .csv files could be added to the repo with links to those instead.

The text was updated successfully, but these errors were encountered:

erick093 · 2022-11-09T16:06:06Z

Changed again, now the link is: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page#:~:text=January-,Yellow%20Taxi%20Trip%20Records,-(PARQUET)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NYC dataset changed format and S3 url #2

NYC dataset changed format and S3 url #2

kyleaddis commented May 27, 2022

erick093 commented Nov 9, 2022

NYC dataset changed format and S3 url #2

NYC dataset changed format and S3 url #2

Comments

kyleaddis commented May 27, 2022

erick093 commented Nov 9, 2022