Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle youtube #110

Merged
merged 1 commit into from
Aug 4, 2023
Merged

handle youtube #110

merged 1 commit into from
Aug 4, 2023

Conversation

mruwnik
Copy link
Collaborator

@mruwnik mruwnik commented Aug 1, 2023

No description provided.

Copy link
Collaborator

@ccstan99 ccstan99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!! I think this is one of the major final missing pieces! Now to clean metadata of special docs and make sure new arxiv papers get scraped.

"source_type": "youtube",
"date_published": self._get_published_date(video),
"authors": self.extract_authors(video),
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find that we were able to get fairly decent metadata this way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. The main missing thing is the authors, as it just does the channel name, but even that should be ok.

return None
except TranscriptsDisabled:
logger.error(f'Transcripts disabled for https://www.youtube.com/watch?v={video_id} - skipping')
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were there many that had transcriptions disabled or unavailable? Would something like this help? https://huggingface.co/spaces/SteveDigital/free-fast-youtube-url-video-to-text-using-openai-whisper

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Or some kind of alternative. Issue added - #112

'PLAPVC5uNprwY0q4_nyeeHqIT07wZqwjGO',
'PLCRVRLd2RhZTpdUdEzJjo3qhmX3y3skWA',
'PLTYHZYmxohXpn5uf8JZ2OouB1PsDJAk-x',
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Each channel is a dataset and all the playlists are a dataset? So to add an individual video, we add it to a playlist we're already tracking? Might have quite a few duplicates since all of Rob's videos will also be in a playlist too? And are we skipping "Robert Miles 2" channel since it's only a handful of videos and more off topic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicates are fine - they will be ignored. This is part of the reason why I put all the playlists together :D
That's part of the reason I skipped it - I just wanted the main body of this to be merged, as later additions will be simple

@mruwnik mruwnik merged commit 5ab84e7 into main Aug 4, 2023
@mruwnik mruwnik deleted the youtube-datasets branch August 4, 2023 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants