-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle youtube #110
handle youtube #110
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!! I think this is one of the major final missing pieces! Now to clean metadata of special docs and make sure new arxiv papers get scraped.
"source_type": "youtube", | ||
"date_published": self._get_published_date(video), | ||
"authors": self.extract_authors(video), | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find that we were able to get fairly decent metadata this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. The main missing thing is the authors, as it just does the channel name, but even that should be ok.
return None | ||
except TranscriptsDisabled: | ||
logger.error(f'Transcripts disabled for https://www.youtube.com/watch?v={video_id} - skipping') | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were there many that had transcriptions disabled or unavailable? Would something like this help? https://huggingface.co/spaces/SteveDigital/free-fast-youtube-url-video-to-text-using-openai-whisper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Or some kind of alternative. Issue added - #112
'PLAPVC5uNprwY0q4_nyeeHqIT07wZqwjGO', | ||
'PLCRVRLd2RhZTpdUdEzJjo3qhmX3y3skWA', | ||
'PLTYHZYmxohXpn5uf8JZ2OouB1PsDJAk-x', | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Each channel is a dataset and all the playlists are a dataset? So to add an individual video, we add it to a playlist we're already tracking? Might have quite a few duplicates since all of Rob's videos will also be in a playlist too? And are we skipping "Robert Miles 2" channel since it's only a handful of videos and more off topic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicates are fine - they will be ignored. This is part of the reason why I put all the playlists together :D
That's part of the reason I skipped it - I just wanted the main body of this to be merged, as later additions will be simple
No description provided.