handle youtube #110

mruwnik · 2023-08-01T18:43:50Z

No description provided.

ccstan99

Yay!! I think this is one of the major final missing pieces! Now to clean metadata of special docs and make sure new arxiv papers get scraped.

ccstan99 · 2023-08-04T07:18:46Z

align_data/sources/youtube/youtube.py

+            "source_type": "youtube",
+            "date_published": self._get_published_date(video),
+            "authors": self.extract_authors(video),
+        })


Did you find that we were able to get fairly decent metadata this way?

yes. The main missing thing is the authors, as it just does the channel name, but even that should be ok.

ccstan99 · 2023-08-04T07:19:32Z

align_data/sources/youtube/youtube.py

+            return None
+        except TranscriptsDisabled:
+            logger.error(f'Transcripts disabled for https://www.youtube.com/watch?v={video_id} - skipping')
+            return None


Were there many that had transcriptions disabled or unavailable? Would something like this help? https://huggingface.co/spaces/SteveDigital/free-fast-youtube-url-video-to-text-using-openai-whisper

yes. Or some kind of alternative. Issue added - #112

ccstan99 · 2023-08-04T07:21:14Z

align_data/sources/youtube/__init__.py

+            'PLAPVC5uNprwY0q4_nyeeHqIT07wZqwjGO',
+            'PLCRVRLd2RhZTpdUdEzJjo3qhmX3y3skWA',
+            'PLTYHZYmxohXpn5uf8JZ2OouB1PsDJAk-x',
+        ]


Nice! Each channel is a dataset and all the playlists are a dataset? So to add an individual video, we add it to a playlist we're already tracking? Might have quite a few duplicates since all of Rob's videos will also be in a playlist too? And are we skipping "Robert Miles 2" channel since it's only a handful of videos and more off topic?

Duplicates are fine - they will be ignored. This is part of the reason why I put all the playlists together :D
That's part of the reason I skipped it - I just wanted the main body of this to be merged, as later additions will be simple

handle youtube

0cd71b9

mruwnik requested review from ccstan99, henri123lemoine, Aprillion and Thomas-Lemoine August 1, 2023 18:44

ccstan99 approved these changes Aug 4, 2023

View reviewed changes

mruwnik merged commit 5ab84e7 into main Aug 4, 2023

mruwnik deleted the youtube-datasets branch August 4, 2023 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle youtube #110

handle youtube #110

mruwnik commented Aug 1, 2023

ccstan99 left a comment

ccstan99 Aug 4, 2023

mruwnik Aug 4, 2023

ccstan99 Aug 4, 2023

mruwnik Aug 4, 2023

ccstan99 Aug 4, 2023

mruwnik Aug 4, 2023

handle youtube #110

handle youtube #110

Conversation

mruwnik commented Aug 1, 2023

ccstan99 left a comment

Choose a reason for hiding this comment

ccstan99 Aug 4, 2023

Choose a reason for hiding this comment

mruwnik Aug 4, 2023

Choose a reason for hiding this comment

ccstan99 Aug 4, 2023

Choose a reason for hiding this comment

mruwnik Aug 4, 2023

Choose a reason for hiding this comment

ccstan99 Aug 4, 2023

Choose a reason for hiding this comment

mruwnik Aug 4, 2023

Choose a reason for hiding this comment