How to convert simple image url and text to dataset #5215
-
Hi I wanna prepare a dataset, I created a csv like this
or html
how can I turn this to a huggingface dataset like https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions 🙄 |
Beta Was this translation helpful? Give feedback.
Answered by
mariosasko
Nov 9, 2022
Replies: 2 comments
-
Hi! You can find the solution in this thread: https://discuss.huggingface.co/t/how-to-change-the-format-of-a-dataset/25104 |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
camenduru
-
thanks @mariosasko ♥ I did like this for html <img src="https://image1.jpg" alt="dog">
<img src="https://image2.jpg" alt="cat">
<img src="https://image3.jpg" alt="panda"> !pip install datasets bs4 from huggingface_hub import notebook_login
!git config --global credential.helper store
notebook_login() !mkdir plushies import urllib.request
from bs4 import BeautifulSoup
with open('/content/plushies.txt') as html:
content = html.read()
soup = BeautifulSoup(content)
for imgtag in soup.find_all('img'):
url=imgtag['src']
name = url.split('/')[-1]
headers={'user-agent': 'Mozilla/5.0'}
r=requests.get(url, headers=headers)
with open(f"/content/plushies/{name}", 'wb') as f:
f.write(r.content) from datasets import load_dataset, Dataset, Image
with open('/content/plushies.txt') as html:
content = html.read()
texts = []
images = []
soup = BeautifulSoup(content)
for imgtag in soup.find_all('img'):
texts.append(imgtag['alt'])
images.append(f"/content/plushies/{imgtag['src'].split('/')[-1]}")
ds = Dataset.from_dict({"image": images, "text": texts})
ds = ds.cast_column("image", Image()) ds.push_to_hub("camenduru/plushies") final result: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi! You can find the solution in this thread: https://discuss.huggingface.co/t/how-to-change-the-format-of-a-dataset/25104