Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the dataset source #8

Open
Delicious-Bitter-Melon opened this issue Dec 1, 2024 · 12 comments
Open

the dataset source #8

Delicious-Bitter-Melon opened this issue Dec 1, 2024 · 12 comments

Comments

@Delicious-Bitter-Melon
Copy link

Thank you for your contribution to open-source datasets. I would like to confirm whether these video resources were collected independently by your team or if they are based on public datasets such as CelebV-Text, which have been processed or utilized further by you?

@SHYuanBest
Copy link
Member

SHYuanBest commented Dec 1, 2024

Thanks for your interest. The current open-source dataset is collected from youtube by our team independently, the other part of our training data that cannot be opened to the public is sourced from open-sora plan. The CelebV-Text dataset seems to have the following problems, so we did not use it:
image

@Delicious-Bitter-Melon
Copy link
Author

The dataset uploaded in the preprint version appears to be incomplete. While the dataset contains 54,239 videos, masks are only available for 31,947 of them. @SHYuanBest

@SHYuanBest
Copy link
Member

We only used 31,947 of them.

@Delicious-Bitter-Melon
Copy link
Author

We only used 31,947 of them.

Thank you for your reply. Could you clarify whether you use a face mask or a head mask to obscure unrelated areas during training?

@SHYuanBest
Copy link
Member

We use the face first, and use the head if the face mask is missing.

@Delicious-Bitter-Melon
Copy link
Author

We use the face first, and use the head if the face mask is missing.

Thank you again for your reply. I have some confusion regarding the process: Do you first concatenate the VAE features of keypoint maps and the VAE features of facial images along the frame dimension, and then concatenate these with the noise video along the channel dimension to achieve a total of 32 channels?

@SHYuanBest
Copy link
Member

yes

@Delicious-Bitter-Melon
Copy link
Author

yes

Thanks.

@Delicious-Bitter-Melon
Copy link
Author

Hi. Could you provide further clarification regarding the preference for "half-body or full-body images" as mentioned in the Hugging Face Space documentation? From my understanding, the system utilizes FaceXLib to detect and crop faces, which should effectively handle face images, half-body images, and full-body images alike. However, your statement seems to emphasize half-body and full-body images while not explicitly mentioning face images. Could you explain how this preference impacts the processing of face images and why there might be a specific emphasis on half-body and full-body shots?

@SHYuanBest
Copy link
Member

If we only input 'crop face', FaceXLib is likely to fail in detecting faces. However, using half-body or full-body images can significantly reduce this likelihood.

@Delicious-Bitter-Melon
Copy link
Author

If we only input 'crop face', FaceXLib is likely to fail in detecting faces. However, using half-body or full-body images can significantly reduce this likelihood.

Got it. Thank you very much for your prompt response.

@SHYuanBest
Copy link
Member

The dataset uploaded in the preprint version appears to be incomplete. While the dataset contains 54,239 videos, masks are only available for 31,947 of them. @SHYuanBest

oh, i miss some important details. The 31947 are single face, and the remain are multi face. #21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants