Check our projects at https://github.com/FaceOnLive
The field of Artificial Intelligence (AI) is rapidly evolving, with new breakthroughs and technologies emerging at a swift pace. This document aims to highlight some of the trending research areas within AI and list relevant source codes where enthusiasts and professionals alike can find resources, code, and projects related to these cutting-edge topics.
Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time.
- Paper: https://arxiv.org/pdf/2403.18802v2.pdf
- Github: https://github.com/google-deepmind/long-form-factuality
We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.
Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs).
Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI.
In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image.
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.
In this paper we propose to study generalization of neural networks on small algorithmically generated datasets.
Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code.
In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks.
- Paper: https://arxiv.org/pdf/2403.13187v1.pdf
- Github: https://github.com/sakanaai/evolutionary-model-merge
Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.
- Paper: https://arxiv.org/pdf/2211.00593v1.pdf
- Github: https://github.com/openai/transformer-debugger
The DeepSeek-VL family (both 1. 3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.
- Paper: https://arxiv.org/pdf/2403.05525v2.pdf
- Github: https://github.com/deepseek-ai/deepseek-vl
- HuggingFace Space: https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B
We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models.
- Paper: https://arxiv.org/pdf/2403.07815v1.pdf
- Github: https://github.com/amazon-science/chronos-forecasting
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.
To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator.
- Paper: https://arxiv.org/pdf/2403.06738v1.pdf
- Github: https://github.com/heheyas/v3d
- HuggingFace Space: https://huggingface.co/spaces/heheyas/V3D
Despite recent advances in image-to-video generation, better controllability and local animation are less explored.
The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing.
- Paper: https://arxiv.org/pdf/2403.09055v1.pdf
- Github: https://github.com/ironjr/streammultidiffusion
The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing.
Despite the success in specific tasks and scenarios, existing foundation agents, empowered by large models (LMs) and advanced tools, still cannot generalize to different scenarios, mainly due to dramatic differences in the observations and actions across scenarios.
We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation.