Check our projects at https://github.com/FaceOnLive

Trending AI Researches with Source Codes

The field of Artificial Intelligence (AI) is rapidly evolving, with new breakthroughs and technologies emerging at a swift pace. This document aims to highlight some of the trending research areas within AI and list relevant source codes where enthusiasts and professionals alike can find resources, code, and projects related to these cutting-edge topics.

20. Long-form factuality in large language models

Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time.

Paper: https://arxiv.org/pdf/2403.18802v2.pdf
Github: https://github.com/google-deepmind/long-form-factuality

19. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.

Paper: https://arxiv.org/pdf/2403.18814v1.pdf
Github: https://github.com/dvlab-research/minigemini

18. BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs).

Paper: https://arxiv.org/pdf/2403.06976v1.pdf
Github: https://github.com/tencentarc/brushnet

17. AIOS: LLM Agent Operating System

Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI.

Paper: https://arxiv.org/pdf/2403.16971v2.pdf
Github: https://github.com/agiresearch/aios

16. AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image.

Paper: https://arxiv.org/pdf/2403.17694v1.pdf
Github: https://github.com/scutzzj/aniportrait

15. VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.

Paper: https://arxiv.org/pdf/2403.16973v1.pdf
Github: https://github.com/jasonppy/voicecraft

14. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

In this paper we propose to study generalization of neural networks on small algorithmically generated datasets.

Paper: https://arxiv.org/pdf/2201.02177v1.pdf
Github: https://github.com/openai/grok

13. LLM4Decompile: Decompiling Binary Code with Large Language Models

Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code.

Paper: https://arxiv.org/pdf/2403.05286v1.pdf
Github: https://github.com/albertan017/LLM4Decompile

12. FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.

Paper: https://arxiv.org/pdf/2403.12962v1.pdf
Github: https://github.com/williamyang1991/fresco

11. Evolutionary Optimization of Model Merging Recipes

Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks.

Paper: https://arxiv.org/pdf/2403.13187v1.pdf
Github: https://github.com/sakanaai/evolutionary-model-merge

10. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.

Paper: https://arxiv.org/pdf/2211.00593v1.pdf
Github: https://github.com/openai/transformer-debugger

9. DeepSeek-VL: Towards Real-World Vision-Language Understanding

The DeepSeek-VL family (both 1. 3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.

Paper: https://arxiv.org/pdf/2403.05525v2.pdf
Github: https://github.com/deepseek-ai/deepseek-vl
HuggingFace Space: https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B

8. Chronos: Learning the Language of Time Series

We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models.

Paper: https://arxiv.org/pdf/2403.07815v1.pdf
Github: https://github.com/amazon-science/chronos-forecasting

7. VideoMamba: State Space Model for Efficient Video Understanding

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

Paper: https://arxiv.org/pdf/2403.06977v2.pdf
Github: https://github.com/opengvlab/videomamba

6. V3D: Video Diffusion Models are Effective 3D Generators

To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator.

Paper: https://arxiv.org/pdf/2403.06738v1.pdf
Github: https://github.com/heheyas/v3d
HuggingFace Space: https://huggingface.co/spaces/heheyas/V3D

5. Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Despite recent advances in image-to-video generation, better controllability and local animation are less explored.

Paper: https://arxiv.org/pdf/2403.08268v1.pdf
Github: https://github.com/mayuelala/followyourclick

4. StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing.

Paper: https://arxiv.org/pdf/2403.09055v1.pdf
Github: https://github.com/ironjr/streammultidiffusion

3. GiT: Towards Generalist Vision Transformer through Universal Language Interface

The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing.

Paper: https://arxiv.org/pdf/2403.09394v1.pdf
Github: https://github.com/haiyang-w/git

2. Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

Despite the success in specific tasks and scenarios, existing foundation agents, empowered by large models (LMs) and advanced tools, still cannot generalize to different scenarios, mainly due to dramatic differences in the observations and actions across scenarios.

Paper: https://arxiv.org/pdf/2403.03186v2.pdf
Github: https://github.com/baai-agents/cradle

1. DragAnything: Motion Control for Anything using Entity Representation

We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation.

Paper: https://arxiv.org/pdf/2403.07420v2.pdf
Github: [https://github.com/showlab/draganything](https://github.com/showlab/draganything

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Check our projects at https://github.com/FaceOnLive

Trending AI Researches with Source Codes

20. Long-form factuality in large language models

19. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

18. BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

17. AIOS: LLM Agent Operating System

16. AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

15. VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

14. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

13. LLM4Decompile: Decompiling Binary Code with Large Language Models

12. FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

11. Evolutionary Optimization of Model Merging Recipes

10. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

9. DeepSeek-VL: Towards Real-World Vision-Language Understanding

8. Chronos: Learning the Language of Time Series

7. VideoMamba: State Space Model for Efficient Video Understanding

6. V3D: Video Diffusion Models are Effective 3D Generators

5. Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

4. StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

3. GiT: Towards Generalist Vision Transformer through Universal Language Interface

2. Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

1. DragAnything: Motion Control for Anything using Entity Representation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Check our projects at https://github.com/FaceOnLive

Trending AI Researches with Source Codes

20. Long-form factuality in large language models

19. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

18. BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

17. AIOS: LLM Agent Operating System

16. AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

15. VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

14. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

13. LLM4Decompile: Decompiling Binary Code with Large Language Models

12. FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

11. Evolutionary Optimization of Model Merging Recipes

10. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

9. DeepSeek-VL: Towards Real-World Vision-Language Understanding

8. Chronos: Learning the Language of Time Series

7. VideoMamba: State Space Model for Efficient Video Understanding

6. V3D: Video Diffusion Models are Effective 3D Generators

5. Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

4. StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

3. GiT: Towards Generalist Vision Transformer through Universal Language Interface

2. Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

1. DragAnything: Motion Control for Anything using Entity Representation