-
World of Bits: An Open-Domain Platform for Web-Based Agents. ICML 2017
Tianlin (Tim) Shi, Andrej Karpathy, Linxi (Jim) Fan, Jonathan Hernandez, Percy Liang [pdf], 2017
-
Rico: A Mobile App Dataset for Building Data-Driven Design Applications 2017
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, Ranjitha Kumar [pdf], 2017
-
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. ICLR 2018
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, Percy Liang [pdf], 2018.2
-
Mapping Natural Language Instructions to Mobile UI Action Sequences. ACL 2020
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge [pdf], 2020.5
-
AndroidEnv: A Reinforcement Learning Platform for Android. ViGIL at NAACL 2021
Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, Doina Precup [pdf], 2021.5
-
Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments. ViGIL at NAACL 2021
Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer [pdf], 2021.4
-
A data-driven approach for learning to control computers. PLMR
Peter C Humphreys, David Raposo, Toby Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, Timothy Lillicrap [pdf], 2022.2
-
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI. Arxiv
Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, Kai Yu [pdf], 2022.5
-
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. Arxiv
Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan [pdf], 2022.7
-
Enabling Conversational Interaction with Mobile UI using Large Language Models. CHI 2023
Bryan Wang, Gang Li, Yang Li [pdf], 2022.9
-
UGIF: UI Grounded Instruction Following. Arxiv
Sagar Gubbi Venkatesh, Partha Talukdar, Srini Narayanan [pdf], 2022.11
-
Multimodal Web Navigation with Instruction-Finetuned Foundation Models. ICLR 2023 Workshop ME-FoMo
Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, Izzeddin Gur [pdf], 2023.5
-
Hierarchical Prompting Assists Large Language Model on Web Navigation. ACL 2023 NLRSE workshop
Abishek Sridhar, Robert Lo, Frank F. Xu, Hao Zhu, Shuyan Zhou [pdf], 2023.5
-
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. Arxiv
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, Kristina Toutanova [pdf], 2023.6
-
Mind2Web: Towards a Generalist Agent for the Web. Arxiv
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su [pdf], 2023.6
-
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis Arxiv
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust [pdf], 2023.7
-
WebArena: A Realistic Web Environment for Building Autonomous Agents Arxiv
Shuyan Zhou, Frank F. Xu, Hao Zh+, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig [pdf], 2023.7
-
Empowering LLM to use Smartphone for Intelligent Task Automation Arxiv
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu [pdf], 2023.8
-
Android in the Wild: A Large-Scale Dataset for Android Device Control Arxiv
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap [pdf], 2023.7
-
An Empirical Study & Evaluation of Modern CAPTCHAs Arxiv
Andrew Searles, Yoshimichi Nakatsuka, Ercan Ozturk, Andrew Paverd, Gene Tsudik, Ai Enkoji [pdf], 2023.7
-
LASER: LLM Agent with State-Space Exploration for Web Navigation Arxiv
Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, Dong Yu [pdf], 2023.9
-
You Only Look at Screens: Multimodal Chain-of-Action Agents Arxiv
Zhuosheng Zhang, Aston Zhang [pdf], 2023.9
-
HeaP: Hierarchical Policies for Web Actions using LLMs Arxiv
Paloma Sodhi, S.R.K. Branavan, Ryan McDonald [pdf], 2023.10
-
The Unsolved Challenges of LLMs as Generalist Web Agents: A Case Study Arxiv
Rim_Assouel1, Tom Marty, Massimo Caccia, Issam H. Laradji, Alexandre Drouin, Sai Rajeswar, Hector Palacios, Quentin Cappart, David Vazquez, Nicolas Chapados, Maxime Gasse, Alexandre Lacoste [pdf], 2023.12
-
"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces Arxiv
Faria Huq, Jeffrey P. Bigham, Nikolas Martelaro [pdf], 2023.12
-
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Arxiv
Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou [pdf], 2023.12
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded Arxiv
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su [pdf], 2024.1
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Arxiv
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu [pdf], 2024.1
-
ScreenAgent: A Vision Language Model-driven Computer Control Agent Arxiv
Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang [pdf], 2024.2
-
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web Arxiv
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov [pdf], 2024.2
-
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue Arxiv
Xing Han Lù, Zdeněk Kasner, Siva Reddy [pdf], 2024.2
-
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study Arxiv
Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F. Karlsson, Bo An, Zongqing Lu [pdf], 2024.3
-
AgentStudio: A Toolkit for Building General Virtual Agents Arxiv
Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan [pdf], 2024.3
-
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? Arxiv
Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue [pdf], 2024.4
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Arxiv
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu [pdf], 2024.4
-
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents Arxiv
Michael Lutz, Arth Bohra, Manvel Saroyan, Artem Harutyunyan, Giovanni Campagna [pdf], 2024.4
-
Autonomous Evaluation and Refinement of Digital Agents Arxiv
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr [pdf], 2024.4
-
MMInA: Benchmarking Multihop Multimodal Internet Agents Arxiv
Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu [pdf], 2024.4
-
SteP: Stacked LLM Policies for Web Actions Arxiv
Paloma Sodhi, S.R.K. Branavan, Yoav Artzi, Ryan McDonald [pdf], 2024.4