forked from koaning/arxiv-frontpage
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
3850 lines (3850 loc) · 424 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script src="https://cdn.tailwindcss.com"></script>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/@alpinejs/[email protected]/dist/cdn.min.js"></script>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/cdn.min.js"></script>
</head>
<body>
<div class="relative mx-auto h-full max-w-2xl text-md">
<table class="table-auto">
<tbody>
<tr>
<td></td>
<td>
<h1 class="text-4xl pt-4 font-bold"><span class="underline">Adam's</span> Arxiv FrontPage</h1>
<br>
<p>Generated on 2024-10-17.</p><br/>
<p class="text-sm text-gray-500 pt-2">This frontpage is made by scraping arxiv and by runnig a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. </p>
<br>
</td>
</tr><tr>
<td></td>
<td>
<h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Current RGB-Thermal Video Object Detection (RGBT VOD) methods still depend on manually aligning data at the image level, which hampers its practical application in real-world scenarios since image pairs captured by multispectral sensors often differ in both fields of view and resolution.To address this limitation, we propose a Multi-modal Dynamic Local fusion Network (MDLNet) designed to handle unaligned RGBT image pairs.Specifically, our proposed Multi-modal Dynamic Local Fusion (MDLF) module includes a set of predefined boxes, each enhanced with random Gaussian noise to generate a dynamic box.Each box selects a local region from the original high-resolution RGB image.This region is then fused with the corresponding information from another modality and reinserted into the RGB.This method adapts to various data alignment scenarios by interacting with local features across different ranges.Simultaneously, we introduce a Cascaded Temporal Scrambler (CTS) within an end-to-end architecture.This module leverages consistent spatiotemporal information from consecutive frames to enhance the representation capability of the current frame while maintaining network efficiency.<span class='px-1 mx-1 bg-yellow-200'>We have curated an open dataset called UVT-VOD2024 for unaligned RGBT VOD. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.715</span></span><span class='px-1 mx-1 bg-yellow-200'>It consists of 30,494 pairs of unaligned RGBT images captured directly from a multispectral camera. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.756</span></span>We conduct a comprehensive evaluation and comparison with MDLNet and state-of-the-art (SOTA) models, demonstrating the superior effectiveness of MDLNet.We will release our code and UVT-VOD2024 to the public for further research.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12143v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
VisAnatomy: An SVG Chart Corpus with Fine-Grained Semantic Labels
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Chart corpora, which comprise data visualizations and their semantic labels, are crucial for advancing visualization research.However, the labels in most existing chart corpora are high-level (e.g., chart types), hindering their utility for broader interactive applications like chart reuse, animation, and accessibility.<span class='px-1 mx-1 bg-yellow-200'>In this paper, we contribute VisAnatomy, a chart corpus containing 942 real-world SVG charts produced by over 50 tools, encompassing 40 chart types and featuring structural and stylistic design variations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.715</span></span>Each chart is augmented with multilevel fine-grained labels on its semantic components, including each graphical element's type, role, and position, hierarchical groupings of elements, group layouts, and visual encodings.We demonstrate the richness of the semantic labels by comparing VisAnatomy with existing corpora.We illustrate the usefulness of VisAnatomy through four applications: chart type classification, chart decomposition, animation authoring, and content navigation for accessibility.Finally, we discuss our plan to improve VisAnatomy and the research opportunities VisAnatomy presents.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12268v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Agents powered by large language models have shown remarkable abilities in solving complex tasks.However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making.In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions.We propose a novel data-driven approach for this problem.Firstly, we collect real-world human activities to generate proactive task predictions.These predictions are then labeled by human annotators as either accepted or rejected.The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents.<span class='px-1 mx-1 bg-yellow-200'>Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.921</span></span>Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents.Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models.These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12361v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Learning to Predict Usage Options of Product Reviews with LLM-Generated Labels
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>Annotating large datasets can be challenging. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.849</span></span>However, crowd-sourcing is often expensive and can lack quality, especially for non-trivial tasks.We propose a method of using LLMs as few-shot learners for annotating data in a complex natural language task where we learn a standalone model to predict usage options for products from customer reviews.We also propose a new evaluation metric for this scenario, HAMS4, that can be used to compare a set of strings with multiple reference sets.Learning a custom model offers individual control over energy efficiency and privacy measures compared to using the LLM directly for the sequence-to-sequence task.We compare this data annotation approach with other traditional methods and demonstrate how LLMs can enable considerable cost savings.We find that the quality of the resulting data exceeds the level attained by third-party vendor services and that GPT-4-generated labels even reach the level of domain experts.We make the code and generated labels publicly available.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12470v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Data-Driven Gyroscope Calibration
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Gyroscopes are inertial sensors that measure the angular velocity of the platforms to which they are attached.To estimate the gyroscope deterministic error terms prior mission start, a calibration procedure is performed.When considering low-cost gyroscopes, the calibration requires a turntable as the gyros are incapable of sensing the Earth turn rate.In this paper, we propose a data-driven framework to estimate the scale factor and bias of a gyroscope.<span class='px-1 mx-1 bg-yellow-200'>To train and validate our approach, a dataset of 56 minutes was recorded using a turntable. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.769</span></span>We demonstrated that our proposed approach outperforms the model-based approach, in terms of accuracy and convergence time.Specifically, we improved the scale factor and bias estimation by an average of 72% during six seconds of calibration time, demonstrating an average of 75% calibration time improvement.That is, instead of minutes, our approach requires only several seconds for the calibration.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12485v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
A Claim Decomposition Benchmark for Long-form Answer Verification
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The advancement of LLMs has significantly boosted the performance of complex long-form question answering tasks.However, one prominent issue of LLMs is the generated "hallucination" responses that are not factual.Consequently, attribution for each claim in responses becomes a common solution to improve the factuality and verifiability.Existing researches mainly focus on how to provide accurate citations for the response, which largely overlook the importance of identifying the claims or statements for each response.To bridge this gap, we introduce a new claim decomposition benchmark, which requires building system that can identify atomic and checkworthy claims for LLM responses.<span class='px-1 mx-1 bg-yellow-200'>Specifically, we present the Chinese Atomic Claim Decomposition Dataset (CACDD), which builds on the WebCPM dataset with additional expert annotations to ensure high data quality. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.761</span></span>The CACDD encompasses a collection of 500 human-annotated question-answer pairs, including a total of 4956 atomic claims.We further propose a new pipeline for human annotation and describe the challenges of this task.In addition, we provide experiment results on zero-shot, few-shot and fine-tuned LLMs as baselines.The results show that the claim decomposition is highly challenging and requires further explorations.All code and data are publicly available at \url{https://github.com/FBzzh/CACDD}.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12558v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Adaptive Prompt Learning with SAM for Few-shot Scanning Probe Microscope Image Segmentation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The Segment Anything Model (SAM) has demonstrated strong performance in image segmentation of natural scene images.However, its effectiveness diminishes markedly when applied to specific scientific domains, such as Scanning Probe Microscope (SPM) images.This decline in accuracy can be attributed to the distinct data distribution and limited availability of the data inherent in the scientific images.On the other hand, the acquisition of adequate SPM datasets is both time-intensive and laborious as well as skill-dependent.To address these challenges, we propose an Adaptive Prompt Learning with SAM (APL-SAM) framework tailored for few-shot SPM image segmentation.Our approach incorporates two key innovations to enhance SAM: 1) An Adaptive Prompt Learning module leverages few-shot embeddings derived from limited support set to learn adaptively central representatives, serving as visual prompts.This innovation eliminates the need for time-consuming online user interactions for providing prompts, such as exhaustively marking points and bounding boxes slice by slice; 2) A multi-source, multi-level mask decoder specifically designed for few-shot SPM image segmentation is introduced, which can effectively capture the correspondence between the support and query images.<span class='px-1 mx-1 bg-yellow-200'>To facilitate comprehensive training and evaluation, we introduce a new dataset, SPM-Seg, curated for SPM image segmentation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.767</span></span>Extensive experiments on this dataset reveal that the proposed APL-SAM framework significantly outperforms the original SAM, achieving over a 30% improvement in terms of Dice Similarity Coefficient with only one-shot guidance.Moreover, APL-SAM surpasses state-of-the-art few-shot segmentation methods and even fully supervised approaches in performance.<span class='px-1 mx-1 bg-yellow-200'>Code and dataset used in this study will be made available upon acceptance. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.732</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12562v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses.However, their application in the medical domain is hindered by unique challenges.For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches.Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D.The lack of medical data further compounds these obstacles.To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine.Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data.<span class='px-1 mx-1 bg-yellow-200'>We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.709</span></span>Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation.Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks.Our code is publicly available at https://github.com/function2-llx/MMMM.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12694v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
MultiCamCows2024 -- A Multi-view Image Dataset for AI-driven Holstein-Friesian Cattle Re-Identification on a Working Farm
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>We present MultiCamCows2024, a farm-scale image dataset filmed across multiple cameras for the biometric identification of individual Holstein-Friesian cattle exploiting their unique black and white coat-patterns. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.742</span></span><span class='px-1 mx-1 bg-yellow-200'>Captured by three ceiling-mounted visual sensors covering adjacent barn areas over seven days on a working dairy farm, the dataset comprises 101, 329 images of 90 cows, plus the underlying original CCTV footage. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.716</span></span>The dataset is provided alongside full computer vision recognition baselines, that is both a supervised and self-supervised learning framework for individual cow identification trained on cattle tracklets.We report a performance above 96% single image identification accuracy from the dataset and demonstrate that combining data from multiple cameras during learning enhances self-supervised identification.We show that our framework enables fully automatic cattle identification, barring only the simple human verification of tracklet integrity during data collection.Crucially, our study highlights that multi-camera, supervised and self-supervised components in tandem not only deliver highly accurate individual cow identification but also achieve this efficiently with no labelling of cattle identities by humans at all.We argue that this improvement in efficacy has practical implications for livestock management, behaviour analysis, and agricultural monitoring.For full reproducibility and practical ease of use, we publish all key software and code including re-identification components and the species detector with this paper.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12695v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts.To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date.It includes tasks for identifying dish names and their origins.<span class='px-1 mx-1 bg-yellow-200'>We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.792</span></span>Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages.To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12705v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Drillboards: Adaptive Visualization Dashboards for Dynamic Personalization of Visualization Experiences
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We present drillboards, a technique for adaptive visualization dashboards consisting of a hierarchy of coordinated charts that the user can drill down to reach a desired level of detail depending on their expertise, interest, and desired effort.This functionality allows different users to personalize the same dashboard to their specific needs and expertise.The technique is based on a formal vocabulary of chart representations and rules for merging multiple charts of different types and data into single composite representations.The drillboard hierarchy is created by iteratively applying these rules starting from a baseline dashboard, with each consecutive operation yielding a new dashboard with fewer charts and progressively more abstract and simplified views.<span class='px-1 mx-1 bg-yellow-200'>We also present an authoring tool for building drillboards and show how it can be applied to an agricultural dataset with hundreds of expert users. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.845</span></span>Our evaluation asked three domain experts to author drillboards for their own datasets, which we then showed to casual end-users with favorable outcomes.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12744v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>Multimodal remote sensing data, collected from a variety of sensors, provide a comprehensive and integrated perspective of the Earth's surface. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.73</span></span>By employing multimodal fusion techniques, semantic segmentation offers more detailed insights into geographic scenes compared to single-modality approaches.Building upon recent advancements in vision foundation models, particularly the Segment Anything Model (SAM), this study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation.At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data.In addition, a pyramid-based Deep Fusion Module (DFM) is incorporated to further integrate high-level geographic features across multiple scales before decoding.This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data.Experimental results on two well-established fine-resolution multimodal remote sensing datasets, ISPRS Vaihingen and ISPRS Potsdam, confirm that the proposed MANet significantly surpasses current models in the task of multimodal semantic segmentation.The source code for this work will be accessible at https://github.com/sstary/SSRS.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11160v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>This paper introduces a centralized, open-source dataset repository designed to advance NLP and NMT for Assamese, a low-resource language. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.779</span></span>The repository supports various tasks like sentiment analysis, named entity recognition, and machine translation by providing both pre-training and fine-tuning corpora.We review existing datasets, highlighting the need for standardized resources in Assamese NLP, and discuss potential applications in AI-driven research, such as LLMs, OCR, and chatbots.While promising, challenges like data scarcity and linguistic diversity remain.The repository aims to foster collaboration and innovation, promoting Assamese language research in the digital age.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11291v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Do LLMs Have the Generalization Ability in Conducting Causal Inference?
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>In causal inference, generalization capability refers to the ability to conduct causal inference methods on new data to estimate the causal-effect between unknown phenomenon, which is crucial for expanding the boundaries of knowledge.Studies have evaluated the causal inference capabilities of Large Language Models (LLMs) concerning known phenomena, yet the generalization capabilities of LLMs concerning unseen phenomena remain unexplored.In this paper, we selected four tasks: Causal Path Discovery (CP), Backdoor Adjustment (BA), Factual Inference (FI), and Counterfactual Inference (CI) as representatives of causal inference tasks.To generate evaluation questions about previously unseen phenomena in new data on the four tasks, we propose a benchmark generation framework, which employs randomly generated graphs and node names to formulate questions within hypothetical new causal scenarios.<span class='px-1 mx-1 bg-yellow-200'>Based on this framework, we compile a benchmark dataset of varying levels of question complexity. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.739</span></span>We extensively tested the generalization capabilities of five leading LLMs across four tasks.Experiment results reveal that while LLMs exhibit good generalization performance in solving simple CP, FI, and complex CI questions, they encounter difficulties when tackling BA questions and face obvious performance fluctuations as the problem complexity changes.Furthermore, when the names of phenomena incorporate existing terms, even if these names are entirely novel, their generalization performance can still be hindered by interference from familiar terms.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11385v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Network Representation Learning for Biophysical Neural Network Analysis
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The analysis of biophysical neural networks (BNNs) has been a longstanding focus in computational neuroscience.A central yet unresolved challenge in BNN analysis lies in deciphering the correlations between neuronal and synaptic dynamics, their connectivity patterns, and learning process.To address this, we introduce a novel BNN analysis framework grounded in network representation learning (NRL), which leverages attention scores to uncover intricate correlations between network components and their features.Our framework integrates a new computational graph (CG)-based BNN representation, a bio-inspired graph attention network (BGAN) that enables multiscale correlation analysis across BNN representations, and an extensive BNN dataset.The CG-based representation captures key computational features, information flow, and structural relationships underlying neuronal and synaptic dynamics, while BGAN reflects the compositional structure of neurons, including dendrites, somas, and axons, as well as bidirectional information flows between BNN components.<span class='px-1 mx-1 bg-yellow-200'>The dataset comprises publicly available models from ModelDB, reconstructed using the Python and standardized in NeuroML format, and is augmented with data derived from canonical neuron and synapse models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.753</span></span>To our knowledge, this study is the first to apply an NRL-based approach to the full spectrum of BNNs and their analysis.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11503v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recent advancements in Multi-modal Large Language Models (MLLMs) have opened new avenues for applications in Embodied AI.Building on previous work, EgoThink, we introduce VidEgoThink, a comprehensive benchmark for evaluating egocentric video understanding capabilities.To bridge the gap between MLLMs and low-level control in Embodied AI, we design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.<span class='px-1 mx-1 bg-yellow-200'>To minimize manual annotation costs, we develop an automatic data generation pipeline based on the Ego4D dataset, leveraging the prior knowledge and multimodal capabilities of GPT-4o. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.85</span></span>Three human annotators then filter the generated data to ensure diversity and quality, resulting in the VidEgoThink benchmark.We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.Experimental results indicate that all MLLMs, including GPT-4o, perform poorly across all tasks related to egocentric video understanding.These findings suggest that foundation models still require significant advancements to be effectively applied to first-person scenarios in Embodied AI.In conclusion, VidEgoThink reflects a research trend towards employing MLLMs for egocentric vision, akin to human capabilities, enabling active observation and interaction in the complex real-world environments.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11623v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Robotic Arm Platform for Multi-View Image Acquisition and 3D Reconstruction in Minimally Invasive Surgery
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Minimally invasive surgery (MIS) offers significant benefits such as reduced recovery time and minimised patient trauma, but poses challenges in visibility and access, making accurate 3D reconstruction a significant tool in surgical planning and navigation.This work introduces a robotic arm platform for efficient multi-view image acquisition and precise 3D reconstruction in MIS settings.We adapted a laparoscope to a robotic arm and captured ex-vivo images of several ovine organs across varying lighting conditions (operating room and laparoscopic) and trajectories (spherical and laparoscopic).We employed recently released learning-based feature matchers combined with COLMAP to produce our reconstructions.The reconstructions were evaluated against high-precision laser scans for quantitative evaluation.Our results show that whilst reconstructions suffer most under realistic MIS lighting and trajectory, many versions of our pipeline achieve close to sub-millimetre accuracy with an average of 1.05 mm Root Mean Squared Error and 0.82 mm Chamfer distance.Our best reconstruction results occur with operating room lighting and spherical trajectories.<span class='px-1 mx-1 bg-yellow-200'>Our robotic platform provides a tool for controlled, repeatable multi-view data acquisition for 3D generation in MIS environments which we hope leads to new datasets for training learning-based models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11703v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Time Series Forecasting (TSF) is key functionality in numerous fields, including in finance, weather services, and energy management.While TSF methods are emerging these days, many of them require domain-specific data collection and model training and struggle with poor generalization performance on new domains.Foundation models aim to overcome this limitation.Pre-trained on large-scale language or time series data, they exhibit promising inferencing capabilities in new or unseen data.This has spurred a surge in new TSF foundation models.We propose a new benchmark, FoundTS, to enable thorough and fair evaluation and comparison of such models.FoundTS covers a variety of TSF foundation models, including those based on large language models and those pretrained on time series.Next, FoundTS supports different forecasting strategies, including zero-shot, few-shot, and full-shot, thereby facilitating more thorough evaluations.Finally, FoundTS offers a pipeline that standardizes evaluation processes such as dataset splitting, loading, normalization, and few-shot sampling, thereby facilitating fair evaluations.Building on this, we report on an extensive evaluation of TSF foundation models on a broad range of datasets from diverse domains and with different statistical characteristics.Specifically, we identify pros and cons and inherent limitations of existing foundation models, and we identify directions for future model design.<span class='px-1 mx-1 bg-yellow-200'>We make our code and datasets available at https://anonymous.4open.science/r/FoundTS-C2B0. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.887</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11802v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Large language models (LLMs) combined with tool learning have gained impressive results in real-world applications.During tool learning, LLMs may call multiple tools in nested orders, where the latter tool call may take the former response as its input parameters.However, current research on the nested tool learning capabilities is still under-explored, since the existing benchmarks lack of relevant data instances.To address this problem, we introduce NesTools to bridge the current gap in comprehensive nested tool learning evaluations.NesTools comprises a novel automatic data generation method to construct large-scale nested tool calls with different nesting structures.<span class='px-1 mx-1 bg-yellow-200'>With manual review and refinement, the dataset is in high quality and closely aligned with real-world scenarios. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.843</span></span>Therefore, NesTools can serve as a new benchmark to evaluate the nested tool learning abilities of LLMs.We conduct extensive experiments on 22 LLMs, and provide in-depth analyses with NesTools, which shows that current LLMs still suffer from the complex nested tool learning task.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11805v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>As large language models rapidly evolve to support longer context, there is a notable disparity in their capability to generate output at greater lengths.Recent study suggests that the primary cause for this imbalance may arise from the lack of data with long-output during alignment training.In light of this observation, attempts are made to re-align foundation models with data that fills the gap, which result in models capable of generating lengthy output when instructed.In this paper, we explore the impact of data-quality in tuning a model for long output, and the possibility of doing so from the starting points of human-aligned (instruct or chat) models.With careful data curation, we show that it possible to achieve similar performance improvement in our tuned models, with only a small fraction of training data instances and compute.In addition, we assess the generalizability of such approaches by applying our tuning-recipes to several models.our findings suggest that, while capacities for generating long output vary across different models out-of-the-box, our approach to tune them with high-quality data using lite compute, consistently yields notable improvement across all models we experimented on.<span class='px-1 mx-1 bg-yellow-200'>We have made public our curated dataset for tuning long-writing capability, the implementations of model tuning and evaluation, as well as the fine-tuned models, all of which can be openly-accessed. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.791</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10210v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.787</span></span>Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation.However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets.We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen).The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples.The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10270v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Machine Translation Evaluation Benchmark for Wu Chinese: Workflow and Analysis
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We introduce a FLORES+ dataset as an evaluation benchmark for modern Wu Chinese machine translation models and showcase its compatibility with existing Wu data.Wu Chinese is mutually unintelligible with other Sinitic languages such as Mandarin and Yue (Cantonese), but uses a set of Hanzi (Chinese characters) that profoundly overlaps with others.The population of Wu speakers is the second largest among languages in China, but the language has been suffering from significant drop in usage especially among the younger generations.We identify Wu Chinese as a textually low-resource language and address challenges for its machine translation models.<span class='px-1 mx-1 bg-yellow-200'>Our contributions include: (1) an open-source, manually translated dataset, (2) full documentations on the process of dataset creation and validation experiments, (3) preliminary tools for Wu Chinese normalization and segmentation, and (4) benefits and limitations of our dataset, as well as implications to other low-resource languages. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.899</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10278v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training.However, the advancement of these models has been hindered by the lack of comprehensive benchmarks.To address this gap, we introduce the General Time Series Forecasting Model Evaluation, GIFT-Eval, a pioneering benchmark aimed at promoting evaluation across diverse datasets.<span class='px-1 mx-1 bg-yellow-200'>GIFT-Eval encompasses 28 datasets over 144,000 time series and 177 million data points, spanning seven domains, 10 frequencies, multivariate inputs, and prediction lengths ranging from short to long-term forecasts. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.799</span></span><span class='px-1 mx-1 bg-yellow-200'>To facilitate the effective pretraining and evaluation of foundation models, we also provide a non-leaking pretraining dataset containing approximately 230 billion data points. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.779</span></span>Additionally, we provide a comprehensive analysis of 17 baselines, which includes statistical models, deep learning models, and foundation models.We discuss each model in the context of various benchmark characteristics and offer a qualitative analysis that spans both deep learning and foundation models.We believe the insights from this analysis, along with access to this new standard zero-shot time series forecasting benchmark, will guide future developments in time series foundation models.<span class='px-1 mx-1 bg-yellow-200'>The codebase, datasets, and a leaderboard showing all the results in detail will be available soon. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.851</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10393v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Improve Meta-learning for Few-Shot Text Classification with All You Can Acquire from the Tasks
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Meta-learning has emerged as a prominent technology for few-shot text classification and has achieved promising performance.However, existing methods often encounter difficulties in drawing accurate class prototypes from support set samples, primarily due to probable large intra-class differences and small inter-class differences within the task.Recent approaches attempt to incorporate external knowledge or pre-trained language models to augment data, but this requires additional resources and thus does not suit many few-shot scenarios.In this paper, we propose a novel solution to address this issue by adequately leveraging the information within the task itself.Specifically, we utilize label information to construct a task-adaptive metric space, thereby adaptively reducing the intra-class differences and magnifying the inter-class differences.We further employ the optimal transport technique to estimate class prototypes with query set samples together, mitigating the problem of inaccurate and ambiguous support set samples caused by large intra-class differences.We conduct extensive experiments on eight benchmark datasets, and our approach shows obvious advantages over state-of-the-art models across all the tasks on all the datasets.<span class='px-1 mx-1 bg-yellow-200'>For reproducibility, all the datasets and codes are available at https://github.com/YvoGao/LAQDA. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.935</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10454v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The training data for LLMs embeds societal values, increasing their familiarity with the language's culture.Our analysis found that 44% of the variance in the ability of GPT-4o to reflect the societal values of a country, as measured by the World Values Survey, correlates with the availability of digital resources in that language.Notably, the error rate was more than five times higher for the languages of the lowest resource compared to the languages of the highest resource.For GPT-4-turbo, this correlation rose to 72%, suggesting efforts to improve the familiarity with the non-English language beyond the web-scraped data.<span class='px-1 mx-1 bg-yellow-200'>Our study developed one of the largest and most robust datasets in this topic area with 21 country-language pairs, each of which contain 94 survey questions verified by native speakers. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.867</span></span>Our results highlight the link between LLM performance and digital data availability in target languages.Weaker performance in low-resource languages, especially prominent in the Global South, may worsen digital divides.We discuss strategies proposed to address this, including developing multilingual LLMs from the ground up and enhancing fine-tuning on diverse linguistic datasets, as seen in African language initiatives.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10489v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users.Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation.<span class='px-1 mx-1 bg-yellow-200'>In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.721</span></span>Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc.To accommodate these formats, we developed over 40 metrics to evaluate these tasks.Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth.We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10563v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
BrainMVP: Multi-modal Vision Pre-training for Brain Image Analysis using Multi-parametric MRI
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Accurate diagnosis of brain abnormalities is greatly enhanced by the inclusion of complementary multi-parametric MRI imaging data.There is significant potential to develop a universal pre-training model that can be quickly adapted for image modalities and various clinical scenarios.However, current models often rely on uni-modal image data, neglecting the cross-modal correlations among different image modalities or struggling to scale up pre-training in the presence of missing modality data.In this paper, we propose BrainMVP, a multi-modal vision pre-training framework for brain image analysis using multi-parametric MRI scans.<span class='px-1 mx-1 bg-yellow-200'>First, we collect 16,022 brain MRI scans (over 2.4 million images), encompassing eight MRI modalities sourced from a diverse range of centers and devices. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.801</span></span>Then, a novel pre-training paradigm is proposed for the multi-modal MRI data, addressing the issue of missing modalities and achieving multi-modal information fusion.Cross-modal reconstruction is explored to learn distinctive brain image embeddings and efficient modality fusion capabilities.A modality-wise data distillation module is proposed to extract the essence representation of each MR image modality for both the pre-training and downstream application purposes.Furthermore, we introduce a modality-aware contrastive learning module to enhance the cross-modality association within a study.Extensive experiments on downstream tasks demonstrate superior performance compared to state-of-the-art pre-training methods in the medical domain, with Dice Score improvement of 0.28%-14.47% across six segmentation benchmarks and a consistent accuracy improvement of 0.65%-18.07% in four individual classification tasks.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10604v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Adapting medical Large Language Models to local languages can reduce barriers to accessing healthcare services, but data scarcity remains a significant challenge, particularly for low-resource languages.<span class='px-1 mx-1 bg-yellow-200'>To address this, we first construct a high-quality medical dataset and conduct analysis to ensure its quality. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.808</span></span>In order to leverage the generalization capability of multilingual LLMs to efficiently scale to more resource-constrained languages, we explore the internal information flow of LLMs from a multilingual perspective using Mixture of Experts (MoE) modularity.Technically, we propose a novel MoE routing method that employs language-specific experts and cross-lingual routing.Inspired by circuit theory, our routing analysis revealed a Spread Out in the End information flow mechanism: while earlier layers concentrate cross-lingual information flow, the later layers exhibit language-specific divergence.This insight directly led to the development of the Post-MoE architecture, which applies sparse routing only in the later layers while maintaining dense others.Experimental results demonstrate that this approach enhances the generalization of multilingual models to other languages while preserving interpretability.Finally, to efficiently scale the model to 50 languages, we introduce the concept of language family experts, drawing on linguistic priors, which enables scaling the number of languages without adding additional parameters.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10626v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recently, diffusion models have achieved great success in mono-channel audio generation.However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions.Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.To the best of our knowledge, this work represents the first attempt to address these issues.<span class='px-1 mx-1 bg-yellow-200'>We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.78</span></span>Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation.Existing audio generation models tend to generate rather random and indistinct spatial audio.To provide accurate guidance for latent diffusion models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance.By leveraging spatial guidance, our unified model not only achieves the objective of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference.Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods.The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10676v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries.We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets.ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues.In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1.<span class='px-1 mx-1 bg-yellow-200'>We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.882</span></span>We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks.Code is available at https://github.com/renqibing/ActorAttack.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10700v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems.Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools.However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective.<span class='px-1 mx-1 bg-yellow-200'>The benchmark incorporates diverse real-world sensor datasets for various tasks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.757</span></span>The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts.Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks.Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10741v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks.However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated.To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers.LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA).This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables.Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models.This significantly reduces the overall evaluation cost.We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination.Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset.By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%).<span class='px-1 mx-1 bg-yellow-200'>Our dataset is available online on HuggingFace, and our code will be available here. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.966</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10783v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-14</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Depth Any Video with Scalable Synthetic Data
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results.In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations.<span class='px-1 mx-1 bg-yellow-200'>First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.85</span></span>Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency.Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates-even on single frames.At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames.Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.10815v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Human activities recognition is an important task for an intelligent robot, especially in the field of human-robot collaboration, it requires not only the label of sub-activities but also the temporal structure of the activity.In order to automatically recognize both the label and the temporal structure in sequence of human-object interaction, we propose a novel Pyramid Graph Convolutional Network (PGCN), which employs a pyramidal encoder-decoder architecture consisting of an attention based graph convolution network and a temporal pyramid pooling module for downsampling and upsampling interaction sequence on the temporal axis, respectively.The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph.To learn the human-object relations, a new attention graph convolutional network is trained to extract condensed information from the graph representation.To segment action into sub-actions, a novel temporal pyramid pooling module is proposed, which upsamples compressed features back to the original time scale and classifies actions per frame. We explore various attention layers, namely spatial attention, temporal attention and channel attention, and combine different upsampling decoders to test the performance on action recognition and segmentation.<span class='px-1 mx-1 bg-yellow-200'>We evaluate our model on two challenging datasets in the field of human-object interaction recognition, i.e. Bimanual Actions and IKEA Assembly datasets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.746</span></span>We demonstrate that our classifier significantly improves both framewise action recognition and segmentation, e.g., F1 micro and F1@50 scores on Bimanual Actions dataset are improved by $4.3\%$ and $8.5\%$ respectively.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.07912v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Offline Hierarchical Reinforcement Learning via Inverse Optimization
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards.However, learning hierarchical policies from static offline datasets presents a significant challenge.Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms.In this work, we propose OHIO:a framework for offline reinforcement learning (RL) of hierarchical policies.Our framework leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy.<span class='px-1 mx-1 bg-yellow-200'>This approach constructs a dataset suitable for off-the-shelf offline training. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.788</span></span>We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness.We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.07933v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
AI Surrogate Model for Distributed Computing Workloads
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>Large-scale international scientific collaborations, such as ATLAS, Belle II, CMS, and DUNE, generate vast volumes of data. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.721</span></span>These experiments necessitate substantial computational power for varied tasks, including structured data processing, Monte Carlo simulations, and end-user analysis.Centralized workflow and data management systems are employed to handle these demands, but current decision-making processes for data placement and payload allocation are often heuristic and disjointed.This optimization challenge potentially could be addressed using contemporary machine learning methods, such as reinforcement learning, which, in turn, require access to extensive data and an interactive environment.Instead, we propose a generative surrogate modeling approach to address the lack of training data and concerns about privacy preservation.We have collected and processed real-world job submission records, totaling more than two million jobs through 150 days, and applied four generative models for tabular data -- TVAE, CTAGGAN+, SMOTE, and TabDDPM -- to these datasets, thoroughly evaluating their performance.Along with measuring the discrepancy among feature-wise distributions separately, we also evaluate pair-wise feature correlations, distance to closest record, and responses to pre-trained models.Our experiments indicate that SMOTE and TabDDPM can generate similar tabular data, almost indistinguishable from the ground truth.Yet, as a non-learning method, SMOTE ranks the lowest in privacy preservation.As a result, we conclude that the probabilistic-diffusion-model-based TabDDPM is the most suitable generative model for managing job record data.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.07940v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Background: Machine learning methods for clinical named entity recognition and entity normalization systems can utilize both labeled corpora and Knowledge Graphs (KGs) for learning.However, infrequently occurring concepts may have few mentions in training corpora and lack detailed descriptions or synonyms, even in large KGs.For Disease Entity Recognition (DER) and Disease Entity Normalization (DEN), this can result in fewer high quality training examples relative to the number of known diseases.Large Language Model (LLM) generation of synthetic training examples could improve performance in these information extraction tasks. Methods: We fine-tuned a LLaMa-2 13B<span class='px-1 mx-1 bg-yellow-200'>Chat LLM to generate a synthetic corpus containing normalized mentions of concepts from the Unified Medical Language System (UMLS) Disease Semantic Group. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.769</span></span>We measured overall and Out of Distribution (OOD) performance for DER and DEN, with and without synthetic data augmentation.We evaluated performance on 3 different disease corpora using 4 different data augmentation strategies, assessed using BioBERT for DER and SapBERT and KrissBERT for DEN. Results:Our synthetic data yielded a substantial improvement for DEN, in all 3 training corpora the top 1 accuracy of both SapBERT and KrissBERT improved by 3-9 points in overall performance and by 20-55 points in OOD data.A small improvement (1-2 points) was also seen for DER in overall performance, but only one dataset showed OOD improvement. Conclusion: LLM generation of normalized disease mentions can improve DEN relative to normalization approaches that do not utilize LLMs to augment data with synthetic mentions.Ablation studies indicate that performance gains for DEN were only partially attributable to improvements in OOD performance.The same approach has only a limited ability to improve DER.<span class='px-1 mx-1 bg-yellow-200'>We make our software and dataset publicly available. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.9</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.07951v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Papers, patents, and clinical trials are indispensable types of scientific literature in biomedicine, crucial for knowledge sharing and dissemination.However, these documents are often stored in disparate databases with varying management standards and data formats, making it challenging to form systematic, fine-grained connections among them.<span class='px-1 mx-1 bg-yellow-200'>To address this issue, we introduce PKG2.0, a comprehensive knowledge graph dataset encompassing over 36 million papers, 1.3 million patents, and 0.48 million clinical trials in the biomedical field. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.777</span></span>PKG2.0 integrates these previously dispersed resources through various links, including biomedical entities, author networks, citation relationships, and research projects.Fine-grained biomedical entity extraction, high-performance author name disambiguation, and multi-source citation integration have played a crucial role in the construction of the PKG dataset.<span class='px-1 mx-1 bg-yellow-200'>Additionally, project data from the NIH Exporter enriches the dataset with metadata of NIH-funded projects and their scholarly outputs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.839</span></span>Data validation demonstrates that PKG2.0 excels in key tasks such as author disambiguation and biomedical entity recognition.<span class='px-1 mx-1 bg-yellow-200'>This dataset provides valuable resources for biomedical researchers, bibliometric scholars, and those engaged in literature mining. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.76</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.07969v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions.However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses.Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses.The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to responses with the highest rewards, which are sparse in data.To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions.We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.<span class='px-1 mx-1 bg-yellow-200'>This dataset is easily integrated with existing direct alignment algorithms and is applicable to any preference dataset. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.829</span></span>The experimental results across instruction-following benchmarks including AlpacaEval, MT-Bench, and Arena-Hard-Auto demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models.Additionally, our method improves the average accuracy on various academic benchmarks.When applying our method to on-policy data, the resulting DPO model achieves SOTA results on AlpacaEval.Through ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere dataset expansion.Our code is available at https://github.com/shenao-zhang/reward-augmented-preference.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.08067v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks.Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities.However, these approaches overlook crucial requirements for prior knowledge of specific concepts, theorems, and tricks to tackle most arithmetic reasoning problems successfully.To address this issue, we propose a novel and effective Teaching-Inspired Integrated Framework, which emulates the instructional process of a teacher guiding students.This method equips LLMs with essential concepts, relevant theorems, and similar problems with analogous solution approaches, facilitating the enhancement of reasoning abilities.<span class='px-1 mx-1 bg-yellow-200'>Additionally, we introduce two new Chinese datasets, MathMC and MathToF, both with detailed explanations and answers. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.862</span></span>Experiments are conducted on nine benchmarks which demonstrates that our approach improves the reasoning accuracy of LLMs.With GPT-4 and our framework, we achieve new state-of-the-art performance on four math benchmarks (AddSub, SVAMP, Math23K and AQuA) with accuracies of 98.2% (+3.3%), 93.9% (+0.2%), 94.3% (+7.2%) and 81.1% (+1.2%).Our data and code are available at https://github.com/SallyTan13/Teaching-Inspired-Prompting.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.08068v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Large language models (LLMs) have achieved reasonable quality improvements in machine translation (MT).However, most current research on MT-LLMs still faces significant challenges in maintaining translation consistency and accuracy when processing entire documents.In this paper, we introduce DelTA, a Document-levEL Translation Agent designed to overcome these limitations.DelTA features a multi-level memory structure that stores information across various granularities and spans, including Proper Noun Records, Bilingual Summary, Long-Term Memory, and Short-Term Memory, which are continuously retrieved and updated by auxiliary LLM-based components.Experimental results indicate that DelTA significantly outperforms strong baselines in terms of translation consistency and quality across four open/closed-source LLMs and two representative document translation datasets, achieving an increase in consistency scores by up to 4.58 percentage points and in COMET scores by up to 3.16 points on average.DelTA employs a sentence-by-sentence translation strategy, ensuring no sentence omissions and offering a memory-efficient solution compared to the mainstream method.Furthermore, DelTA improves pronoun translation accuracy, and the summary component of the agent also shows promise as a tool for query-based summarization tasks.<span class='px-1 mx-1 bg-yellow-200'>We release our code and data at https://github.com/YutongWang1216/DocMTAgent. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.882</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.08143v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This paper explores the problem of commonsense-level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model's internal commonsense knowledge (see Figure 1).To study this issue, we introduce an automated pipeline, augmented with human-in-the-loop quality control, to establish a benchmark aimed at simulating and assessing the conflicts in MLLMs.Utilizing this pipeline, we have crafted a diagnostic benchmark comprising 374 original images and 1,122 high-quality question-answer (QA) pairs.This benchmark covers two types of conflict target and three question difficulty levels, providing a thorough assessment tool.Through this benchmark, we evaluate the conflict-resolution capabilities of nine representative MLLMs across various model families and find a noticeable over-reliance on textual queries.Drawing on these findings, we propose a novel prompting strategy, "Focus-on-Vision" (FoV), which markedly enhances MLLMs' ability to favor visual data over conflicting textual knowledge.Our detailed analysis and the newly proposed strategy significantly advance the understanding and mitigating of vision-knowledge conflicts in MLLMs.<span class='px-1 mx-1 bg-yellow-200'>The data and code are made publicly available. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.774</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.08145v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td></td>
<td>
<h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>When annotators disagree, predicting the labels given by individual annotators can capture nuances overlooked by traditional label aggregation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.813</span></span>We introduce three approaches to predicting individual annotator ratings on the toxicity of text by incorporating individual annotator-specific information: a neural collaborative filtering (NCF) approach, an in-context learning (ICL) approach, and an intermediate embedding-based architecture.We also study the utility of demographic information for rating prediction.NCF showed limited utility; however, integrating annotator history, demographics, and survey information permits both the embedding-based architecture and ICL to substantially improve prediction accuracy, with the embedding-based architecture outperforming the other methods.We also find that, if demographics are predicted from survey information, using these imputed demographics as features performs comparably to using true demographic data.This suggests that demographics may not provide substantial information for modeling ratings beyond what is captured in survey responses.Our findings raise considerations about the relative utility of different types of annotator information and provide new approaches for modeling annotators in subjective NLP tasks.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.12217v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-10-15</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
AIC CTU system at AVeriTeC: Re-framing automated fact-checking as a simple RAG task
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This paper describes our $3^{rd}$ place submission in the AVeriTeC shared task in which we attempted to address the challenge of fact-checking with evidence retrieved in the wild using a simple scheme of Retrieval-Augmented Generation (RAG) designed for the task, leveraging the predictive power of Large Language Models.We release our codebase and explain its two modules - the Retriever and the Evidence & Label generator - in detail, justifying their features such as MMR-reranking and Likert-scale confidence estimation.We evaluate our solution on AVeriTeC dev and test set and interpret the results, picking the GPT-4o as the most appropriate model for our pipeline at the time of our publication, with Llama 3.1 70B being a promising open-source alternative.<span class='px-1 mx-1 bg-yellow-200'>We perform an empirical error analysis to see that faults in our predictions often coincide with noise in the data or ambiguous fact-checks, provoking further research and data augmentation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.671</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2410.11446v1' target="_blank">
link
</a>
</p>