M Multimodal Doc QA
Ca' Foscari University of Venice · 2026

Selecting the right evidence for long multimodal documents.

Two complementary approaches to the central bottleneck of multimodal RAG — not how much evidence is retrieved, but which evidence reaches the reader. We treat selection as structured optimisation over a document graph.

Ambuj Mehrish Sebastiano Vascon CASPER · MSCA Horizon Europe
Constrained Dominant Sets · Graph Selection

Constrained Dominant Sets for Multimodal Document Question Answering

Evidence selection reframed as a constrained dominant-set problem on a query-augmented affinity graph. The query becomes a hard structural constraint; relevance and redundancy balance automatically via a spectral bound — no tuning, no training.

66.99VisDoMBench Avg · SOTA
+37.1over no-retrieval
+4.8MMLongBench-Doc
CDS vs cosine selection comparison
Min-Cost Flow · Dual-Process Reasoning

FlowReader: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A

Evidence assembly as a min-cost flow problem on a multimodal node graph. A single scoring vector controls source selection, sink selection, and every edge cost — with a dual-process gate that triggers a System-2 refinement pass only when needed.

58.40PaperTab · best
72.93SlideVQA · best
65.47Macro avg
FlowReader replicator dynamics
Shared thesis

Long multimodal documents are engineered to be redundant — findings recur across figures, captions, and prose. Similarity-based retrieval amplifies this, spending its budget on near-duplicates. Both works replace flat top-k ranking with graph-native selection that explicitly rewards relevance and non-redundancy, solved to a global equilibrium with replicator dynamics from evolutionary game theory.