M Multimodal Doc QA
← All papers
Min-Cost Flow · Dual-Process Reasoning

FlowReader: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A

Ambuj Mehrish1  ·  Sebastiano Vascon1

1 Ca' Foscari University of Venice — {ambuj.mehrish, sebastiano.vascon}@unive.it

§ Abstract

Long, multimodal documents force retrieval-augmented systems to assemble answers from evidence fragmented across text, tables, and slides — broken across cells in a long table, spread over multiple slides, or split between a figure and its discussion. Top-k chunk retrieval treats each fragment independently and cannot represent how evidence connects. We introduce FlowReader, which reframes evidence assembly as a min-cost flow problem on a multimodal node graph: a single scoring vector h controls source selection (via MMR), sink selection (via a length-aware answerability proxy), and the costs and capacities of every edge. The optimal flow is decomposed into candidate evidence paths; a compact non-redundant subset is selected by entropy-regularised replicator dynamics; and parallel VLM workers under a dual-process gate produce the answer with a single System-2 refinement pass triggered when answer consistency is low or routed flow is strained. On VisDoMBench, FlowReader is best on the two subsets dominated by fragmented evidence — PaperTab (58.40, +1.30 over G²-Reader) and SlideVQA (72.93, +0.62) — and competitive elsewhere. Macro-averaged across all five subsets, FlowReader (65.47) sits within 0.74 of the strongest baseline.

58.40
PaperTab · best (+1.30)
72.93
SlideVQA · best (+0.62)
65.47
Macro average
1
Conditional System-2 pass

01 Fragmented evidence breaks top-k retrieval

The answer to a question about a long report is rarely contained in a single chunk. It is split between a figure and its discussion, spread over several slides, or scattered across the cells of a long table. Top-k chunk retrieval scores each fragment independently and has no way to represent how those fragments connect — so it either misses a piece of the chain or floods the reader with redundant near-duplicates.

FlowReader treats the problem as routing: how should a fixed budget of evidence flow from the most query-relevant sources to the most answerable sinks, through the cheapest coherent paths in the document graph?

02 Evidence assembly as min-cost flow

A single learned scoring vector h over graph nodes drives the entire pipeline: it selects sources via maximal marginal relevance, selects sinks via a length-aware answerability proxy, and parameterises the cost and capacity of every edge. Solving the resulting min-cost flow (OR-Tools, network simplex) gives an optimal routing of evidence, which is then decomposed into candidate paths.

FlowReader full pipeline diagram
Architecture One scoring vector h unifies source selection (MMR), sink selection (answerability), and all edge costs/capacities on the multimodal graph. The optimal flow is decomposed into evidence paths, a non-redundant subset is chosen by entropy-regularised replicator dynamics, and parallel VLM workers read the surviving paths under a dual-process gate.

Unified scoring vector

A single vector h controls source selection, sink selection, and every edge cost and capacity — collapsing scoring, routing, and selection into one controllable object.

Min-cost flow routing

The optimal flow at demand F=6 is solved with OR-Tools and decomposed into up to 60 candidate evidence paths — capturing how fragmented evidence connects, which top-k cannot.

Entropy-regularised path selection

A compact, non-redundant subset of paths is selected by replicator dynamics with an entropy regulariser that prevents premature collapse onto a single route.

Dual-process System-2 gate

Parallel VLM workers answer in System 1. A single System-2 refinement pass fires only when answer consistency is low or routed flow is strained — bounding extra compute a priori.

Entropy-regularised replicator dynamics

Path selection is a quality–diversity game over the candidate paths. An entropy term smooths the trajectory and preserves exploration across plausible routes; the fixed point approximates a Nash equilibrium, after which low-mass paths are pruned.

Replicator dynamics convergence over candidate paths
Replicator dynamics Probability mass concentrates on paths that are individually strong and mutually non-overlapping, while entropy regularisation keeps the intermediate population spread across multiple candidate routes.

When System 2 fires

System 2 trigger conditions and edits
System-2 triggers The gate fires on low flow saturation or low answer consistency, applying a trigger-specific graph edit — bridging a saturated min-cut (T1), re-scoring high-flow edges (T2), or expanding sources by BFS (T3) — then re-solving the pipeline once.

03 Results

Main results on the full VisDoMBench benchmark, averaged over three runs (± standard deviation). Bold = best per column, underline = second best.

ModelTypeSPIQAFetaTabPaperTabSciGraphQASlideVQA
GPT-5VLM55.2263.9437.0864.0845.06
Qwen3-VL-32BVLM29.8637.3934.3223.0624.87
Deepseek-OCROCR63.6070.3251.5861.9165.69
RAGAnythingRAG67.6957.7642.0241.6052.18
MA-RAGRAG45.5227.7033.4329.3229.40
GraphRAGGraph-RAG62.6561.3542.9065.7621.68
LightRAGGraph-RAG73.8864.7151.0275.0029.63
MMGraphRAGGraph-RAG69.9172.4056.3664.1154.20
VisDoMRAGGraph-RAG75.4461.0256.2163.3669.03
ViDoRAGGraph-RAG68.1858.7443.6737.8671.71
G²-ReaderGraph-RAG73.1966.8957.1061.5672.31
FlowReaderOurs74.2364.4558.4057.3272.93
— w/o System 2Ablation74.1163.2657.9555.8972.36

FlowReader leads on PaperTab and SlideVQA — the subsets dominated by evidence fragmented across cells and slides, exactly where independent top-k scoring fails. The System-2 pass contributes a consistent lift across subsets.

04 Bounded, adaptive compute

Unlike iterative replanning loops that are unbounded in their replanning cap, FlowReader's cost has an a-priori bound: System-1 calls plus a single conditional System-2 budget. The gate spends extra compute only on the queries that need it.

15.6
avg System-1 VLM calls
F=6
flow demand routed
≤11
paths read by workers
a-priori cost bound

Generator: Qwen3-VL-32B-Instruct (vLLM, tensor-parallel 4, bf16) on 4× A100 80GB. Embedder: nomic-embed-text-v1.5. Parser: MinerU. Flow solver: Google OR-Tools (network simplex). All hyperparameters fixed once on a 50-query dev split and reused unchanged.

§ Citation

@article{mehrish2026flowreader,
  title   = {FlowReader: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A},
  author  = {Mehrish, Ambuj and Vascon, Sebastiano},
  journal = {arXiv preprint},
  year    = {2026},
  url     = {https://arxiv.org/abs/2606.07235}
}