FlowReader — Min-Cost Flow Optimization for Multi-Modal Long Document Q&A

§ Abstract

Long, multimodal documents force retrieval-augmented systems to assemble answers from evidence fragmented across text, tables, and slides — broken across cells in a long table, spread over multiple slides, or split between a figure and its discussion. Top-k chunk retrieval treats each fragment independently and cannot represent how evidence connects. We introduce FlowReader, which reframes evidence assembly as a min-cost flow problem on a multimodal node graph: a single scoring vector h controls source selection (via MMR), sink selection (via a length-aware answerability proxy), and the costs and capacities of every edge. The optimal flow is decomposed into candidate evidence paths; a compact non-redundant subset is selected by entropy-regularised replicator dynamics; and parallel VLM workers under a dual-process gate produce the answer with a single System-2 refinement pass triggered when answer consistency is low or routed flow is strained. On VisDoMBench, FlowReader is best on the two subsets dominated by fragmented evidence — PaperTab (58.40, +1.30 over G²-Reader) and SlideVQA (72.93, +0.62) — and competitive elsewhere. Macro-averaged across all five subsets, FlowReader (65.47) sits within 0.74 of the strongest baseline.

58.40

PaperTab · best (+1.30)

72.93

SlideVQA · best (+0.62)

65.47

Macro average

1

Conditional System-2 pass

01 Fragmented evidence breaks top-k retrieval

The answer to a question about a long report is rarely contained in a single chunk. It is split between a figure and its discussion, spread over several slides, or scattered across the cells of a long table. Top-k chunk retrieval scores each fragment independently and has no way to represent how those fragments connect — so it either misses a piece of the chain or floods the reader with redundant near-duplicates.

FlowReader treats the problem as routing: how should a fixed budget of evidence flow from the most query-relevant sources to the most answerable sinks, through the cheapest coherent paths in the document graph?

02 Evidence assembly as min-cost flow

A single learned scoring vector h over graph nodes drives the entire pipeline: it selects sources via maximal marginal relevance, selects sinks via a length-aware answerability proxy, and parameterises the cost and capacity of every edge. Solving the resulting min-cost flow (OR-Tools, network simplex) gives an optimal routing of evidence, which is then decomposed into candidate paths.

FlowReader full pipeline diagram — **Architecture** One scoring vector h unifies source selection (MMR), sink selection (answerability), and all edge costs/capacities on the multimodal graph. The optimal flow is decomposed into evidence paths, a non-redundant subset is chosen by entropy-regularised replicator dynamics, and parallel VLM workers read the surviving paths under a dual-process gate.

Unified scoring vector

A single vector h controls source selection, sink selection, and every edge cost and capacity — collapsing scoring, routing, and selection into one controllable object.

Min-cost flow routing

The optimal flow at demand F=6 is solved with OR-Tools and decomposed into up to 60 candidate evidence paths — capturing how fragmented evidence connects, which top-k cannot.

Entropy-regularised path selection

A compact, non-redundant subset of paths is selected by replicator dynamics with an entropy regulariser that prevents premature collapse onto a single route.

Dual-process System-2 gate

Parallel VLM workers answer in System 1. A single System-2 refinement pass fires only when answer consistency is low or routed flow is strained — bounding extra compute a priori.

Entropy-regularised replicator dynamics

Path selection is a quality–diversity game over the candidate paths. An entropy term smooths the trajectory and preserves exploration across plausible routes; the fixed point approximates a Nash equilibrium, after which low-mass paths are pruned.

Replicator dynamics convergence over candidate paths — **Replicator dynamics** Probability mass concentrates on paths that are individually strong and mutually non-overlapping, while entropy regularisation keeps the intermediate population spread across multiple candidate routes.

When System 2 fires

System 2 trigger conditions and edits — **System-2 triggers** The gate fires on low flow saturation or low answer consistency, applying a trigger-specific graph edit — bridging a saturated min-cut (T1), re-scoring high-flow edges (T2), or expanding sources by BFS (T3) — then re-solving the pipeline once.

03 Results

Main results on the full VisDoMBench benchmark, averaged over three runs (± standard deviation). Bold = best per column, underline = second best.

Model	Type	SPIQA	FetaTab	PaperTab	SciGraphQA	SlideVQA
GPT-5	VLM	55.22	63.94	37.08	64.08	45.06
Qwen3-VL-32B	VLM	29.86	37.39	34.32	23.06	24.87
Deepseek-OCR	OCR	63.60	70.32	51.58	61.91	65.69
RAGAnything	RAG	67.69	57.76	42.02	41.60	52.18
MA-RAG	RAG	45.52	27.70	33.43	29.32	29.40
GraphRAG	Graph-RAG	62.65	61.35	42.90	65.76	21.68
LightRAG	Graph-RAG	73.88	64.71	51.02	75.00	29.63
MMGraphRAG	Graph-RAG	69.91	72.40	56.36	64.11	54.20
VisDoMRAG	Graph-RAG	75.44	61.02	56.21	63.36	69.03
ViDoRAG	Graph-RAG	68.18	58.74	43.67	37.86	71.71
G²-Reader	Graph-RAG	73.19	66.89	57.10	61.56	72.31
FlowReader	Ours	74.23	64.45	58.40	57.32	72.93
— w/o System 2	Ablation	74.11	63.26	57.95	55.89	72.36

FlowReader leads on PaperTab and SlideVQA — the subsets dominated by evidence fragmented across cells and slides, exactly where independent top-k scoring fails. The System-2 pass contributes a consistent lift across subsets.

04 Bounded, adaptive compute

Unlike iterative replanning loops that are unbounded in their replanning cap, FlowReader's cost has an a-priori bound: System-1 calls plus a single conditional System-2 budget. The gate spends extra compute only on the queries that need it.

15.6

avg System-1 VLM calls

F=6

flow demand routed

≤11

paths read by workers

✓

a-priori cost bound

Generator: Qwen3-VL-32B-Instruct (vLLM, tensor-parallel 4, bf16) on 4× A100 80GB. Embedder: nomic-embed-text-v1.5. Parser: MinerU. Flow solver: Google OR-Tools (network simplex). All hyperparameters fixed once on a 50-query dev split and reused unchanged.

§ Citation

@article{mehrish2026flowreader,
  title   = {FlowReader: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A},
  author  = {Mehrish, Ambuj and Vascon, Sebastiano},
  journal = {arXiv preprint},
  year    = {2026},
  url     = {https://arxiv.org/abs/2606.07235}
}