<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://theleprechaun25.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://theleprechaun25.github.io/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-03-01T08:26:47+00:00</updated><id>https://theleprechaun25.github.io/feed.xml</id><title type="html">Andoni Irazusta Garmendia</title><subtitle>Personal Page of Andoni Irazusta Garmendia.
</subtitle><entry><title type="html">Can a Weekend + LLMs Produce a New Neural Solver for the Quadratic Assignment Problem?</title><link href="https://theleprechaun25.github.io/blog/2026/qap-weekend/" rel="alternate" type="text/html" title="Can a Weekend + LLMs Produce a New Neural Solver for the Quadratic Assignment Problem?" /><published>2026-02-23T00:00:00+00:00</published><updated>2026-02-23T00:00:00+00:00</updated><id>https://theleprechaun25.github.io/blog/2026/qap-weekend</id><content type="html" xml:base="https://theleprechaun25.github.io/blog/2026/qap-weekend/"><![CDATA[<h2 id="motivation">Motivation</h2>

<p>LLMs have gotten surprisingly competent at the core task of research projects: reading papers, writing code, and turning messy thoughts into structured text. 
That naturally raises a question I couldn’t stop thinking about:</p>

<p>If I compress the entire research loop into a single weekend—survey → idea → implementation → experiments → write-up—how far can I get with LLMs as a co-pilot?</p>

<p>This post is that experiment.</p>

<p>I’m going to try it on the Quadratic Assignment Problem (QAP): an NP-hard optimization problem that’s notoriously hard. 
QAP is a “permutation” problem (like TSP), but the objective is quadratic, choices interact with each other in a way that quickly becomes globally entangled. 
Many well-known combinatorial problems can be expressed as special cases or close relatives of QAP, including variants related to routing, matching, layout, and assignment.</p>

<p>Despite that importance, QAP has received much less attention in neural combinatorial optimization than problems like TSP. 
One reason is practical: QAP is harder to “package” into the neural architectures that work nicely for other permutation problems. 
The signal you need depends on a lot of pairwise interactions, so learning good local decisions from partial context is simply harder. 
I’ll unpack this properly later, because it’s exactly where a new model design might help.</p>

<p>So the goal is simple and slightly reckless:</p>

<p>Can I, with the help of LLMs, produce a working neural method for QAP in a single weekend? And if I can, how competitive will it be?
We’ll find out.</p>

<hr />

<h2 id="preliminaries">Preliminaries</h2>

<p>Before diving into the sprint, a quick note on how I’ll run it.</p>

<p>This is a time-boxed project: the point is not to claim a polished, state-of-the-art solver, but to see what kind of research artifact can be produced under extreme constraints with modern LLM support. The output I’m aiming for is:</p>
<ul>
  <li>a working implementation,</li>
  <li>a reproducible experimental setup,</li>
  <li>and a coherent write-up (call it a short paper or technical report).</li>
</ul>

<p>I’ll use the following tools:</p>
<ul>
  <li>ChatGPT (GPT-5.2 Thinking) for quick literature scanning, problem framing, and architecture brainstorming.</li>
  <li>Codex (gpt-5.3-codex) for coding iterations and refinements.</li>
  <li>Prism (LLM-assisted LaTeX projects) for the final report formatting.</li>
</ul>

<p>Let’s start.</p>

<hr />

<h2 id="stage-1-literature-review">Stage 1: Literature Review</h2>

<p>Before reading papers, I wanted to make sure I had the QAP formulation fresh in my mind.
It’s been a while, so I simply asked the LLM for a quick refresh.</p>

<p><strong>Q.</strong> <em>Remind me of the Quadratic Assignment Problem (QAP) formulation.</em></p>

<p><strong>A.</strong> The <strong>Quadratic Assignment Problem (QAP)</strong> assigns $n$ facilities to $n$ locations. Let:</p>

<ul>
  <li>$F \in \mathbb{R}^{n\times n}$ be the <strong>flow</strong> matrix, where $F_{ij}$ quantifies the interaction between facilities $i$ and $j$.</li>
  <li>$D \in \mathbb{R}^{n\times n}$ be the <strong>distance</strong> matrix, where $D_{ab}$ is the distance between locations $a$ and $b$.</li>
  <li>$\pi \in S_n$ be a <strong>permutation</strong>, where $\pi(i)$ is the location assigned to facility $i$.</li>
</ul>

<p>The QAP objective is:</p>

\[\min_{\pi \in S_n} \ \sum_{i=1}^{n}\sum_{j=1}^{n} F_{ij}\,D_{\pi(i)\,\pi(j)}.\]

<p>Equivalently, using a permutation matrix $P \in {0,1}^{n\times n}$ with $P_{i,\pi(i)}=1$ and the feasible set $\Pi$ of all permutation matrices:</p>

\[\min_{P \in \Pi} \ \langle F,\ P D P^\top \rangle,\]

<p>where the Frobenius inner product is:</p>

\[\langle A,B \rangle = \sum_{i=1}^{n}\sum_{j=1}^{n} A_{ij}B_{ij}.\]

<hr />

<p>Next, I asked the LLM to do what I’d normally do first: a fast literature scan, focused specifically on <strong>neural / deep learning methods for QAP</strong>.</p>

<p><strong>Q.</strong> <em>Give me a concise literature review of deep learning approaches for the Quadratic Assignment Problem (QAP): main method families, key papers, benchmarks/baselines, and the most promising gaps to explore.</em></p>

<p><strong>A.</strong> <em>(Literature review.)</em></p>

<p>After a fairly broad scan (24 minutes, 277 sources), it returned a structured summary and a handful of recent papers that were new to me.</p>

<div class="table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Paper / year (real title)</th>
        <th>Cite</th>
        <th>Short description</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td><strong>Revised Note on Learning Algorithms for Quadratic Assignment with Graph Neural Networks</strong> (2018)</td>
        <td><d-cite key="nowak2018revised_qap_gnn"></d-cite></td>
        <td>Supervised GNN method trained on synthetic/planted regimes to learn QAP-like assignment behavior (primarily studied outside “full” QAPLIB facility-layout evaluation).</td>
      </tr>
      <tr>
        <td><strong>Neural Graph Matching Network: Learning Lawler’s Quadratic Assignment Problem With Extension to Hypergraph and Multiple-Graph Matching</strong> (2019; TPAMI 2022)</td>
        <td><d-cite key="wang2022ngm_tpami"></d-cite></td>
        <td>Association-graph neural matching + Sinkhorn projection to predict permutations for Lawler-QAP / graph matching; extends to hypergraph and multi-graph matching settings.</td>
      </tr>
      <tr>
        <td><strong>Solving the Quadratic Assignment Problem using Deep Reinforcement Learning</strong> (2023)</td>
        <td><d-cite key="bagga2023drl_qap_double_pointer"></d-cite></td>
        <td>Constructive RL for Koopmans–Beckmann QAP using an alternating “double pointer” style policy over facilities/locations, trained with actor–critic.</td>
      </tr>
      <tr>
        <td><strong>Revocable Deep Reinforcement Learning with Affinity Regularization for Outlier-Robust Graph Matching</strong> (ICLR 2023)</td>
        <td><d-cite key="liu2023rgm_affinity"></d-cite></td>
        <td>RL-based solver for Lawler-QAP/graph matching with outliers; uses revocable decisions and affinity regularization to improve robustness to unmatched nodes.</td>
      </tr>
      <tr>
        <td><strong>Learning Solution-Aware Transformers for Efficiently Solving Quadratic Assignment Problem</strong> (ICML 2024)</td>
        <td><d-cite key="tan2024sawt_qap"></d-cite></td>
        <td>Learn-to-improve transformer that conditions on the incumbent assignment (“solution-aware”) and outputs swap-style improvement actions; evaluated on synthetic KB-QAP and QAPLIB generalization.</td>
      </tr>
      <tr>
        <td><strong>Unsupervised Learning for Quadratic Assignment</strong> (2025)</td>
        <td><d-cite key="min2025ul_qap_plume"></d-cite></td>
        <td>Unsupervised permutation learning (no solved targets) to generate strong initial solutions that are refined by tabu search; studies instance-regime and size variation.</td>
      </tr>
      <tr>
        <td><strong>BiQAP: Neural Bi-level Optimization-based Framework for Solving Quadratic Assignment Problems</strong> (KDD 2025)</td>
        <td><d-cite key="shi2025biqap"></d-cite></td>
        <td>Bi-level framework where a network guides a differentiable entropic relaxation (inner solver) while optimizing the original QAP objective at the outer level.</td>
      </tr>
      <tr>
        <td><strong>Learning the Quadratic Assignment Problem with Warm-Started MCMC Finetuning and Cross-Graph Attention</strong> (ICLR 2026 submission)</td>
        <td><d-cite key="pan2026plma_qap_mcmc"></d-cite></td>
        <td>Cross-graph attention + warm-started MCMC finetuning/adaptation at inference time; energy-based formulation for efficient swap proposals and runtime–quality tradeoffs.</td>
      </tr>
    </tbody>
  </table>

</div>

<p>For this project, I want to isolate what a neural policy can do <strong>without leaning on an external metaheuristic</strong>. Those hybrid pipelines are often excellent in practice, but they also blur the real potential of a neural solver, treating it as a neural component inside a classical solver.</p>

<p>So I’ll treat the following as <em>hybrid baselines</em> rather than the main object of study:</p>
<ul>
  <li>
    <d-cite key="min2025ul_qap_plume"></d-cite>
    <p>(explicit tabu-search backbone; the network primarily provides an initialization)</p>
  </li>
  <li>
    <d-cite key="pan2026plma_qap_mcmc"></d-cite>
    <p>(explicit inference-time MCMC finetuning/adaptation)</p>
  </li>
</ul>

<p>Instead, the most relevant direction for me is the learn-to-improve family, especially <d-cite key="tan2024sawt_qap"></d-cite>—because it looks structurally close to what makes classical QAP solvers strong: <strong>swap-based improvement trajectories</strong>.</p>

<p>I’ll start from methods with public code so the comparisons are not based on re-implementations:</p>

<ul>
  <li>
    <d-cite key="nowak2018revised_qap_gnn"></d-cite>
    <p>— <code class="language-plaintext highlighter-rouge">alexnowakvila/QAP_pt</code></p>
  </li>
  <li>
    <d-cite key="wang2022ngm_tpami"></d-cite>
    <p>— <code class="language-plaintext highlighter-rouge">Thinklab-SJTU/ThinkMatch</code> (NGM model folder)</p>
  </li>
  <li>
    <d-cite key="bagga2023drl_qap_double_pointer"></d-cite>
    <p>— <code class="language-plaintext highlighter-rouge">psbagga17/drlqap</code></p>
  </li>
  <li>
    <d-cite key="liu2023rgm_affinity"></d-cite>
    <p>— <code class="language-plaintext highlighter-rouge">Thinklab-SJTU/RGM</code></p>
  </li>
  <li>
    <d-cite key="tan2024sawt_qap"></d-cite>
    <p>— <code class="language-plaintext highlighter-rouge">PKUTAN/SAWT</code></p>
  </li>
</ul>

<p>A key point already shows up here: these baselines actually split into <strong>two partially different “QAPs”</strong> in practice:</p>
<ul>
  <li><em>facility-layout / Koopmans–Beckmann</em> settings (often geometry-structured),</li>
  <li><em>Lawler-QAP / graph matching</em> settings (affinity-driven, association-graph flavored).</li>
</ul>

<hr />

<h2 id="stage-2-gaps-within-neural-qap">Stage 2: Gaps within Neural QAP</h2>

<p>The literature report came back with the usual “checkbox gaps”: scalability beyond small $n$, generalization to QAPLIB, hybridization with exact solvers, lack of standardized benchmarks, and a <em>“theory and guarantees”</em> bucket (help!). All true, but also a bit too generic.</p>

<p>What I <em>did</em> take seriously is the evaluation angle. 
Seems like QAP papers often look hard to compare because each one uses different evaluation protocols: different instance generators, different time budgets, different baselines. 
So rather than jumping straight into model design, I decided to first produce a <strong>clean benchmark + reproducible baselines</strong>.</p>

<p>That said, I didn’t want this weekend to become an exercise of reviewing theoretical proofs, or re-implementing plain tabu search for the tenth time. 
My point is to explore something <strong>algorithmic</strong>: a representation choice, an inference procedure, or a learning pipeline that is genuinely better aligned with QAP’s structure.</p>

<p>So I asked the LLM again:</p>

<p><b>Q.</b> <em>Ignore generic “future work” (scale, more benchmarks, theory). Based on existing neural QAP methods, list 5–8 specific, mechanism-level gaps and a concrete improvement for each.</em></p>

<p><strong>A.</strong> It came with a bunch of interesting gaps, here the most interesting ones in my opinion:</p>

<ul>
  <li>
    <p><strong>“What problem are we solving?” (KB-QAP vs Lawler-QAP mismatch).</strong> Several neural papers labeled as “QAP” are really solving different formulations (layout-flavored KB-QAP vs matching-flavored Lawler-QAP). This makes comparisons misleading unless the benchmark is explicit about which regime it targets. Some neural pipelines implicitly rely on coordinates and Euclidean geometry (a habit from TSP). In QAP, and particularly in QAPLIB, $D$ is often just an arbitrary cost matrix (grid/manhattan, random, structured, etc.), so a coordinate-based inductive bias can break silently.</p>
  </li>
  <li>
    <p><strong>Association-graph materialization blows up memory.</strong> Methods that build the full association graph quickly hit GPU-memory limits as $n$ grows, making large QAPLIB instances impractical.</p>
  </li>
  <li>
    <p><strong>Constructive RL tends to be underpowered unless it gets strong structure priors.</strong> Constructive decoders face hard credit assignment because early choices interact quadratically with everything that comes later. This is visible in small-$n$ RL papers that do not scale cleanly.</p>
  </li>
  <li>
    <p><strong>Additive scoring/energy misses quadratic coupling.</strong> Surrogates that score assignments as $\sum_i \phi_{i,\pi(i)}$ cannot represent the true quadratic interactions, biasing move ranking and exploration, especially on dense instances.</p>
  </li>
  <li>
    <p><strong>Sparse/ambiguous reward in learn-to-improve RL.</strong> Treating non-improving steps as “zero signal” (and not explicitly penalizing worsening moves) makes plateaus look identical and wastes steps under tight budgets.</p>
  </li>
</ul>

<hr />

<h2 id="stage-3-the-proposal-architecture-and-traininginference-recipe">Stage 3: The Proposal: Architecture and Training/Inference Recipe</h2>

<p>Before touching model design, I implemented the <em>classical baselines</em> (greedy construction + swap local search), <em>neural baselines</em> (RGM <d-cite key="liu2023rgm_affinity"></d-cite>, SAWT <d-cite key="tan2024sawt_qap"></d-cite>) and a synthetic evaluation pack. This gave me a stable loop: generate instances → run solver under a fixed step budget → report objective / gap / runtime.</p>

<h3 id="a-minimal-neural-improvement-starting-point">A minimal neural-improvement starting point</h3>

<p>The first neural baseline is deliberately simple: a <strong>swap-based neural improvement</strong> policy.</p>

<ul>
  <li><strong>State:</strong> an incumbent permutation $\pi$ (starting from a sequential initialization).</li>
  <li><strong>Action:</strong> choose a swap $(i,j)$ and apply it.</li>
  <li><strong>Rollout:</strong> fixed horizon of $T=4$ steps.</li>
  <li><strong>Training:</strong> RL on improvement trajectories (Normalized improvement over $T$ steps as reward).</li>
</ul>

<p>The main design question then became: <em>what inductive bias should the policy have to pick good swaps reliably?</em><br />
Instead of trying to design the final model in one shot, I iterated quickly with a coding agent: propose a modification, implement, train a short run, keep the changes that improve the validation curve.</p>

<p>Below is a compact log of the most informative iterations (validation on 100 instances of size $n=5$, reporting the best gap to optimal % after 20 steps).</p>

<hr />

<h3 id="iteration-log">Iteration log</h3>

<h4 id="1-baseline-swap-policy-trained-on-n5">1) Baseline swap policy (trained on $n=5$)</h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 1.8522%<br />
<strong>What it is:</strong> a basic swap policy trained end-to-end with RL, without strong QAP structure injected.</p>

<h4 id="2-minmax-instance-normalization">2) Min–max instance normalization</h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 1.6824%<br />
<strong>Change:</strong> normalize the instance matrices (flow and distance) with min–max scaling.<br />
<strong>Why it helped:</strong> reduced sensitivity to absolute scale and improved optimization stability.</p>

<h4 id="3-hetero_rel_transformer-2n-heterogeneous-tokens--relational-attention">3) <code class="language-plaintext highlighter-rouge">hetero_rel_transformer</code> (2$n$ heterogeneous tokens + relational attention)</h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 1.5965%<br />
<strong>Change:</strong> represent each instance with <strong>2$n$ tokens</strong>:</p>
<ul>
  <li>$n$ facility tokens + $n$ location tokens,<br />
and inject instance structure via <strong>relation-biased attention</strong> (flows on facility–facility edges, distances on location–location edges, assignment links on cross edges).<br />
<strong>Why it helped:</strong> forced the policy to treat QAP structure as pairwise relations rather than flat features.</li>
</ul>

<h4 id="4-hetero_rel_transformer-v2-solution-aware-relations--stronger-attention-control">4) <code class="language-plaintext highlighter-rouge">hetero_rel_transformer v2</code> (solution-aware relations + stronger attention control)</h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 1.1659%<br />
<strong>Changes (core ones):</strong></p>
<ul>
  <li>add <strong>induced distance</strong> $D_{\pi(i)\pi(j)}$ as a first-class relation feature,</li>
  <li>add <strong>per-head gates</strong> that balance content attention ($qk$) vs relation bias,</li>
  <li>optionally modulate value aggregation with relation features.<br />
<strong>Why it helped:</strong> made the encoder explicitly <em>solution-aware</em> in the same way swap delta computations are.</li>
</ul>

<h4 id="5-1-dual-encoder-layer-facility-only--location-only-pre-encoding">5) +1 dual-encoder layer (facility-only + location-only pre-encoding)</h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 1.0301%<br />
<strong>Change:</strong> before the 2$n$ hetero encoder, run a small “static” encoder:</p>
<ul>
  <li>facility-only attention over $F$,</li>
  <li>location-only attention over $D$,
then fuse these static embeddings with the dynamic (solution-conditioned) tokens.<br />
<strong>Why it helped:</strong> improved token quality by encoding within-type structure (flow communities, distance geometry) before mixing types.</li>
</ul>

<h4 id="6-pre-norm-gtlayer--relation-features-as-both-bias-and-gate-e1e2">6) Pre-norm <code class="language-plaintext highlighter-rouge">GTLayer</code> + relation features as both bias and gate (e1/e2)</h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 0.6666%<br />
<strong>Change:</strong> switch the main layers to a <strong>pre-norm transformer</strong> where relation features affect attention in two ways:</p>
<ul>
  <li><strong>e1:</strong> additive bias on attention logits,</li>
  <li><strong>e2:</strong> multiplicative gate on attention probabilities (relation-conditioned filtering).<br />
<strong>Why it helped:</strong> the model can learn both <em>which relations matter</em> (gate) and <em>how strongly</em> (bias), which is closely aligned with swap evaluation.</li>
</ul>

<h4 id="7-3-static-pre-layers-stronger-dual-encoder">7) +3 static (pre) layers (stronger dual encoder)</h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 0.5694%<br />
<strong>Change:</strong> deepen the static pre-encoder (3 layers).<br />
<strong>Observation:</strong> more capacity helps in-distribution, but can start to over-specialize and hurt out-of-distribution generalization.</p>

<h4 id="8-hetero_rel_transformer_3n_assign_pairwl-3n-tokens--explicit-assignment-tokens--pairwise-reasoning">8) <code class="language-plaintext highlighter-rouge">hetero_rel_transformer_3n_assign_pairwl</code> (3$n$ tokens + explicit assignment tokens + pairwise reasoning)</h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 0.3677%<br />
<strong>Change:</strong> move from 2$n$ to <strong>3$n$ tokens</strong> by adding an <strong>assignment token</strong> $A_i$ for each facility:</p>
<ul>
  <li>$F_i$ (facility token), $L_a$ (location token), and $A_i=(i,\pi(i))$ (current assignment token).<br />
Then decode swaps with a <strong>pair-state module</strong> (PairWL): an approximate 2-WL-style update over assignment-token pairs $(i,j)$ using triadic composition $\sum_k (i,k)\circ(k,j)$.<br />
<strong>Why it helped:</strong> swap choice is inherently pairwise; making pair structure explicit reduces the burden on facility embeddings.</li>
</ul>

<h4 id="9---pairwl_use_token_feedback">9) <code class="language-plaintext highlighter-rouge">--pairwl_use_token_feedback</code></h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 0.3128%<br />
<strong>Change:</strong> add <strong>pair-to-token feedback</strong>, pooling pair states around each assignment token and writing back into token embeddings.<br />
<strong>Why it helped (and hurt):</strong> improves in-distribution consistency, but can increase coupling and reduce robustness when scaling.</p>

<h4 id="10---pairwl_use_relation_aware_triad----pairwl_use_strong_token_feedback">10) <code class="language-plaintext highlighter-rouge">--pairwl_use_relation_aware_triad + --pairwl_use_strong_token_feedback</code></h4>
<p><strong>Result (gap to optimal @ $n=5$):</strong> 0.2972%<br />
<strong>Changes:</strong></p>
<ul>
  <li>make the triadic composition <strong>relation-aware</strong> by gating intermediate nodes $k$ using learned functions of local pair features,</li>
  <li>strengthen the token feedback pathway (optionally top-$k$ pooling).<br />
<strong>Why it helped:</strong> pushes the pair module closer to swap-delta reasoning, where triadic context is filtered by instance- and solution-dependent cues.</li>
</ul>

<p>To summarize the effect of these design iterations, the next Figure shows the best validation gap found on 20 steps (on $n=5$) after each architecture update.</p>

<p style="text-align:center;">
  <img src="/assets/img/qap_llm_gap_vs_iteration.png" alt="Alt text" style="width:80%; max-width:900px;" />
</p>

<hr />

<h3 id="what-i-take-from-these-iterations">What I take from these iterations</h3>

<p>1) The best improvements came from features that mimic <em>how swap local search evaluates moves</em>:</p>
<ul>
  <li>induced distances under the incumbent ($D_{\pi(i)\pi(j)}$),</li>
  <li>strong control of relation influence (bias + gate),</li>
  <li>explicit pair reasoning when decoding swaps.</li>
</ul>

<p>2) Extra capacity helps, but only if it reinforces the right structure.<br />
   The 3N + PairWL variants are slower, but they improved generalization markedly at larger $n$ under the same step budget.</p>

<p>In the next stage, I’ll formalize the evaluation protocol and compare these variants against the basleines under consistent budgets.</p>

<hr />

<h2 id="stage-4-experiments-and-evaluation-protocol">Stage 4: Experiments and Evaluation Protocol</h2>

<p>After several architecture iterations, the next step is to evaluate under a more realistic protocol: more test instances, multiple sizes, and direct comparison against both learned and classical baselines.</p>

<p>We use our model trained exclusively on instances of size $n=5$. However, we will now evaluate their generalization to larger sizes:
$n \in {5,10,15,20}$.</p>

<h3 id="baselines">Baselines</h3>

<p>We compare against four baselines that cover both “neural” and “classical” points of reference:</p>

<ul>
  <li><strong>Learned baselines:</strong>  RGM (neural constructive trained on $n=20$) and SAWT (neural improvement trained on $n=20$).</li>
  <li><strong>Greedy constructive:</strong> a simple non-learned heuristic that builds a permutation once.</li>
  <li><strong>Swap local search:</strong> a classical improvement baseline operating in the same 2-swap neighborhood, run until getting stuck in a local optima.</li>
</ul>

<p>We report <strong>anytime performance</strong> (best-so-far objective versus time).</p>

<p style="text-align:center;">
  <img src="/assets/img/anytime_euclid_panel.png" alt="Anytime performance across sizes" style="width:80%; max-width:900px;" />
</p>

<p>Results show a clear pattern: the policy trained on $n=5$ remains competitive on $n=5$ and exhibits non-trivial transfer to larger sizes. Under the same evaluation budget, it achieves the strongest final objective among the tested baselines, indicating that the improvements learned in the small regime are not purely size-specific.</p>

<hr />

<h2 id="stage-5-wrapping-up">Stage 5: Wrapping up</h2>

<p>I did not expect the final outcome to look this strong.</p>

<p>The most surprising part is not that a neural model can learn to propose useful swaps, that is plausible in hindsight, but that <em>the iteration loop</em> was fast enough to make that progress in a couple of days. In the past, I had explored similar ideas, but the cost of implementation details (plumbing, debugging, profiling, refactors) made each iteration way longer. Ideas accumulated in unfinished branches and abandoned folders, long before reaching a clean evaluation protocol.</p>

<p>This time, the workflow was different: iterate on the architecture and training loop quickly, validate with a tight benchmark, keep only the changes that move the curve. That speed changes what is feasible.</p>

<p>There is a trade-off, though. Delegating implementation to a coding agent reduces friction, but it also reduces direct contact with every line of code. That loss of “full manual control” is not free: it can hide bugs, bake in unexamined assumptions, and make it easier to mistake artifacts for progress. The right posture is to embrace the speed while staying disciplined about verification.</p>

<p>I initially planned to end with a short, paper-like write-up, but the results were promising enough to justify a more complete treatment. Instead, I’m turning this into a proper article: cleaner experimental design, stronger baselines, and a thorough set of benchmarks and ablations to pinpoint what actually drives the gains. More soon.</p>

<d-bibliography></d-bibliography>]]></content><author><name>Andoni Irazusta Garmendia</name></author><category term="nco" /><summary type="html"><![CDATA[Weekend project with LLMs.]]></summary></entry><entry><title type="html">Scale Generalization Problem in NCO</title><link href="https://theleprechaun25.github.io/blog/2026/scale-generalization-problem/" rel="alternate" type="text/html" title="Scale Generalization Problem in NCO" /><published>2026-02-17T00:00:00+00:00</published><updated>2026-02-17T00:00:00+00:00</updated><id>https://theleprechaun25.github.io/blog/2026/scale-generalization-problem</id><content type="html" xml:base="https://theleprechaun25.github.io/blog/2026/scale-generalization-problem/"><![CDATA[<h2 id="neural-improvement-for-the-tsp">Neural Improvement for the TSP</h2>

<p>Neural Combinatorial Optimization (NCO) often gets introduced through <em>constructive</em> models: a policy that builds a solution from scratch, one decision at a time. Neural Improvement (NI) flips the script: start from an existing solution and repeatedly apply small edits that make it better.</p>

<p>The Traveling Salesperson Problem (TSP) will be our playground to explain NI.</p>

<p>A TSP instance with $N$ cities is a set of coordinates</p>

\[X^{(N)} = (x_1,\ldots,x_N)\in \mathbb{R}^{N\times 2}.\]

<p>A tour is a Hamiltonian cycle, commonly represented as a permutation $\pi\in S_N$, with cyclic indexing $\pi_{N+1}=\pi_1$. The tour length is</p>

\[L(\pi; X^{(N)}) = \sum_{t=1}^{N} \|x_{\pi_t}-x_{\pi_{t+1}}\|_2.\]

<p>In NI, we also define a <em>move operator</em> $\Phi$ (e.g., 2-opt), and an action $a_t$ that selects a particular move. Starting from an initial tour $\pi^{(0)}$, NI produces a sequence</p>

\[\pi^{(t+1)} = \Phi(\pi^{(t)}, a_t),\qquad t=0,1,\ldots,T-1.\]

<p>Given a step budget $T$, the goal is to quickly drive the tour cost down.</p>

<blockquote>
  <p><strong>One sentence mental model:</strong> NI is a learned local-search heuristic whose policy is applied repeatedly; scale generalization asks whether that heuristic remains valid when $N$ grows.</p>
</blockquote>

<hr />

<h2 id="why-scale-generalization-is-hard">Why scale generalization is hard</h2>

<p>If you train on TSP instances with $N\in[20,100]$, you’d love the same learned improver to work on $N=500$ or $N=1000$. That is the <em>cross-scale generalization</em> problem.</p>

<h3 id="1-the-action-set-grows-with-n">1) The action set grows with $N$</h3>

<table>
  <tbody>
    <tr>
      <td>For 2-opt, an action is typically “pick two breakpoints” $(i,j)$, leading to $</td>
      <td>\mathcal{A}_N</td>
      <td>=\Theta(N^2)$. Even if your model outputs a score for every candidate move, the decision surface changes with $N$.</td>
    </tr>
  </tbody>
</table>

<p>In plain terms: at $N=50$ you might have a handful of obviously good 2-opt moves; at $N=500$, you have <em>many</em> plausible moves, and the best ones can be annoyingly close.</p>

<h3 id="2-the-state-distribution-shifts-with-n">2) The state distribution shifts with $N$</h3>

<p>A random tour at $N=50$ looks different than at $N=500$. So does a “partially improved” tour after $t$ edits. The density of crossings, typical edge lengths, and the distribution of “available easy wins” all change.</p>

<p>So even if your architecture can technically process any $N$, the model still faces a distribution shift:</p>

\[s \sim \mathcal{D}_N \quad \text{changes with } N.\]

<h3 id="3-the-horizon-grows">3) The horizon grows</h3>

<p>If your inference budget scales like $T(N)=4N$ for example, then the policy is used for much longer rollouts on large instances. Small systematic biases that are harmless at $N=50$ can compound at $N=500$.</p>

<hr />

<h2 id="2-step-imitation-learning-across-sizes">2-step imitation learning across sizes</h2>

<p>Let’s think how could we train a NI policy using imitation learning (IL).</p>

<p>We could build a “teacher” that chooses the best move, then train a network to imitate it. But in NI, a <em>myopic</em> teacher can be misleading because some moves are “setup moves” that enable better improvements later.</p>

<p>That motivates a <strong>k-step optimal teacher</strong>. Let’s use <strong>k=2</strong> to start.</p>

<p>Let $s=(X^{(N)},\pi)$ be the state. For a first action $a_1$, define the 2-step lookahead value</p>

\[Q^{(2)}(s,a_1) = \min_{a_2\in \mathcal{A}_N(\Phi(\pi,a_1))}
L\big(\Phi(\Phi(\pi,a_1),a_2); X^{(N)}\big).\]

<p>Then the teacher action is</p>

\[a^\star(s) = \arg\min_{a_1\in\mathcal{A}_N(\pi)} Q^{(2)}(s,a_1).\]

<p>Your IL objective becomes</p>

\[\min_\theta\; \mathbb{E}_{s\sim\mathcal{D}}
\big[ -\log \pi_\theta(a^\star(s)\mid s) \big].\]

<h3 id="a-simple-setup-move-example">A simple “setup move” example</h3>

<p>Suppose three candidate first moves have immediate improvements:</p>

<ul>
  <li>$a$: improves by $+1.0$, but then nothing good remains $\Rightarrow$ total $+1.0$.</li>
  <li>$b$: improves by $+0.2$, but unlocks a second move of $+3.0$ $\Rightarrow$ total $+3.2$.</li>
  <li>$c$: improves by $+0.8$, then $+0.1$ $\Rightarrow$ total $+0.9$.</li>
</ul>

<p>A greedy teacher picks $a$. A 2-step teacher picks $b$.<br />
So 2-step IL teaches “planning” even in a local-search setting.</p>

<p>A useful tweak is <strong>soft imitation</strong>:</p>

\[p_T(a\mid s)\propto \exp\!\big(-Q^{(2)}(s,a)/\tau\big),
\qquad
\min_\theta\; \mathbb{E}\big[ \mathrm{KL}(p_T(\cdot\mid s)\,\|\,\pi_\theta(\cdot\mid s))\big].\]

<blockquote>
  <p><strong>Why this matters for scale:</strong> the number of near-tied 2-step choices tends to increase with $N$, so one-hot labels become brittle. A soft teacher distribution can stabilize training.</p>
</blockquote>

<hr />

<h2 id="what-to-measure-and-how-to-stress-test">What to measure and how to stress-test</h2>

<p>If your goal is <em>cross-scale generalization</em>, you want evaluations that separate “works on the training distribution” from “learned the right invariances and planning heuristics”.</p>

<h3 id="1-cross-scale-rollout-curves-anytime-performance">1) Cross-scale rollout curves (anytime performance)</h3>

<p>For each $N$, plot best-so-far tour length versus steps $t$, normalized by a strong reference:</p>

\[\text{gap}(t) = \frac{L(\pi^{(t)}) - L_{\text{ref}}}{L_{\text{ref}}}\times 100\%.\]

<p>The reference can be a classical solver/heuristic (or best-known on synthetic benchmarks).</p>

<p>The key is not only final gap, but how the curve degrades as $N$ increases.</p>

<h3 id="2-teacher-consistency-across-sizes">2) Teacher consistency across sizes</h3>

<p>If the teacher changes implementation with $N$ (e.g., exact depth-2 at small sizes, approximate at large sizes), quantify the discrepancy on overlapping sizes. This helps you separate “student failed to generalize” from “teacher changed”.</p>

<h3 id="3-start-from-different-initializations">3) Start from different initializations</h3>

<p>NI performance depends heavily on the initial tour distribution: random, greedy, constructive neural, POMO-style sampling, etc. Cross-scale generalization can look strong from one initializer and collapse from another.</p>

<p>A simple takeaway: <strong>changing the initializer changes the state distribution</strong>, and the learned improver is only as robust as the diversity of states it saw during training.</p>

<hr />

<h2 id="references">References</h2>]]></content><author><name>Andoni Irazusta Garmendia</name></author><category term="nco" /><summary type="html"><![CDATA[An introduction to the cross-scale generalization via the Traveling Salesperson Problem.]]></summary></entry><entry><title type="html">Introduction to Neural Combinatorial Optimization (NCO)</title><link href="https://theleprechaun25.github.io/blog/2026/introduction-nco/" rel="alternate" type="text/html" title="Introduction to Neural Combinatorial Optimization (NCO)" /><published>2026-01-27T00:00:00+00:00</published><updated>2026-01-27T00:00:00+00:00</updated><id>https://theleprechaun25.github.io/blog/2026/introduction-nco</id><content type="html" xml:base="https://theleprechaun25.github.io/blog/2026/introduction-nco/"><![CDATA[<h2 id="combinatorial-optimization">Combinatorial Optimization</h2>

<p>Combinatorial optimization (CO) sits underneath a huge range of real systems: routing and logistics, scheduling, allocation, packing, and graph problems such as partitioning or cuts. What makes these problems challenging is not that the objective is mysterious, but that the number of feasible solutions typically grows <strong>combinatorially</strong> with instance size.</p>

<p>Classical solvers and heuristics are the result of decades of human algorithmic effort. They are not “generic black boxes”: they encode substantial <strong>problem structure</strong> and <strong>domain knowledge</strong>—which neighborhoods to search, which relaxations to solve, which cuts to add, which branching rules work, which invariances matter, which parameters to tune. This accumulated craft is a major reason why mature optimization toolchains remain extremely strong in practice.</p>

<p>Surveys provide a broad picture of how machine learning has entered this landscape, and how CO problems are used as testbeds for learning-based decision-making <d-cite key="bengio2021machine,mazyavkina2021reinforcement"></d-cite>.</p>

<hr />

<h2 id="the-traveling-salesperson-problem">The Traveling Salesperson Problem</h2>

<p>The Traveling Salesperson Problem (TSP) is a classical benchmark in combinatorial optimization. The name comes from its canonical story: a salesperson wants to plan a trip that visits a set of cities exactly once and returns home, while minimizing total travel distance.</p>

<p>Formally, given $N$ cities represented by coordinates $x_1,\dots,x_N \in \mathbb{R}^2$, a tour is an ordering in which each city is visited exactly once and the route returns to the start. The symmetric Euclidean TSP asks for the tour of minimum total length, where the cost between two cities is their Euclidean distance (and is symmetric). Writing a tour as a permutation of cities $\pi$, its length is the sum of distances between consecutive cities in that order, including the closing edge back to the first city:</p>

\[\min_{\pi \in S_N}\; C(\pi)
\;=\;
\sum_{t=1}^{N} \left\|x_{\pi_t} - x_{\pi_{t+1}}\right\|_2,
\qquad \pi_{N+1} := \pi_1.\]

<p>It is often useful to view a TSP instance as a complete weighted graph: each city is a node, and the cost of traveling between cities $i$ and $j$ is an edge weight $d_{ij}$. The animation below shows a random TSP instance. Hover over a city to highlight its distances to the others, and use the slider to control how many edges are displayed. You can also click Random tour to draw a uniformly random tour (a random permutation of the cities), which typically yields a poor solution. In the next section, we will present some strategies to obtain much better solutions.</p>

<div class="l-page">
  <div class="anim-wrap" id="tspAnimWrap">
    <div class="anim-toolbar">
      <button id="tspNew">New instance</button>
      <button id="tspRandTour">Random tour</button>

      <span class="pill">Edge threshold</span>
      <input id="tspThresh" type="range" min="0" max="100" value="38" />

      <span class="pill" id="tspInfo">Ready.</span>
      <span class="pill" id="tspTourInfo">Tour length: –</span>
    </div>
    <canvas class="anim-canvas" id="tspCanvas"></canvas>
  </div>
</div>

<script>
(function(){
  // -----------------------
  // RNG + geometry helpers
  // -----------------------
  function mulberry32(a){
    return function(){
      var t = a += 0x6D2B79F5;
      t = Math.imul(t ^ (t >>> 15), t | 1);
      t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
      return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
    }
  }
  function dist(a,b){
    const dx=a[0]-b[0], dy=a[1]-b[1];
    return Math.hypot(dx,dy);
  }
  function tourCost(pts, tour){
    if(!tour || tour.length < 2) return NaN;
    let c=0;
    for(let k=0;k<tour.length;k++){
      const i=tour[k], j=tour[(k+1)%tour.length];
      c += dist(pts[i], pts[j]);
    }
    return c;
  }
  function shuffleInPlace(arr, rand){
    for(let i=arr.length-1;i>0;i--){
      const j = Math.floor(rand()*(i+1));
      const tmp = arr[i]; arr[i]=arr[j]; arr[j]=tmp;
    }
    return arr;
  }
  function randomTour(n, rand){
    const tour = Array.from({length:n}, (_,i)=>i);
    // Keep city 0 fixed as start for nicer visuals (rotation-invariant anyway)
    const tail = tour.slice(1);
    shuffleInPlace(tail, rand);
    return [0].concat(tail);
  }

  // -----------------------
  // DOM
  // -----------------------
  const canvas = document.getElementById("tspCanvas");
  const ctx = canvas.getContext("2d");

  const btnNew = document.getElementById("tspNew");
  const btnRandTour = document.getElementById("tspRandTour");
  const slider = document.getElementById("tspThresh");

  const info = document.getElementById("tspInfo");
  const tourInfo = document.getElementById("tspTourInfo");

  // -----------------------
  // State
  // -----------------------
  let rng = mulberry32(20260127);
  let pts = [];
  let hoverIdx = -1;

  // optional tour (null means "no tour drawn")
  let tour = null;

  // -----------------------
  // Canvas resize
  // -----------------------
  function resize(){
    const dpr = window.devicePixelRatio || 1;
    const rect = canvas.getBoundingClientRect();
    canvas.width = Math.floor(rect.width * dpr);
    canvas.height = Math.floor(rect.height * dpr);
    ctx.setTransform(dpr,0,0,dpr,0,0);
    draw();
  }
  window.addEventListener("resize", resize);

  // -----------------------
  // Instance creation
  // -----------------------
  function newInstance(n=36){
    pts=[];
    for(let i=0;i<n;i++){
      const x = 0.08 + 0.84 * rng();
      const y = 0.08 + 0.84 * rng();
      pts.push([x,y]);
    }
    hoverIdx = -1;
    tour = null; // reset tour when instance changes
    info.textContent = "New instance created.";
    tourInfo.textContent = "Tour length: –";
    draw();
  }

  // -----------------------
  // Hover logic
  // -----------------------
  function pickHover(mx, my){
    const rect = canvas.getBoundingClientRect();
    const W = rect.width, H = rect.height;
    const pad = 26;
    function X(x){ return pad + x*(W-2*pad); }
    function Y(y){ return pad + y*(H-2*pad); }

    let best=-1, bestd=1e18;
    for(let i=0;i<pts.length;i++){
      const dx = X(pts[i][0]) - mx;
      const dy = Y(pts[i][1]) - my;
      const d = Math.hypot(dx,dy);
      if(d < bestd){ bestd = d; best = i; }
    }
    return (bestd <= 14) ? best : -1;
  }

  canvas.addEventListener("mousemove", (e)=>{
    const rect = canvas.getBoundingClientRect();
    const mx = e.clientX - rect.left;
    const my = e.clientY - rect.top;
    const idx = pickHover(mx, my);
    if(idx !== hoverIdx){
      hoverIdx = idx;
      if(hoverIdx>=0) info.textContent = `Selected city ${hoverIdx+1}/${pts.length}.`;
      else info.textContent = "Hover cities to see distances.";
      draw();
    }
  });
  canvas.addEventListener("mouseleave", ()=>{
    hoverIdx = -1;
    info.textContent = "Hover cities to see distances.";
    draw();
  });

  // -----------------------
  // Draw
  // -----------------------
  function draw(){
    const rect = canvas.getBoundingClientRect();
    const W = rect.width, H = rect.height;
    ctx.clearRect(0,0,W,H);

    const dark = document.documentElement.getAttribute("data-theme")==="dark";
    const edge = dark ? "rgba(240,240,240,.20)" : "rgba(0,0,0,.10)";
    const edgeStrong = dark ? "rgba(240,240,240,.55)" : "rgba(0,0,0,.28)";
    const nodeFill = dark ? "rgba(230,230,230,.92)" : "rgba(30,30,30,.88)";
    const nodeStroke = dark ? "rgba(255,255,255,.25)" : "rgba(0,0,0,.12)";
    const label = dark ? "rgba(255,255,255,.70)" : "rgba(0,0,0,.60)";
    const accent = getComputedStyle(document.documentElement).getPropertyValue("--global-theme-color").trim() || (dark ? "#00C060" : "#00A550");

    const pad = 26;
    function X(x){ return pad + x*(W-2*pad); }
    function Y(y){ return pad + y*(H-2*pad); }

    // threshold from slider percentile of pairwise distances
    let thr = 0;
    if(pts.length >= 2){
      const ds = [];
      for(let i=0;i<pts.length;i++){
        for(let j=i+1;j<pts.length;j++){
          ds.push(dist(pts[i], pts[j]));
        }
      }
      ds.sort((a,b)=>a-b);
      const q = Math.max(0, Math.min(100, +slider.value)) / 100.0;
      thr = ds[Math.floor(q * (ds.length-1))];
    }

    // draw edges under threshold
    ctx.lineWidth = 1.2;
    for(let i=0;i<pts.length;i++){
      for(let j=i+1;j<pts.length;j++){
        const d = dist(pts[i], pts[j]);
        if(d > thr) continue;
        ctx.strokeStyle = edge;
        ctx.beginPath();
        ctx.moveTo(X(pts[i][0]), Y(pts[i][1]));
        ctx.lineTo(X(pts[j][0]), Y(pts[j][1]));
        ctx.stroke();
      }
    }

    // if hovering, draw distances from selected city
    if(hoverIdx >= 0){
      ctx.lineWidth = 2.6;
      for(let j=0;j<pts.length;j++){
        if(j===hoverIdx) continue;
        ctx.strokeStyle = edgeStrong;
        ctx.beginPath();
        ctx.moveTo(X(pts[hoverIdx][0]), Y(pts[hoverIdx][1]));
        ctx.lineTo(X(pts[j][0]), Y(pts[j][1]));
        ctx.stroke();
      }
    }

    // draw tour if present
    if(tour && tour.length === pts.length){
      ctx.strokeStyle = accent;
      ctx.lineWidth = 3.2;
      ctx.beginPath();
      ctx.moveTo(X(pts[tour[0]][0]), Y(pts[tour[0]][1]));
      for(let k=1;k<tour.length;k++){
        ctx.lineTo(X(pts[tour[k]][0]), Y(pts[tour[k]][1]));
      }
      ctx.lineTo(X(pts[tour[0]][0]), Y(pts[tour[0]][1]));
      ctx.stroke();
      ctx.lineWidth = 1.2;
    }

    // nodes
    for(let i=0;i<pts.length;i++){
      const px = X(pts[i][0]), py = Y(pts[i][1]);
      const r = (i===hoverIdx) ? 7.5 : 5.2;

      ctx.beginPath();
      ctx.arc(px,py,r,0,Math.PI*2);
      ctx.fillStyle = nodeFill;
      ctx.fill();
      ctx.lineWidth = 1.2;
      ctx.strokeStyle = nodeStroke;
      ctx.stroke();

      if(i===0){
        // mark start node
        ctx.beginPath();
        ctx.arc(px,py,r+4.5,0,Math.PI*2);
        ctx.strokeStyle = accent;
        ctx.lineWidth = 2.6;
        ctx.stroke();
      }
    }

    // legend
    ctx.fillStyle = label;
    ctx.font = "12px system-ui, -apple-system, Segoe UI, Roboto, sans-serif";
    if(pts.length >= 2){
      const q = Math.round((+slider.value));
      ctx.fillText(`Edge threshold quantile: ${q}%`, 18, H-14);
    }
  }

  // -----------------------
  // UI actions
  // -----------------------
  slider.addEventListener("input", draw);

  btnNew.addEventListener("click", ()=>{
    rng = mulberry32((Math.random()*1e9)>>>0);
    newInstance(36);
  });

  btnRandTour.addEventListener("click", ()=>{
    if(!pts.length){
      info.textContent = "Create an instance first.";
      return;
    }
    // use a fresh seed so repeated clicks give different tours
    const localRng = mulberry32((Math.random()*1e9)>>>0);
    tour = randomTour(pts.length, localRng);
    const c = tourCost(pts, tour);
    tourInfo.textContent = `Tour length: ${c.toFixed(2)}`;
    info.textContent = "Random tour generated.";
    draw();
  });

  // -----------------------
  // init
  // -----------------------
  newInstance(36);
  resize();
})();
</script>

<hr />

<h2 id="classical-algorithms-for-the-tsp">Classical Algorithms for the TSP</h2>

<p>Even for moderate $N$, the search space is enormous: fixing a start city still leaves $(N-1)!$ possible tours ($(N-1)!/2$ in the symmetric TSP, since reversing a tour gives the same length). For example, with $N=20$ this is $19! \approx 1.2\times 10^{17}$ candidate tours. Exhaustively evaluating every tour and selecting the best is therefore computationally infeasible.</p>

<p>There are <strong>exact</strong> solvers that can find optimal tours by combining relaxations, bounds, and systematic search (e.g., branch-and-cut). But that is not the focus of this blog. Instead, we will look at two classical strategies used to obtain (not optimal but high-quality) tours efficiently: <strong>constructive heuristics</strong> and <strong>local search</strong>.</p>

<h3 id="constructive-heuristics">Constructive heuristics</h3>

<p>A heuristic is a “strategy or rule” used when finding the optimal solution is computationally impossible. Constructive heuristics build a solution step-by-step by making choices following specific rules.</p>

<p>There are many constructive heuristics for the TSP, here two of the most well known:</p>

<ul>
  <li><strong>Nearest Neighbor</strong>: Starting from a city, repeatedly go to the nearest unvisited city, until all cities are visited, then go back to the initial city.</li>
  <li><strong>Nearest Insertion</strong>: Maintain a partial tour and insert the next city where it increases the tour length the least.</li>
</ul>

<p>The primary drawback of these methods is that they are myopic. Because these algorithms never “look ahead,” they often result in a logarithmic approximation factor. In plain English: as the map gets bigger, the gap between the heuristic’s guess and the actual shortest path grows significantly. They are prone to “the lighthouse effect,” where they travel efficiently for 90% of the trip but are forced to take a massive, inefficient leap at the end to close the loop.</p>

<h3 id="local-search">Local search</h3>

<p>Once a constructive heuristic provides an initial feasible solution, <strong>Local Search (LS)</strong> takes over to optimize it. It operates on the principle of neighborhoods: it takes the current route and looks at “neighboring” routes that are only slightly different.</p>

<p>To move in the neighborhood of solutions, we define operators (or actions later). One of the most used operator in the TSP is the <strong>2-opt</strong>. The 2-opt selects two non-adjacent edges, (A,B) and (C,D), deletes them, and replaces them with (A,C) and (B,D).</p>

<p>Therefore, the LS takes a solution and test several 2-opt moves, if any of those moves improves the quality of the tour it moves there, and continues repeatedly improving the solution until there is no possible improving 2-opt move in the neighborhood (we reached a local optima).</p>

<h3 id="demo">Demo</h3>
<p>In the demo below we show how these two families work in TSP:</p>

<div class="l-page">
  <div class="anim-wrap" id="ncoAnimWrap">
    <div class="anim-toolbar">
      <button id="modeConstruct">Constructive</button>
      <button id="modeImprove">Improvement (2-opt)</button>
      <button id="btnReset">Reset</button>
      <button id="btnStep">Step</button>
      <button id="btnPlay">Play</button>
      <span class="pill" id="pillInfo">Ready.</span>
      <span class="pill" id="pillCost">Cost: –</span>
      <span class="pill" id="pillStep">Step: –</span>
    </div>
    <canvas class="anim-canvas" id="ncoCanvas"></canvas>
  </div>
</div>

<script>
(function(){
  // ---- helpers ----
  function mulberry32(a){
    return function(){
      var t = a += 0x6D2B79F5;
      t = Math.imul(t ^ (t >>> 15), t | 1);
      t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
      return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
    }
  }
  function dist(a,b){
    const dx=a[0]-b[0], dy=a[1]-b[1];
    return Math.hypot(dx,dy);
  }
  function tourCost(pts, tour){
    let c=0;
    for(let k=0;k<tour.length;k++){
      const i=tour[k], j=tour[(k+1)%tour.length];
      c += dist(pts[i], pts[j]);
    }
    return c;
  }
  function nearestNeighborTour(pts, start=0){
    const n=pts.length;
    const used=new Array(n).fill(false);
    const tour=[start];
    used[start]=true;
    for(let t=1;t<n;t++){
      const last=tour[tour.length-1];
      let best=-1, bestd=1e18;
      for(let j=0;j<n;j++){
        if(used[j]) continue;
        const d=dist(pts[last], pts[j]);
        if(d<bestd){ bestd=d; best=j; }
      }
      tour.push(best);
      used[best]=true;
    }
    return tour;
  }
  function twoOptBestMove(pts, tour){
    // returns best (i,j,delta) for 2-opt (reverse segment between i+1..j)
    const n=tour.length;
    let bestDelta=0, bestI=-1, bestJ=-1;
    for(let i=0;i<n;i++){
      const a=tour[i], b=tour[(i+1)%n];
      for(let j=i+2;j<n;j++){
        if(i===0 && j===n-1) continue; // avoid full reversal
        const c=tour[j], d=tour[(j+1)%n];
        const before = dist(pts[a], pts[b]) + dist(pts[c], pts[d]);
        const after  = dist(pts[a], pts[c]) + dist(pts[b], pts[d]);
        const delta = after - before; // negative is improvement
        if(delta < bestDelta){
          bestDelta=delta; bestI=i; bestJ=j;
        }
      }
    }
    return {i:bestI, j:bestJ, delta:bestDelta};
  }
  function applyTwoOpt(tour, i, j){
    // reverse segment (i+1..j)
    const out=tour.slice();
    let l=i+1, r=j;
    while(l<r){
      const tmp=out[l]; out[l]=out[r]; out[r]=tmp;
      l++; r--;
    }
    return out;
  }

  // ---- DOM ----
  const canvas = document.getElementById("ncoCanvas");
  const ctx = canvas.getContext("2d");
  const pillInfo = document.getElementById("pillInfo");
  const pillCost = document.getElementById("pillCost");
  const pillStep = document.getElementById("pillStep");

  const btnReset = document.getElementById("btnReset");
  const btnStep = document.getElementById("btnStep");
  const btnPlay = document.getElementById("btnPlay");
  const modeConstruct = document.getElementById("modeConstruct");
  const modeImprove = document.getElementById("modeImprove");

  // ---- state ----
  let mode = "construct";
  let rng = mulberry32(1337);
  let pts = [];
  let tour = [];
  let partial = [];
  let visited = [];
  let step = 0;
  let playing = false;
  let timer = null;
  let lastMove = null;

  function resize(){
    const dpr = window.devicePixelRatio || 1;
    const rect = canvas.getBoundingClientRect();
    canvas.width = Math.floor(rect.width * dpr);
    canvas.height = Math.floor(rect.height * dpr);
    ctx.setTransform(dpr,0,0,dpr,0,0);
    draw();
  }
  window.addEventListener("resize", resize);

  function makeInstance(n=34){
    pts=[];
    for(let i=0;i<n;i++){
      const x = 0.08 + 0.84 * rng();
      const y = 0.08 + 0.84 * rng();
      pts.push([x,y]);
    }
  }

  function initConstruct(){
    partial = [0];
    visited = new Array(pts.length).fill(false);
    visited[0]=true;
    step = 0;
    lastMove = null;
    pillInfo.textContent = "Constructive: sequential selection (illustrative).";
    updatePills();
  }

  function initImprove(){
    tour = nearestNeighborTour(pts, 0);
    step = 0;
    lastMove = null;
    pillInfo.textContent = "Improvement: greedy 2-opt steps.";
    updatePills();
  }

  function resetAll(){
    stop();
    rng = mulberry32(1337);
    makeInstance(34);
    if(mode==="construct") initConstruct(); else initImprove();
    draw();
  }

  function stop(){
    playing=false;
    if(timer){ clearInterval(timer); timer=null; }
    btnPlay.textContent = "Play";
  }

  function play(){
    if(playing){ stop(); return; }
    playing=true;
    btnPlay.textContent = "Pause";
    timer = setInterval(()=>{
      const progressed = stepOnce();
      if(!progressed) stop();
    }, 220);
  }

  function updatePills(){
    if(mode==="construct"){
      pillCost.textContent = partial.length>=2 ? ("Partial cost: " + tourCost(pts, partial).toFixed(2)) : "Partial cost: –";
      pillStep.textContent = "Step: " + step + "/" + (pts.length-1);
    }else{
      pillCost.textContent = "Cost: " + tourCost(pts, tour).toFixed(2);
      pillStep.textContent = "Step: " + step;
    }
  }

  function stepOnce(){
    if(mode==="construct"){
      if(partial.length === pts.length){
        pillInfo.textContent = "Constructive: tour completed.";
        updatePills(); draw();
        return false;
      }
      const last = partial[partial.length-1];
      let best=-1, bestd=1e18;
      for(let j=0;j<pts.length;j++){
        if(visited[j]) continue;
        const d=dist(pts[last], pts[j]);
        if(d<bestd){ bestd=d; best=j; }
      }
      partial.push(best);
      visited[best]=true;
      step++;
      lastMove = {type:"pick", a:last, b:best};
      pillInfo.textContent = "Constructive: picked next city.";
      updatePills(); draw();
      return true;
    }else{
      const mv = twoOptBestMove(pts, tour);
      if(mv.i<0){
        pillInfo.textContent = "Improvement: reached a 2-opt local optimum.";
        updatePills(); draw();
        return false;
      }
      lastMove = {type:"2opt", i:mv.i, j:mv.j};
      tour = applyTwoOpt(tour, mv.i, mv.j);
      step++;
      pillInfo.textContent = "Improvement: applied a 2-opt swap.";
      updatePills(); draw();
      return true;
    }
  }

  function draw(){
    const rect = canvas.getBoundingClientRect();
    const W = rect.width, H = rect.height;
    ctx.clearRect(0,0,W,H);

    const dark = document.documentElement.getAttribute("data-theme")==="dark";
    const edge = dark ? "rgba(240,240,240,.35)" : "rgba(0,0,0,.18)";
    const edgeStrong = dark ? "rgba(240,240,240,.70)" : "rgba(0,0,0,.40)";
    const nodeFill = dark ? "rgba(230,230,230,.90)" : "rgba(30,30,30,.85)";
    const nodeStroke = dark ? "rgba(255,255,255,.25)" : "rgba(0,0,0,.15)";
    const accent = getComputedStyle(document.documentElement).getPropertyValue("--global-theme-color").trim() || (dark ? "#00C060" : "#00A550");

    const pad = 26;
    function X(x){ return pad + x*(W-2*pad); }
    function Y(y){ return pad + y*(H-2*pad); }

    function drawPath(path, close=false){
      if(path.length<2) return;
      ctx.beginPath();
      ctx.moveTo(X(pts[path[0]][0]), Y(pts[path[0]][1]));
      for(let k=1;k<path.length;k++){
        ctx.lineTo(X(pts[path[k]][0]), Y(pts[path[k]][1]));
      }
      if(close) ctx.lineTo(X(pts[path[0]][0]), Y(pts[path[0]][1]));
      ctx.strokeStyle = edgeStrong;
      ctx.lineWidth = 2;
      ctx.stroke();
    }

    if(mode==="construct"){
      drawPath(partial, false);
      if(partial.length===pts.length) drawPath(partial, true);
    }else{
      drawPath(tour, true);
    }

    // highlight last move
    if(lastMove && lastMove.type==="pick"){
      ctx.strokeStyle = accent;
      ctx.lineWidth = 4;
      ctx.beginPath();
      ctx.moveTo(X(pts[lastMove.a][0]), Y(pts[lastMove.a][1]));
      ctx.lineTo(X(pts[lastMove.b][0]), Y(pts[lastMove.b][1]));
      ctx.stroke();
    }
    if(lastMove && lastMove.type==="2opt"){
      const n=tour.length;
      const i=lastMove.i, j=lastMove.j;
      const a = tour[i], b = tour[(i+1)%n];
      const c = tour[j], d = tour[(j+1)%n];
      ctx.strokeStyle = accent;
      ctx.lineWidth = 4;
      ctx.beginPath();
      ctx.moveTo(X(pts[a][0]), Y(pts[a][1])); ctx.lineTo(X(pts[b][0]), Y(pts[b][1]));
      ctx.moveTo(X(pts[c][0]), Y(pts[c][1])); ctx.lineTo(X(pts[d][0]), Y(pts[d][1]));
      ctx.stroke();
    }

    // nodes
    for(let i=0;i<pts.length;i++){
      const px = X(pts[i][0]), py = Y(pts[i][1]);
      ctx.beginPath();
      ctx.arc(px,py,5.2,0,Math.PI*2);
      ctx.fillStyle = nodeFill;
      ctx.fill();
      ctx.lineWidth = 1.2;
      ctx.strokeStyle = nodeStroke;
      ctx.stroke();
    }

    // start marker
    ctx.beginPath();
    ctx.arc(X(pts[0][0]), Y(pts[0][1]), 8, 0, Math.PI*2);
    ctx.strokeStyle = accent;
    ctx.lineWidth = 3;
    ctx.stroke();

    // footer
    ctx.fillStyle = dark ? "rgba(255,255,255,.60)" : "rgba(0,0,0,.55)";
    ctx.font = "12px system-ui, -apple-system, Segoe UI, Roboto, sans-serif";
    const tag = (mode==="construct")
      ? "Constructive: one-pass tour construction (illustrative policy)"
      : "Improvement: iterative local search (greedy 2-opt)";
    ctx.fillText(tag, 18, H-14);
  }

  modeConstruct.addEventListener("click", ()=>{
    stop(); mode="construct"; initConstruct(); draw();
  });
  modeImprove.addEventListener("click", ()=>{
    stop(); mode="improve"; initImprove(); draw();
  });
  btnReset.addEventListener("click", resetAll);
  btnStep.addEventListener("click", ()=>{ stop(); stepOnce(); });
  btnPlay.addEventListener("click", play);

  resetAll();
  resize();
})();
</script>

<hr />

<h2 id="why-neural-combinatorial-optimization">Why Neural Combinatorial Optimization</h2>

<p>How can we improve over simple heuristics, without paying the computational cost of exact solvers? This is the motivation behind <strong>Neural Combinatorial Optimization (NCO)</strong>. The core idea is to <strong>learn strong heuristics from data</strong>: rather than simple decision rules, we train a neural network model to make the key choices that a solver repeatedly faces.</p>

<p>To make it concrete, consider a constructive TSP heuristic such as Nearest Neighbor. At each step it applies a fixed rule: “go to the closest unvisited city.” NCO asks: what if, instead, we could query a model: <em>given this instance and the current city, where should I go next?</em> A learned policy can condition on richer context than a single distance comparison, potentially capturing patterns that are hard to encode with simple rules.</p>

<p>This learning perspective is attractive because neural networks have shown a strong ability to learn complex input–output mappings from large datasets. In NCO, we leverage that ability by training on a dataset of problem instances, and then deploying the trained model at inference time to produce solutions for <strong>unseen instances</strong>.</p>

<hr />

<h2 id="neural-combinatorial-optimization-methods">Neural Combinatorial Optimization Methods</h2>

<p>The idea of using neural networks for combinatorial optimization predates modern deep learning: early work explored Hopfield networks for problems like TSP. <d-cite key="hopfield1985tsp"></d-cite> What changed around 2015 is that new neural architectures and training recipes, together with increased computational power, made it practical to learn <em>decision-making policies</em> that operate on variable-size sets and graphs. In routing, a key step was learning to map <strong>sets/graphs → permutations/structures</strong>, first with pointer-style decoders <d-cite key="vinyals2015pointer"></d-cite> and then with attention-based encoder–decoders <d-cite key="bello2016neural,kool2019attention"></d-cite>.</p>

<p>Since then, and imitating the classical optimization algorithms, the space of NCO methods has diversified along two main families:</p>
<ol>
  <li><strong>Neural Constructive (NC) models</strong>, which produce a solution from scratch.</li>
  <li><strong>Neural Improvement (NI) models</strong>, which, starting from a complete tour, iteratively apply local modifications to it.</li>
</ol>

<h3 id="a-unifying-view">A unifying view</h3>

<p>Both families can be described by the same template. Let $x$ denote the <strong>static instance information</strong> (e.g., city coordinates or a distance matrix). Let $s_t$ denote the <strong>dynamic state</strong> at decision step $t$ (information that changes from step to step). A neural solver is then a policy</p>

\[a_t \sim \pi_\theta(\cdot \mid x, s_t),\]

<p>where the action $a_t$ is either “pick the next element” (constructive) or “apply a modification” (improvement).</p>

<ul>
  <li>In constructive TSP, $s_t$ typically includes the set of visited cities and the current partial tour.</li>
  <li>In improvement TSP, $s_t$ is typically a complete tour (and sometimes a short history of recent moves).</li>
</ul>

<p>This framing makes clear that both are <strong>sequence problems</strong>, differing mainly in <em>what the action space is</em> and <em>how the dynamic state is represented</em>.</p>

<h3 id="neural-constructive-methods">Neural constructive methods</h3>

<p>We can distinguish two types of constructive methods: Autoregressive and Non-autoregressive methods.</p>

<p><strong>Autoregressive (AR) construction</strong> builds a tour one city at a time:</p>

\[\pi = (\pi_1,\ldots,\pi_N), \qquad \pi_t \sim \pi_\theta(\cdot \mid x, \pi_{&lt;t}),\]

<p>often implemented with an encoder that embeds the cities and a decoder that attends over remaining (unvisited) nodes. <d-cite key="vinyals2015pointer,kool2019attention"></d-cite></p>

<p><strong>Non-autoregressive (NAR) construction</strong> predicts a <em>global object</em> in one shot, commonly an edge score matrix (a “heatmap”) $H_{ij}$ that indicates how compatible it is to connect cities $i$ and $j$. A separate decoding procedure is then used to turn $H$ into a valid tour. This includes approaches based on graph prediction and diffusion-style generation. <d-cite key="joshi2019efficient,sun2023difusco"></d-cite></p>

<p><strong>Typical strength.</strong> Constructive models are excellent at amortizing: after training, a single run can produce a good solution quickly. Extra compute is often spent on <em>restarts</em> (sampling multiple candidates and keeping the best) rather than deeper reasoning within one construction.</p>

<hr />

<h3 id="neural-improvement-methods">Neural improvement methods</h3>

<p>Improvement methods start from a complete tour $\pi^{(0)}$ and apply a sequence of local edits:</p>

\[\pi^{(0)} \to \pi^{(1)} \to \cdots \to \pi^{(T)}, \qquad a_t \sim \pi_\theta(\cdot \mid x, \pi^{(t-1)}).\]

<p>This viewpoint connects naturally to “learning to search”: the model is not predicting a tour directly, but learning a strategy for navigating the solution space under a step budget. <d-cite key="chen2019learning"></d-cite></p>

<p><strong>Typical strength.</strong> Improvement methods naturally define an anytime procedure: you can stop at any step and return the best tour found so far, trading compute for quality in a direct way.</p>

<hr />

<h2 id="learning-in-nco">Learning in NCO</h2>

<p>After choosing an NCO method, we still need to decide <strong>how the model learns</strong>. In practice there are three common routes:</p>

<ul>
  <li>
    <p><strong>Supervised / imitation learning.</strong><br />
Train the model to imitate a strong “teacher” (an exact solver, or a good heuristic). This is usually stable and sample-efficient, but producing good labels can be expensive.</p>
  </li>
  <li>
    <p><strong>Reinforcement learning.</strong><br />
Let the model sample its own decisions and learn from feedback derived from the objective: how good is the constructed tour, or how much does a proposed move improve it? RL avoids the need for optimal labels, but it is typically more data-hungry and can be harder to stabilize.</p>
  </li>
  <li>
    <p><strong>Unsupervised learning.</strong><br />
Learn a scoring (energy) function that assigns higher scores (lower energy) to better solutions, and then generate solutions by optimizing or sampling from that score. Many score-based and diffusion-style approaches fit this view, where learning focuses on modeling structure and inference performs the actual search.</p>
  </li>
</ul>

<hr />

<h2 id="wrapping-up">Wrapping up</h2>

<p>We have now met the main ingredients that motivate Neural Combinatorial Optimization. If there is one takeaway from this introduction, it is that NCO is best viewed as <em>learning solver behavior for a problem family</em>: the model is trained on many instances, and then reused at inference time to make fast, informed choices on unseen instances.</p>

<p>In the next posts, we will go deeper into (i) how different NCO method families spend compute, (ii) how learning signals shape solver behavior, and (iii) how to evaluate these methods fairly under realistic budgets.</p>

<d-bibliography></d-bibliography>]]></content><author><name>Andoni Irazusta Garmendia</name></author><category term="nco" /><summary type="html"><![CDATA[A visual introduction to NCO via the Traveling Salesperson Problem.]]></summary></entry></feed>