A visual introduction to NCO via the Traveling Salesperson Problem.
Combinatorial optimization (CO) sits underneath a huge range of real systems: routing and logistics, scheduling, allocation, packing, and graph problems such as partitioning or cuts. What makes these problems challenging is not that the objective is mysterious, but that the number of feasible solutions typically grows combinatorially with instance size.
Classical solvers and heuristics are the result of decades of human algorithmic effort. They are not “generic black boxes”: they encode substantial problem structure and domain knowledge—which neighborhoods to search, which relaxations to solve, which cuts to add, which branching rules work, which invariances matter, which parameters to tune. This accumulated craft is a major reason why mature optimization toolchains remain extremely strong in practice.
Surveys provide a broad picture of how machine learning has entered this landscape, and how CO problems are used as testbeds for learning-based decision-making
The Traveling Salesperson Problem (TSP) is a classical benchmark in combinatorial optimization. The name comes from its canonical story: a salesperson wants to plan a trip that visits a set of cities exactly once and returns home, while minimizing total travel distance.
Formally, given $N$ cities represented by coordinates $x_1,\dots,x_N \in \mathbb{R}^2$, a tour is an ordering in which each city is visited exactly once and the route returns to the start. The symmetric Euclidean TSP asks for the tour of minimum total length, where the cost between two cities is their Euclidean distance (and is symmetric). Writing a tour as a permutation of cities $\pi$, its length is the sum of distances between consecutive cities in that order, including the closing edge back to the first city:
\[\min_{\pi \in S_N}\; C(\pi) \;=\; \sum_{t=1}^{N} \left\|x_{\pi_t} - x_{\pi_{t+1}}\right\|_2, \qquad \pi_{N+1} := \pi_1.\]It is often useful to view a TSP instance as a complete weighted graph: each city is a node, and the cost of traveling between cities $i$ and $j$ is an edge weight $d_{ij}$. The animation below shows a random TSP instance. Hover over a city to highlight its distances to the others, and use the slider to control how many edges are displayed. You can also click Random tour to draw a uniformly random tour (a random permutation of the cities), which typically yields a poor solution. In the next section, we will present some strategies to obtain much better solutions.
Even for moderate $N$, the search space is enormous: fixing a start city still leaves $(N-1)!$ possible tours ($(N-1)!/2$ in the symmetric TSP, since reversing a tour gives the same length). For example, with $N=20$ this is $19! \approx 1.2\times 10^{17}$ candidate tours. Exhaustively evaluating every tour and selecting the best is therefore computationally infeasible.
There are exact solvers that can find optimal tours by combining relaxations, bounds, and systematic search (e.g., branch-and-cut). But that is not the focus of this blog. Instead, we will look at two classical strategies used to obtain (not optimal but high-quality) tours efficiently: constructive heuristics and local search.
A heuristic is a “strategy or rule” used when finding the optimal solution is computationally impossible. Constructive heuristics build a solution step-by-step by making choices following specific rules.
There are many constructive heuristics for the TSP, here two of the most well known:
The primary drawback of these methods is that they are myopic. Because these algorithms never “look ahead,” they often result in a logarithmic approximation factor. In plain English: as the map gets bigger, the gap between the heuristic’s guess and the actual shortest path grows significantly. They are prone to “the lighthouse effect,” where they travel efficiently for 90% of the trip but are forced to take a massive, inefficient leap at the end to close the loop.
Once a constructive heuristic provides an initial feasible solution, Local Search (LS) takes over to optimize it. It operates on the principle of neighborhoods: it takes the current route and looks at “neighboring” routes that are only slightly different.
To move in the neighborhood of solutions, we define operators (or actions later). One of the most used operator in the TSP is the 2-opt. The 2-opt selects two non-adjacent edges, (A,B) and (C,D), deletes them, and replaces them with (A,C) and (B,D).
Therefore, the LS takes a solution and test several 2-opt moves, if any of those moves improves the quality of the tour it moves there, and continues repeatedly improving the solution until there is no possible improving 2-opt move in the neighborhood (we reached a local optima).
In the demo below we show how these two families work in TSP:
How can we improve over simple heuristics, without paying the computational cost of exact solvers? This is the motivation behind Neural Combinatorial Optimization (NCO). The core idea is to learn strong heuristics from data: rather than simple decision rules, we train a neural network model to make the key choices that a solver repeatedly faces.
To make it concrete, consider a constructive TSP heuristic such as Nearest Neighbor. At each step it applies a fixed rule: “go to the closest unvisited city.” NCO asks: what if, instead, we could query a model: given this instance and the current city, where should I go next? A learned policy can condition on richer context than a single distance comparison, potentially capturing patterns that are hard to encode with simple rules.
This learning perspective is attractive because neural networks have shown a strong ability to learn complex input–output mappings from large datasets. In NCO, we leverage that ability by training on a dataset of problem instances, and then deploying the trained model at inference time to produce solutions for unseen instances.
The idea of using neural networks for combinatorial optimization predates modern deep learning: early work explored Hopfield networks for problems like TSP.
Since then, and imitating the classical optimization algorithms, the space of NCO methods has diversified along two main families:
Both families can be described by the same template. Let $x$ denote the static instance information (e.g., city coordinates or a distance matrix). Let $s_t$ denote the dynamic state at decision step $t$ (information that changes from step to step). A neural solver is then a policy
\[a_t \sim \pi_\theta(\cdot \mid x, s_t),\]where the action $a_t$ is either “pick the next element” (constructive) or “apply a modification” (improvement).
This framing makes clear that both are sequence problems, differing mainly in what the action space is and how the dynamic state is represented.
We can distinguish two types of constructive methods: Autoregressive and Non-autoregressive methods.
Autoregressive (AR) construction builds a tour one city at a time:
\[\pi = (\pi_1,\ldots,\pi_N), \qquad \pi_t \sim \pi_\theta(\cdot \mid x, \pi_{<t}),\]often implemented with an encoder that embeds the cities and a decoder that attends over remaining (unvisited) nodes.
Non-autoregressive (NAR) construction predicts a global object in one shot, commonly an edge score matrix (a “heatmap”) $H_{ij}$ that indicates how compatible it is to connect cities $i$ and $j$. A separate decoding procedure is then used to turn $H$ into a valid tour. This includes approaches based on graph prediction and diffusion-style generation.
Typical strength. Constructive models are excellent at amortizing: after training, a single run can produce a good solution quickly. Extra compute is often spent on restarts (sampling multiple candidates and keeping the best) rather than deeper reasoning within one construction.
Improvement methods start from a complete tour $\pi^{(0)}$ and apply a sequence of local edits:
\[\pi^{(0)} \to \pi^{(1)} \to \cdots \to \pi^{(T)}, \qquad a_t \sim \pi_\theta(\cdot \mid x, \pi^{(t-1)}).\]This viewpoint connects naturally to “learning to search”: the model is not predicting a tour directly, but learning a strategy for navigating the solution space under a step budget.
Typical strength. Improvement methods naturally define an anytime procedure: you can stop at any step and return the best tour found so far, trading compute for quality in a direct way.
After choosing an NCO method, we still need to decide how the model learns. In practice there are three common routes:
Supervised / imitation learning.
Train the model to imitate a strong “teacher” (an exact solver, or a good heuristic). This is usually stable and sample-efficient, but producing good labels can be expensive.
Reinforcement learning.
Let the model sample its own decisions and learn from feedback derived from the objective: how good is the constructed tour, or how much does a proposed move improve it? RL avoids the need for optimal labels, but it is typically more data-hungry and can be harder to stabilize.
Unsupervised learning.
Learn a scoring (energy) function that assigns higher scores (lower energy) to better solutions, and then generate solutions by optimizing or sampling from that score. Many score-based and diffusion-style approaches fit this view, where learning focuses on modeling structure and inference performs the actual search.
We have now met the main ingredients that motivate Neural Combinatorial Optimization. If there is one takeaway from this introduction, it is that NCO is best viewed as learning solver behavior for a problem family: the model is trained on many instances, and then reused at inference time to make fast, informed choices on unseen instances.
In the next posts, we will go deeper into (i) how different NCO method families spend compute, (ii) how learning signals shape solver behavior, and (iii) how to evaluate these methods fairly under realistic budgets.