## Sunday, December 10, 2017

### SciPy 1.0 linear programming

The simplex solver in SciPy has been somewhat of a problem. It is a text-book LP solver (tablaux based), so only suited for very small problems. Even on those small problems it is just not very reliable.

Here is a small example [1]:

Min -x2 -x3
x0 + 0.33*x2 + 0.67*x3 = 0.5
x1 + 0.67*x2 + 0.33*x3 = 0.5
x0 + x1 + x2 + x3 = 1.0


We can encode this problem as follows:

from scipy.optimize import linprog
a_eq = [[1.0, 0.0, 0.33, 0.67],
[0.0, 1.0, 0.67, 0.33],
[1, 1, 1, 1]]

b_eq = [0.5, 0.5, 1.0]

c = [0, 0, -1.0, -1.0]

res1 = linprog(c=c, A_eq=a_eq, b_eq=b_eq, method='simplex')
print(res1)


The simplex solver has troubles with this.  It returns a sub-optimal solution with objective function value 0:

     fun: -0.0
message: 'Optimization terminated successfully.'
nit: 4
slack: array([], dtype=float64)
status: 0
success: True
x: array([ 0.5,  0.5,  0. ,  0. ])
con: array([ -9.83213511e-13,  -9.83102488e-13,  -1.96620498e-12])


SciPy 1,0 has a new interior point solver [3] which seems a bit more robust:

res2 = linprog(c=c, A_eq=a_eq, b_eq=b_eq, method='interior-point')
print(res2)

fun: -1.000000000000884
message: 'Optimization terminated successfully.'
nit: 4
slack: array([], dtype=float64)
status: 0
success: True
x: array([  5.41163802e-13,   5.41163802e-13,   5.00000000e-01,
5.00000000e-01])


We see an optimal objective of -1. It also gives a warning:

D:\Anaconda3\lib\site-packages\scipy\optimize\_linprog_ip.py:623: OptimizeWarning: A_eq does not appear to be of full row rank. To improve performance, check the problem formulation for redundant equality constraints.
warn(redundancy_warning, OptimizeWarning)

There is no good reason a Simplex solver should give incorrect results in case the A matrix is not full-rank. Usually solvers add a slack (often not physically) to each constraint, so A will never be rank deficient. The basis matrix can always be chosen such that it is both optimal and non-singular. Here is how an optimal basis (with objective -1) looks like for this problem provided by a different simplex solver:

                    LOWER          LEVEL          UPPER         MARGINAL

---- EQU e1          0.5000         0.5000         0.5000         EPS       'Non-basic'
---- EQU e2          0.5000         0.5000         0.5000          .        'Basic'
---- EQU e3          1.0000         1.0000         1.0000        -1.0000    'Non-basic'

LOWER          LEVEL          UPPER         MARGINAL

---- VAR x0           .              .            +INF            1.0000    'Non-basic'
---- VAR x1           .              .            +INF            1.0000    'Non-basic'
---- VAR x2           .             0.5000        +INF             .        'Basic'
---- VAR x3           .             0.5000        +INF             .        'Basic'


We see equation 2 (or its slack) is made basic although the value of the slack is zero. This solution is actually both primal and dual degenerate. [4] A different solver gives:

                   LOWER          LEVEL          UPPER         MARGINAL

---- EQU e1         0.5000         0.5000         0.5000        -1.0000    'Non-basic'
---- EQU e2         0.5000         0.5000         0.5000        -1.0000    'Non-basic'
---- EQU e3         1.0000         1.0000         1.0000          .        'Basic'

LOWER          LEVEL          UPPER         MARGINAL

---- VAR x0          .              .            +INF            1.0000    'Non-basic'
---- VAR x1          .              .            +INF            1.0000    'Non-basic'
---- VAR x2          .             0.5000        +INF             .        'Basic'
---- VAR x3          .             0.5000        +INF             .        'Basic'


Same primal solution, but different basis. No hint of dual degeneracy however. Bonus question: Which of these two simplex solutions is more correct?

Actually both are correct. From linear programming 101 we know  $$y = c_B^T B^{-1}$$ where $$y$$ are the duals.  We can calculate this quickly with some R. Suppose we have:

A
##    x0 x1        x2        x3 s1 s2 s3
## e1  1  0 0.3333333 0.6666667  1  0  0
## e2  0  1 0.6666667 0.3333333  0  1  0
## e3  1  1 1.0000000 1.0000000  0  0  1
c
## x0 x1 x2 x3 s1 s2 s3
##  0  0 -1 -1  0  0  0 

Then we can do:

bs ← c('x2','x3','s2')
t(c[bs]) %*% solve(A[,bs])
##      e1 e2 e3
## [1,]  0  0 -1
bs ← c('x2','x3','s3')
t(c[bs]) %*% solve(A[,bs])
##      e1 e2 e3
## [1,] -1 -1  0

There is even a third optimal basis. It has the following duals

bs ← c('x2','x3','s1')
t(c[bs]) %*% solve(A[,bs])
##      e1 e2 e3
## [1,]  0  0 -1

In a sense we have three optimal solutions and SciPy's simplex solver is not able to find a single one of them.

#### Conclusion

In general the Simplex solver from scipy.optimize.linprog should be avoided. It fails on quite a few small models. The new interior-point solver available in SciPy 1.0 seems to behave much better.

## Friday, December 8, 2017

### Sparse Matrix export from GAMS to R

#### The problem

A colleague posed the question to me: how we can export a GAMS parameter to an R sparse matrix?. A sparse matrix is a matrix where most of the elements are zero. In large scale optimization, sparse matrices play an important role. Software (solvers, modeling systems) are exploiting sparsity to reduce memory requirements and to speed up calculations. In fact very large problems can only be solved by using sparse data structures. I actually don't think the user wants a sparse matrix, but let's assume we really want one. The R Matrix package [3] has facilities to create and manipulate sparse matrices.

We assume we have the following GAMS data:
 set    i /a,b,c,d,e/    j /e,f,g,h,i,j/ ; table d(i,j)     e f g h i j  a  1   2   4  b  c      3  d  e          5 ; execute_unload "gamsdata";

Notice that will export three items: sets i and j and the parameter d.

The big issue here is that GAMS uses a sparse storage scheme throughout. This means only the nonzero elements of matrix d are stored. Unfortunately, this also means that empty columns and empty rows are discarded: they are simply not visible. We can see this when we display the matrix:

----     17 PARAMETER d

e           g           i

a       1.000       2.000       4.000
c                   3.000
e                               5.000

The trick is to export also the domains i and j. With the aid of these domain sets we can reconstruct the sparse matrix with named rows and columns and with all empty rows and columns preserved.

Although this example is a small matrix, we assume the matrix is very larger and sparse. This means we don't want to use in R a pivot operation on d to create a 2D dense matrix. In all operations in this write-up we store and operate on just the non-zero elements.

Below I show three different approaches to populate a sparse matrix in R with GAMS data. There are some interesting issues in each approach.

#### Method 1: GDXRRW and R

library(gdxrrw)
# read d,i,j into data frames
D <- rgdx.param("gamsdata","d")
I <- rgdx.set("gamsdata","i")
J <- rgdx.set("gamsdata","j")


When we print things we see they arrived as follows:
D
##   i j d
## 1 a e 1
## 2 a g 2
## 3 a i 4
## 4 c g 3
## 5 e i 5
I
##   i
## 1 a
## 2 b
## 3 c
## 4 d
## 5 e
J
##   i
## 1 e
## 2 f
## 3 g
## 4 h
## 5 i
## 6 j
Note that all these three identifiers represent data frames. Warning: note that the columns for the tables I and J are always called i. This is a quirk of gdxrrw.

We need to create mappings from name to index position for both the rows and columns. So "a" → 1, "b" → 2, etc. R has an interesting feature: named vectors allow us to index a vector by strings. We created these named vectors as follows:

# create named vectors for index positions
# this allows us to map from name -> index position
ordi <- seq(nrow(I))
names(ordi) <- I$i ordj <- seq(nrow(J)) names(ordj) <- J$i


When we print these mapping vectors we see:
ordi
## a b c d e
## 1 2 3 4 5
ordj
## e f g h i j
## 1 2 3 4 5 6

Using these vectors we can easily convert columns i and j in the D data frame from strings to numbers.

ordi[ D$i ] # row indices ## a a a c e ## 1 1 1 3 5 ordj[ D$j ] # column indices
## e g i g i
## 1 3 5 3 5

With this we are ready to construct the sparse matrix in R:

library(Matrix)
# create sparse matrix
M <- sparseMatrix(
i = ordi[ D$i ], # row indices j = ordj[ D$j ],      # column indices
x = D$d, # nonzero values dims = c(nrow(I),nrow(J)), # size of matrix dimnames = list(I$i,J$i) # names of rows and columns )  When we print this we see: M ## 5 x 6 sparse Matrix of class "dgCMatrix" ## e f g h i j ## a 1 . 2 . 4 . ## b . . . . . . ## c . . 3 . . . ## d . . . . . . ## e . . . . 5 . In this method we used named vectors to be able to index vectors by strings. This trick helped to translate the strings forming the sets i and j into their ordinal values or index positions. #### Method 2: Add some GAMS code We can simplify the R code a bit by letting GAMS do part of the work. We can augment the matrix d with extra information about the index positions for non-zero values. set v /value,ordi,ordj/; parameter d2(i,j,v); d2(i,j, 'value') = d(i,j); d2(i,j, 'ordi')$d(i,j) = ord(i);
d2(i,j,
'ordj')$d(i,j) = ord(j); display d2; The displayed output looks like: ---- 22 PARAMETER d2 value ordi ordj a.e 1.000 1.000 1.000 a.g 2.000 1.000 3.000 a.i 4.000 1.000 5.000 c.g 3.000 3.000 3.000 e.i 5.000 5.000 5.000 Reading this data is basically the same as before: # read d,i,j into data frames D2 <- rgdx.param("gamsdata","d2") I <- rgdx.set("gamsdata","i") J <- rgdx.set("gamsdata","j")  The D2 data frame looks like: D2 ## i j v d2 ## 1 a e value 1 ## 2 a e ordi 1 ## 3 a e ordj 1 ## 4 a g value 2 ## 5 a g ordi 1 ## 6 a g ordj 3 ## 7 a i value 4 ## 8 a i ordi 1 ## 9 a i ordj 5 ## 10 c g value 3 ## 11 c g ordi 3 ## 12 c g ordj 3 ## 13 e i value 5 ## 14 e i ordi 5 ## 15 e i ordj 5 The row and columns indices and the values can be extracted by standard subsetting: # subset D2 to extract ordi, ordj and value data ordi <- D2[D2$v=="ordi","d2"]
ordj <- D2[D2$v=="ordj","d2"] v <- D2[D2$v=="value","d2"]


The sparse matrix is now easy to create:

# create sparse matrix
m <- sparseMatrix(
i = ordi,        # row indices
j = ordj,        # column indices
x = v,           # nonzero values
dims = c(nrow(I),nrow(J)),    # size of matrix
dimnames = list(I$i,J$i)      # names of rows and columns
)


Output is as expected:
m
## 5 x 6 sparse Matrix of class "dgCMatrix"
##   e f g h i j
## a 1 . 2 . 4 .
## b . . . . . .
## c . . 3 . . .
## d . . . . . .
## e . . . . 5 .

#### Method 3: SQLite and R

Here we use SQLite as an intermediate data store. This will allow us to do some SQL joins while importing the data. First we convert the GAMS data into SQLite database. This is very easy:

system("gdx2sqlite -i gamsdata.gdx -o gamsdata.db")


The reading of the data is straightforward:

library(RSQLite)
sqlite<-dbDriver("SQLite")
db <- dbConnect(sqlite,"gamsdata.db")


The data arrives as:

D
##   i j value
## 1 a e     1
## 2 a g     2
## 3 a i     4
## 4 c g     3
## 5 e i     5
I
##   i
## 1 a
## 2 b
## 3 c
## 4 d
## 5 e
J
##   j
## 1 e
## 2 f
## 3 g
## 4 h
## 5 i
## 6 j

The row numbers in an SQLite table can be retrieved by the rowid. For example:
dbGetQuery(db,"select rowid,* from i")
##   rowid i
## 1     1 a
## 2     2 b
## 3     3 c
## 4     4 d
## 5     5 e

Note: the rowid is automatically incremented by SQLite when adding rows. However, after deleting rows the rowid has no longer a direct relation to the row number. Our data base is newly created, so in our case we can assume the rowid is the same as the row number.

We use this to map from index names for the sets i and j to their ordinal numbers. We do this by using a join on tables d and i (or d and j).

The row and column indices can be retrieved as:

ordi <- dbGetQuery(db,"select i.rowid from d inner join i on i.i=d.i")[[1]]
ordj <- dbGetQuery(db,"select j.rowid from d inner join j on j.j=d.j")[[1]]


These vectors look like:

ordi
## [1] 1 1 1 3 5
ordj
## [1] 1 3 5 3 5

By now creating the sparse matrix is simple:

m <- sparseMatrix(
i = ordi,     # row indices
j = ordj,     # column indices
x = D$value, # nonzero values dims = c(nrow(I),nrow(J)), # size of matrix dimnames = list(I$i,Jj) # names of rows and columns )  The output is no surprise: m ## 5 x 6 sparse Matrix of class "dgCMatrix" ## e f g h i j ## a 1 . 2 . 4 . ## b . . . . . . ## c . . 3 . . . ## d . . . . . . ## e . . . . 5 . #### Method 4: Pivoting data frames It is my assumption the user just wanted to pivot from "long" format to "wide" format [1]. This is efficiently done with the function spread [2]: D <- rgdx.param("gamsdata","d") library(tidyr) P <- spread(D,j,d,fill=0)  This yields: D ## i j d ## 1 a e 1 ## 2 a g 2 ## 3 a i 4 ## 4 c g 3 ## 5 e i 5 P ## i e g i ## 1 a 1 2 4 ## 2 c 0 3 0 ## 3 e 0 0 5 If you don't specify the fill parameter, spread will impute NA values. Note this operation is dense: all zeros are explicitly stored. The duplication in column names can be easily fixed by something like: colnames(P)][1]="" P ## e g i ## 1 a 1 2 4 ## 2 c 0 3 0 ## 3 e 0 0 5 If we want to keep the empty rows and columns from the original matrix, it is easiest to fix this at the GAMS level: parameter d3(i,j) 'dense version'; d3(i,j) = d(i,j) + eps; display d3; The addition of the special value eps to a number is somewhat special in GAMS. We have: $x+\text{eps} = \begin{cases}x&\text{if }x\ne 0\\ \text{eps}&\text{if }x=0\end{cases}$ This trick can be used to preserve all zero (and non-zero) values. Indeed, the values of d3 are: ---- 26 PARAMETER d3 dense version e f g h i j a 1.000 EPS 2.000 EPS 4.000 EPS b EPS EPS EPS EPS EPS EPS c EPS EPS 3.000 EPS EPS EPS d EPS EPS EPS EPS EPS EPS e EPS EPS EPS EPS 5.000 EPS We can import this into R as follows: D3 <- rgdx.param("gamsdata","d3",squeeze=F) P <- spread(D3,j,d)  The squeeze option helps to deal with the eps values: we have to make sure that we map eps back to a physical zero. The results look like: P ## i e f g h i j ## 1 a 1 0 2 0 4 0 ## 2 b 0 0 0 0 0 0 ## 3 c 0 0 3 0 0 0 ## 4 d 0 0 0 0 0 0 ## 5 e 0 0 0 0 5 0 #### Conclusion It is not always so easy to move data from GAMS to R. You basically have to bridge two different worlds. In some cases you need to know much (too much?) about GAMS internals and about R tricks. #### References 1. Pivoting a table: GAMS, SQLite, R and Python, http://yetanothermathprogrammingconsultant.blogspot.com/2016/01/pivoting-table-gams-sqlite-r-and-python.html 2. Brad Boehmke, Data Processing with dplyr & tyidr, https://rpubs.com/bradleyboehmke/data_wrangling 3. Martin Maechler and Douglas Bates, 2nd Introduction to the Matrix package, https://cran.r-project.org/web/packages/Matrix/vignettes/Intro2Matrix.pdf ## Monday, December 4, 2017 ### Using Nonlinear Mixed-integer Programming In [1] we explored some linear mixed-integer programming formulations for the Least Median of Squares regression problem. Now let’s look at some nonlinear formulations. ##### Quadratic model A mixed-integer quadratically constrained formulation can look like [1]:  \begin{align}\min\>&z\\&z\ge r^2_i-M\delta_i\\ &\sum_i \delta_i = n-h\\&r_i = y_i-\sum_j X_{i,j} \beta_j\\&\delta_i \in \{0,1\}\end{align} This MIQCP model is convex and can be solved with top-of-the-line commercial solvers like Cplex and Gurobi. Some timings I see are:  ---- 145 PARAMETER report obj time nodes big-M MIP 0.262 30.594 160058.000 MIQCP 0.069 712.156 1783661.000 The objectives look different but they are actually correct: in the MIP formulation the objective reflects $$|r|_{(h)}$$ while the quadratic formulation uses $$r^2_{(h)}$$. This performance differential is not unusual. If there are good MIP formulations available, in most cases quadratic formulations are not competitive. ##### MINLP model The MINLP model from [1]:  \begin{align}\min\>&z\\&z\ge \delta_i \cdot r^2_i\\ &\sum_i \delta_i = h\\&r_i = y_i-\sum_j X_{i,j} \beta_j\\&\delta_i \in \{0,1\}\end{align} is more problematic. This is no longer convex, so we need a global solver. When we try to solve this model as is we are in a heap of problems:  ---- 193 PARAMETER report obj time nodes modelstat MINLP/Couenne 0.452 5.562 261.000 1.000 (optimal) MINLP/Baron 0.225 1000.000 71811.000 8.000 (integer solution, exceeded time limit) Obviously, the Couenne solution is not correct, and Baron is not doing so great either. Global solvers have troubles when variables do not have proper bounds. It would be good it all kind of alarm bells start ringing, but they don’t. Let’s add the following bounds:  \begin{align}&-100\le \beta_j \le 100\\&-100\le r_i \le 100\end{align} Of course in practice you may not have good bounds available. These bounds yield much better results:  ---- 193 PARAMETER report obj time nodes modelstat MINLP/Couenne 0.069 108.875 10827.000 1.000 MINLP/Baron 0.069 170.890 77.000 1.000 This is actually quite good compared to the MIQCP model. ##### Initial solutions Recently I was asked about the effect of an initial point on the behavior of Couenne. Let’s try the experiment where we do two solves in sequence. The result is:  ---- 181 PARAMETER report obj time nodes modelstat Couenne.1 0.069 108.469 10827.000 1.000 Couenne.2 0.069 113.359 11782.000 1.000 This is surprising: the second solve is actually more expensive! I have no good explanation for this. May be the initial point only affected the first sub-problem and in a bad way. Now try Baron:  ---- 181 PARAMETER report obj time nodes modelstat Baron.1 0.069 185.600 77.000 1.000 Baron.2 0.069 94.980 29.000 1.000 This looks much better. In addition the Baron log gives some clear messages about the initial point being used as an incumbent:  Starting solution is feasible with a value of 0.686748259896E-001 Doing local search Solving bounding LP Starting multi-start local search Done with local search =========================================================================== Iteration Open nodes Time (s) Lower bound Upper bound 1 1 8.25 0.00000 0.686748E-01 3+ 3 36.98 0.00000 0.686748E-01 7+ 3 66.42 0.00000 0.686748E-01 29 0 94.98 0.686747E-01 0.686747E-01 Cleaning up We can see the effect of the initial point: the upper bound is immediately equal to the objective value of the initial point. ##### Setting a bound What about instead of providing an initial point, just set a bound on $$z$$:  $z\le 0.069$ This looks like a good strategy to help the solver.  ---- 193 PARAMETER report obj time nodes modelstat Couenne 0.069 130.125 10104.000 1.000 Baron NA 115.390 15.000 19.000 (infeasible, no solution) Bummer: it does not help at all. Solving global MINLP models is not yet as “automatic” as solving an LP. You need to be aware: solvers for these type of models are a little bit more fragile. A little bit of a sobering conclusion. ##### References 1. Integer Programming and Least Median of Squares Regression, http://yetanothermathprogrammingconsultant.blogspot.com/2017/11/integer-programming-and-least-median-of.html ## Tuesday, November 21, 2017 ### Fun with indicator constraints In [1] a simple model is presented for the Least Median of Squares regression problem:  \begin{align}\min\>&z\\&\delta_i=1 \Rightarrow \> –z \le r_i \le z\\&\sum_i \delta_i = h\\&r_i = y_i - \sum_j X_{i,j} \beta_j\\&\delta_i \in \{0,1\} \end{align} Here $$X$$ and $$y$$ are data. Modern solvers like Cplex and Gurobi support indicator constraints. This allows us to directly pass on the above model to a solver without resorting to other formulations (such as binary variables with big-M constraints or SOS1 structures). This particular model is interesting: it has an unbounded LP relaxation. E.g. the start of the Cplex log shows:  Nodes Cuts/ Node Left Objective IInf Best Integer Best Bound ItCnt Gap 0 0 unbounded 0 0 2 unbounded 0 Elapsed time = 0.03 sec. (1.27 ticks, tree = 0.01 MB, solutions = 0) * 44 44 integral 0 0.4886 -0.0000 63 100.00% Found incumbent of value 0.488611 after 0.05 sec. (3.02 ticks) * 45 43 integral 0 0.4779 -0.0000 66 100.00%. . . Gurobi has some real problems on this model:  Optimize a model with 48 rows, 97 columns and 188 nonzeros Model has 94 general constraints Variable types: 50 continuous, 47 integer (47 binary) Coefficient statistics: Matrix range [1e+00, 5e+00] Objective range [1e+00, 1e+00] Bounds range [1e+00, 1e+00] RHS range [4e+00, 2e+01] Presolve added 47 rows and 47 columns Presolve time: 0.00s Presolved: 95 rows, 144 columns, 423 nonzeros Presolved model has 94 SOS constraint(s) Variable types: 97 continuous, 47 integer (47 binary) Root relaxation: unbounded, 77 iterations, 0.00 seconds Nodes | Current Node | Objective Bounds | Work Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time 0 0 postponed 0 - - - - 0s 0 0 postponed 0 - - - - 0s 0 2 postponed 0 - - - - 0s * 136 136 68 0.9658407 - - 6.1 0s * 138 136 69 0.9135526 - - 6.4 0s H 363 353 0.8212338 - - 5.1 0s H 997 786 0.6097619 - - 5.5 0s H 1414 964 0.5006481 - - 6.9 0s H 1564 892 0.3806481 - - 8.7 0s H 1713 931 0.3611765 - - 8.8 0s...139309129 6259738 0.02538 65 23 0.26206 - - 28.3 28775s 139337799 6262622 cutoff 67 0.26206 - - 28.3 28780s 139367156 6266141 cutoff 68 0.26206 - - 28.3 28785s 139395590 6268935 cutoff 67 0.26206 - - 28.3 28790s 139424342 6272187 postponed 64 0.26206 - - 28.3 28795s This model does not seem to finish (the model has only 47 discrete variables so this log indicates something is seriously wrong: this model should solve in a manner of seconds). Gurobi was never able to establish a lower bound (column BestBd). Notice that Gurobi translates the implications to SOS1 variables (see [1] for the SOS1 version of this model). There is a simple work around: add the bound $$z \ge 0$$. Now things solve very fast. We have a normal bound and no more any of these “postponed” nodes.  Root relaxation: objective 0.000000e+00, 52 iterations, 0.00 seconds Nodes | Current Node | Objective Bounds | Work Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time 0 0 0.00000 0 24 - 0.00000 - - 0s H 0 0 0.7556195 0.00000 100% - 0s 0 2 0.00000 0 24 0.75562 0.00000 100% - 0s H 54 30 0.5603846 0.00000 100% 19.3 0s. . . The bound $$z\ge 0$$ can be deduced from the constraints $$–z \le r_i \le z$$ (we know $$h$$ of them have to hold). Presolvers are not able to find this out automatically. See [2] for some other interesting observations on indicator constraints. Update: Gurobi has identified the problem and it has been fixed (available in next version). ##### References 1. Integer Programming and Least Median of Squares Regression, http://yetanothermathprogrammingconsultant.blogspot.com/2017/11/integer-programming-and-least-median-of.html 2. Belotti, P., Bonami, P., Fischetti, M. et al., On handling indicator constraints in mixed integer programming, Comput Optim Appl (2016) 65: 545. ## Tuesday, November 14, 2017 ### Integer Programming and Least Median of Squares Regression Least Median Regression [1] is another attempt to make regression more robust, i.e. less sensitive to outliers.  \begin{align}\min_{\beta}\>&\operatorname*{median}_i r_i^2 \\ &y-X\beta = r\end{align} From the summary of [1]: Classical least squares regression consists of minimizing the sum of the squared residuals. Many authors have produced more robust versions of this estimator by replacing the square by something else, such as the absolute value. In this article a different approach is introduced in which the sum is replaced by the median of the squared residuals. The resulting estimator can resist the effect of nearly 50% of contamination in the data. In the special case of simple regression, it corresponds to finding the narrowest strip covering half of the observations. Generalizations are possible to multivariate location, orthogonal regression, and hypothesis testing in linear models. For $$n$$ even, the median is often defined as the average of the two middle observations. For this regression model, usually a slightly simplifying definition is used: the median is the $$h$$-th ordered residual with $$h=\lfloor (n+1)/2\rfloor$$ (here $$\lfloor . \rfloor$$ means rounding down). For technical reasons, sometimes another value is suggested [2]:  $h=\left\lfloor \frac{n}{2} \right\rfloor + \left\lfloor \frac{p+1}{2}\right\rfloor$ where $$p$$ is the number of coefficients to estimate (number of $$\beta_j$$’s). Some algorithms use $$\lfloor (n+p+1)/2\rfloor$$. The objective function can therefore be described succinctly as:  $\min\>r^2_{(h)}$ We used here the ordering $$|r|_{(1)}\le |r|_{(2)}\le \dots\le|r|_{(n)}$$. The problem for general $$h$$’s is called least quantile of squares regression or shorter least quantile regression. For $$h=n$$ the model can be reformulated as a Chebyshev regression problem [4]. ##### Intermezzo: Minimizing k-th smallest Minimizing the $$k$$-th smallest value $$x_{(k)}$$ of a set of variables $$x_i$$, with $$i=1,\dots n$$ is not obvious. Minimizing the largest value, $$x_{(n)}$$ is easy: we can do  \begin{align}\min\>&z\\&z\ge x_i\end{align} The trick to minimize $$x_{(k)}$$ is to do:  \begin{align}\min\>&z\\&z\ge x_i- M \delta_i\\&\sum_i\delta_i = n-k\\&\delta_i \in \{0,1\}\end{align} In effect we dropped the largest $$n-k$$ values from consideration (in statistical terms this is called “trimming”). To be precise: we dropped $$n-k$$ values, and because we are minimizing, these will be the largest values automatically. Note that we do not sort the values here: sorting values in a MIP model is very expensive. ##### A quadratic model The LMS model can now be stated as:  \begin{align}\min\>&z\\&z\ge r^2_i-M\delta_i\\ &\sum_i \delta_i = n-h\\&r_i = y_i-\sum_j X_{i,j} \beta_j\\&\delta_i \in \{0,1\}\end{align} This is a MIQCP (Mixed Integer Quadratically Constrained Programming) model. It is convex so we can solve this with solvers like Cplex and Gurobi. ##### An MINLP model Redefining our $$\delta_i$$ variables, we can formulate a MINLP model:  \begin{align}\min\>&z\\&z\ge \delta_i \cdot r^2_i\\ &\sum_i \delta_i = h\\&r_i = y_i-\sum_j X_{i,j} \beta_j\\&\delta_i \in \{0,1\}\end{align} This model is very compact but requires an MINLP solver. Finding globally optimal solutions for this problem is not very easy. ##### Absolute values Minimizing the median of the squared errors is the same as minimizing the median of the absolute values of the errors. This is because the ordering of squared errors is the same as the ordering of the absolute values. Hence the objective of our problem can be stated as:  $\min\>|r|_{(h)}$ i.e., minimize the $$h$$-th smallest absolute value. Now we know how to minimize the $$h$$-th smallest value, the only hurdle left is combining this with an absolute value formulation. We follow here the sparse bounding formulation from [3,4].  \begin{align}\min\>&z\\&-z - M\delta_i \le r_i \le z + M\delta_i\\&\sum_i \delta_i = n-h\\&r_i = y_i-\sum_j X_{i,j} \beta_j\\&\delta_i \in \{0,1\}\end{align} It is not so easy to find good values for these big-$$M$$’s (see [2] for an attempt). One way around this is to use indicator constraints:  \begin{align}\min\>&z\\&\delta_i=1 \Rightarrow \> –z \le r_i \le z\\&\sum_i \delta_i = h\\&r_i = y_i-\sum_j X_{i,j} \beta_j\\&\delta_i \in \{0,1\} \end{align} Note that we have flipped the meaning of $$\delta_i$$. If your solver does not allow for indicator constraints, we can also achieve the same thing by SOS1 sets:  \begin{align}\min\>&z\\&-z-s_i\le r_i \le z + s_i\\&s_i\cdot \delta_i = 0\\&\sum_i \delta_i = h\\&r_i = y_i-\sum_j X_{i,j} \beta_j\\&\delta_i \in \{0,1\}\end{align} where $$s_i$$ are additional slack variables. They can be free or non-negative variables. The complementarity condition $$s_i\cdot \delta_i = 0$$ can be implemented by forming $$n$$ SOS1 sets with two members: $$(\delta_i,s_i)$$. Actually some solvers may reformulate indicator constraints into SOS1 sets when there are no good bounds (in which case a direct translation to a big-M binary variable reformulation is not possible). A similar but more complicated approach using variable splitting has been proposed by [5]: Besides being unnecessarily complex, I believe this formulation is not correct. An obvious optimal solution for this model would be $$\gamma=0$$ and $$\mu_i=0$$ which is not a reasonable solution for our original problem. One way to repair this model is by saying the pairs $$(z_i,\bar{\mu}_i)$$ form SOS1 sets instead of $$(z_i,\mu_i)$$. I assume this is what the authors meant. This change makes the model correct, but not any simpler. If we really want to use a variable splitting approach with SOS1 sets, I propose a much simpler model:  \begin{align}\min\>&z\\&z \ge r_i^+ + r_i^- - s_i\\&r_i^+ - r_i^- = y_i – \sum_j X_{i,j}\beta_j\\&s_i\cdot \delta_i = 0\\&\sum_i \delta_i = h\\&\delta_i \in \{0,1\}\\&r_i^+, r_i^-, s_i \ge 0\end{align} The variable splitting in this model is not immediately interpretable: we only guarantee $$|r_i|=r_i^+ + r_i^-$$ for the case where $$s_i=0$$ and $$z \ge r_i^+ + r_i^-$$ is binding. ##### Example In [6,7] a dramatic example is shown on a small data set. Another interesting plot is to compare the ordered absolute values of the residuals: Here we plotted $$|r|_{(i)}$$. The model was solved with $$h=25$$ (this is where the vertical line is drawn). As you can see, LMS (Least Median of Squares) tries to keep the absolute value of the $$h$$-th residual as small as possible, and does not care about what happens afterward. We can see the residual increase in magnitude after this point. OLS (Ordinary Least Squares) will try to minimize all squared residuals and thus is heavily influenced by the bigger ones at the end. ##### Some timings The “k out of n” constraint is always difficult for MIP solvers. We see that also here using Cplex: Our fixed version of the model from [5] is doing somewhat poorly: it has too many discrete structures. (In [5] a really good approximation is found and then passed to the MIP using MIPSTART). Data set 2 has a few interesting results. The binary variable approach with big-M’s leads to a better than optimal solution due to “trickle flow”. One of the binary variables has value: 3.5193967E-6 which is within the integer feasibility tolerance but is large enough to cause the objective to improve a bit. The Bertsimas model found the optimal solution within an hour but we were not able to prove global optimality (it still has a large relative gap). ##### A note on minimizing the true median Going back to the standard definition of the median, the “true” median is usually defined as the average of the two middle observations in case $$n$$ is even. In the above discussion we use a simpler definition and just minimized the $$h$$-th ordered residual. Here I’ll try to show that it is possible to handle the traditional definition of the median. This requires a little bit of effort however. The trick to minimize the median in general can look like [8]: If $$n$$ is odd:  \begin{align}\min\>&z\\&z\ge x_i-\delta_i M\\&\sum_i \delta_i = \frac{n-1}{2}\\&\delta_i \in \{0,1\}\end{align} If $$n$$ is even:  \begin{align}\min\>&\frac{z_1+z_2}{2}\\&z_1\ge x_i-\delta_i M\\&\sum_i \delta_i = \frac{n}{2}\\&z_2\ge x_i-\eta_i M\\&\sum_i \eta_i = \frac{n}{2}-1\\&\delta_i,\eta_i \in \{0,1\}\end{align} The “$$n$$ is even” case is the new and interesting part. Note that I expect we can improve this “$$n$$ is even” version a little bit by adding the condition $$\delta_i \ge \eta_i$$. This technique allows us to calculate the minimum median of a variable $$x_i$$. However, if we want to apply this to our Least Median of Squares model we are in big trouble. In our formulations, we used a trick that minimizing $$r^2_{(h)}$$ is the same as minimizing $$|r|_{(h)}$$. This is no longer the case with this “true” median objective for $$n$$ even. We can however use a quadratic formulation: If $$n$$ is even:  \begin{align}\min\>&\frac{z_1+z_2}{2}\\&z_1\ge r^2_i-\delta_i M\\&\sum_i \delta_i = \frac{n}{2}\\&z_2\ge r^2_i-\eta_i M\\&\sum_i \eta_i = \frac{n}{2}-1\\&r_i = y_i-\sum_j X_{i,j} \beta_j\\&\delta_i,\eta_i \in \{0,1\}\end{align} Although we can solve this with standard solvers, it turns out to be quite difficult. The linear models discussed before were much easier to solve. This regression problem turns out to be interesting from a modeling perspective. ##### References 1. Peter Rousseeuw, Least Median of Squares Regression, Journal of the American Statistical Association, 79 (1984), pp. 871-880. 2. A. Giloni, M. Padberg, Least Trimmed Squares Regression, Least Median Squares Regression, and Mathematical Programming, Mathematical and Computer Modelling 35 (2002), pp. 1043-1060 3. Linear programming and LAD regression, http://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html 4. Linear programming and Chebyshev regression, http://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-chebyshev-regression.html 5. Dimitris Bertsimas and Rahul Mazumder, Least Quantile Regression via Modern Optimization, The Annals of Statistics, 2014, Vol. 42, No. 6, 2494-2525, https://arxiv.org/pdf/1310.8625.pdf. 6. Peter Rousseeuw, Annick Leroy, Robust Regression and Outlier Detection, Wiley, 2003. 7. Peter Rousseeuw, Tutorial to Robust Statistics, Journal of Chemometrics, Vol. 5, pp. 1-20, 1991. 8. Paul Rubin, Minimizing a Median, https://orinanobworld.blogspot.com/2017/09/minimizing-median.html ## Friday, November 10, 2017 ### Linear Programming and Chebyshev Regression The LAD (Least Absolute Deviation) or $$\ell_1$$ regression problem (minimize the sum of the absolute values of the residuals) is often discussed in Linear Programming textbooks: it has a few interesting LP formulations [1]. Chebyshev or $$\ell_{\infty}$$ regression is a little bit less well-known. Here we minimize the maximum (absolute) residual.  \begin{align}\min_{\beta}\>&\max_i \> |r_i|\\ &y-X\beta = r\end{align} As in [1], $$\beta$$ are coefficients to estimate and $$X,y$$ are data. $$r$$ are the residuals. Some obvious and less obvious formulations are: • Variable splitting:\begin{align}\min\>&z\\& z \ge r^+_i + r^-_i\\&r^+_i - r^-_i =y_i –\sum_j X_{i,j}\beta_j\\&r^+_i, r^-_i\ge 0\end{align} With variable splitting we use two non-negative variables $$r^+_i - r^-_i$$ to describe a value $$r_i$$ that can be positive or negative. We need to make sure that one of them is zero in order for $$r^+_i + r^-_i$$ to be equal to the absolute value $$|r_i|$$. Here we have an interesting case, as we are only sure of this for the particular index $$i$$ that has the largest absolute deviation (i.e. where $$|r_i|=z$$). In cases where $$|r_i|<z$$ we actually do not enforce $$r^+_i \cdot r^-_i = 0$$. Indeed, when looking at the solution I see lots of cases where $$r^+_i > 0, r^-_i > 0$$. Effectively those $$r^+_i, r^-_i$$ have no clear physical interpretation. This is very different from the LAD regression formulation [1] where we require all $$r^+_i \cdot r^-_i = 0$$. • Bounding:\begin{align}\min\>&z\\ & –z \le y_i –\sum_j X_{i,j}\beta_j \le z\end{align}Here $$z$$ can be left free or you can make it a non-negative variable (it will be non-negative automatically). Note that there are actually two constraints here: $$–z \le y_i –\sum_j X_{i,j}\beta_j$$ and $$y_i –\sum_j X_{i,j}\beta_j \le z$$. This model contains the data twice, leading to a large number of nonzero elements in the LP matrix. • Sparse bounding: This model tries to remedy the disadvantage of the standard bounding model by introducing extra free variables $$d$$ and extra equations:\begin{align}\min\>&z\\ & –z \le d_i \le z\\& d_i = y_i –\sum_j X_{i,j}\beta_j \end{align} Note again that $$–z \le d_i \le z$$ is actually two constraints: $$–z \le d_i$$ and $$d_i \le z$$. • Dual:\begin{align}\max\>&\sum_i y_i(d_i+e_i)\\&\sum_i X_{i,j}(d_i+e_i) = 0 \perp \beta_j\\&\sum_i (d_i-e_i)=1\\&d_i\ge 0, e_i\le 0\end{align}The estimates $$\beta$$ can be found to be the duals of equation $$X^T(d+e)=0$$. We use the same synthetic data as in [1] with $$m=5,000$$ cases, and $$n=100$$ coefficients. Some timings with Cplex (default LP method) yield the following results (times are in seconds): Opposed to the LAD regression example in [1], the bounding formulation is very fast here. The dual formulation remains doing very good. ##### Historical Note The use of this minimax principle goes back to Euler (1749) [3,4]. Leonhard Euler (1707-1783) ##### References 1. Linear Programming and LAD Regression, http://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html 2. A.Giloni, M.Padberg, Alternative Methods of Linear Regression, Mathematical and Computer Modeling 35 (2002), pp.361-374. 3. H.L. Harter, The method of least squares and some alternatives, Technical Report, Aerospace Research Laboratories, 1972. 4. L. Euler, Pièce qui a Remporté le Prix de l'Academie Royale des Sciences en 1748, sur les Inegalités de Mouvement de Saturn et de Jupiter. Paris (1749). ## Thursday, November 9, 2017 ### Linear Programming and LAD Regression I believe any book on linear programming will mention LAD (Least Absolute Deviation) or $$\ell_1$$ regression: minimize the sum of the absolute values of the residuals.  \begin{align}\min_{\beta}\>&\sum_i |r_i|\\ &y-X\beta = r\end{align} Here $$\beta$$ are coefficients to estimate. So they are the decision variables in the optimization model. $$X,y$$ are data (watch out here: in many optimization models we denote decision variables by $$x$$; here they are constants). $$r$$ are the residuals; it is an auxiliary variable that can be substituted out. The standard LP models suggested for this problem are not very complicated, but still interesting enough to have them cataloged. There are at least three common LP formulations for this: variable splitting, bounding and a dual formulation. Here is a summary: • Variable splitting:\begin{align}\min\>&\sum_i r^+_i + r^-_i\\&r^+_i - r^-_i =y_i –\sum_j X_{i,j}\beta_j\\&r^+_i, r^-_i\ge 0\end{align}In this model, automatically one of the pair $$(r^+_i,r^-_i)$$ will be zero. We don’t have to add an explicit complementarity condition $$r^+_i \cdot r^-_i = 0$$. This is fortunate: we can keep the model linear. • Bounding:\begin{align}\min\>&\sum_i r_i\\&-r_i \le y_i –\sum_j X_{i,j}\beta_j \le r_i\end{align}Here $$r_i$$ can be left free or you can make it a non-negative variable. It will be non-negative automatically. Note that there are actually two constraints here: $$-r_i \le y_i –\sum_j X_{i,j}\beta_j$$ and $$y_i –\sum_j X_{i,j}\beta_j\le r_i$$. This formulation is mentioned in [1]. • Sparse bounding: In the standard bounding formulation we have all the data $$X_{i,j}$$ twice in the model, leading to a large number of non-zero elements in the LP matrix. We can alleviate this by introducing an extra free variable $$d$$: \begin{align}\min\>&\sum_i r_i\\&-r_i \le d_i \le r_i\\&d_i = y_i –\sum_j X_{i,j}\beta_j\end{align} Effectively we reduced the number of non-zeros by a factor of two compared to the standard bounding formulation. Note that the first constraint $$-r_i\le d_i \le r_i$$ is actually two constraints: $$-r_i\le d_i$$ and $$d_i\le r_i$$. The sparse version will have fewer non-zero elements, but many more constraints and variables. Advanced LP solvers are based on sparse matrix technology and the effect of more nonzeros is often underestimated by novice users of LP software. The same arguments and reformulations can be used in $$\ell_1$$ portfolio models [3]. • Dual:\begin{align}\max\>&\sum_i y_i d_i\\&\sum_i X_{i,j} d_i=0 \perp \beta_j \\&-1\le d_i \le 1\end{align} The optimal values for $$\beta$$ can be recovered from the duals for the constraint $$X^Td=0$$ (this is what the notation $$\perp$$ means). One could make the argument the last formulation is the best: it has fewer variables than variable splitting, and fewer equations than the bounding approach. In addition, as mentioned before, the standard bounding formulation has the data twice in the model, resulting in a model with many nonzero elements. Modern LP solvers are not very well suited for these type of models. They like very sparse LP models, while these models are very dense. Let’s try anyway with a large, artificial data set with $$m=5,000$$ cases, and $$n=100$$ coefficients. The data matrix $$X$$ has 500,000 elements. Some timings with Cplex (default LP method) yield the following results: Times are in seconds. The dual formulation seems indeed quite fast. It is interesting that the bounding model (this formulation is used a lot) is actually the slowest. Note that these results were obtained using default settings. This effectively means that Cplex selected the dual simplex solver for all these instances. These timings will change when the primal simplex method or the barrier algorithm is used. ##### L1fit The R package L1pack [5] contains a dense, simple (compared to say Cplex), specialized version of the Simplex method based on an algorithm from [6,7]. It is actually quite fast on the above data: The number of Simplex iterations is 561. This confirms that sparse, general purpose LP solvers have a big disadvantage on this problem (usually Cplex would beat any LP algorithm you can come up yourself, by a large margin). See the comment from Robert Fourer for some notes on the Barrodale-Roberts algorithm. So I think I need to refine my previous statement a bit: there are two reasons why l1fit is doing so well: (1) dense data using a dense code and (2) a specialized Simplex method handling the absolute values. The number of iterations is close to our dual formulation (which is also very compact), so the time per iteration is about the same as Cplex. ##### Historical note The concept of minimizing the sum of absolute deviations goes back to [4]. The term quaesita refers to “unknown quantities”. F.Y. Edgeworth, 1845-1926. I noticed that Edgeworth was from Edgeworthstown, Ireland (population: 2,335 in 2016). ##### References 1. Least absolute deviations, https://en.wikipedia.org/wiki/Least_absolute_deviations 2. A.Giloni, M.Padberg, Alternative Methods of Linear Regression, Mathematical and Computer Modeling 35 (2002), pp.361-374. 3. L1 portfolio formulation, http://yetanothermathprogrammingconsultant.blogspot.com/2010/04/l1-portfolio-formulation.html 4. Francis Ysidro Edgeworth, On observations relating to several quantities, Hermathena 6 (1887), pp. 279-285 5. Osorio, F., and Wołodźko, T. (2017). L1pack: Routines for L1 estimation, https://cran.r-project.org/web/packages/L1pack/index.html 6. Barrodale, I., and Roberts, F.D.K. (1973). An improved algorithm for discrete L1 linear approximations. SIAM Journal of Numerical Analysis 10, 839-848 7. Barrodale, I., and Roberts, F.D.K. (1974). Solution of an overdetermined system of equations in the L1 norm. Communications of the ACM 17, 319-320. ## Tuesday, November 7, 2017 ### Suguru: GAMS/MIP vs Python/Z3  For many logical puzzles a constraint solver like Z3 may be more suitable than a Mathematical Programming approach using a MIP solver. In [1] a MIP formulation was presented for Suguru puzzle. Now let’s see how much difference we observe when we use Z3. There are two major implications for the model itself: 1. We can use integers directly, i.e. $$x_{i,j} \in \{1,\dots,u_{i,j}\}$$ instead of $$x_{i,j,k} \in \{0,1\}$$. 2. We have access to constructs like Distinct and != (unequal). However most of the code is devoted to data manipulation as we will see below. Although the approaches are different (different languages, different solver technologies), the actual code is remarkably similar. ##### Data entry: GAMS First we need to get the data into GAMS. Unfortunately we need to enter two matrices: one for the blocks and one for the initial values.  set i 'rows' /i1*i7/ j 'columns' /j1*j9/ k 'values' /1*9/ b 'blocks' /b1*b15/;table blockno(i,j) j1 j2 j3 j4 j5 j6 j7 j8 j9 i1 1 1 2 2 2 3 3 4 6 i2 1 1 2 2 3 3 4 4 6 i3 1 7 8 8 3 4 4 5 6 i4 7 7 7 8 8 9 9 6 6 i5 7 10 10 8 12 9 9 14 15 i6 11 11 10 10 12 12 9 15 15 i7 11 11 10 12 12 13 13 15 15;table initval(i,j) j1 j2 j3 j4 j5 j6 j7 j8 j9 i1 1 1 i2 2 3 i3 4 4 1 4 i4 1 i5 4 5 4 5 1 i6 3 i7 2 5; The set k is a bit of an over-estimate: we actually need only values 1 through 5 for this problem (the maximum size of a block is 5). ##### Data entry: Python Entering the same boards in Python can look like:  from z3 import * blockno = [ [ 1, 1, 2, 2, 2, 3, 3, 4, 6], [ 1, 1, 2, 2, 3, 3, 4, 4, 6], [ 1, 7, 8, 8, 3, 4, 4, 5, 6], [ 7, 7, 7, 8, 8, 9, 9, 6, 6], [ 7,10,10, 8,12, 9, 9,14,15], [11,11,10,10,12,12, 9,15,15], [11,11,10,12,12,13,13,15,15] ] initval = [ [ 1, 0, 0, 1, 0, 0, 0, 0, 0], [ 0, 2, 0, 0, 0, 0, 0, 3, 0], [ 4, 0, 4, 0, 0, 0, 0, 1, 4], [ 0, 0, 0, 1, 0, 0, 0, 0, 0], [ 4, 0, 5, 0, 4, 0, 5, 1, 0], [ 0, 0, 0, 0, 0, 0, 0, 3, 0], [ 0, 0, 0, 0, 2, 0, 0, 0, 5] ] m = len(blockno) # number of rows n = len(blockno[0]) # number of columns R = range(m) C = range(n) bmax = max([blockno[i][j] for i in R for j in C]) B = range(bmax) The order is a bit reversed here: first enter the boards with the block numbers and the initial values and then calculate the dimensions. ##### Derived data: GAMS We need to calculate a number of things before we can actually write the equations.  alias (i,i2),(j,j2); sets bmap(b,i,j) 'maps between b <-> (i,j)' bk(b,k) 'allowed (b,k) combinations' ijk(i,j,k) 'allowed values' n(i,j,i2,j2) 'neighbors' ; parameters len(b) 'number of cells in block' ; bmap(b,i,j) = blockno(i,j)=ord(b); len(b) = sum(bmap(b,i,j),1); bk(b,k) = ord(k)<=len(b); ijk(i,j,k) = sum((bmap(b,i,j),bk(b,k)),1); n(i,j,i,j+1) = blockno(i,j)<>blockno(i,j+1); n(i,j,i+1,j-1) = blockno(i,j)<>blockno(i+1,j-1); n(i,j,i+1,j) = blockno(i,j)<>blockno(i+1,j); n(i,j,i+1,j+1) = blockno(i,j)<>blockno(i+1,j+1); Note that the set n (indicating neighbors we need to inspect for different values) only looks “forward” from cell $$(i,j)$$. This is to prevent double-counting: we don’t want to compare neighboring cells twice in the equations. ##### Derived data: Python The Python/Z3 model needs to calculate similar derived data:  # derived data maxb = [0]*bmax # number of cells in each block bij = [[] for b in B] # list of cells in each block nij = [ [ [] for j in C ] for i in R ] # neighbors for i in R: for j in C: bn = blockno[i][j]-1 maxb[bn] += 1 bij[bn] += [(i,j)] if j+1and blockno[i][j]!=blockno[i][j+1]: nij[i][j] += [(i,j+1)] if i+1and j-1>=0 and blockno[i][j]!=blockno[i+1][j-1]: nij[i][j] += [(i+1,j-1)] if i+1and blockno[i][j]!=blockno[i+1][j]: nij[i][j] += [(i+1,j)] if i+1and j+1and blockno[i][j]!=blockno[i+1][j+1]: nij[i][j] += [(i+1,j+1)] # upperbound on X[i,j] maxij = [ [ maxb[blockno[i][j]-1] for j in C ] for i in R ] I use here nested lists. bij has a list of cells for each block. nij contains a list of neighboring cells for each cell. ##### Model constraints: GAMS The decision variables and linear constraints are described in [1]. The GAMS representation is very close the mathematical formulation.  variable dummy 'objective variable'; binary variable x(i,j,k); * fix initial values x.fx(i,j,k)(initval(i,j)=ord(k)) = 1; equation    oneval(i,j)   'pick one value per cell'    unique(b,k)   'unique in a block'    neighbors(i,j,i2,j2,k) 'neighboring cells must be different'    edummy        'dummy objective' ; oneval(i,j).. sum(ijk(i,j,k), x(i,j,k)) =e= 1; unique(bk(b,k)).. sum(bmap(b,i,j), x(i,j,k)) =e= 1; neighbors(i,j,i2,j2,k)$(n(i,j,i2,j2) and ijk(i,j,k) and ijk(i2,j2,k)).. x(i,j,k)+x(i2,j2,k) =l= 1; edummy.. dummy =e= 0; Notice we added a dummy objective to the model. ##### Model constraints: Z3  # Model X = [ [ Int("x%d.%d" % (i+1, j+1)) for j in C ] for i in R ] Xlo = And([X[i][j] >= 1 for i in R for j in C]) Xup = And([X[i][j] <= maxij[i][j] for i in R for j in C]) Uniq = And([ Distinct([X[i][j] for (i,j) in bij[b]]) for b in B]) Nbor = And([ And([X[i][j] != X[i2][j2] for (i2,j2) in nij[i][j] ]) for i in R for j in C if len(nij[i][j])>0 ]) Xfix = And([X[i][j] == initval[i][j] for i in R for j in C if initval[i][j]>0]) Cons = [Xlo,Xup,Uniq,Nbor,Xfix] Most of this is boilerplate. We use Distinct to make the contents of the cells unique within a block. The != operator is used to make sure neighbors of different blocks do not have the same value. Both these constructs are not so easy to implement in a MIP model, but there we were rescued by using binary variables instead. ##### Solving and printing results: GAMS We only look for a feasible solution (any feasible integer solution will be globally optimal immediately). This means we don’t have to worry about setting the OPTCR option in GAMS. The default gap of 10% is irrelevant here as the solver will stop after finding the first integer solution.  model m /all/; solve m minimizing dummy using mip; parameter v(i,j) 'value of each cell'; v(i,j) = sum(ijk(i,j,k), x.l(i,j,k)*ord(k)); option v:0; display v; To report a solution we recover the integer cell values. The results look like:  ---- 81 PARAMETER v value of each cell j1 j2 j3 j4 j5 j6 j7 j8 j9 i1 1 5 4 1 5 1 5 2 1 i2 3 2 3 2 3 2 4 3 5 i3 4 1 4 5 4 1 5 1 4 i4 5 2 3 1 3 2 3 2 3 i5 4 1 5 2 4 1 5 1 4 i6 2 3 4 3 5 3 4 3 2 i7 4 1 2 1 2 1 2 1 5 ##### Solving and printing results: Python/Z3  # solve and print s = Solver() s.add(Cons) if s.check() == sat: m = s.model() sol = [ [ m.evaluate(X[i][j]) for j in C] for i in R] print(sol) The solution looks like:  [[1, 5, 4, 1, 5, 1, 5, 2, 1], [3, 2, 3, 2, 3, 2, 4, 3, 5], [4, 1, 4, 5, 4, 1, 5, 1, 4], [5, 2, 3, 1, 3, 2, 3, 2, 3], [4, 1, 5, 2, 4, 1, 5, 1, 4], [2, 3, 4, 3, 5, 3, 4, 3, 2], [4, 1, 2, 1, 2, 1, 2, 1, 5]] ##### Check for uniqueness: GAMS To check if the solution is unique we add a cut and resolve. If this is infeasible we know there was only one solution.  * * test if solution is unique * equation cut 'forbid current solution'; cut.. sum(ijk(i,j,k), x.l(i,j,k)*x(i,j,k)) =l= card(i)*card(j)-1; model m2 /all/; solve m2 minimizing dummy using mip; abort$(m2.modelstat=1 or m2.modelstat=8) "Unexpected solution found",x.l;

We abort if the model status is not "optimal" (1) or "integer solution found" (8). (It should be enough to check on optimal only, we err on the safe side here).

##### Check for uniqueness: Z3
 # Check for uniqueness Cut = Or([X[i][j] != sol[i][j] for i in R for j in C]) s.add(Cut) if s.check() == sat:     print("There is an alternative solution.") else:     print("Solution is unique.")
##### References
1. Suguru, http://yetanothermathprogrammingconsultant.blogspot.com/2017/11/suguru.html