## Wednesday, June 21, 2017

### Minimizing the k-th largest x

Minimizing the largest $$x_i$$ is an easy exercise in LP modeling:
 \bbox[lightcyan,10px,border:3px solid darkblue] { \begin{align} \min\>& z\\ & z \ge x_i \end{align}}
This is sometimes called MINIMAX.

What about minimizing the $$k^{\text{th}}$$ largest value? Here is one possible MIP formulation:

 \bbox[lightcyan,10px,border:3px solid darkblue] { \begin{align} \min\>& z\\ & z \ge x_i - \delta_i M\\ & \sum_i \delta_i = k-1\\ & \delta_i \in \{0,1\} \end{align}}

I.e., we have $$k-1$$ exceptions on the constraint $$z \ge x_i$$. The objective will make sure the exceptions are the largest $$x_i$$.  We should think long and hard about making the big-M’s as small as possible. If you have no clue about proper bounds on $$x_i$$ one could use indicator constraints.

Interestingly, minimizing the sum of the $$k$$ largest can be modeled as a pure LP (1) :

 \bbox[lightcyan,10px,border:3px solid darkblue] { \begin{align} \min\>& \sum_i v_i + k\cdot q\\ & v_i \ge x_i - q\\ &v_i \ge 0 \\ &q \text{ free } \end{align}}
You almost would think there is an LP formulation for minimizing the $$k^{\text{th}}$$ largest value. I don't see it however.
##### References
1. Comment by Michael Grant in http://orinanobworld.blogspot.se/2015/08/optimizingpartoftheobjectivefunction-ii.html.

## Wednesday, June 14, 2017

### Modeling production runs with length exactly three

In (1) the question was posed how to model production runs that have an exact length. Although the question was asked in terms of integer variables, this is much easier to deal with when we have binary variables. Let’s introduce a binary variable:

 $x_{t} = \begin{cases}1 & \text{when the unit x is turned on in period t}\\0 &\text{otherwise}\end{cases}$

The question was formulated as an implication:

 $x_t=1 \implies x_{t+1}=x_{t+2}=1$

This implication is not 100% correct: it would turn on $$x$$ forever. But we understand what the poster meant, if a unit is turned on, it should stay on for exactly three periods.

This condition can be more correctly stated as the implication:

 $x_{t-1}=0 \text{ and } x_{t}=1 \implies x_{t+1}=x_{t+2}=1 \text{ and } x_{t+3}=0$

We can make linear inequalities out of this as follows:

 \bbox[lightcyan,10px,border:3px solid darkblue]{\begin{align}&x_{t+1}\ge x_{t}-x_{t-1}\\&x_{t+2}\ge x_{t}-x_{t-1}\\&1-x_{t+3}\ge x_{t}-x_{t-1}\end{align}}

If we want to model the condition “a unit should stay on for at least three periods”, we can drop the last constraint and just keep:

 \begin{align}&x_{t+1}\ge x_{t}-x_{t-1}\\&x_{t+2}\ge x_{t}-x_{t-1}\end{align}

##### References
1. https://stackoverflow.com/questions/44496473/block-of-consecutive-variables-to-have-same-value-in-mixed-integer-linear-progra

## Friday, June 9, 2017

### A Staffing Problem

In (1) a "simple problem" is stated (problems are rarely as simple as they seem):

For the next 18 weeks there is some data about demand for staffing resources:

Actually we don’t need the data for the individual projects: just the totals. It is noted that the maximum number of “bodies” we need is 52 (in week 3).

We start with 48 temps available in the first period. We can let a temp staffer go and we can hire new ones. However when hiring a staffer it will cost 10 days (2 weeks) to train this person. During training a staffer is not productive.  We can also keep a staffer idling for a few periods.

To make a difference between idling and training, as shown here:

I will assume that hiring + training is slightly more expensive than keeping someone idle. I think it makes sense to assign some cost to the hiring and training process. This property was not in the original post but I think I have convinced myself that this is actually a reasonable assumption.

To model this I introduce two sets of binary variables:

 \begin{align} & r_{i,t} = \begin{cases} 1 & \text{if a staffer i is available for training or work during period t}\\0 & \text{otherwise} \end{cases} \\& h_{i,t} = \begin{cases} 1 & \text{if a staffer i is hired at the beginning of period t}\\ 0 & \text{otherwise} \end{cases}\end{align}

We can link $$h_{i,t}$$ and $$r_{i,t}$$ as follows:

 $h_{i,t}\ge r_{i,t}-r_{i,t-1}$

This implements the implication:

 $r_{i,t-1}=0 \text{ and } r_{i,t}=1 \implies h_{i,t}=1$

that is: if we change $$r$$ from 0 to 1, we have a new hire. We will add a (small) cost to a hire to the objective so we don’t need to add constraint to enforce the other way around:

 $r_{i,t-1}=1 \text{ or } r_{i,t}=0 \implies h_{i,t}=0$

There is one wrinkle here: we need to make sure that the 48 staffers already hired before can work in the first period without being considered to be hired. We can explicitly model this:

 $h_{i,t}\ge \begin{cases} r_{i,t}-r_{i,t-1} &\text{if t>1}\\r_{i,t}&\text{if t=1 and i>48}\end{cases}$

Actually in the GAMS model below I approached this slightly differently, but with the same net result.

The main equation in the model is to make sure we have enough staffing in each period $$t$$. This can be modeled as:

 $\sum_i \left( r_{i,t} – h_{i,t} - h_{i,t-1}\right) \ge \mathit{demand}_t$

The complete model can look like:

Notes:

• $$\mathit{rinit}_{i,t}$$ is used to indicate initial staffing when we start period 1. It is sparse: it assumes the value 1 only if $$i=1$$ and $$t \le 48$$.
• This allows us to write equation EHire as one single, clean equation.
• Note that GAMS will assume a variable is zero outside its domain:  $$r_{i,t-1}$$ is zero when $$t=1$$.
• I used the maximum demand (52 in week 3) to dimension the the problem: set $$i$$ has 52 members.
• The model does not decide which workers are idle. We can see in each period how many workers we have and how many are in training. The remaining ones are working or sitting idle.
• We reuse resource numbers $$i$$. In the results below worker $$i=21$$ is likely to be two different persons.

The results look like:

Here a green cell with code=2 means hired and working (or idle, but not in training). A red cell with code=1 means in training.  The row “total” is the number of staffers (either working, idling or training). The row “Work” indicates the number of workers available (not in training). We are overshooting demand a bit: some workers are idling.

A picture is always a good idea, so here we see demand vs. staffing available for work.

We see two place where we increase the workforce: we hire in week 1 to handle demand in week 3, and we hire again in week 7 to deal with demand in week 9.

## Wednesday, June 7, 2017

### Minimum down- and up-time

In machine scheduling models we sometimes want to impose minimum up-time and minimum down-time restrictions. E.g., from (1):

My question is how do I add in additional constraints that if the factory switches off then its needs to stay off for 3 months, and if it switches back on then it needs to stay on for 4 months?

One possible solution is the following. Let us define our binary decision variable by

 $x_{i,t} = \begin{cases}1 & \text{if factory i is operating in month t} \\ 0&\text{otherwise} \end{cases}$
##### Method 1

We really want to forbid short run patterns such as 010 (i.e. off-on-off), 0110, 01110 and 101, 1001. Forbidding patterns 010, 0110, 01110 will ensure a factory is up at least 4 consecutive periods. By not allowing 101,1001 we really make sure that a down time period is at least three months. We can model these restrictions in a linear fashion as follows (2):

 Forbid 010 $$-x_{i,t}+x_{i,t+1}-x_{i,t+2}\le 0$$ Forbid 0110 $$-x_{i,t}+x_{i,t+1}+x_{i,t+2}-x_{i,t+3}\le 1$$ Forbid 01110 $$-x_{i,t}+x_{i,t+1}+x_{i,t+2}+x_{i,t+3}-x_{i,t+4}\le 2$$ Forbid 101 $$x_{i,t}-x_{i,t+1}+x_{i,t+2}\le 1$$ Forbid 1001 $$x_{i,t}-x_{i,t+1}-x_{i,t+2}+x_{i,t+3}\le 1$$
##### Method 2

A different approach is as follows. First define binary variables:

 \begin{align} &\delta^{\mathit{on}}_{i,t} = \begin{cases}1 & \text{if factory i is turned on in month t} \\ 0&\text{otherwise} \end{cases}\\ &\delta^{\mathit{off}}_{i,t} = \begin{cases}1 & \text{if factory i is turned off in month t} \\ 0&\text{otherwise} \end{cases} \end{align}

Mathematically we write this as:

 \begin{align} &\delta^{\mathit{on}}_{i,t} = (1-x_{i,t-1})\cdot x_{i,t}\\ &\delta^{\mathit{off}}_{i,t} = x_{i,t-1}\cdot (1-x_{i,t})\\ \end{align}

We can linearize these non-linear equations by:

 \begin{align} &\delta^{\mathit{on}}_{i,t} \le 1-x_{i,t-1}\\ &\delta^{\mathit{on}}_{i,t} \le x_{i,t}\\ &\delta^{\mathit{on}}_{i,t} \ge x_{i,t}-x_{i,t-1}\\ &\delta^{\mathit{off}}_{i,t} \le x_{i,t-1}\\ &\delta^{\mathit{off}}_{i,t} \le 1-x_{i,t}\\ &\delta^{\mathit{off}}_{i,t} \ge x_{i,t-1}-x_{i,t}\\ \end{align}

With these variables we can implement the implications:

 \begin{align} &\delta^{\mathit{on}}_{i,t} = 1 \implies x_{i,t} + x_{i,t+1} + x_{i,t+2} + x_{i,t+3} = 4 \\ &\delta^{\mathit{off}}_{i,t} = 1 \implies x_{i,t} + x_{i,t+1} + x_{i,t+2} = 0 \end{align}

This can be linearized as:

 \begin{align} & x_{i,t+1} + x_{i,t+2} + x_{i,t+3} \ge 3 \delta^{\mathit{on}}_{i,t} \\ & x_{i,t+1} + x_{i,t+2} \le 2 (1-\delta^{\mathit{off}}_{i,t}) \end{align}

Note that I dropped $$x_{i,t}$$ in both inequalities. These are already known from the definition of $$\delta^{\mathit{on}}_{i,t}$$ and $$\delta^{\mathit{off}}_{i,t}$$.

In the comments it is mentioned by Rob Pratt that we can strengthen this a bit (the math does not look very good in a comment, so I repeat it here):

 \begin{align} & x_{i,t+1} \ge \delta^{\mathit{on}}_{i,t}\\& x_{i,t+2} \ge \delta^{\mathit{on}}_{i,t}\\& x_{i,t+3} \ge \delta^{\mathit{on}}_{i,t}\\& x_{i,t+1} \le 1-\delta^{\mathit{off}}_{i,t} \\& x_{i,t+2} \le 1-\delta^{\mathit{off}}_{i,t} \end{align}
##### References
1. How do I add a constraint to keep a factory switched on or off for a certain period of time in PuLP? https://stackoverflow.com/questions/44281389/how-do-i-add-a-constraint-to-keep-a-factory-switched-on-or-off-for-a-certain-per/44293592
2. Integer cuts, http://yetanothermathprogrammingconsultant.blogspot.com/2011/10/integer-cuts.html

## Friday, May 26, 2017

### Working on an R package….

The C++ code (parsing Excel formulas so we can for instance execute them) is not working 100% correctly as of now…

Time to make it work under a debugger (amazingly I did not need a debugger until now).

Update: found it (without debugger). Dereferencing a nil-pointer. Is it unreasonable to expect a (descriptive) exception to be raised when this happens?

## Sunday, May 21, 2017

### Indexing economic time series in R

When we want to compare different (economic) data, an often used approach is indexing. We choose one year (often the beginning of the time series) as the base year. We then normalize each time series such that the value at the base year is 100. Or:

 $\hat{x}_t = 100 \frac{x_t}{x_0}$

When doing this in R it is interesting to see the implementation can depend much on the chosen data structure.

Below I consider three different data structures: the time series are stored (1) row-wise in a matrix, (2) column-wise in a matrix, and (3) unordered in a long-format data frame.

##### Matrix: row wise

Below is some (artificial) data organized as a matrix with the rows being the series. We have three series: a,b and c. The columns represent the years. Row and column names are used to make this visible:

str(A)

##  num [1:3, 1:20] 67.235 4743.289 0.871 64.69 5006.955 ...
##  - attr(*, "dimnames")=List of 2
##   ..$: chr [1:3] "a" "b" "c" ## ..$ : chr [1:20] "2011" "2012" "2013" "2014" ...

A

##           2011         2012         2013         2014         2015
## a   67.2353402   64.6901151   63.6518902   69.1449240   71.8494470
## b 4743.2887884 5006.9547226 5623.0654046 5912.6498320 5736.7239960
## c    0.8710604    0.9193449    0.8711697    0.8451556    0.8440797
##           2016         2017         2018         2019         2020
## a   77.2782169   75.5709375   83.1405320   82.0417121   85.6433929
## b 6302.6091905 6241.5527226 6140.4721640 5542.9254530 5627.5502851
## c    0.8496936    0.8739366    0.8955205    0.8681451    0.8202612
##           2021         2022         2023         2024         2025
## a   89.1381161   89.8803998   90.5211813    91.411429   94.4488615
## b 6006.0691966 5998.6549344 6121.5326208  6378.963851 6628.1064720
## c    0.8639629    0.8477383    0.8235255     0.887063    0.8998234
##           2026         2027         2028         2029         2030
## a   94.2000739   89.9766167   90.4627291   87.2632337   86.2154671
## b 7115.7111566 6410.3639302 7038.3387976 6765.7867813 7076.7998102
## c    0.8547652    0.9033014    0.9224435    0.9127362    0.9541839

To index this we can use the the following R code:

Aindx <- 100*A/A[,1]
Aindx

##   2011      2012      2013      2014      2015      2016     2017     2018
## a  100  96.21445  94.67029 102.84015 106.86262 114.93690 112.3976 123.6560
## b  100 105.55872 118.54782 124.65296 120.94402 132.87425 131.5870 129.4560
## c  100 105.54318 100.01254  97.02606  96.90254  97.54703 100.3302 102.8081
##        2019      2020      2021      2022      2023     2024     2025
## a 122.02171 127.37854 132.57628 133.68029 134.63334 135.9574 140.4750
## b 116.85827 118.64237 126.62247 126.46615 129.05671 134.4840 139.7365
## c  99.66531  94.16812  99.18518  97.32255  94.54286 101.8371 103.3021
##        2026     2027     2028     2029     2030
## a 140.10500 133.8234 134.5464 129.7877 128.2294
## b 150.01640 135.1460 148.3852 142.6391 149.1961
## c  98.12926 103.7013 105.8989 104.7845 109.5428

Note that the expression 100*A/A[,1] is not as trivial as it seems. We divide a $$3 \times 20$$ matrix by a vector of length 3. The division is done element-wise and column-by-column. We sometimes say the elements of A[,1] are recycled.

The recycling mechanism can be illustrated with a small example:

c(1,2,3,4)+c(1,2)

## [1] 2 4 4 6

I try to have a picture in each post, so here we go:

##### Matrix column wise

If the matrix is organized column-wise (e.g. by taking the transpose), we have:

A

##             a        b         c
## 2011 67.23534 4743.289 0.8710604
## 2012 64.69012 5006.955 0.9193449
## 2013 63.65189 5623.065 0.8711697
## 2014 69.14492 5912.650 0.8451556
## 2015 71.84945 5736.724 0.8440797
## 2016 77.27822 6302.609 0.8496936
## 2017 75.57094 6241.553 0.8739366
## 2018 83.14053 6140.472 0.8955205
## 2019 82.04171 5542.925 0.8681451
## 2020 85.64339 5627.550 0.8202612
## 2021 89.13812 6006.069 0.8639629
## 2022 89.88040 5998.655 0.8477383
## 2023 90.52118 6121.533 0.8235255
## 2024 91.41143 6378.964 0.8870630
## 2025 94.44886 6628.106 0.8998234
## 2026 94.20007 7115.711 0.8547652
## 2027 89.97662 6410.364 0.9033014
## 2028 90.46273 7038.339 0.9224435
## 2029 87.26323 6765.787 0.9127362
## 2030 86.21547 7076.800 0.9541839

The expression to index the series becomes now much more complicated:

Aindx <- 100*A/rep(A[1,],each=nrow(A))
Aindx

##              a        b         c
## 2011 100.00000 100.0000 100.00000
## 2012  96.21445 105.5587 105.54318
## 2013  94.67029 118.5478 100.01254
## 2014 102.84015 124.6530  97.02606
## 2015 106.86262 120.9440  96.90254
## 2016 114.93690 132.8742  97.54703
## 2017 112.39764 131.5870 100.33019
## 2018 123.65600 129.4560 102.80808
## 2019 122.02171 116.8583  99.66531
## 2020 127.37854 118.6424  94.16812
## 2021 132.57628 126.6225  99.18518
## 2022 133.68029 126.4662  97.32255
## 2023 134.63334 129.0567  94.54286
## 2024 135.95741 134.4840 101.83714
## 2025 140.47503 139.7365 103.30206
## 2026 140.10500 150.0164  98.12926
## 2027 133.82340 135.1460 103.70135
## 2028 134.54640 148.3852 105.89891
## 2029 129.78775 142.6391 104.78448
## 2030 128.22939 149.1961 109.54279

In this case the automatic recycling is not working the way we want, and we have to do this by hand. Basically, in terms of our little example, before we were happy with c(1,2) being extended automatically to c(1,2,1,2) while we need now something like c(1,1,2,2).

##### Data frame long format

Often data comes in a “long” format. Here is a picture to illustrate the difference between a “wide” and a “long” format:

Often wide format data comes from spreadsheets while long format is often used in databases. Sometimes the operation to convert from long to wide is called “pivot” (and the reverse “unpivot”).

A long format data frame with the above data can look like (I show the first part only):

##   series year        value
## 1      a 2011   67.2353402
## 2      b 2011 4743.2887884
## 3      c 2011    0.8710604
## 4      a 2012   64.6901151
## 5      b 2012 5006.9547226
## 6      c 2012    0.9193449

How can we index this?

Here is my solution:

# get first year
y0 <- min(df$year) y0 ## [1] 2011 # get values at first year x0 <- df[df$year==y0,"value"]
x0

## [1]   67.2353402 4743.2887884    0.8710604

# allow x0 to be indexed by series name
names(x0) <- df[df$year==y0,"series"] x0 ## a b c ## 67.2353402 4743.2887884 0.8710604 # indexing of the series df$indexedvalue <- 100*df$value/x0[df$series]
(df)

##   series year        value indexedvalue
## 1      a 2011   67.2353402    100.00000
## 2      b 2011 4743.2887884    100.00000
## 3      c 2011    0.8710604    100.00000
## 4      a 2012   64.6901151     96.21445
## 5      b 2012 5006.9547226    105.55872
## 6      c 2012    0.9193449    105.54318

The trick I used was to make the vector of values of the first year addressable by the series name. E.g.:

x0["a"]

##        a
## 67.23534

This allows us to calculate the column with indexed values in one vectorized operation.

##### dplyr

In the comments below, Ricardo Sanchez, offered another, rather clean, approach for the last operation:

library(dplyr)
df <- df %>%
group_by(series) %>%
arrange(year) %>%
mutate(indexedvalue = 100 * value / first(value))
df

## Source: local data frame [60 x 4]
## Groups: series [3]
##
##    series  year        value indexedvalue
##    <fctr> <int>        <dbl>        <dbl>
## 1       a  2011   67.2353402    100.00000
## 2       b  2011 4743.2887884    100.00000
## 3       c  2011    0.8710604    100.00000
## 4       a  2012   64.6901151     96.21445
## 5       b  2012 5006.9547226    105.55872
## 6       c  2012    0.9193449    105.54318
## 7       a  2013   63.6518902     94.67029
## 8       b  2013 5623.0654046    118.54782
## 9       c  2013    0.8711697    100.01254
## 10      a  2014   69.1449240    102.84015
## # ... with 50 more rows

##### sqldf

Of course if you are familiar with SQL we can also use that:

library(sqldf)
df <-  sqldf("
select df.series,year,value,100*value/v0 as indexedvalue
from df
join (select min(year),value as v0, series
from df
group by series) df0
on df.series = df0.series
"
)
(df)

##   series year        value indexedvalue
## 1      a 2011   67.2353402    100.00000
## 2      b 2011 4743.2887884    100.00000
## 3      c 2011    0.8710604    100.00000
## 4      a 2012   64.6901151     96.21445
## 5      b 2012 5006.9547226    105.55872
## 6      c 2012    0.9193449    105.54318

##### References
1. Federal Reserve Bank of Dallas, Indexing to a Common Starting Point, https://www.dallasfed.org/research/basics/indexing.aspx

## Saturday, May 20, 2017

### Journalist explaining statistics

I would call this explanation, well, below average:

MATH, HORRIBLE MATH
[…]
Take that bit about the bell curve of IQ. It’s an unpleasant fact that half of all people are of below average IQ. It’s also true that half of all people are below average height, weight, and everything else. And the other half are above average. You know why? Because that’s what “average” means.
The italics are in the original text (they are not mine). May be alternative facts include alternative definitions of statistical terms. It is ironic that the title of the section is about bad math.
##### References
1. Jonah Goldberg, Can Trump be contained?, http://www.nationalreview.com/g-file/447797/donald-trump-robert-mueller-special-counsel-investigation-social-justice-math

## Thursday, May 18, 2017

### Simple piecewise linear problem, not so easy with binary variables

The following picture illustrates the problem:

The blue line is what we want to model:

 $\bbox[lightcyan,10px,border:3px solid darkblue]{ y = \begin{cases} 0 & \text{if 0 \le x \le a}\\ (x-a)\displaystyle\frac{H}{b-a} & \text{if a < x < b}\\ H & \text{if x\ge b}\end{cases}}$

Is there much we can exploit here, from this simple structure? I don’t believe so, and came up with:

 \begin{align}&x_1 \le a \delta_1\\&a \delta_2 \le x_2 \le b \delta_2\\&b \delta_3 \le x_3 \le U \delta_3\\&\delta_1+\delta_2+\delta_3 = 1\\&x = x_1+x_2+x_3\\&y = (x_2 - a \delta_2)\displaystyle\frac{H}{b-a} + H \delta_3 \\&\delta_k \in \{0,1\}\\&x_i \ge 0\\&0 \le x \le U\end{align}
##### Update

Added missing $$\sum_k \delta_k=1$$, see comments below.

## Wednesday, May 17, 2017

### RStudio Tips and Tricks

Why R is Bad for You

Summary: Someone had to say it.  In my opinion R is not the best way to learn data science and not the best way to practice it either.  More and more large employers agree.

##### References
1. Sean Lopp, RStudio Tips and Tricks, https://www.youtube.com/watch?v=kuSQgswZdr8&t=85s. Delivered by Sean Lopp (RStudio) at the 2017 New York R Conference on April 21st and 22nd at Work-Bench.
### Small GAMS trick: $eval In GAMS we typically first declare sets and and then deduce things like the number of elements in a set:  set i/i1*i20/;scalar n;n = card(i); Often the question comes up: Can I do it the other way around? First declare the scalar n and then create set if that size? We can use a funny trick for that, using a variant of the$set construct:
 scalar n /20/;$eval n nset i /i1 * i%n%/; In general I prefer to write this as: $set n 20scalar n /%n%/;set i /i1 * i%n%/;
• The construct $eval n n is like a$set. It will evaluate the right most n and the result is used to populate the preprocessor identifier (the left most n).