Lines and intersections extracted from the iCub's eyes. |
- What percentage of the expected horizontal and vertical grid lines are matched?
- What percentage of the grid intersection points are matched?
- How uniform is the color is the fringe just outside of the warped image?
Black-box Optimization
A black-box optimization method is given a fitness function with no extra information. The fitness function is just a black-box, an engineering term for a component in a design that reliably does some work without the designer needing to know how it works. The fitness function will give no other information other than the value of search point. A black-box optimizer searches through the inputs to the fitness function trying to find points with high scores.
Gradient-based (Newton) Methods
Black-box optimization was originally named to contrast with the gradient-based or Newton methods that everyone learns in high-school calculus. A gradient-based method method not only sees the fitness value \(f(x)\), it also sees the gradient \(\nabla f(x)\), the direction in which the fitness increases most rapidly. Newton methods can find the high and low points of a function very quickly because they are given good information about where to look next. Conjugate gradient Ascent reduces error with the square of time. That is, if at time \(t = 1\) the different in current fitness from the highest nearby fitness value is \(100\), then at time \(t=2\) will be \(10\), at least in theory.
But there are caveats. First, Newton methods need a gradient. In a case like ours, the fitness function is a program, and you can't take the derivative of a program. You could extract a math formula from the program with a lot of hard work. And in this case, that formula would even be differentiable (it's piecewise linear by design). But it would take many symbols to write it down. It would be much easier to estimate the gradient by \(\frac{1}{2} \left(f (x + h) - f(x-h)\right) \) using small increments \(h\) in each direction. However, estimated gradients slow down gradient ascent so that it only reduces error at a rate proportional to the time -- estimating the gradient costs a linear factor. Not only that, but even with a gradient, some functions don't work very well with gradients. Like ours.
Our scoring function is \(multi-modal\). If you could plot the fitness values over a plane, you would see lots of bumps distributed at the grid points. This is because if you slide the correct quadrilateral for the chessboard over one square, you would still have \(\frac{7}{8}\) of the grid points matched. So in ideal conditions the landscape of our fitness function has a large pyramid over the quadrilateral where the best chessboard is. Then it at each grid points in a box around the correct quadrilateral it has a slightly smaller pyramid, followed by a another row of pyramids at the next grid point, and so on out to the edges of the board. There is also noise created by extraneous and missing lines and intersections.
If you run a gradient method on this type of fitness function, you will at best get the tip top of one of the small pyramids. Although Newton methods are very accurate, they are local. They don't look for the highest point anywhere in the fitness function, just for the one nearest to the starting point. There are hacks like momentum factors that can avoid small local optima. But if all you have is a regular series of local optima, they won't work.
Quasi-Newton and Local Methods
The need for non-gradient methods became apparent from the earliest days of computing, and some were even used by some scientists in the Manhattan Project. By the 1960's a number of such methods had been developed -- Compass Search, Powell's method, and the Nelder-Mead (or amoeba) method are three well-known ones. Researchers within the mathematical community call such methods Direct Search. But these methods were not always reliable, and mathematicians disliked the fact that such methods can fail to find even a local optimum in some cases. Instead, they developed quasi-Newton methods such as the BFGS method that provide local convergence guarantees using sophisticated estimation of the gradient.
Still, estimating gradients is not enough to solve multi-modal and noisy problems, and so direct search methods were researched further. Starting from the Nelder-Mead algorithm, Virginia Torczon and her colleagues have developed a much improved method called pattern search. One nice thing about pattern search is that there are (local) convergence proofs and the convergence rates are known. A general pattern search method performs comparably to the best evolutionary algorithms mentioned below (e.g. CMA-ES, Differential Evolution). However, none of these work as well as the simple method that we will ultimately develop by using the characteristics of the chessboard problem, so no more needs to be said about pattern search here.
Evolutionary Algorithms
In Europe during the 1960s, Hans-Paul Schwefel and Ingo Rechenberg developed a method called evolution strategies inspired by Darwinian selection to search for the optima of real functions in Euclidean spaces. In their method, several search points (a population) are placed randomly in the space. A small selection of the best points are then varied in many ways by adding a Gaussian random variable to create a new population. The process is repeated as long as desired. There are no convergence guarantees, but the method can work well, and numerous improvements and variations have been developed over the years.
Around the same time in America, John Holland developed a similar method for binary codes inspired by Mendelian genetics. He called his method a genetic algorithm and published a widely read book named Adaptation in Natural and Artifical Systems in 1975. His student, David Goldberg, popularized these methods in the 1980s, and together with neural networks the terms became widely known in popular science and sci-fi circles.
A genetic algorithm is characterized by a gene encoding, in which solutions to a problem are expressed typically in a binary form as a list of ones and zeros. So to search for solutions to a problem in the real numbers, one might use a fixed decimal encoding. For the integers, one could use two's complement binary numbers. Since any representation inside a computer requires a binary code, genetic algorithms can be easily applied to any object in a digital computer (not necessarily with success).
The other prominent feature of genetic algorithms is crossover. In crossover, two gene encodings are mixed according to some scheme to create a third new gene encoding. Overall, genetic algorithms, like evolution strategies, propose a population of search points, then use the scores to generate a new population using selection, crossover, and mutation. Selection chooses the best points by fitness to generate the next population. Crossover mixes and matches these points, and mutation randomly flips a small number of bits. Populations are generated in this way as long as needed, usually until the population consists of several copies of the same point.
These days, standard genetic algorithms and evolution strategies have been amplified, extended an modified in numerous ways to produce an entire academic field of evolutionary computation, or natural computation. There are optimization methods inspired by the foraging behavior of ants, the migration of geese, and the adaptive defense mechanisms of the immune system.
Performance of Evolutionary Algorithms
A key characteristic of both evolution strategies and genetic algorithms is that neither provides any guarantees at all with respect to performance. They can and often fail badly, especially when applied to problem naïvely. For this reason, and because of overstated claims by some supporters, these methods have acquired an unwarranted bad reputation in some circles. Nonetheless, they can work quite well if applied with care. Evolutionary algorithms have been used to design a control system for a finless rocket, to evolve a better radio antenna, and to schedule courses satisfying many constraints.
The study of when and why evolutionary methods work is relatively young. The earliest attempt at explaining behavior was John Holland's schema theory. However, schema theory is not taken seriously anymore, for good reasons. Kenneth de Jong did early experimental work examining when and why genetic algorithms fail.
Since the mid 1990s, several researchers have begun to study the performance characteristic of evolutionary methods from a formal mathematical point of view. Some success and failure cases are now known. For example, needle in a haystack problems cannot be solved without exponential time. And a simple evolution strategy performs better than line search on on the problem of finding an exact match to some unknown string of bits. But theory that is known lags quite far behind practice. One important fact is that the No Free Lunch theorems imply that there must be a class of problems for which evolutionary methods perform better than gradient methods. It is a fallacy to say that because gradient methods are understood they must perform better! But it is also a fallacy to assume that the class of problems exists where evolutionary methods perform better is likely to occur in real-world problems. Still, evolutionary methods are used because they can work well.
In practice, the best evolutionary methods at present for searching real functions seem to be Differential Evolution and CMA-ES (or its generalized version, NES). Both of these methods tend to perform well out of the box, comparably to the pattern search mentioned above. And any of those three are far superior to a Newton or quasi-Newton method on certain multi-modal problems (like the chessboard problem). Differential evolution has even been shown to converge to the global optimum under certain conditions.
But to really get a good evolutionary algorithm, one has to take problem characteristics into account. In a way, this defies the original plan for a black-box method. The idea was that we would get good performance on any problem. But the reality is that one gets good performance on some problems by ignoring others. There is a general tradeoff between generality and quality/speed. In a recent paper, I even showed what the optimal solutions look like when you have a model of fitness problems. The catch is that the optimal solution averages over problems (and includes a second internal optimization). So a general model has to average over problems that aren't be solved at the moment, for which it sacrifices some performance on problems that are being solved.
Thus if you really want to solve a problem, you have to tinker with the method. One interesting thing about the evolutionary framework is its flexibility. One has two very flexible phases, selection and mutation. The selection phase uses fitness values in order to decide which search points should receive focus, and the mutation phase then tries to combine or alter the selected points to potentially make them better. After choosing a good general selection method, such as tournament selection or even evolutionary annealing, one can adapt the mutation process to match the search problem at hand. One way to manage such adaptations is by using fancy indirect encodings, or one can work directly in the search space using complex mutators.
Our chessboard recognition algorithm will do exactly this: it will directly search quadrilaterals, using the known features of the problem to choose good mutations. In this way, we will be able to find chessboards much faster than any general-purpose method does while still using an evolutionary algorithm. But that is the topic for the next post.
No comments:
Post a Comment