Morton john canty image analysis and pattern recognition for remote sensing with algorithms in envi-idl

Image Analysis and Pattern Recognition for Remote Sensing with Algorithms in ENVI/IDL Morton John Canty Forschungszentrum J¨ulich GmbH m.canty@fz-juelich.de March 21, 2005

Contents 1 Images, Arrays and Vectors 1 1.1 Multispectral satellite images . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Algebra of vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Finding minima and maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Image Statistics 13 2.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 A special function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Conditional probabilities and Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Transformations 21 3.1 Fourier transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Discrete Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Discrete Fourier transform of an image . . . . . . . . . . . . . . . . . . 23 3.2 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Minimum noise fraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Maximum autocorrelation factor (MAF) . . . . . . . . . . . . . . . . . . . . . 28 4 Radiometric enhancement 31 4.1 Lookup tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 Histogram equalization . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.2 Histogram matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.1 Laplacian of Gaussian ﬁlter . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Topographic modelling 39 5.1 RST transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 i

ii CONTENTS 5.2 Imaging transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Camera models and RFM approximations . . . . . . . . . . . . . . . . . . . . 41 5.4 Stereo imaging, elevation models and orthorectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.5 Slope and aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.6 Illumination correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6 Image Registration 53 6.1 Frequency domain registration . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.1 Contour detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2.2 Closed contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2.3 Chain codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2.4 Invariant moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2.5 Contour matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2.6 Consistency check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.3 Re-sampling and warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7 Image Sharpening 61 7.1 HSV fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.2 Brovey fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.3 PCA fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.4 Wavelet fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.4.1 Discrete wavelet transform . . . . . . . . . . . . . . . . . . . . . . . . 64 7.4.2 À trous filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.5 Quality indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8 Change Detection 69 8.1 Algebraic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 8.2 Principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 8.3 Post-classification comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 8.4 Multivariate alteration detection . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.4.1 Canonical correlation analysis . . . . . . . . . . . . . . . . . . . . . . . 71 8.4.2 Solution by Cholesky factorization . . . . . . . . . . . . . . . . . . . . 72 8.4.3 Properties of the MAD components . . . . . . . . . . . . . . . . . . . 73 8.4.4 Covariance of MAD variates with original observations . . . . . . . . . 74 8.4.5 Scale invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 8.4.6 Improving signal to noise . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.4.7 Decision thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.5 Radiometric normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9 Unsupervised Classification 79

CONTENTS iii 9.1 A simple cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 9.2 Algorithms that minimize the simple cost function . . . . . . . . . . . . . . . 81 9.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 9.2.2 Extended K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 9.2.3 Agglomerative hierarchical clustering . . . . . . . . . . . . . . . . . . . 83 9.2.4 Fuzzy K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 9.3 EM Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 9.3.1 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 9.3.2 Partition density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 9.3.3 Including spatial information . . . . . . . . . . . . . . . . . . . . . . . 87 9.4 The Kohonen Self Organizing Map . . . . . . . . . . . . . . . . . . . . . . . . 89 9.5 Unsupervised classification of changes . . . . . . . . . . . . . . . . . . . . . . 91 10 Supervised Classification 93 10.1 Bayes decision rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 10.2 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 10.3 Bayes Maximum likelihood classification . . . . . . . . . . . . . . . . . . . . . 95 10.4 Non-parametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 10.5 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 10.5.1 The feed-forward network . . . . . . . . . . . . . . . . . . . . . . . . . 99 10.5.2 Cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 10.5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 10.5.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 10.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 10.6.1 Standard deviation of misclassification . . . . . . . . . . . . . . . . . . 111 10.6.2 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 10.6.3 Confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 11 Hyperspectral analysis 117 11.1 Mixture modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 11.1.1 Full linear unmixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 11.1.2 Unconstrained linear unmixing . . . . . . . . . . . . . . . . . . . . . . 119 11.1.3 Intrinsic end-members and pixel purity . . . . . . . . . . . . . . . . . . 119 11.2 Orthogonal subspace projection . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A Least Squares Procedures 125 A.1 Generalized least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.2 Recursive least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.3 Orthogonal regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 B The Discrete Wavelet Transformation 131

iv CONTENTS B.1 Inner product space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 B.2 Haar wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 B.3 Multi-resolution analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 B.4 Fixpoint wavelet approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B.5 The mother wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.6 The Daubechies wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 B.7 Wavelets and filter banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 C Advanced Neural Network Training Algorithms 151 C.1 The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 C.1.1 The R-operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 C.1.2 Calculating the Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . 155 C.2 Scaled conjugate gradient training . . . . . . . . . . . . . . . . . . . . . . . . 156 C.2.1 Conjugate directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 C.2.2 Minimizing a quadratic function . . . . . . . . . . . . . . . . . . . . . 157 C.2.3 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 C.3 Kalman filter training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.3.1 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.3.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 D ENVI Extensions 171 D.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 D.2 Topographic modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 D.2.1 Calculating building heights . . . . . . . . . . . . . . . . . . . . . . . . 172 D.2.2 Illumination correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 D.3 Image registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 D.4 Image fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 D.4.1 DWT fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 D.4.2 ATWT fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 D.4.3 Quality index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 D.5 Change detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 D.5.1 Multivariate Alteration Detecton . . . . . . . . . . . . . . . . . . . . . 184 D.5.2 Maximum autocorrelation factor . . . . . . . . . . . . . . . . . . . . . 186 D.5.3 Radiometric normalization . . . . . . . . . . . . . . . . . . . . . . . . 187 D.6 Unsupervised classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 D.6.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 D.6.2 Fuzzy K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . 190 D.6.3 EM clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 D.6.4 Probabilistic label relaxation . . . . . . . . . . . . . . . . . . . . . . . 194 D.6.5 Kohonen self organizing map . . . . . . . . . . . . . . . . . . . . . . . 196 D.6.6 A GUI for change clustering . . . . . . . . . . . . . . . . . . . . . . . . 197

CONTENTS v D.7 Neural network: Scaled conjugate gradient . . . . . . . . . . . . . . . . . . . . 198 D.8 Neural network: Kalman ﬁlter . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 D.9 Neural network: Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Bibliography 203

Chapter 1 Images, Arrays and Vectors 1.1 Multispectral satellite images There are a number of multispectral satellite-based sensors currently in orbit which are used for earth observation. Representative of these we mention here the Landsat ETM+ system. The ETM+ instrument on the Landsat 7 spacecraft contains sensors to measure radiance in three spectral intervals: • visible and near infrared (VNIR) bands - bands 1,2,3,4, and 8 (PAN) with a spectral range between 0.4 and 1.0 micrometer. • short wavelength infrared (SWIR) bands - bands 5 and 7 with a spectral range between 1.0 and 3.0 micrometer. • thermal long wavelength infrared (LWIR) band - band 6 with a spectral range between 8.0 and 12.0 micrometer. In addition a panchromatic (PAN) image (band 8) covering the visible spectrum is provided. Ground resolutions are 15m (PAN), 30m (VNIR,SWIR) and 60m (LWIR). Figure 1.1 shows a color composite image of a Landsat 7 scene over Morocco acquired in 1999. A single multispectral image can be represented as an array of gray-scale values or digital numbers gk(i, j), 1 ≤ i ≤ c, 1 ≤ j ≤ r, where c is the number of pixel columns and r is the number of pixel rows. If we are dealing with an N-band multispectral image, then the index k, 1 ≤ k ≤ N, denotes the spectral band. Often a pixel intensity is stored in a single byte, so that 0 ≤ gk ≤ 255. The gray-scale values are the result of sampling along an array of sensors the at-sensor radiance fλ(x, y) at wavelength λ due to sunlight reflected from some point (x, y) on the Earth’s surface and focussed by the satellite’s optical system at the sensors. Ignoring atmospheric effects this radiance is given roughly by fλ(x, y) ∼ iλ(x, y)rλ(x, y), where iλ(x, y) is the sun’s irradiance at the surface in units of watt/m2 µm, and rλ(x, y) is the surface reflectance, a number between 0 and 1. The conversion between gray-scale 1

2 CHAPTER 1. IMAGES, ARRAYS AND VECTORS Figure 1.1: Color composite of bands 4 (red), 5 (green) and 7 (blue) for a Landsat ETM+ image over Morocco.

1.1. MULTISPECTRAL SATELLITE IMAGES 3 or digital number g and at-sensor radiance f is determined by the sensor calibration as measured (and maintained) by the satellite image provider: f = Cg(i, j) + fmin where C = (fmax − fmin)/255, in which fmax and fmin are maximum and minimum mea- surable radiances at the sensor. Atmospheric scattering and absorption models are used to calculate surface reflectance from the observed at-sensor radiance, as it is the reflectance which is directly related to the physical properties of the surface being examined. Various conventions can be used for storing the image array g(i, j) in computer memory or on storage media. In band interleaved by pixel (BIP) format, for example, a two-channel, 3 × 3 pixel image would be stored as g1(1, 1) g2(1, 1) g1(2, 1) g2(2, 1) g1(3, 1) g2(3, 1) g1(1, 2) g2(1, 2) g1(2, 2) g2(2, 2) g1(3, 2) g2(3, 2) g1(1, 3) g2(1, 3) g1(2, 3) g2(2, 3) g1(3, 3) g2(3, 3), whereas in band interleaved by line (BIL) it would be stored as g1(1, 1) g1(2, 1) g1(3, 1) g2(1, 1) g2(2, 1) g2(3, 1) g1(1, 2) g1(2, 2) g1(3, 2) g2(2, 1) g1(2, 2) g2(2, 3) g1(1, 3) g2(2, 3) g1(3, 3) g2(3, 1) g1(2, 3) g2(3, 3), and in band sequential (BSQ) format it is stored as g1(1, 1) g1(2, 1) g1(3, 1) g1(1, 2) g1(2, 2) g1(3, 2) g1(1, 3) g1(2, 3) g1(3, 3) g2(1, 1) g2(2, 1) g2(3, 1) g2(1, 2) g2(2, 2) g2(3, 2) g2(1, 3) g2(2, 3) g2(3, 3). In the computer language IDL, so-called row major indexing is used for arrays and the elements in an array are numbered from zero. This means that, if a gray-scale image g is stored in an IDL array variable G, then the intensity value g(i, j) is addressed as G[i-1,j-1]. An N-band multispectral image is stored in BIP format as an N × c × r array in IDL, in BIL format as a c × N × r and in BSQ format as an c × r × N array. Auxiliary information, such as image acquisition parameters and georeferencing, is normally included with the image data on the same file, and the format may or may not make use of compression algorithms. Examples are the geoTIFF1 file format used for example by Space Imaging Inc. for distributing Carterra(c) imagery and which includes lossless compression, the HDF (Hierachical Data Format) in which for example ASTER images are distributed and the cross-platform PCDSK format employed by PCI Geomatics with its image processing software, which is in plain ASCII code and not compressed. ENVI uses a simple “flat binary” file structure with an additional ASCII header file. 1geoTIFF refers to TIFF files which have geographic (or cartographic) data embedded as tags within the TIFF file. The geographic data can then be used to position the image in the correct location and geometry on the screen of a geographic information display.

4 CHAPTER 1. IMAGES, ARRAYS AND VECTORS 1.2 Algebra of vectors and matrices It is very convenient to use a vector representation for multispectral images, namely g(i, j) =    g1(i, j) ... gN (i, j)    , (1.1) which is a column vector of multispectral gray-scale values at the position (i, j). Since we will be making extensive use of the vector notation of Eq. (1.1) we review here some of the basic properties of vectors and matrices. We can illustrate most of these properties in just two dimensions. x x2 x1 Figure 1.2: A vector in two dimensions. The transpose of the two-dimensional column vector shown in Fig. 1.2, x = x1 x2 , is the row vector x = (x1, x2). The sum of two vectors is given by x + y = x1 x2 + y1 y2 = x1 + y1 x2 + y2 , and the inner product by x y = (x1, x2) y1 y2 = x1y1 + x2y2. The length or norm of the vector x is x = |x| = x2 1 + x2 2 = √ x x . The programming language IDL is especially good at manipulating vectors and matrices:

1.2. ALGEBRA OF VECTORS AND MATRICES 5 IDL x=[[1],[2]] IDL print,x 1 2 IDL print,transpose(x) 1 2 b X x yθ x cos θ Figure 1.3: The inner product. The inner product can be written in terms of the vector lengths and the angle θ between the two vectors as x y = |x||y| cos θ = xy cos θ, see Fig. 1.3. If θ = 90o the vectors are orthogonal so that x y = 0. Any vector can be decomposed into orthogonal unit vectors: x = x1 x2 = x1 1 0 + x2 0 1 . A two-by-two matrix is written A = a11 a12 a21 a22 . When a matrix is multiplied with a vector the result is another vector, e.g. Ax = a11 a12 a21 a22 x1 x2 = a11x1 + a12x2 a21x1 + a22x2 . The IDL operator for matrix and vector multiplication is ##. IDL a=[[1,2],[3,4]] IDL print,a 1 2 3 4 IDL print,a##x 5 11

6 CHAPTER 1. IMAGES, ARRAYS AND VECTORS Matrices also have a transposed form, obtained by interchanging their rows and columns: A = a11 a21 a12 a22 . The product of two matrices is given by AB = a11 a12 a21 a22 b11 b12 b21 b22 = a11b11 + a12b21 · · · · · · · · · and is another matrix. The determinant of a two-dimensional matrix is |A| = det A = a11a22 − a12a21. The outer product of two vectors is a matrix: xy = x1 x2 (y1, y2) = x1 0 x2 0 y1 y2 0 0 = x1y1 x1y2 x2y1 x2y2 The identity matrix is given by I = 1 0 0 1 , IA = AI = A. The matrix inverse A−1 is deﬁned in terms of the identity matrix according to A−1 A = AA−1 = I. In two dimensions it is easy to verify that A−1 = 1 |A| a22 −a12 −a21 a11 . IDL print, determ(float(a)) -2.00000 IDL print, invert(a) -2.00000 1.00000 1.50000 -0.500000 IDL print, a##invert(a) 1.00000 0.000000 0.000000 1.00000 If |A| = 0, then A has no inverse and is said to be a singular matrix. The trace of a square matrix is the sum of its diagonal elements: Tr A = a11 + a22. 1.3 Eigenvalues and eigenvectors The statistical properties of ensembles of pixel intensities (for example entire images or speciﬁc land-cover classes) are often approximated by their mean values and covariance

1.3. EIGENVALUES AND EIGENVECTORS 7 matrices. As we will see later, covariance matrices are always symmetric. A matrix A is symmetric if it doesn’t change when it is transposed, i.e. if A = A . Very often we have to solve the so-called eigenvalue problem, which is to ﬁnd eigenvectors x and eigenvalues λ that satisfy the equation Ax = λx or, equivalently, a11 a12 a21 a22 x1 x2 = λ x1 x2 . This is the same as the two equations (a11 − λ)x1 + a12x2 = 0 a21x1 + (a22 − λ)x2 = 0. (1.2) If we eliminate x1 and make use of the symmetry a12 = a21, we obtain [(a11 − λ)(a22 − λ) − a2 12]x2 = 0. In general x2 = 0, so we must have (a11 − λ)(a22 − λ) − a2 12 = 0, which is known as the characteristic equation for the eigenvalue problem. It is a quadratic equation in λ with solutions λ(1) = 1 2 a11 + a22 + (a11 + a22)2 − 4(a11a22 − a2 12) λ(2) = 1 2 a11 + a22 − (a11 + a22)2 − 4(a11a22 − a2 12) . (1.3) Thus there are two eigenvalues and, correspondingly, two eigenvectors x(1) and x(2) , which can be obtained by substituting λ(1) and λ(2) into (1.2) and solving for x1 and x2. It is easy to show that the eigenvalues are orthogonal (x(1) ) x(2) = 0. The matrix formed by the two eigenvectors, u = (x(1) , x(2) ) = x (1) 1 x (2) 1 x (1) 2 x (2) 2 , is said to diagonalize the matrix a. That is u Au = λ(1) 0 0 λ(2) . (1.4) We can illustrate the whole procedure in IDL as follows:

8 CHAPTER 1. IMAGES, ARRAYS AND VECTORS IDL a=float([[1,2],[2,3]]) IDL print,a 1.00000 2.00000 2.00000 3.00000 IDL print,eigenql(a,eigenvectors=u,/double) 4.2360680 -0.23606798 IDL print,transpose(u)##a##u 4.2360680 -2.2204460e-016 -1.6653345e-016 -0.23606798 Note that, after diagonalization, the off-diagonal elements are not precisely zero due to rounding errors in the computation. All of the above properties generalize easily to N dimensions. 1.4 Finding minima and maxima In order to maximize some desirable property of a multispectral image, such as signal to noise or spread in intensity, we often need to take derivatives of vectors. A vector (partial) derivative in two dimensions is written ∂ ∂x and is defined as the vector ∂ ∂x = 1 0 ∂ ∂x1 + 0 1 ∂ ∂x2 . Many of the operations with vector derivatives correspond exactly to operations with ordinary scalar derivatives (They can all be verified easily by writing out the expressions component-by component): ∂ ∂x (x y) = y analogous to ∂ ∂x xy = y ∂ ∂x (x x) = 2x analogous to ∂ ∂x x2 = 2x The scalar expression x Ay, where A is a matrix, is called a quadratic form. We have ∂ ∂x (x Ay) = Ay ∂ ∂y (x Ay) = A x and ∂ ∂x (x Ax) = Ax + A x. Note that, if A is a symmetrix matrix, this last equation can be written ∂ ∂x (x Ax) = 2Ax. Suppose x∗ is a critical point of the function f(x), i.e. d dx f(x∗ ) = d d f(x) x=x∗ = 0, (1.5)

1.4. FINDING MINIMA AND MAXIMA 9 x∗ x f(x) d dx f(x∗ ) = 0 Figure 1.4: A function of one variable. see Fig. 1.4. Then f(x∗ ) is a local minimum if d2 dx2 f(x∗ ) 0. This becomes obvious if we express f(x) as a Taylor series about x∗ f(x) = f(x∗ ) + (x − x∗ ) d dx f(x∗ ) + (x − x∗ )2 d2 dx2 f(x∗ ) + . . . . For |x − x∗ | sufficiently small this is equivalent to f(x) ≈ f(x∗ ) + (x − x∗ )2 d2 dx2 f(x∗ ). The situation is similar for scalar functions of a vector: f(x) ≈ f(x∗ ) + (x − x∗ ) ∂f(x∗ ) ∂x + 1 2 (x − x∗ ) H(x − x∗ ). (1.6) where H is called the Hessian matrix: (H)ij = ∂2 ∂xi∂xj f(x∗ ). (1.7) In the neighborhood of the critical point, since ∂f(x∗ ) ∂x = 0, we get the approximation f(x) ≈ f(x∗ ) + (x − x∗ ) H(x − x∗ ). Now the condition for a local minimum is that the Hessian matrix be positive definite at the point x∗ . Positive definiteness means that x Hx 0 for all x = 0. (1.8) Suppose we want to find a minimum (or maximum) of a scalar function f(x) of the vector x. If there are no constraints, then we solve the set of equations ∂f(x) ∂xi = 0, i = 1, 2, or, in terms of our notation for vector derivatives, ∂f(x) ∂x = 0 = 0 0 .

10 CHAPTER 1. IMAGES, ARRAYS AND VECTORS However suppose that x is constrained by the equation g(x) = 0. For example, we might have g(x) = x2 1 + x2 2 − 1 = 0 which constrains x to lie on a circle of radius 1. Finding an minimum of f subject to g = 0 is equivalent to ﬁnding an unconstrained minimum of f(x) + λg(x), (1.9) where λ is called a Lagrange multiplier and is treated like an additional variable, see [Mil99]. That is, we solve the set of equations ∂ ∂xi (f(x) + λg(x)) = 0, i = 1, 2 ∂ ∂λ (f(x) + λg(x)) = 0. (1.10) The latter equation is just g(x) = 0. For example, let f(x) = ax2 1 + bx2 2 and g(x) = x1 + x2 − 1. Then we get the three equations ∂ ∂x1 (f(x) + λg(x)) = 2ax1 + λ = 0 ∂ ∂x2 (f(x) + λg(x)) = 2bx2 + λ = 0 ∂ ∂λ (f(x) + λg(x)) = x1 + x2 − 1 = 0 The solution is x1 = b a + b , x2 = a a + b .

1.4. FINDING MINIMA AND MAXIMA 11 Exercises 1. Show that the outer product of two 2-dimensional vectors is a singular matrix. 2. Prove that the eigenvectors or a 2 × 2 symmetric matrix are orthogonal. 3. Diﬀerentiate the function 1 (x · a · y) with respect to y. 4. Verify the following matrix identity in IDL: (A · B) = B · A . 5. Calculate the eigenvalues and eigenvectors of a non-symmetric matrix with IDL. 6. Plot the function f(x) = x2 1 − x2 2 with IDL. Find its minima and maxima subject to the constraint g(x) = x2 1 + x2 2 − 1 = 0.

12 CHAPTER 1. IMAGES, ARRAYS AND VECTORS

Chapter 2 Image Statistics It is useful to think of image pixel intensities g(x) as realizations of a random vector G(x) drawn independently from some probability distribution. 2.1 Random variables A random variable can be used to represent some quantity which changes in an unpredictable way each time it is observed. If there is a discrete set of M possible events {Ei}, i = 1 . . . M, associated with some random process, let pi be the probability that the ith event Ei will occur. If ni represents the number of times Ei occurs in n trials, we expect that pi → ni/n in the limit n → ∞ and that M i=1 pi = 1. For example, on the throw of a pair of dice, {Ei} = (1, 1), (1, 2), (2, 1) . . . (6, 6) and each event is equally probable pi = 1/36, i = 1 . . . 36. Formally, a random variable X is a real function on the set of possible events: X = f(Ei). If, for example, X is the sum of the points on the dice, X = f(E1) = 2, X = f(E2) = 3, X = f(E3) = 3, . . . X = f(E36) = 12. On the basis of the probabilities of the individual events, we can associate a distribution function P(x) with the random variable X, deﬁned by P(x) = Pr(X ≤ x). For the dice example, P(1) = 0, P(2) = 1/36, P(3) = 1/12, . . . P(12) = 1. 13

14 CHAPTER 2. IMAGE STATISTICS For continuous random variables, such as the measured radiance at a satellite sensor, the distribution function is not expressed in terms of discrete probabilities, but rather in terms of a probability density function p(x), where p(x)dx is the probability that the value of the random variable X lies in the interval [x, x + dx]. Then P(x) = Pr(X ≤ x) = x −∞ p(t)dt and, of course, P(−∞) = 0, P(∞) = 1. Two random variables X and Y are said to be independent when Pr(X ≤ x and Y ≤ y) = Pr(X ≤ x, Y ≤ y) = P(x)P(y). The mean or expected value of a random variable X is written X and is defined in terms of the probability density function: X = ∞ −∞ xp(x)dx. The variance of X, written var(X) is defined as the expected value of the random variable (X − X )2 , i.e. var(X) = (X − X )2 . In terms of the probability density function, it is given by var(X) = ∞ −∞ (x − X )2 p(x)dx. Two simple but very useful identities follow from the definition of variance: var(X) = X2 − X 2 var(aX) = a2 var(X). (2.1) 2.2 The normal distribution It is often the case that random variables are well-described by the normal or Gaussian probability density function p(x) = 1 √ 2πσ exp(− 1 2σ2 (x − µ)2 ). In that case X = µ, var(X) = σ2 . The expected value of pixel intensities G(x) =     G1(x) G2(x) ... GN (x)     ,

2.2. THE NORMAL DISTRIBUTION 15 where x denotes the pixel coordinates, i.e. x = (i, j), is estimated by averaging over all of the pixels in the image, G(x) ≈ 1 cr c,r i,j=1 g(i, j), referred to as the sample mean vector. It is usually assumed to be independent of x, i.e. G(x) = G . The covariance between bands k and is deﬁned according to cov(Gk, G ) = (Gk − Gk )(G − G ) and is estimated again by averaging over the pixels: cov(Gk, G ) ≈ 1 cr c,r i,j=1 (gk(i, j) − Gk )(g (i, j) − G ), which is called the sample covariance. The covariance is also usually assumed to be independent of x. The variance for bands k is given by var(Gk) = cov(Gk, Gk) = (Gk − Gk )2 . The random vector G is often assumed to be described by a multivariate normal probability density function p(g), given by p(g) = 1 (2π)N/2 |Σ| exp − 1 2 (g − µ) Σ−1 (g − µ) . We indicate this by writing G ∼ N(µ, Σ). The distribution function of the multi-spectral pixels is then completely determined by the expected value G = µ and by the covariance matrix Σ. In two dimensions, for example, Σ = var(G1) cov(G1, G2) cov(G2, G1) var(G2) = σ2 1 σ12 σ21 σ2 2 . Note that, since cov(Gk, G ) = cov(G , Gk), the covariance matrix is symmetric, Σ = Σ . The covariance matrix can also be written as an outer product: Σ = (G − G )(G − G ) . as can its estimated value: Σ ≈ 1 cr c,r i,j=1 (g(i, j) − G )(g(i, j) − G ) . If G = 0, we can write simply Σ = GG . Another useful identity applies to any linear combination a G of the random vector G, namely var(a G) = a Σa. (2.2)

16 CHAPTER 2. IMAGE STATISTICS This is obvious in two dimensions, since we have var(a G) = cov(a1G1 + a2G2, a1G1 + a2G2) = a2 1var(G1) + a1a2cov(G1, G2) + a1a2cov(G2, G1) + a2 2var(G2) = (a1, a2) var(G1) cov(G1, G2) cov(G2, G1) var(G2) a1 a2 . Variance is always nonnegative and the vector a in (2.2) is arbitrary, so we have a Σa ≥ 0 for all a. The covariance matrix is therefore said to be positive semi-deﬁnite. The correlation matrix C is similar to the covariance matrix, except that each matrix element (i, j) is normalized to var(Gi)var(Gj). In two dimensions C = 1 ρ12 ρ21 1 =   1 cov(G1,G2) √ var(G1)var(G2) cov(G2,G1) √ var(G1)var(G2) 1   = 1 σ12 σ1σ2 σ21 σ1σ2 1 . The following ENVI/IDL program calculates and prints out the covariance matrix of a multispectral image: envi_select, title=’Choose multispectral image’,fid=fid,dims=dims,pos=pos if (fid eq -1) then return num_cols = dims[2]-dims[1]+1 num_rows = dims[4]-dims[3]+1 num_pixels = (num_cols*num_rows) num_bands = n_elements(pos) samples=intarr(num_bands,n_elements(num_pixels)) for i=0,num_bands-1 do samples[i,*]=envi_get_data(fid=fid,dims=dims,pos=pos[i]) print, correlate(samples,/covariance,/double) end ENVI .GO 111.46663 82.123236 159.58377 133.80637 82.123236 64.532431 124.84815 104.45298 159.58377 124.84815 246.18004 205.63420 133.80637 104.45298 205.63420 192.70367 2.3 A special function If n is an integer, the factorial of n is deﬁned by n! = n(n − 1) · · · 1, 1! = 0! = 1. The generalization of this to non-integers z is the gamma function Γ(z) = ∞ 0 tz−1 e−t dt. It has the property Γ(z + 1) = zΓ(z).

2.4. CONDITIONAL PROBABILITIES AND BAYES THEOREM 17 The factorial is a special case, i.e. for integer n Γ(n) = n! A further generalization is the incomplete gamma function ΓP (a, x) = 1 Γ(a) x 0 ta−1 s−t dt. It has the properties ΓP (a, 0) = 0, ΓP (a, ∞) = 1. Here is a plot of ΓP for a = 3 in IDL: x=findgen(100)/10 envi_plot_dtat,x,igamma(3,x) Figure 2.1: The incomplete gamma function. We are interested in this function for the following reason. Suppose that the random variables Xi, i = 1 . . . n, are independent normally distributed with zero mean and variance σ2 i . Then the random variable Z = n i=1 Xi σi 2 has the distribution function P(z) = Pr(Z ≤ z) = ΓP (n/2, z/2), and is said to be chi-square distributed with n degrees of freedom. 2.4 Conditional probabilities and Bayes Theorem If A and B are two events such that the probability of A andB occurring simultaneously is P(A, B), then the conditional probability of A occuring given that B has occurred is P(A | B) = P(A, B) P(B) .

18 CHAPTER 2. IMAGE STATISTICS Bayes’ Theorem (named after Rev. Thomas Bayes, an 18th century mathematician who derived a special case) is the basic starting point for inference problems using probability theory as logic. We will use it in the following form. Let X be a random variable describing a pixel intensity, and let {Ck | k = 1 . . . M} be a set of possible classes for the pixels. Then the a posteriori conditional probability for class Ck, given the measured pixel intensity x is P(Ck|x) = P(x|Ck)P(Ck) P(x) , (2.3) where P(Ck) is the prior-probability for class Ck, P(x|Ck) is the conditional probability of observing the value x, if it belongs to class Ck, P(x) = M k=1 p(x|Ck)p(Ck) is the total probability for x. 2.5 Linear regression Applying radiometric corrections to digital images often involves fitting a set of m data points (xi, yi) to a straight line: y(x) = a + bx + . Suppose that the measurements yi include a random error with variance σ2 and that the measurements xi are exact. Define a “goodness of fit” function χ2 (a, b) = m i=1 yi − a − bxi σ 2 . (2.4) If the random variable is normally distributed, then we obtain the most likely (i.e. best) values for a and b by minimizing this function, that is, by solving the equations ∂χ2 ∂a = ∂χ2 ∂b = 0. The solution is ˆb = sxy s2 xx , â = ¯y − ˆb¯x, (2.5) where sxy = 1 m m i=1 (xi − ¯x)(yi − ¯y) s2 xx = 1 m m i=1 (xi − ¯x)2 ¯x = 1 m m i=1 xi, ¯y = 1 m m i=1 yi. The uncertainties in the estimates â and ˆb are given by σ2 a = σ2 x2 i m x2 i − ( xi)2 σ2 b = σ2 m m x2 i − ( xi)2 . (2.6)

2.5. LINEAR REGRESSION 19 If σ2 is not known a priori, then it can be estimated by ˆσ2 = 1 m − 2 m i=1 (yi − ˆa − ˆbxi)2 . Generalized and orthogonal least squares methods are described in Appendix A. A recusive procedure is described in Appendix C.

20 CHAPTER 2. IMAGE STATISTICS Exercises 1. Write the multivariate normal probability density function p(g) for the case Σ = σ2 I. Show that probability density function for a one-dimensional random variable G is a special case. Prove that G = µ. 2. In the Monty Hall game a contestant is asked to choose between one of three doors. Behind one of the doors is an automobile as prize for choosing the correct door. After the contestant has chosen, Monty Hall opens one of the other two doors to show that the automobile is not there. He then asks the contestant if she wishes to change her mind and choose the other unopened door. Use Bayes’ theorem to prove that her correct answer is “yes”. 3. Derive the uncertainty for a in (2.6) from the formula for error propagation σ2 a = N i=1 σ2 ∂f ∂yi 2 .

Chapter 3 Transformations Up until now we have thought of multispectral images as (r × c × N)-dimensional arrays of measured pixel intensities. In the present chapter we consider other representations of images which are often useful in image analysis. 3.1 Fourier transforms Figure 3.1: Fourier series approximation of a sawtooth function. The series was truncated at k = ±4. The left hand side shows the intensities |ˆx(k)|2 . A periodic function x(t) with period T, x(t) = x(t + T) can always be expressed as the inﬁnite Fourier series x(t) = ∞ k=−∞ ˆx(k)ei2π(kf)t , (3.1) where f = 1/T = ω/2π and eix = cos x + i sin x. From the orthogonality of the e-functions, the coeﬃcients ˆx(k) in the expansion are given by ˆx(k) = f 1/2f −1/2f x(t)e−i2π(kf)t dt. (3.2) 21

22 CHAPTER 3. TRANSFORMATIONS Figure 3.1 shows an example for the sawtooth function with period T = 1: x(t) = t, −1/2 ≤ t 1/2. Parseval’s formula follows directly from (3.2) k |ˆx(k)|2 = f 1/2f −1/2f (x(t))2 dt. 3.1.1 Discrete Fourier transform Let g(j) be a discrete sample of the real function g(x) (a row of pixels), sampled c times at the sampling interval ∆ over a complete period T, i.e. g(j) = g(x = j∆), j = 0 . . . c − 1. The corresponding discrete Fourier series is written g(j) = 1 c c/2 k=−c/2 ˆg(k)ei2π(kf)(j∆) , j = 0 . . . c − 1, (3.3) where the truncation frequency ±c 2 f is the highest frequency component that can be determined by the sampling. This frequency is called the Nyquist critical frequency and is given by 1/2∆, so that f is determined by cf 2 = 1 2∆ or f = 1 c∆ . (This corresponds to sampling over one complete period: c∆ = T.) Thus (3.3) becomes g(j) = 1 c c/2 k=−c/2 ˆg(k)ei2πkj/c , j = 0 . . . c − 1. With the observation ei2π(−c/2)j/c = e−iπj = (−1)c = eiπj = ei2π(c/2)j/c , we can write this as g(j) = 1 c c/2−1 k=−c/2 ˆg(k)ei2πkj/c , j = 0 . . . c − 1, a set of c equations in the c unknown frequency components ˆg(k). Equivalently, g(j) = 1 c c/2−1 k=0 ˆg(k)eπ2πkj/c + 1 c −1 k=−c/2 ˆg(k)ei2πkj/c = 1 c c/2−1 k=0 ˆg(k)ei2πkj/c + 1 c c−1 k =c/2 X(k − c)ei2π(k −c)j/c = 1 c c/2−1 k=0 ˆg(k)ei2πkj/c + 1 c c−1 k=c/2 ˆg(k − c)ei2πkj/c .

3.2. WAVELETS 23 Thus we can write g(j) = 1 c c−1 k=0 ˆg(k)ei2πkj/c , j = 0 . . . c − 1, (3.4) if we interpret ˆg(k) → ˆg(k − c) when k ≥ c/2. The solution to (3.4) for the complex frequency components ˆg(k) is called the discrete Fourier transform and is given by ˆg(k) = c−1 j=0 g(j)e−i2πkj/c , k = 0 . . . c − 1. (3.5) This follows from the following orthogonality property: c−1 j=0 ei2π(k−k )j/c = cδk,k . (3.6) Eq. (3.4) itself is the discrete inverse Fourier transform. The discrete analog of Parsival’s formula is c−1 k=0 |ˆg(k)|2 = 1 c c−1 j=0 g(j)2 . (3.7) Determining the frequency components in (3.5) would appear to involve, in all, c2 ﬂoating point multiplication operations. The fast Fourier transform (FFT) exploits the structure of the complex e-functions to reduce this to order c log c, see for example [PFTV86]. 3.1.2 Discrete Fourier transform of an image The discrete Fourier transform is easily generalized to two dimensions for the purpose of image analysis. Let g(i, j), i, j = 0 . . . c − 1, represent a (quadratic) gray scale image. Its discrete Fourier transform is ˆg(k, ) = c−1 i=0 c−1 j=0 g(i, j)e−i2π(ik+j )/c (3.8) and the corresponding inverse transform is g(i, j) = 1 c2 c−1 k=0 c−1 =0 ˆg(k, )ei2π(ik+j )/c . (3.9) 3.2 Wavelets Unlike the Fourier transform, which represents a signal (array of pixel intensities) in terms of pure frequency functions, the wavelet transform expresses the signal in terms of functions which are restricted both in terms of frequency and spatial extent. In many applications, this turns out to be particularly eﬃcient and useful. We’ll see an example of this in Chapter 7, where we discuss image fusion in more detail. The wavelet transform is discussed in Appendix B.

24 CHAPTER 3. TRANSFORMATIONS 3.3 Principal components The principal components transformation forms linear combinations of multispectral pixel intensities which are mutually uncorrelated and which have maximum variance. We assume without loss of generality that G = 0, so that the covariance matrix of a multispectral image is is Σ = GG , and look for a linear combination Y = a G with maximum variance, subject to the normalization condition a a = 1. Since the covariance of Y is a Σa, this is equivalent to maximizing an unconstrained Lagrange function, see Section 1.4, L = a Σa − 2λ(a a − 1). The maximum of L occurs at that value of a for which ∂L ∂a = 0. Recalling the rules for vector differentiation, ∂L ∂a = 2Σa − 2λa = 0 which is the eigenvalue problem Σa = λa. Since Σ is real and symmetric, the eigenvectors are orthogonal (and normalized). Denote them a1 . . . aN for eigenvalues λ1 ≥ . . . ≥ λN . Define the matrix A = (a1 . . . aN ), AA = I, and let the the transformed principal component vector be Y = A G with covariance matrix Σ . Then we have Σ = YY = A GG A = A ΣA = Diag(λ1 . . . λN ) =     λ1 0 · · · 0 0 λ2 · · · 0 ... ... ... ... 0 0 · · · λN     =: Λ. The fraction of the total variance in the original multispectral image which is described by the first i principal components is λ1 + . . . + λi λ1 + . . . + λi + . . . + λN . If the original multispectral channels are highly correlated, as is usually the case, the first few principal components will account for a very high percentage of the variance the image. For example, a color composite of the first 3 principal components of a LANDSAT TM scene displays essentially all of the information contained in the 6 spectral components in one single image. Nevertheless, because of the approximation involved in the assumption of a normal distribution, higher order principal components may also contain significant information [JRR99]. The principal components transformation can be performed directly from the ENVI main menu. However the following IDL program illustrates the procedure in detail: ; Principal components analysis envi_select, title=’Choose multispectral image’, $

3.4. MINIMUM NOISE FRACTION 25 fid=fid, dims=dims,pos=pos if (fid eq -1) then return num_cols = dims[2]+1 num_lines = dims[4]+1 num_pixels = (num_cols*num_lines) num_channels = n_elements(pos) image=intarr(num_channels,num_pixels) for i=0,num_channels-1 do begin temp=envi_get_data(fid=fid,dims=dims,pos=pos[i]) m = mean(temp) image[i,*]=temp-m endfor ; calculate the transformation matrix A sigma = correlate(image,/covariance,/double) lambda = eigenql(sigma,eigenvectors=A,/double) print,’Covariance matrix’ print, sigma print,’Eigenvalues’ print, lambda print,’Eigenvectors’ print, A ; transform the image image = image##transpose(A) ; reform to BSQ format PC_array = bytarr(num_cols,num_lines,num_channels) for i = 0,num_channels-1 do PC_array[*,*,i] = $ reform(image[i,*],num_cols,num_lines,/overwrite) ; output the result to memory envi_enter_data, PC_array end 3.4 Minimum noise fraction Principal components analysis maximizes variance. This doesn’t always lead to images of decreasing image quality (i.e. of increasing noise). The MNF transformation minimizes the noise content rather than maximizing variance, so, if this is the desired criterion, it is to be preferred over PCA. Suppose we can represent a gray scale image G with covariance matrix Σ and zero mean as a sum of uncorrelated signal and noise noise components G = S + N,

26 CHAPTER 3. TRANSFORMATIONS both normally distributed, with covariance matrices ΣS and ΣN and zero mean. Then we have Σ = GG = (S + N)(S + N) = SS + NN , since noise and signal are uncorrelated, i.e. SN = NS = 0. Thus Σ = ΣS + ΣN . (3.10) Now let us seek a linear combination a G for which the signal to noise ratio SNR = var(a S) var(a N) = a ΣSa a ΣN a is maximized. From (3.10) we can write this in the form SNR = a Σa a ΣN a − 1. (3.11) Differentiating we get ∂ ∂a SNR = 1 a ΣN a 1 2 Σa − a Σa (a ΣN a)2 1 2 ΣN a = 0, or, equivalently, (a ΣN a)Σa = (a Σa)ΣN a . This condition is met when a solves the generalized eigenvalue problem ΣN a = λΣa. (3.12) Both ΣN and Σ are symmetric and the latter is also positive definite. Its Cholesky factorization is Σ = LL , where L is a lower triangular matrix, and can be thought of as the “square root” of Σ. Such an L always exists is Σ is positive definite. With this, we can write (3.12) as ΣN a = λLL a or, equivalently, L−1 ΣN (L )−1 L a = λL a or, with b = L a and commutivity of inverse and transpose, [L−1 ΣN (L−1 ) ]b = λb, a standard eigenproblem for a real, symmetric matrix L−1 ΣN (L−1 ) . From (3.11) we see that the SNR for eigenvalue λi is just SNRi = ai Σai ai (λiΣai) − 1 = 1 λi − 1. Thus the eigenvector ai corresponding to the smallest eigenvalue λi will maximize the signal to noise ratio. Note that (3.12) can be written in the form ΣN A = ΣAΛ, (3.13)

3.4. MINIMUM NOISE FRACTION 27 where A = (a1 . . . aN ) and Λ = Diag(λ1 . . . λN ). The MNF transformation is available in the ENVI environment. It is carried out in two steps which are equivalent to the above. First of all the noise contribution to G is “whitened”, i.e. the random vector N has covariance matrix I, the identity matrix. Since ΣN can be assumed to be diagonal anyway (the noise in any band is uncorrelated with the noise in any other band), we accomplish this by doing a transformation which divides the components of G by the standard deviations of the noise, X = Σ −1/2 N G, where Σ −1/2 N ΣN Σ −1/2 N = I. The transformed random vector X thus has covariance matrix ΣX = Σ −1/2 N ΣΣ −1/2 N . (3.14) Next we do an ordinary principal components transformation on X, i.e. Y = B X where B ΣX B = ΛX, B B = I. (3.15) The overall transformation is thus Y = B Σ −1/2 N G = A G where A = Σ −1/2 N B is not an orthogonal transformation. To see that this transformation is equivalent to solving the generalized eigenvalue problem, consider ΣN A = ΣN Σ −1/2 N B = Σ 1/2 N ΣX BΛ−1 X = Σ 1/2 N Σ −1/2 N ΣΣ −1/2 N BΛ−1 X = ΣAΛ−1 X . This is equivalent to (3.13) with λXi = 1 λi = SNRi + 1. Thus an eigenvalue in the second transformation equal to one corresponds to “pure noise”. Before the transformation can be performed, it is of course necessary to estimate the noise covariance matrix ΣN . This can be done for example by diﬀerencing with respect to the local mean: (ΣN )k ≈ 1 cr c,r i,j (gk(i, j) − mk(i, j))(g (i, j) − m (i, j)) where mk(i, j) is the local mean of pixels in some neighborhood of (i, j).

28 CHAPTER 3. TRANSFORMATIONS 3.5 Maximum autocorrelation factor (MAF) Let x represent the coordinates of a pixel within image G, i.e. x = (i, j). We consider the covariance matrix Γ between the original image, represented by G(x), and the same image G(x + ∆) shifted by an amount ∆ = (∆x, ∆y) : Γ(∆) = G(x)G(x + ∆) , assumed to be independent of x. Then Γ(0) = Σ, and furthermore Γ(−∆) = G(x)G(x − ∆) = G(x + ∆)G(x) = (G(x)G(x + ∆) ) = Γ(∆) . Now we consider the covariance of projections of the original and shifted images: cov(a G(x), a G(x + ∆)) = a G(x)G(x + ∆) a = a Γ(∆)a = a Γ(−∆)a = 1 2 a (Γ(∆) + Γ(−∆))a. (3.16) Deﬁne Σ∆ as the covariance matrix of the diﬀerence image G(x) − G(x + ∆), i.e. Σ∆ = (G(x) − G(x + ∆))(G(x) − G(x + ∆) = G(x)G(x) + G(x + ∆)G(x + ∆) − G(x)G(x + ∆) − G(x + ∆)G(x) = 2Σ − Γ(∆) − Γ(−∆). Hence Γ(∆) + Γ(−∆) = 2Σ − Σ∆ and we can write (3.16) in the form cov(a G(x), a G(x + ∆)) = a Σa − 1 2 a Σ∆a. The correlation of the projections is therefore given by corr(a G(x), a G(x + ∆)) = a Σa − 1 2 a Σ∆a var(a G(x))var(a G(x + ∆)) = a Σa − 1 2 a Σ∆a (a Σa)(a Σa) = 1 − 1 2 a Σ∆a a Σa . (3.17) We want to determine that vector a which extremalizes this correlation, so we wish to extremalize the function R(a) = a Σ∆a a Σa .

3.5. MAXIMUM AUTOCORRELATION FACTOR (MAF) 29 Differentiating, ∂R ∂a = 1 a Σa 1 2 Σ∆a − a Σ∆a (a Σa)2 1 2 Σa = 0 or (a Σa)Σ∆a = (a Σ∆a)Σa. This condition is met when a solves the generalized eigenvalue problem Σ∆a = λΣa, (3.18) which is seen to have the same form as (3.12). Again both Σ∆ and Σ are symmetric and the latter is also positive definite and we obtain the standard eigenproblem [L−1 Σ∆(L−1 ) ]b = λb, for the real, symmetric matrix L−1 Σ∆(L−1 ) . Let the eigenvalues be λ1 ≥ . . . λN and the corresponding (orthogonal) eigenvectors be bi. We have 0 = bi bj = ai LL aj = ai Σaj, i = j, and therefore cov(ai G(x), aj G(x)) = ai Σaj = 0, i = j, so that the MAF components are orthogonal (uncorrelated). Moreover with equation (2.14) we have corr(ai G(x), ai G(x + ∆)) = 1 − 1 2 λi, and the first MAF component has minimum autocorrelation. An ENVI plug-in for performing the MAF transformation is given in Ap- pendix D.5.2.

30 CHAPTER 3. TRANSFORMATIONS Exercises 1. Show that, for x(t) = sin(2πt) in Eq. (2.2), ˆx(−1) = − 1 2i , ˆx(1) = 1 2i , and ˆx(k) = 0 otherwise. 2. Calculate the discrete Fourier transform of the sequence 2, 4, 6, 8 from (3.4). You have to solve four simultaneous equations, the ﬁrst of which is 2 = 1 4 ˆg(0) + ˆg(1) + ˆg(2) + ˆg(3) . Verify your result in IDL with the command print, FFT([2,4,6,8])

Chapter 4 Radiometric enhancement 4.1 Lookup tables Figure 4.1: Contrast enhancement with a lookup table represented as the continuous function f(x) [JRR99]. Intensity enhancement of an image is easily accomplished by means of lookup tables. For byte-encoded data, the pixel intensities g are used to index an array LUT[k], k = 0 . . . 255, the entries of which also lie between 0 and 255. These entries can be chosen to implement linear stretching, saturation, histogram equalization, etc. according to ˆgk(i, j) = LUT[gk(i, j)], 0 ≤ i ≤ r − 1, 0 ≤ j ≤ c − 1. 31

32 CHAPTER 4. RADIOMETRIC ENHANCEMENT It is also useful to think of the the lookup table as an approximately continuous function y = f(x). If hin(x) is the histogram of the original image and hout(y) is the histogram of the image after transformation through the lookup table, then, since the number of pixels is constant, hout(y) dy = hin(x) dx, see Fig.4.1 4.1.1 Histogram equalization For histogram equalization, we want hout(y) to be constant independent of y. Hence dy ∼ hin(x) dx and y = f(x) ∼ x 0 hin(t)dt. The lookup table y for histogram equalization is thus proportional to the cumulative sum of the original histogram. 4.1.2 Histogram matching Figure 4.2: Steps required for histogram matching [JRR99]. It is often desirable to match the histogram of one image to that of another so as to make their apparent brightnesses as similar as possible, for example when the two images

4.2. CONVOLUTIONS 33 are combined in a mosaic. We can do this by first equalizing both the input histogram hin(x) and the reference histogram href (y) with the cumulative lookup tables z = f(x) and z = g(y), respectively. The required lookup table is then y = g−1 (z) = g−1 (f(x)). The necessary steps for implementing this function are illustrated in Fig. 1.5 taken from [JRR99]. 4.2 Convolutions With the convention ω = 2πk/c we can write (3.5) in the form ˆg(ω) = c−1 j=0 g(j)e−iωj . (4.1) The convolution of g with a filter h = (h(0), h(1), . . .) is defined by f(j) = k h(k)g(j − k) =: h ∗ g, (4.2) where the sum is over all nonzero elements of the filter h. If the number of nonzero elements is finite, we speak of a finite impulse response filter (FIR). Theorem 1 (Convolution theorem) In the frequency domain, convolution is replaced by multiplication: ˆf(ω) = ˆh(ω)ˆg(ω). Proof: ˆf(ω) = j f(j)e−iωj = j,k h(k)g(j − k)e−iωj ˆh(ω)ˆg(ω) = k h(k)e−iωk g( )e−iω = k, h(k)g( )e−iω(k+ ) = k,j h(k)g(j − k)e−iωj = ˆf(ω). This can of course be generalized to two dimensional images, so that there are three basic steps involved in image filtering: 1. The image and the convolution filter are transformed from the spatial domain to the frequency domain using the FFT. 2. The transformed image is multiplied with the frequency filter. 3. The filtered image is transformed back to the spatial domain.

34 CHAPTER 4. RADIOMETRIC ENHANCEMENT We often distinguish between low-pass and high-pass filters. Low pass filters perform some sort of averaging. The simplest example is h = (1/2, 1/2, 0 . . .), which computes the average of two consecutive pixels. A high-pass filter computes differences of nearby pixels, e.g. h = (1/2, −1/2, 0 . . .). Figure 4.3 shows the Fourier transforms of these two simple filters generated by the the IDL program ; Hi-Lo pass filters x = fltarr(64) x[0]=0.5 x[1]=-0.5 p1 =abs(FFT(x)) x[1]=0.5 p2 =abs(FFT(x)) envi_plot_data,lindgen(64),[[p1],[p2]] end Figure 4.3: Low-pass(red) and high-pass (white) filters in the frequency domain. The quantity |ˆh(k)|2 is plotted as a function of k. The highest frequency is at the center of the plots, k = c/2 = 32 . 4.2.1 Laplacian of Gaussian filter We shall illustrate image filtering with the so-called Laplacian of Gaussian (LoG) filter, which will be used in Chapter 6 to implement contour matching for automatic determination of ground control points. To begin with, consider the gradient operator for a two-dimensional image: = ∂ ∂x = i ∂ ∂x1 + j ∂ ∂x2 ,

4.2. CONVOLUTIONS 35 where i and j are unit vectors in the vertical and horizontal directions, respectively. g(x) is a vector in the direction of the maximum rate of change of gray scale intensity. Since the intensity values are discrete, the partial derivatives must be approximated. For example we can use the Sobel operators: ∂g(x) ∂x1 ≈ [g(i − 1, j − 1) + 2g(i, j − 1) + g(i + 1, j − 1)] − [g(i − 1, j + 1) + 2g(i, j + 1) + g(i + 1, j + 1)] = 2(i, j) ∂g(x) ∂x2 ≈ [g(i − 1, j − 1) + 2g(i − 1, j) + g(i − 1, j + 1)] − [g(i + 1, j − 1) + 2g(i + 1, j) + g(i + 1, j + 1)] = 1(i, j) which are equivalent to the two-dimensional FIR filters h1 = −1 0 1 −2 0 2 −1 0 1 and h2 = 1 2 1 0 0 0 −1 −2 −1 , respectively. The magnitude of the gradient is | | = 2 1 + 2 2. Edge detection can be achieved by calculating the filtered image f(i, j) = | |(i, j) and setting an appropriate threshold. Figure 4.4: Laplacian of Gaussian filter.

36 CHAPTER 4. RADIOMETRIC ENHANCEMENT Now consider the second derivatives of the image intensities, which can be represented formally by the Laplacian 2 = · = ∂2 ∂x2 1 + ∂2 ∂x2 2 . 2 g(x) is a scalar quantity which is zero whenever the gradient is maximum. Therefore changes in intensity from dark to light or vice versa correspond to sign changes in the Laplacian and these can also be used for edge detection. The Laplacian can also be approximated by a FIR filter, however such filters tend to be very sensitive to image noise. Usually a low-pass Gauss filter is first used to smooth the image before the Laplacian filter is applied. It is more efficient, however, to calculate the Laplacian of the Gauss function itself and then use the resulting function to derive a high-pass filter. The Gauss function in two dimensions is given by 1 2πσ2 exp − 1 2σ2 (x2 1 + x2 2), where the parameter σ determines its extent. Its Laplacian is 1 2πσ6 (x2 1 + x2 2 − 2σ2 ) exp − 1 2σ2 (x2 1 + x2 2) a plot of which is shown in Fig. 4.4. The following program illustrates the application of the filter to a gray scale image, see Fig. 4.5: pro LoG sigma = 2.0 filter = fltarr(17,17) for i=0L,16 do for j=0L,16 do $ filter[i,j] = (1/(2*!pi*sigma^6))*((i-8)^2+(j-8)^2-2*sigma^2) $ *exp(-((i-8)^2+(j-8)^2)/(2*sigma^2)) ; output as EPS file thisDevice =!D.Name set_plot, ’PS’ Device, Filename=’c:tempLoG.eps’,xsize=4,ysize=4,/inches,/Encapsulated shade_surf,filter device,/close_file set_plot, thisDevice ; read a jpg image filename = Dialog_Pickfile(Filter=’*.jpg’,/Read) OK = Query_JPEG(filename,fileinfo) if not OK then return xsize = fileinfo.dimensions[0] ysize = fileinfo.dimensions[1] window,11,xsize=xsize,ysize=ysize Read_JPEG,filename,image1 image = bytarr(xsize,ysize)

4.2. CONVOLUTIONS 37 image[*,*] = image1[0,*,*] tvscl,image ; run the filter filt = image*0.0 filt[0:16,0:16]=filter[*,*] image1= float(fft(fft(image)*fft(filt),1)) ; get zero-crossings and display image2 = bytarr(xsize,ysize) indices = where( (image1*shift(image1,1,0) lt 0) or (image1*shift(image1,0,1) lt 0) ) image2[indices]=255 wset, 11 tv, image2 end Figure 4.5: Image ﬁltered with the Laplacian of Gaussian ﬁlter.

38 CHAPTER 4. RADIOMETRIC ENHANCEMENT

Chapter 5 Topographic modelling Satellite images are two-dimensional representations of the three-dimensional earth surface. The correct treatment of the third dimension – the elevation – is essential for terrain modelling and accurate georeferencing. 5.1 RST transformation Transformations of spatial coordinates1 in 3 dimensions which involve only rotations, scaling and translations can be represented by a 4 × 4 transformation matrix A v∗ = Av (5.1) where v is the column vector containing the original coordinates v = (X, Y, Z, 1) and v∗ contains the transformed coordinates v∗ = (X∗ , Y ∗ , Z∗ , 1) . For example the translation X∗ = X + X0 Y ∗ = Y + Y0 Z∗ = Z + Z0 corresponds to the transformation matrix T =    1 0 0 X0 0 1 0 Y0 0 0 1 Z0 0 0 0 1    , a uniform scaling by 50% to S =    1/2 0 0 0 0 1/2 0 0 0 0 1/2 0 0 0 0 1    , 1The following treatment closely follows Chapter 2 of Gonzales and Woods [GW02]. 39

40 CHAPTER 5. TOPOGRAPHIC MODELLING and a simple rotation θ about the Z-axis to Rθ =    cos θ sin θ 0 0 −sinθ cosθ 0 0 0 0 1 0 0 0 0 1    , etc. The complete RST transformation is then v∗ = RSTv = Av. (5.2) The inverse transformation is of course represented by A−1 . 5.2 Imaging transformations An imaging (or perspective) transformation projects 3D points onto a plane. It is used to describe the formation of a camera image and, unlike the RST transformation, is non-linear since it involves division by coordinate values. Figure 5.1: Basic imaging process, from [GW02]. In Figure 5.1, the camera coordinate system (x, y, x) is aligned with the world coordinate system, describing the terrain to be imaged. The camera focal length is λ. From simple geometry we obtain expressions for the image plane coordinates in terms of the world coordinates: x = λX λ − Z y = λY λ − Z . (5.3) Solving for the X and Y world coordinates: X = x λ (λ − Z) Y = y λ (λ − Z). (5.4)

5.3. CAMERA MODELS AND RFM APPROXIMATIONS 41 Thus, in order to extract the geographical coordinates (X, Y ) of a point on the earth’s surface from its image coordinates, we require knowledge of the elevation Z. Correcting for the elevation in this way constitutes the process of orthorectification. 5.3 Camera models and RFM approximations Equation (5.3) is overly simplified, as it assumes that the origin of world and image coordinates coincide. In order to apply it, one has first to transform the image coordinate system from the satellite to the world coordinate system. This is done in a straightforward way with the rotation and translation transformations introduced in Section 5.1. However it requires accurate knowledge of the height and orientation of the satellite imaging system at the time of the image acquisition (or, more exactly, during the acquisition, since the latter is normally not instantaneous). The resulting non-linear equations that relate image and world coordinates are what constitute the camera or sensor model for that particular image. Direct use of the camera model for image processing is complicated as it requires extremely exact, sometimes proprietary information about the sensor system and its orbit. An alternative exists if the image provider also supplies a so-called rational function model (RFM) which approximates the camera model for each acquisition as a ratio of rational polynomials, see e.g. [TH01]. Such RFMs have the form r = f(X , Y , Z ) = a(X , Y , Z ) b(X , Y , Z ) c = g(X , Y , Z ) = c(X , Y , Z ) d(X , Y , Z ) (5.5) where c and r are the column and row (XY) coordinates in the image plane relative to an origin (c0, r0) and scaled by a factor cs resp. rs: c = c − c0 cs , r = r − r0 rs . Similarly X , Y and Z are relative, scaled world coordinates: X = X − X0 Xs , Y = Y − Y0 Ys , Z = Z − Z0 Zs . The polynomials a, b, c and d are typically to third order in the world coordinates, e.g. a(X, Y, Z) = a0 + a1X + a2Y + a3Z + a4XY + a5XZ + a6Y Z + a7X2 + a8Y 2 + a9Z2 + a10XY Z + a11X3 + a12XY 2 + a13XZ2 + a14X2 Y + a15Y 3 + a16Y Z2 + a17X2 Z + a18Y 2 Z + a19Z3 The advantage of using ratios of polynomials is that these are less subject to interpolation error. For a given acquisition the provider fits the RFM to his camera model using a three- dimensional grid of points covering the image and world spaces with a least squares fitting procedure. The RFM is capable of representing the camera model extremely well and can be used as a replacement for it. Both Space Imaging and Digital Globe provide RFMs with their high resolution IKONOS and QuickBird imagery. Below is a sample Quickbird RFM file giving the origins, scaling factors and polynomial coefficients needed in Eq. (5.5).

42 CHAPTER 5. TOPOGRAPHIC MODELLING satId = QB02; bandId = P; SpecId = RPC00B; BEGIN_GROUP = IMAGE errBias = 56.01; errRand = 0.12; lineOffset = 4683; sampOffset = 4154; latOffset = 32.5709; longOffset = 51.8391; heightOffset = 1582; lineScale = 4733; sampScale = 4399; latScale = 0.0256; longScale = 0.0269; heightScale = 500; lineNumCoef = ( +1.162844E-03, -7.011681E-03, -9.993482E-01, -1.119999E-02, -6.682911E-06, +7.591306E-05, +3.632740E-04, -1.111298E-04, -5.842086E-04, +2.212466E-06, -1.275349E-06, +1.279061E-06, +1.918762E-08, -6.957548E-07, -1.240783E-06, -7.644403E-07, +3.479752E-07, +1.259300E-05, +1.085128E-06, -1.571375E-06); lineDenCoef = ( +1.000000E+00, +1.801541E-06, +5.822024E-04, +3.774278E-04, -2.141015E-08, -6.984359E-07, -1.344888E-06, -9.669251E-07, -4.726988E-08, +1.329814E-06, +2.113403E-08, -2.914653E-06,

5.3. CAMERA MODELS AND RFM APPROXIMATIONS 43 -4.367422E-07, +6.988065E-07, +4.035593E-07, +3.275453E-07, -2.740827E-07, -4.147675E-06, -1.074015E-06, +2.218804E-06); sampNumCoef = ( -9.783496E-04, +9.566915E-01, -8.477919E-03, -5.393803E-02, -1.590864E-04, +5.477412E-04, -3.968308E-04, +4.819512E-04, -3.965558E-06, -3.442885E-05, +5.821180E-08, +2.952683E-08, -1.363146E-07, +2.454422E-07, +1.372698E-07, +1.987710E-07, -3.167074E-07, -1.038018E-06, +1.376092E-07, -2.352636E-07); sampDenCoef = ( +1.000000E+00, +5.029785E-04, +1.225257E-04, -5.780883E-04, -1.543054E-07, +1.240426E-06, -1.830526E-07, +3.264812E-07, -1.255831E-08, -5.177631E-07, -5.868514E-07, -9.029287E-07, +7.692317E-08, +1.289335E-07, -3.649242E-07, +0.000000E+00, +1.229000E-07, -1.290467E-05, +4.318970E-08, -8.391348E-08);

44 CHAPTER 5. TOPOGRAPHIC MODELLING END_GROUP = IMAGE END; To illustrate a simple use of the RFM data, consider a vertical structure in a high- resolution image, such as a chimney or building fassade. Suppose we determine the image coordinates of the bottom and top of the structure to be (rb, cb) and (rt, ct), respectively. Then from 5.5 rb = f(X, Y, Zb) cb = g(X, Y, Zb) rt = f(X, Y, Zt) ct = g(X, Y, Zt), (5.6) since the (X, Y ) coordinates must be the same. This would appear to constitute a set of four equations in four unknowns X, Y , Zb and Zt, however the solution is unstable because of the close similarity of Zt to Zb. Nevertheless the object height Zt − Zb can be obtained by the following procedure: 1. Get (rb, cb) and (rt, ct) from the image. 2. Solve first two equations in (5.6) (e.g. with Newton’s method) for X and Y with Zb set equal to the average elevation in the scene if no DEM is available, otherwise to the true elevation. 3. For a spanning range of Zt values, calculate (rt, ct) from the second two equations in (5.6) and choose for Zt the value of Zt which gives closest agreement to the values read in. Quite generally, the RFM can approximate the camera model very well and can be used as an alternative for providing end users with the necessary information to perform their own photogrammetric processing. An ENVI plug-in for object height determination from RFM data is given in Appendix D.2.1. 5.4 Stereo imaging, elevation models and orthorectification The missing elevation information Z in (5.3) or in (5.5) can be obtained with stereoscopic imaging techniques. Figure 5.2 shows two cameras viewing the same world point w from two positions. The separation of the lens centers is the baseline. The objective is to find the coordinates (X, Y, Z) of w if its image points have coordinates (x1, y1) and (x2, y2). We assume that the cameras are identical and that their image coordinate systems are perfectly aligned, differing only in the location of their origins. The Z coordinate of w is the same for both coordinate systems. In Figure 5.3 the first camera is brought into coincidence with the world coordinate system. Then from (5.4), X1 = x1 λ (λ − Z). Alternatively, if the second camera is brought to the origin of the world coordinate system, X2 = x2 λ (λ − Z).

5.4. STEREO IMAGING, ELEVATION MODELS AND ORTHORECTIFICATION 45 Figure 5.2: The stereo imaging process, from [GW02]. Figure 5.3: Top view of Figure 5.2, from [GW02].

46 CHAPTER 5. TOPOGRAPHIC MODELLING But, from the figures, X2 = X1 + B, where B is the baseline. We have from the above three equations: Z = λ − λB x2 − x1 . (5.7) Thus if the displacement of the image coordinates of the point w, namely x2 − x1 can be determined, the Z coordinate can be calculated. The task is then to find two corresponding points in different images of the same scene. This is usually accomplished by spatial correlation techniques and is closely related to the problem of image-to-image registration discussed in the next chapter. Figure 5.4: ASTER stereo acquisition geometry. Because the stereo image must be correlated, best results are obtained if they are acquired within a very short time of each other, preferably “along track” if a single platform is used, see Figure 5.4. This figure shows the orientation and imaging geometry of the VNIR 3N and 3B cameras on the ASTER platform for acquiring a stereo full scene. The satellite travels at

5.4. STEREO IMAGING, ELEVATION MODELS AND ORTHORECTIFICATION 47 a speed of 6.7 km/sec at a height of 705 km. A 60 × 60 km2 full scene is scanned in 9 seconds. 55 seconds later the same scene is scanned by the back-looking camera, corresponding to a baseline of 370 km. The along-track geometry means that the stereo pair is unipolar, that is, the displacements due to viewing angle are only along the y axis in the imaging plane. Therefore the spatial correlation algorithm used to match points can be one dimensional. If carried out on a pixel for pixel basis, one obtains a digital elevation model (DEM). Figure 5.5: ASTER 3N nadir camera image. Figure 5.6: ASTER 3B back-looking camera image. As an example, Figures 5.5 and 5.6 show an ASTER stereo pair. Both images have been rotated so as to make them unipolar.

48 CHAPTER 5. TOPOGRAPHIC MODELLING The following IDL program calculates a very rudimentary DEM: pro test_correl_images height = 705.0 base = 370.0 pixel_size = 15.0 envi_select, title=’Choose 1st image’, fid=fid1, dims=dims1, pos=pos1, /band_only envi_select, title=’Choose 2nd image’, fid=fid2, dims=dims2, pos=pos2, /band_only im1 = envi_get_data(fid=fid1,dims=dims1,pos=pos1) im2 = envi_get_data(fid=fid2,dims=dims2,pos=pos2) n_cols = dims1[2]-dims1[1]+1 n_rows = dims1[4]-dims1[3]+1 parallax = fltarr(n_cols,n_rows) progressbar = Obj_New(’progressbar’, Color=’blue’, Text=’0’,$ title=’Cross correlation, column ...’,xsize=250,ysize=20) progressbar-start for i=7L,n_cols-8 do begin if progressbar-CheckCancel() then begin envi_enter_data,pixel_size*parallax*(height/base) progressbar-Destroy return endif progressbar-Update,(i*100)/n_cols,text=strtrim(i,2) for j=25L,n_rows-26 do begin cim = correl_images(im1[i-5:i+5,j-5:j+5],im2[i-7:i+7,j-25:j+25], $ xoffset_b=0,yoffset_b=-20,xshift=0,yshift=20) corrmat_analyze,cim,xoff,yoff,m,e,p parallax[i,j] = yoff (-5.0) endfor endfor progressbar-destroy envi_enter_data,pixel_size*parallax*(height/base) end This program makes use of the routines correl images and corrmat analyze from the IDL Astronomy User’s Library2 to calculate the cross-correlation of the two images. For each pixel in the nadir image an 11 × 11 window is moved along an 11 × 51 window in the back- looking image centered at the same position. The point of maximum correlation deﬁnes the parallax or displacement p. This is related to the relative elevation e of the pixel according to e = h b p × 15m, where h is the height of the sensor and b is the baseline, see Figure 5.7. Figure 5.8 shows the result. Clearly there are many problems due to the correlation errors, however the relative elevations are approximately correct when compared to the DEM determined with the ENVI commercial add-on AsterDTM, see Figure 5.9. 2www.astro.washington.edu/deutsch/idl/htmlhelp/index.html

5.4. STEREO IMAGING, ELEVATION MODELS AND ORTHORECTIFICATION 49 ' b h e p satellite motion ground nadir cameraback camera Figure 5.7: Relating parallax p to elevation e by similar triangles: e/p = (h − e)/b ≈ h/b. Figure 5.8: A rudimentary DEM.

50 CHAPTER 5. TOPOGRAPHIC MODELLING Figure 5.9: DEM generated with the commercial product AsterTDM. Either the complete camera model or an RFM can be used, but usually neither is sufficient for an absolute DEM relative to mean sea level. Most often additional ground reference points within the image whose elevations are known are also required for absolute calibration. The orthorectification of the image is carried out on the basis of a suitable DEM and consists of projecting the (X, Y, Z) coordinates of each pixel onto the (X, Y ) coordinates of a given map projection. 5.5 Slope and aspect Terrain analysis involves the processing of elevation data. Specifically we consider here the generation of slope images, which give the steepness of the terrain at each pixel, and aspect images, which give the prevailing direction relative to north of a vector normal to the landscape at each pixel. A 3×3 pixel window can be used to determine both slope and aspect, see Figure 5.10. Define ∆x1 = c − a ∆y1 = a − g ∆x2 = f − d ∆y2 = b − h ∆x3 = i − g ∆y3 = c − i and ∆x = (∆x1 + ∆x2 + ∆x3)/(3xs) ∆y = (∆y1 + ∆y2 + ∆y3)/(3xs, where xs, ys give the pixel dimensions in meters. Then the slope in % at the central pixel position is given by s = (∆x)2 + (∆y)2 2 × 100 whereas the aspect in radians measured clockwise from north is θ = tan−1 ∆x ∆y .

5.6. ILLUMINATION CORRECTION 51 a b c d e f g h i Figure 5.10: Pixel elevations in an 8-neighborhood. The letters represent elevations. Slope/aspect determinations from a DEM are available in the ENVI main menu under Topographic/Topographic Modelling. 5.6 Illumination correction Figure 5.11: Angles involved in computation of local solar elevation, taken from [RCSA03]. Topographic modelling can be used to correct images for the eﬀects of local solar illumination, which depends not only upon the sun’s position (elevation and azimuth) but also upon the local slope and aspect of the terrain being illuminated. Figure 5.11 shows the angles involved [RCSA03]. Solar elevation is θi, solar azimuth is φa, θp is the slope and φ0 is the aspect. The quantity to be calculated is the local solar elevation γi which determines

52 CHAPTER 5. TOPOGRAPHIC MODELLING the local irradiance. From trigonometry we have cos γi = cos θp cos θi + sin θp sin θi cos(φa − φ0). (5.8) An example of a cos γi image in hilly terrain is shown in Figure 5.12. Figure 5.12: Cosine of local solar illumination angle stretched across a DEM. Let ρT represent the reflectance of the inclined surface in Figure 5.11. Then for a Lambertian surface, i.e. a surface which scatters reflected radiation uniformly in al directions, the reflectance of the corresponding horizontal surface ρH would be ρH = ρT cos θi cos γi . (5.9) The Lambertian assumption is in general not correct, the actual reflectance being described by a complicated bidirectional reflectance distribution function (BRDF). An empiri- cal appraoch which gives a better approximation to the BRDF is the C-correction [TGG82]. Let m and b be the slope and intercept of a regression line for reflectance vs. cos γi for a particular image band. Then instead of (5.9) one uses ρH = ρT cosθi + b/m cos γi + b/m . (5.10) An ENVI plug-in for illumination correction with the C-correction approximation is given in Appendix D.2.2.

Chapter 6 Image Registration Image registration, either to another image or to a map, is a fundamental task in image processing. It is required for georeferencing, stereo imaging, accurate change detection, or any kind of multitemporal image analysis. Image-to-image registration methods can be divided into roughly four classes [RC96]: 1. algorithms that use pixel values directly, i.e. correlation methods 2. frequency- or wavelet-domain methods that use e.g. the fast fourier transform(FFT) 3. feature-based methods that use low-level features such as edges and corners 4. algorithms that use high level features and the relations between them, e.g. object- oriented methods We consider examples of frequency-domain and feature-based methods here. 6.1 Frequency domain registration Consider two N × N gray scale images g1(i , j ) and g2(i, j), where g2 is oﬀset relative to g1 by an integer number of pixels: g2(i, j) = g1(i , j ) = g1(i − i0, j − j0), i0, j0 N. Taking the Fourier transform we have ˆg2(k, l) = ij g1(i − i0, j − j0)e−i2π(ik+jl)/N , or with a change of indices to i j , ˆg2(k, l) = i j g1(i , j )e−i2π(i k+j l)/N e−i2π(i0k+j0l)/N = ˆg1(k, l)e−i2π(i0k+j0l)/N . (This is referred to as the Fourier translation property.) Therefore we can write ˆg2(k, l)ˆg∗ 1(k, l) |ˆg2(k, l)ˆg∗ 1(k, l)| = e−i2π(i0k+j0l)/N , (6.1) 53

54 CHAPTER 6. IMAGE REGISTRATION Figure 6.1: Phase correlation of two identical images shifted by 10 pixels. where ˆg∗ 1 is the complex conjugate of ˆg1. The inverse transform of the right hand side exhibits a Dirac delta function (spike) at the coordinates (i0, j0). Thus if two otherwise identical images are oﬀset by an integer number of pixels, the oﬀset can be found by taking their Fourier transforms, computing the ratio on the left hand side of (6.1) (the so-called cross-power spectrum) and then taking the inverse transform of the result. The position of the maximum value in the inverse transform gives the values of i0 and j0. The following IDL program illustrates the procedure, see Fig. 6.1 ; Image matching by phase correlation ; read a bitmap image and cut out two 512x512 pixel arrays filename = Dialog_Pickfile(Filter=’*.jpg’,/Read) if filename eq ’’ then print, ’cancelled’ else begin Read_JPeG,filename,image g1 = image[0,10:521,10:521] g2 = image[0,0:511,0:511] ; perform Fourier transforms f1 = fft(g1, /double) f2 = fft(g2, /double) ; Determine the offset g = fft( f2*conj(f1)/abs(f1*conj(f1)), /inverse, /double )

6.2. FEATURE MATCHING 55 pos = where(g eq max(g)) print, ’Offset = ’ + strtrim(pos mod 512) + strtrim(pos/512) ; output as EPS file thisDevice =!D.Name set_plot, ’PS’ Device, Filename=’c:tempphasecorr.eps’,xsize=4,ysize=4,/inches,/Encapsulated shade_surf,g[0,0:50,0:50] device,/close_file set_plot, thisDevice endelse end Images which diﬀer not only by an oﬀset but also by a rigid rotation and change of scale can in principle be registered similarly, see [RC96]. 6.2 Feature matching A tedious task associated with image-image registration using low level image features is the setting of ground control points (GCPs) since, in general, it is necessary to resort to the manual entry. However various techniques for automatic determination of GCPs have been suggested in the literature. We will discuss one such method, namely contour matching [LMM95]. This technique has been found to function reliably in bitemporal scenes in which vegetation changes do not dominate. It can of course be augmented (or replaced) by other automatic methods or by manual determination. The procedures involved in image-image registration using contour matching are shown in Fig. 6.2 [LMM95]. LoG Zero Crossing Edge Strength Contour Finder Chain Code Encoder Closed Contour Matching Consistency Check Warping E E E E E E cc ''' Image 1 Image 2 Image 2 (registered) Figure 6.2: Image-image registration with contour matching.

56 CHAPTER 6. IMAGE REGISTRATION 6.2.1 Contour detection The first step involves the application of a Laplacian of Gaussian filter to both images. After determining the contours by examining zero-crossings of the LoG-filtered image, the contour strengths are encoded in the pixel intensities. Strengths are taken to be proportional to the magnitude of the gradient at the zero-crossing. 6.2.2 Closed contours In the next step, all closed contours with strengths above some given threshold are determined by tracing the contours. Pixels which have been visited during tracing are set to zero so that they will not be visited again. 6.2.3 Chain codes For subsequent matching purposes, all significant closed contours found in the preceding step are chain encoded. Any digital curve can be represented by an integer sequence {a1, a2 . . . ai . . .}, ai ∈ {0, 1, 2, 3, 4, 5, 6, 7}, depending on the relative position of the current pixel with respect to the previous pixel in the curve. This simple code has the drawback that some contours produce wrap around. For example the line in the direction −22.5o has the chain code {707070 . . .}. Li et al. [LMM95] suggest the smoothing operation: {a1a2 . . . an} → {b1b2 . . . bn}, where b1 = a1 and bi = qi, qi is an integer satisfying (qi−ai) mod 8 = 0 and |qi−bi−1| → min, i = 2, 3 . . . n. They also suggest the applying the Gaussian smoothing filter {0.1, 0.2, 0.4, 0.2, 0.1} to the result. Two chain codes can be compared by “sliding” one over the other and determining the maximum correlation between them. 6.2.4 Invariant moments The closed contours are first matched according to their invariant moments. These are defined as follows, see [Hab95, GW02]. Let the set C denote the set of pixels defining a contour, with |C| = n, that is, n is the number of pixels on the contour. The moment of order p, q of the contour is defined as mpq = i,j∈C jp iq . (6.2) Note that n = m00. The center of gravity xc, yc of the contour is thus xc = m10 m00 , yc = m01 m00 . The centralized moments are then given by µpq = i,j∈C (j − xc)p (i − yc)q , (6.3)

6.2. FEATURE MATCHING 57 and the normalized centralized moments by ηpq = 1 µ (p+q)/2+1 00 µpq. (6.4) For example, η20 = 1 µ2 00 µ20 = 1 n2 i,j∈C (j − yc)2 . The normalized centralized moments are, apart from effects of digital quantization, invariant under scale changes and translations of the contours. Finally, we can define moments which are also invariant under rotations, see [Hu62]. The first two such invariant moments are h1 = η20 + η02 h2 = (η20 − η02)2 + 4η2 11. (6.5) For example, consider a general rotation of the coordinate axes with origin at the center of gravity of a contour: j i = cos θ sin θ − sin θ cos θ j i = A j i . The first invariant moment in the rotated coordinate system is h1 = 1 n2 i ,j ∈C (j 2 + i 2 ) = 1 n2 i ,j ∈C (j , i ) j i = 1 n2 i,j∈C (j, i)A A j i = 1 n2 i,j∈C (j2 + i2 ), since A A = I. 6.2.5 Contour matching Each significant contour in one image is first matched with contours in the second image according to their invariant moments h1, h2. This is done by setting a threshold on the allowed differences, for instance 1 standard deviation. If one or more matches is found, the best candidate for a GCP pair is then chosen to be that matched contour in the second image for which the chain code correlation with the contour in the first image is maximum. If the maximum correlation is less that some threshold, e.g. 0.9, then no match is found. The actual GCP coordinates are taken to be the centers of gravity of the matched contours. 6.2.6 Consistency check The contour matching procedure invariably generates false GCP pairs, so a further processing step is required. In [LMM95] use is made of the fact that distances are preserved under a rigid transformation. Let A1A2 represent the distance between two points A1 and A2 in

58 CHAPTER 6. IMAGE REGISTRATION an image. For two sets of m matched contour centers {Ai} and {Bi} in image 1 and 2, the ratios AiAj/BiBj, i = 1 . . . m, j = i + 1 . . . m, are calculated. These should form a cluster, so that pairs scattered away from the cluster center can be rejected as false matches. An ENVI plug-in for GCP determination via contour matching is given in Appendix D.3. 6.3 Re-sampling and warping We represent with (x, y) the coordinates of a point in image 1 and the corresponding point in image 2 with (u, v). A second order polynomial map of image 2 to image 1, for example, is given by u = a0 + a1x + a2y + a3xy + a4x2 + a5y2 v = b0 + b1x + b2y + b3xy + b4x2 + b5y2 . Since there are 12 unknown coefficients, we require at least 6 GCP pairs to determine the map (each pair generates 2 equations). If more than 6 pairs are available, the coefficients can be found by least squares fitting. This has the advantage that an RMS error for the mapping can be estimated. Similar considerations apply for lower or higher order polynomial maps. Having determined the map coefficients, image 2 can be registered to image 1 by resampling. Nearest neighbor resampling simply chooses the actual pixel in image 2 that has its center nearest the calculated coordinates (u, v) and transfers it to location (x, y). This is the preferred technique for classification or change detection, since the registered image consists of the original pixel brightnesses, simply rearranged in position to give a correct image geometry. Other commonly used resampling methods are bilinear interpolation and cubic convolution interpolation, see [JRR99] for details. These methods mix the spectral intensities of neighboring pixels.

6.3. RE-SAMPLING AND WARPING 59 Exercises 1. We can approximate the centralized moments (6.3) of a contour by the integral µpq = (x − xx)p (y − yc)q f(x, y)dxdy, where the integration is over the whole image and where f(x, y) = 1 if the point (x, y) lies on the contour and f(x, y) = 0 otherwise. Use this approximation to prove that the normalized centralized moments ηpq given in (3.4) are invariant under scaling transformations of the form x y = α 0 0 α · x y .

60 CHAPTER 6. IMAGE REGISTRATION

Chapter 7 Image Sharpening The change detection and classification algorithms that we will meet in the next chapters exploit of course not only the spatial but also the spectral information of satellite imagery. Many common platforms (Landsat 7 TM, IKONOS, SPOT, QuickBird) offer panchromatic images with higher ground resolution than that of the spectral channels. Application of multispectral change detection or classification methods is therefore restricted to the lower resolution. Conventional image fusion techniques, such as the well-known HSV-transformation can be used to sharpen the spectral components, however the effect of mixing-in of the panchromatic image is often to “dilute” the spectral resolution. Another disadvantage of the HSV transformation is that one is restricted to using three of the available spectral channels. In the following we will outline the HSV method and then consider alternative fusion techniques. 7.1 HSV fusion In computers with 24-bit graphics (true color), any three channels of a multispectral image can be displayed with 8 bits for each of the additive primary colors red, green and blue. The monitor displays this as an RGB color composite image which, depending on the choice of image channels and their relative intensities, may or may not appear to be natural. There are 224 ≈ 16 million colors possible. Another means of color definition is in terms of hue, saturation and value (HSV). Value (or intensity) can be thought of as an axis equidistant from the three orthogonal primary color axes. Hue refers to the actual color and is defined as an angle on a circle perpendicular to the value axis. Saturation is the “amount” of color present and is represented by the radius of the circle described by the hue, A commonly used method for fusion of two images (for example a lower resolution multispectral image with a higher resolution panchromatic image) is to transform the first image from RGB to HSV space, replace the V component with the grayscale values of the second image after performing a radiometric normalization, and then transform back to RGB space. The forward transformation begins by rotating the RGB coordinate axes into the diagonal 61

62 CHAPTER 7. IMAGE SHARPENING axis of the RGB color cube. The coordinates in the new reference system are given by   m1 m2 i1   =   2/ √ 6 −1/ √ 6 −1/ √ 6 0 1/ √ 2 −1/ √ 2 1/ √ 3 1/ √ 3 1/ √ 3   ·   R G B   . Then the the rectangular coordinates (m1, m2, i1) are transformed into the cylindrical HSV coordinates: H = arctan(m1/m2), S = m2 1 + m2 2, I = √ 3 i1. The following IDL code illustrates the necessary steps for HSV fusion making use of ENVI batch procedures. These are also invoked directly from the ENVI main menu. pro HSVFusion, event ; get MS image envi_select, title=’Select low resolution three-band input file’, $ fid=fid1, dims=dims1, pos=pos1 if (fid1 eq -1) or (n_elements(pos1) ne 3) then return ; get PAN image envi_select, title=’Select panchromatic image’, $ fid=fid2, pos=pos2, dims=dims2, /band_only if (fid2 eq -1) then return envi_check_save, /transform ; linear stretch the images and convert to byte format envi_doit,’stretch_doit’, fid=fid1, dims=dims1, pos=pos1, method=1, $ r_fid=r_fid1, out_min=0, out_max=255, $ range_by=0, i_min=0, i_max=100, out_dt=1, out_name=’c:temphsv_temp’ envi_doit,’stretch_doit’, fid=fid2, dims=dims2, pos=pos2, method=1, $ r_fid=r_fid2, out_min=0, out_max=255, $ range_by=0, i_min=0, i_max=100, out_dt=1, /in_memory envi_file_query, r_fid2, ns=f_ns, nl=f_nl f_dims = [-1l, 0, f_ns-1, 0, f_nl-1] ; HSV sharpening envi_doit, ’sharpen_doit’, $ fid=[r_fid1,r_fid1,r_fid1], pos=[0,1,2], f_fid=r_fid2, $ f_dims=f_dims, f_pos=[0], method=0, interp=0, /in_memory ; remove temporary files from ENVI envi_file_mng, id=r_fid1, /remove, /delete envi_file_mng, id=r_fid2, /remove end

7.2. BROVEY FUSION 63 7.2 Brovey fusion In its simplest form this method multiplies each re-sampled multispectral pixel by the ratio of the corresponding panchromatic pixel intensity to the sum of all of the multispectral intensities. The corrected pixel intensities ¯gk(i, j) in the kth fused multispectral channel are given by ¯gk(i, j) = gk(i, j) · gp(i, j) k gk (i, j) , (7.1) where gk(i, j) is the (re-sampled) pixel intensity in the kth channel and gp(i, j) is the corresponding pixel intensity in the panchromatic image. (The ENVI-environment oﬀers Brovey fusion in its main menu.) This technique assumes that the spectral range spanned by the panchromatic image is essentially the same as that covered by the multispectral channels. This is seldom the case. Moreover, to avoid bias, the intensities used should be the radiances at the satellite sensors, implying use of the sensors’ calibration. 7.3 PCA fusion Panchromatic sharpening using principal components analysis (PCA) is similar to the HSV method. After the PCA transformation, the ﬁrst principal component is replaced by the panchromatic image, again after radiometric normalization, see Figure 7.1. Figure 7.1: Panchromatic fusion with the principal components transformation. Image sharpening using PCA and the closely related Gram-Schmidt transformation is available from the ENVI main menu.

64 CHAPTER 7. IMAGE SHARPENING 7.4 Wavelet fusion Wavelets provide an eﬃcient means of representing high and low frequency components of multispectral images and can be used to perform image sharpening. Two examples are given here. 7.4.1 Discrete wavelet transform The discrete wavelet transform (DWT) of a two-dimensional image is shown in Appendix B to be equivalent to an iterative application of the high-low-pass ﬁlter bank illustrated in Figure 7.2 H G H H G G

↓ E E E E E E E E E E E E E E E E gk(i, j) gk+1(i, j) CH k+1(i, j) CV k+1(i, j) CD k+1(i, j) Columns Rows Figure 7.2: Wavelet filter bank. H is a low-pass and G a high-pass filter derived from the coefficients of the wavelet transformation. The symbol ↓ indicates downsampling by a factor of 2. The original image gk(i, j) can be reconstructed by inverting the filter. A single application of the filter corresponding to the Daubechies D4 wavelet to a satellite image g1(i, j) (1m resolution) is shown in Figure B.12. The high frequency information (wavelet coefficients) is stored in the arrays CH 2 , CV 2 and CD 2 and displayed in the upper right, lower left and lower right quadrants, respectively. The original image with its resolution degraded by a factor two, g2(i, j), is in the upper left quadrant. Applying the filter bank iteratively to the upper left quadrant yields a further reduction by a factor of 2. The fusion procedure for IKONOS or QuickBird imagery for instance, in which the resolutions of panchromatic and the 4 multispectral components differ exactly by a factor of 4, is then as follows: Both the degraded panchromatic image and the four multispectral images are compressed once again (e.g. to 8m resolution in the case of IKONOS) and the high frequency components Cz 4 , z = H, V, D, are sampled to estimate the correction coefficients az = σz ms/σz pan bz = mz ms − az mz pan, (7.2) where mz and σz denote mean and standard deviation, respectively. These coefficients are then used to normalize the wavelet coefficients for the panchromatic image to those of the multispectral image: Cz i (i, j) → az Cz i (i, j) + bz , z = H, V, D, i = 2, 3. (7.3)

7.4. WAVELET FUSION 65 The degraded panchromatic image g3(i, j) is then replaced by the each of the four multispectral images and the normalized wavelet coefficients are used to reconstruct the original 1m resolution. We thus obtain what would be seen if the multispectral sensors had the resolution of the panchromatic sensor [RW00]. An ENVI plug-in for panchromatic sharpening with the DWT is given in Appendix D.4.1. 7.4.2 À trous filtering The radiometric fidelity obtained with the discrete wavelet transform is excellent, as will be shown in the next section. However the lack of translational invariance of the DWT often leads to spatial artifacts (blurring, shadowing, staircase effect) in the sharpened product. This is illustrated in the following program, in which an image is transformed once with the DWT and the low-pass quadrant shifted by one pixel relative to the high-pass quadrants (i.e. the wavelet coefficients). After inverting the transformation, serious degradation is apparent, see Figure 7.3. pro translate_wavelet ; get an image band envi_select, title=’Select input file’, $ fid=fid, dims=dims, pos=pos, /band_only if fid eq -1 then return ; create a DWT object aDWT = Obj_New(’DWT’,envi_get_data(fid=fid,dims=dims,pos=pos)) ; compress aDWT-compress ; shift the compressed portion supressing phase correlation match aDWT-inject,shift(aDWT-Get_Quadrant(0),[1,1]),pc=0 ; restore aDWT-expand ; return result to ENVI envi_enter_data, aDWT-get_image() end As an alternative to the DWT, the à trous wavelet transform (ATWT) has been proposed for image sharpening [AABG02]. The ATWT is a multiresolution decomposition defined formally by a low-pass filter H = {h(0), h(1), . . .} and a high-pass filter G = δ − H, where δ denotes an all-pass filter. Thus the high frequency part is just the difference between the original image and low-pass filtered image. Not surprisingly, this transformation does not allow perfect reconstruction if the output is downsampled. Therefore downsampling is not performed at all. Rather, at the kth iteration of the low-pass filter, 2k−1 zeroes are inserted between the elements of H. This means that every other pixel is interpolated on the first iteration: H = {h(0), 0, h(1), 0, . . .}, while on the second iteration H = {h(0), 0, 0, h(1), 0, 0, . . .} etc. (hence the name à trous = with holes). The low-pass filter is usually chosen to be symmetric (unlike the Daubechies wavelet filters for example). The prototype filter chosen

66 CHAPTER 7. IMAGE SHARPENING here is the cubic B-spline filter H = {1/16, 1/4, 3/8, 1/4, 1/16}. The transformation is highly redundant and requires considerably more computer storage to implement. However when used for image sharpening it is much less sensitive to mis- alignment between the multispectral and panchromatic images. Figure 7.3: Artifacts due to lack of translational invariance of the DWT. Figure 7.4 outlines the scheme implemented in the ENVI plug-in for ATWT panchromatic sharpening. The MS band is nearest-neighbor upsampled by a factor of 2 to match the dimensions of the high resolution band. The à trous transformation is applied to both bands (columns and rows are filtered with the upsampled cubic spline filter, with the difference determining the high-pass result). The high frequency component of the pan image is normalized to that of the MS image in the same way as for DWT sharpening, equations (7.2) and (7.3). Then the low frequency pan component is replaced by the filtered MS image and the transformation inverted. An ENVI plug-in for ATWT sharpening is described in Appendix D.4.2. 7.5 Quality indices Wang and Bovik [WB02] suggest the following measure of radiometric fidelity between two image bands f and g:

7.5. QUALITY INDICES 67 E E E E E T + G G ↑H ↑H T insert E T c normalize MS Pan MS(sharpened)

↑ Figure 7.4: `A trous image sharpening scheme for an MS to panchromatic resolution ratio of two. The symbol ↑H denotes the upsampled low-pass ﬁlter. Figure 7.5: Comparison of three image sharpening methods with the Wang-Bovik quality index. Left to right: Gram-Schmidt, ATWT, DTW.

68 CHAPTER 7. IMAGE SHARPENING Q = σfg σf σg · 2 ¯f¯g ¯f2 + ¯g2 · 2σf σg σ2 f + σ2 g = 4σfg ¯f¯g ( ¯f2 + ¯g2)(σ2 f + σ2 g) (7.4) where ¯f and σf are mean and variance of band f and σfg is the covariance of the two bands. This ﬁrst term in (7.4) is seen to be the correlation coeﬃcient between the two images, with values in [−1, 1], the second term compares their average brightness, with values in [0, 1] and the third term compares their contrasts, also in [0, 1]. Thus perfect radiometric correspondence would give a value Q = 1. Since image quality is usually not spatially invariant, it is usual to compute Q in, say, M sliding windows and then average over all such windows: Q = 1 M M j=1 Qj. An ENVI plug-in for determining the quality index for pansharpened images is given in Appendix D.4.3. Figure 7.5 shows a comparison of three image sharpening methods applied to a QuickBird image, namely the Gram-Schmidt, ATWT and DWT transformations. The latter is by far the best, but spatial artifacts are apparent.

Chapter 8 Change Detection To quote Singh’s review article on change detection [Sin89], “The basic premise in using remote sensing data for change detection is that changes in land cover must result in changes in radiance values ... [which] must be large with respect to radiance changes from other factors.” In the present chapter we will mention briefly the most commonly used digital techniques for enhancing this “change signal” in bitemporal satellite images, and then focus our attention on the so-called multivariate alteration detection algorithm of Nielsen et al. [NCS98]. 8.1 Algebraic methods In order to see changes in the two multispectral images represented by N-dimensional random vectors F and G, a simple procedure is to subtract them from each other component- by-component, examining the N differenced images characterized by F − G = (F1 − G1, F2 − G2 . . . FN − GN ) (8.1) for significant changes. Pixel intensity differences near zero indicate no change, large positive or negative values indicate change, and decision thresholds can be set to define significant changes. If the difference signatures in the spectral channels are used to classify the kind of change that has taken place, one speaks of change vector analysis. Thresholds are usually expressed in standard deviations from the mean difference value, which is taken to correspond to no change. Alternatively, ratios of intensities of the form Fk Gk , k = 1 . . . N (8.2) can be built between successive images. Ratios near unity correspond to no-change, while small and large values indicate change. A disadvantage of this method is that random variables of the form (8.2) are not normally distributed, so simple threshold values defined in terms of standard deviations are not valid. Other algebraic combinations, such as differences in vegetation indices (Section 2.1) are also in use. All of these “band math” operations can of course be performed conveniently within the ENVI/IDL environment. 69

70 CHAPTER 8. CHANGE DETECTION 8.2 Principal components Figure 8.1: Change detection with principal components. Consider the bitemporal feature space for a single spectral band m in which each pixel is denoted by a point (fm, gm), a realization of the random vector (Fm, Gm). Since the unchanged pixels are highly correlated, they will lie in a narrow, elongated cluster along the principal axis, whereas change pixels will lie some distance away from it, see Fig. 8.1. The second principal component will thus quantify the degree of change associated with a given pixel. Since the principal axes are determined by diagonalization of the covariance matrix for all of the pixels, the no-change axis may be poorly determined. To avoid this problem, the principal components can be determined iteratively using weights for each pixel according to the magnitude of the second principal component. This method can be generalized to treat all multispectral bands simultaneously [Wie97]. 8.3 Post-classification comparison If two co-registered satellite images have been classified, then the class labels can be compared to determine land cover changes. If classification is carried out at the pixel level (as opposed to segments or objects), then classification errors (typically 5%) may dominate the true changes, depending on the magnitude of the latter. ENVI offers functions for statistical analysis of post-classification change detection.

8.4. MULTIVARIATE ALTERATION DETECTION 71 8.4 Multivariate alteration detection Suppose we make a linear combination of the intensities for all N channels in the first image acquired at time t2, represented by the random vector F. That is, we create a single image whose pixel intensities are U = a F = a1F1 + a2F2 + . . . aN FN , where the vector of coefficients a is as yet unspecified. We do the same for t2, i.e. we make the linear combination V = b G, and then look at the scalar difference image U − V . This procedure combines all the information into a single image, whereby one still hast to choose the coefficients a and b in some suitable way. Nielsen et al. [NCS98] suggest determining the coefficients so that the positive correlation between U and V is minimized. This means that the resulting difference image U − V will show maximum spread in its pixel intensities. If we assume that the spread is primarily due to actual changes that have taken place in the scene over the interval t2 − t1, then this procedure will enhance those changes as much as possible. Specifically we seek linear combinations such that var(U − V ) = var(U) + var(V ) − 2cov(U, V ) → maximum, (8.3) subject to the constraints var(U) = var(V ) = 1. (8.4) Note that under these constraints var(U − V ) = 2(1 − ρ), (8.5) where ρ is the correlation of the transformed vectors U and V , ρ = corr(U, V ) = cov(U, V ) var(U)var(V ) . Since we are dealing with change detection, we require that the random variables U and V be positively correlated, that is, cov(U, V ) 0. We thus seek vectors a and b which minimize the positive correlation ρ. 8.4.1 Canonical correlation analysis Canonical correlation analysis leads to a transformation of each set of variables F and G such that their mutual correlation is displayed unambiguously, see [And84], Chapter 12. We can derive the transformation as follows: For multivariate normally distributed data the combined random vector is distributed as F G ∼ N 0 0 , Σff Σfg Σgf Σgg . Recalling the property (1.6) we have var(U) = a Σff a, var(V ) = b Σggb, cov(U, V ) = a Σfgb.

72 CHAPTER 8. CHANGE DETECTION If we introduce the Lagrange multipliers ν/2 and µ/2, extremalizing the covariance cov(U, V ) under the constraints (8.4) is equivalent to extremalizing the unconstrained Lagrange function L = a Σfgb − ν 2 (a Σff a − 1) − µ 2 (b Σggb − 1). Differentiating, we obtain ∂L ∂a = Σfgb − ν 2 2Σff a = 0, ∂L ∂b = Σfga − µ 2 2Σggb = 0 or a = 1 ν Σ−1 ff Σfgb, b = 1 µ Σ−1 gg Σfga. The correlation between the random variables U and V is ρ = cov(U, V ) var(U)var(V ) = a Σfgb a Σff a b Σggb . Substituting for a and b in this equation gives (with Σfg = Σgf ) ρ2 = a ΣfgΣ−1 gg Σgf a a Σff a , ρ2 = b Σgf Σ−1 ff Σfgb b Σggb , which are equivalent to the two generalized eigenvalue problems ΣfgΣ−1 gg Σgf a = ρ2 Σff a Σgf Σ−1 ff Σfgb = ρ2 Σggb. (8.6) Thus the desired projections U = a F are given by the eigenvectors a1 . . . aN corresponding to the generalized eigenvalues ρ2 ∼ λ1 ≥ . . . ≥ λN of ΣfgΣ−1 gg Σgf with respect to Σff . Similarly the desired projections V = b G are given by the eigenvectors b1 . . . bN of Σgf Σ−1 ff Σfg with respect to Σgg corresponding to the same eigenvalues. Nielsen et al. [NCS98] refer to the N difference components Mi = Ui − Vi = ai F − bi G, i = 1 . . . N, (8.7) as the multivariate alteration detection (MAD) components of the combined bitemporal image. 8.4.2 Solution by Cholesky factorization Equations (8.6) are of the form Σ1a = λΣa, where both Σ1 and Σ are symmetric and Σ is positive definite. The Cholesky factorization of Σ is Σ = LL , where L is a lower triangular matrix, and can be thought of as the “square root” of Σ. Such an L always exists is Σ is positive definite. Therefore we can write Σ1a = LL a

8.4. MULTIVARIATE ALTERATION DETECTION 73 or, equivalently, L−1 Σ1(L )−1 L a = λL a or, with d = L a and commutivity of inverse and transpose, [L−1 Σ1(L−1 ) ]d = λd, a standard eigenproblem for a real, symmetric matrix L−1 Σ1(L−1 ) . Let the (orthogonal) eigenvectors be di. We have 0 = di dj = ai LL aj = ai Σaj, i = j. (8.8) 8.4.3 Properties of the MAD components We have, from (8.4) and (8.8), for the eigenvectors ai and bi, ai Σff aj = bi Σggbj = δij. Furthermore bi = 1 √ λi Σ−1 gg Σgf ai, i.e. substituting this into the LHS of the second equation in (8.6): Σgf Σ−1 ff Σfg 1 √ λi Σ−1 gg Σgf ai = Σgf Σ−1 ff 1 √ λi λiΣff ai = Σgf λiai = λiΣggbi, as required. It follows that ai Σfgbj = ai 1 λj ΣfgΣ−1 gg Σgf aj = λj ai Σff ai = λj δij, and similarly for bi Σgf aj. Thus the covariances of the MAD components are given by cov(Ui − Vi, Uj − Vj) = cov(ai F − bi G, aj F − bj G) = 2δij(1 − λj). The MAD components are therefore orthogonal (uncorrelated) with variances var(Ui − Vi) = σ2 MADi = 2(1 − λi). (8.9) The transformation corresponding to the smallest eigenvalue, namely (aN , bN ), will thus give maximal variance for the diﬀerence U − V . We can derive change probabilities from a MAD image as follows. The sum of the squares of the standardized MAD components for no-change pixels, given by Z = MAD1 σMAD1 2 + . . . + MADN σMADN 2 , is approximately chi-square distributed with N degrees of freedom, i.e., Pr(Z ≤ z) = ΓP (N/2, z/2). For a given measured value z for some pixel, the probability that Z could be that large or larger, given that the pixel is no-change, is 1 − ΓP (N/2, z/2).

74 CHAPTER 8. CHANGE DETECTION The probability that the pixel is a change pixel is therefore the complement of this, Pchange(z) = 1 − (1 − ΓP (N/2, z/2)) = ΓP (N/2, z/2). (8.10) This quantity can be plotted for example as a gray scale image to show the regions of change. The last MAD component has maximum spread in its pixel intensities and, ideally, maximum change information. However, depending on the type of change one is looking for, the other components may also be extremely useful. The second-to-last image has maximum spread subject to the condition that the pixel intensities are statistically uncorrelated with those in the first image, and so on. Since interesting anthropomorphic changes will generally be uncorrelated with dominating seasonal vegetation changes or stochastic image noise, it is quite common that such changes will be concentrated in higher order components. This in fact is one of the nicest aspects of the method – it sorts different categories of change into different image components. Therefore we can also perform change vector analysis on the MAD change vector. An ENVI plug-in for MAD is given in Appendix D.5.1. 8.4.4 Covariance of MAD variates with original observations With (8.7) and A = (ai . . . aN ), B = (bi . . . bN ) FM = F(A F − B G) = Σff A − ΣfgB GM = G(A F − B G) = Σgf A − ΣggB. 8.4.5 Scale invariance An additional advantage of the MAD procedure stems from the fact that the calculations involved are invariant under linear transformations of the original image intensities. This implies that the method is insensitive to differences in atmospheric conditions or sensor calibrations at the two acquisition times. We can see this as follows. Suppose the second image G is transformed according to some linear transformation T, H = TG. The relevant covariance matrices for (8.6) are then Σfg = FH = ΣfgT Σgf = HF = TΣgf Σff = Σff Σgg = HH = TΣggT . The eigenproblems are therefore ΣfgT (TΣggT )−1 TΣgf a = ρ2 Σff a TΣgf Σ−1 ff ΣfgT c = ρ2 TΣggT c, where c is the desired projection for H. These are equivalent to ΣfgΣ−1 gg Σgf a = ρ2 Σff a Σgf Σ−1 ff Σfg(T c) = ρ2 Σgg(T c),

8.4. MULTIVARIATE ALTERATION DETECTION 75 which are identical to (8.6) with b = T c. Therefore the MAD components in the transformed situation are ai F − ci H = ai F − ci TG = ai F − (T ci) G = ai F − bi G as before. 8.4.6 Improving signal to noise The MAD transformation can be augmented by subsequent application of the maximum autocorrelation factor (MAF) transformation, in order to improve the spatial coherence of the difference components, see [NCS98]. When image noise is estimated as the difference between intensities of neighboring pixels, the MAF transformation is equivalent to the MNF transformation. The MAF/MAD variates thus generated are also orthogonal and invariant under affine transformations. An ENVI plug-in for performing the MAF transformation is given in Appendix D.5.2. 8.4.7 Decision thresholds Since the MAD components are approximately normally distributed about zero and uncorrelated, see Figure 8.2, decision thresholds for change or no change pixels can be set in terms of standard deviations about the mean for each component separately. This can be done arbitrarily, for example by saying that all pixels in a MAD component whose intensities are within ±2σMAD are no-change pixels. Figure 8.2: Scatter plot of two MAD components. We can do better than this, however, using a Bayesean technique. Let us consider the following mixture model for a random variable X representing one of the MAD components: p(x) = p(x | NC)p(NC) + p(x | C−)p(C−) + p(x | C+)p(C+), (8.11)

76 CHAPTER 8. CHANGE DETECTION Figure 8.3: Probability mixture model for MAD components. where C+, C− and NC denote positive change, negative change and no change, respectively, see Fig. 8.3. The set of measurements S = {xi} may be grouped into four disjoint sets: SNC, SC−, SC+, SU = SSNC ∪ SC− ∪ SC+, with SU denoting the set of ambiguous pixels.1 From the sample mean and sample variance, we estimate initially the moments for the distribution of no-change pixels: µNC = 1 |SNC| · i∈SNC xi, (σNC)2 = 1 |SNC| · i∈SNC (xi − µNC)2 (|S| denotes set cardinality) and similarly for C− and C+. Bruzzone and Prieto [BP00] suggest improving these estimates by using the pixels in SU and applying the so-called EM algorithm (see [Bis95] for a good explanation): µNC = i∈S p(NC | xi)xi / i∈S p(NC | xi) (σNC)2 = i∈S p(NC | xi)(xi − µNC)2 / i∈S p(NC | xi) p (NC) = 1 |S| · i∈S p(NC | xi) , (8.12) where p(NC | xi) is the a posteriori probability for a no-change pixel conditional on measurement xi. We have the following rules for determining p(NC | xi): 1. i ∈ SNC : p(NC | xi) = 1 2. i ∈ SC± : p(NC | xi) = 0 1The symbols ∪ and denote set union and set diﬀerence, respectively. These sets can be determined in practice by setting generous, scene-independent thresholds for change and no-change pixel intensities, see [BP00].

8.5. RADIOMETRIC NORMALIZATION 77 3. i ∈ SU : p(NC | xi) = p(xi | NC)p(NC)/p(xi) (Bayes’ rule), where p(xi | NC) = 1 √ 2πσNC · exp −(xi − µNC)2 2σ2 NC and p(xi) = p(xi | NC)p(NC) + p(xi | C−)p(C−) + p(xi | C+)p(C+). Substituting into (8.12) we obtain the set of equations µNC = i∈SU p(NC | xi)xi + i∈SNC xi i∈SU p(NC | xi) + |SNC| (σNC)2 = i∈SU p(NC | xi)(xi − µNC)2 + i∈SNC (xi − µNC)2 i∈SU p(NC | xi) + |SNC| p (NC) = 1 |S| i∈SU p(NC | xi) + |SNC| , which can be iterated numerically to improve the initial estimates of the distributions. One can then determine e.g. the upper change threshold as the appropriate solution of p(x | NC)p(NC) = p(x | C+)p(C+). Taking logarithms, 1 2σ2 C+ (x − µC+)2 − 1 2σ2 NC (x − µNC)2 = log σNC σC+ · p(C+) P(NC) =: A with solutions x = µC+σ2 NC − µNCσ2 C+ ± σNCσC+ (µNC − µC+)2 + 2A(σ2 NC − σ2 C+) σ2 NC − σ2 C+ . A corresponding expression obtains for the lower threshold. In the next chapter we will extend this method to discriminate clusters of change and no change pixels. An ENVI GUI for determining change thresholds is given in Appendix D.6.6. 8.5 Radiometric normalization Radiometric normalization of satellite imagery requires, among other things, an atmospheric correction algorithm and the associated atmospheric properties at the times of image acquisition. For most historical satellite scenes such data are not available and even for planned acquisitions they may be difficult to obtain. A relative normalization based on the radiometric information intrinsic to the images themselves is an alternative whenever absolute surface radiances are not required, for example in change detection applications or for supervised land cover classification. One usually proceeds under the assumption that the relationship between the at-sensor radiances recorded at two different times from regions of constant reflectance is spatially

78 CHAPTER 8. CHANGE DETECTION homogeneous and can be approximated by linear functions. The critical aspect is the determination of suitable time-invariant features upon which to base the normalization. As we have seen, the MAD transformation invariant to linear and affine scaling. Thus, if one uses MAD for change detection applications, preprocessing by linear radiometric normalization is superfluous. However radiometric normalization of imagery is important for many other applications, such as mosaicing, tracking vegetation indices over time, supervised and unsupervised land cover classification, etc. Furthermore, if some other, non-invariant change detection procedure is preferred, it must generally be preceded by radiometric normalization [CNS04]. Taking advantage of this invariance, one can apply the MAD transformation to select the no-change pixels in bitemporal images, and then used them for radiometric normalization. The procedure is simple, fast and completely automatic and compares very favorably with normalization using hand-selected, time-invariant features. An ENVI plug-in for radiometric normalization with the MAD transformation is given in Appendix D.5.3.

Chapter 9 Unsupervised Classification Supervised classification of multispectral remote sensing imagery is commonly used for land- cover determination, see Chapter 10. For supervised classification it is very important to define training areas which adequately represent the spectral characteristics of each class in the image to be classified, as the quality of the training set has a significant effect on the classification process and its accuracy. Finding and verifying training areas can be rather laborious since the analyst must select representative pixels for each of the classes. This must be done by visual examination of the image data and by information extraction from additional sources such as ground reference data (ground truth) or existing maps. Unlike supervised classification, clustering methods (or unsupervised methods) require no training sets at all. Instead, they attempt to find the underlying structure automatically by organizing the data into classes sharing similar, e.g. spectrally homogeneous, characteristics. The analyst simply needs to specify the number of clusters present. Clustering plays an especially important role when very little a priori information about the data is available and provides a useful method for organizing a large set of data so that the retrieval of information may be made more efficiently. A primary objective of using clustering algorithms for pre-classification of multispectral remote sensing data in particular is to obtain optimum information for the selection of training regions for subsequent supervised land-use segmentation of the imagery. 9.1 A simple cost function We begin with the assumption that the measured features (pixel intensities) x = {xi | i = 1 . . . n} are chosen independently from K multivariate normally distributed populations corresponding the K principal land cover categories present in the image. The xi are thus realization of random vectors Xk ∼ N(µk, Σk), k = 1 . . . K. (9.1) Here µk and Σk are the expected value and covariance matrix of Xk, respectively. We denote a given clustering by C = {C1, . . . Ck, . . . CK} where Ck denotes the index set for the kth cluster.1 We wish to maximize the posteriori probability p(C | x) for observing the 1The set of indices {i | i = 1 . . . n, xi is in class k}. 79

80 CHAPTER 9. UNSUPERVISED CLASSIFICATION clustering given the data. From Bayes’ rule, p(C | x) = p(x | C)p(C) p(x) . (9.2) The quantity p(x|C) is the joint probability density function for clustering C, also referred to as the likelihood of observing the clustering C given the data x, P(C) is the prior probability for C and p(x) is a normalization independent of C. The joint probability density for the data is the product of the individual probability densities, i.e., p(x | C) = K k=1 i∈Ck p(xi | Ck) = K k=1 i∈Ck (2π)−N/2 |Σk|−1/2 exp − 1 2 (xi − µk) Σ−1 k (xi − µk) . Forming the product in this way is justiﬁed by the independence of the samples. The log-likelihood is given by [Fra96] L = log p(x | C) = K k=1 i∈Ck − N 2 log(2π) − 1 2 log |Σk| − 1 2 (xi − µk) Σ−1 k (xi − µk) . With (9.2) we can therefore write log p(C | x) ∝ L + log p(C). (9.3) If all K classes exhibit identical covariance matrices according to Σk = σ2 I, k = 1 . . . K, (9.4) where I is the identity matrix, then L is maximized when the expression K k=1 i∈Ck (xi) − µk) ( 1 2σ2 I)(xi − µk) = K k=1 i∈Ck (xi − µk) (xi − µk) 2σ2 is minimized. We are thus led to the cost function E(C) = K k=1 i∈Ck (xi − µk) (xi − µk) 2σ2 − log p(C). (9.5) An optimal clustering C under these assumptions is achieved for E(C) → min . Now we introduce a “hard” class dependency in the form of a matrix u with elements given by uki = 1 if i ∈ Ck 0 otherwise. (9.6)

9.2. ALGORITHMS THAT MINIMIZE THE SIMPLE COST FUNCTION 81 The matrix u satisfies the conditions K k=1 uki = 1, i = 1 . . . n, (9.7) meaning that each sampled pixel xi, i = 1 . . . n, belongs to precisely one class, and n i=1 uki 0, k = 1 . . . K, (9.8) meaning that no class Ck, k = 1 . . . K, is empty. The sum in (9.8) is the number nk of pixels in the kth class. An unbiased estimate mk of the expected value µk for the kth cluster is therefore given by µk ≈ mk = 1 nk i∈Ck xi = n i=1 ukixi n i=1 uki , k = 1 . . . K, (9.9) and an estimate Fk of the covariance matrix Σk by Σk ≈ Fk = n i=1 uki(xi − mk)(xi − mk) n i=1 uki , k = 1 . . . K. (9.10) We can now write (9.5) in the form E(C) = K k=1 n i=1 uki (xi − mk) (xi − mk) 2σ2 − log p(C). (9.11) Finally, if we do not wish to include prior probabilities, we can simply say that all clustering configurations C are a priori equally likely. Then the last term in (refe911) is independent of C and we have, dropping the multiplicative constant 1/2σ2 , the well-known sum-of-squares cost function E(C) = K k=1 n i=1 uki(xi − mk) (xi − mk). (9.12) 9.2 Algorithms that minimize the simple cost function We begin with the popular K-means method and then consider an algorithm due to (Palu- binskas 1998) [Pal98], which uses cost function (9.11) and for which the number of clusters is determined automatically. Then we discuss a common version of bottom-up or agglomerative hierarchical clustering, and finally a “fuzzy” version of the K-means algorithm. 9.2.1 K-means The K-means clustering algorithm (KM) (sometimes referred to as basic Isodata [DH73] or migrating means [JRR99]) is based on the cost function (9.12). After initialization of the cluster centers, the distance measure corresponding to a minimization of (9.12), namely d(i, k) = (xi − mk) (xi − mk) is used to re-cluster the pixel vectors. Then (9.9) is used to recalculate the cluster centers. This procedure is iterated until the centers cease to change significantly. K-means clustering may be performed within the ENVI environment from the main menu.

82 CHAPTER 9. UNSUPERVISED CLASSIFICATION 9.2.2 Extended K-means Denote by pk = p(Ck) the prior probability for cluster k. The entropy S associated with this prior distribution is S = − K k=1 pk log pk. (9.13) Distributions with high entropy are those for which the pi are all similar, that is, the pixels are distributed evenly over all available clusters, see [Bis95]. Low entropy means that most of the data are concentrated in very few clusters. We choose a prior distribution p(C) in (9.11) for which few clusters are more probable than many clusters, namely p(C) ∝ exp(−αES) = exp αE K k=1 pk log pk , where αE is a parameter. The cost function (9.11) can then be written as E(C) = K k=1 n i=1 uki (xi − mk) (xi − mk) 2σ2 − αE K k=1 pk log pk. (9.14) With pk ≈ nk n = 1 n n i=1 uki (9.15) this becomes E(C) = K k=1 n i=1 uki (xi − mk) (xi − mk) 2σ2 − αE n log pk . (9.16) An estimate for the parameter αE may be obtained as follows [Pal98]: From (9.14) and (9.15) E(C) ≈ K k=1 nσ2 kpk 2σ2 − αEpk log pk . Equating the likelihood and prior terms in this expression and taking σ2 k ≈ σ2 and pk ≈ 1/ ˜K, where ˜K is some a priori expected number of clusters, gives αE ≈ − n 2 log(1/ ˜K) . (9.17) The parameter σ2 can be estimated from the data. The extended K-means (EKM) algorithm is as follows: First an initial conﬁguration with a very large number of clusters K is chosen (for one-dimensional data this might conveniently be the 256 gray values that a pixel with 8-bit resolution can have) and initial values mk = 1 nk n i=1 ukixi, pk = nk n (9.18) are determined. Then the data are re-clustered according to the distance measure corresponding to a minimization of (9.16): d(i, k) = (xi − mk) (xi − mk) 2σ2 − αE n log pk. (9.19)

9.2. ALGORITHMS THAT MINIMIZE THE SIMPLE COST FUNCTION 83 The prior term tends to reduce the number of clusters and any class which has in the course of the algorithm nk = 0 is simply dropped from the calculation. (Condition (9.8) is thus relaxed.) Iteration of (9.18) and (9.19) continues until no signiﬁcant changes in the mk occur. The explicit choice of the number of clusters K is replaced by the necessity of choosing a value for the “meta-parameter” αE. This has the advantage that we can use one parameter for a wide variety of images and let the algorithm itself decide on the actual value of K in any given instance. 9.2.3 Agglomerative hierarchical clustering The agglomerative hierarchical clustering algorithm that we consider here is, as for K-means, based on the cost function (9.12), see [DH73]. It begins by assigning each pixel in the dataset to its own class or cluster. At this stage of course, the cost function E(C), Eq. (9.12), is zero. We write E(C) in the form E(C) = K k=1 Ek (9.20) where Ek is given by Ek = i∈Ck (xi − mk) (xi − mk). Every agglomeration of clusters to form a smaller number of clusters will increase E(C). We therefore seek a prescription for choosing two clusters for combination that will increase E(C) by the smallest amount possible. Suppose clusters k with nk members and with n members are merged, k , and the new cluster is labeled k. Then mk → nkmk + n m nk + n =: ¯m. Thus after the agglomeration, Ek changes to Ek = i∈Ck∪C (xi − ¯m) (xi − ¯m) and E disappears. The net change in E(C) is therefore, after some algebra, ∆(k, ) = i∈Ck∪C (xi − ¯m) (xi − ¯m) − i∈Ck (xi − mk) (xi − mk) − i∈C (xi − m ) (xi − m ) = nkn nk + n (mk − m ) (mk − m ). (9.21) The minimum increase in E(C) is achieved by combining those two clusters k and which minimize the above expression. Given two alternative candidate cluster pairs with similar combined memberships nk + n and whose means have similar euclidean separations mk − m , this prescription obviously favors combining that pair with the larger discrep- ancy between nk and n . Thus similar-sized clusters are preserved and smaller clusters are absorbed by larger ones.

84 CHAPTER 9. UNSUPERVISED CLASSIFICATION Let k, represent the cluster formed by combination of the clusters k and . Then the increase in cost incurred by combining this cluster with cluster r can be determined from (9.21) as ∆( k, , r) = (nk + nr)∆(k, r) + (n + nr)∆( , r) − nr∆(k, ) nk + n + nr . (9.22) Once ∆(i, j) = 1 2 (xi − xj) (xi − xj) for i, j = 1 . . . n has been initialized from (9.21) for all possible combinations of pixels, the recursive formula (9.22) can be used to calculate efficiently the cost function for any further combinations without reference to the original data. The algorithm terminates when the desired number of clusters has been reached or continues until a single cluster has been formed. Assuming that the data consist of ˜K compact and well separated clusters, the slope of E(C) vs. the number of clusters K should decrease (become more negative) for K ≤ ˜K. An ENVI plug-in for agglomerative hierarchic clustering is given in Appendix D.6.1. 9.2.4 Fuzzy K-means For q 1 we write (9.9) and (9.12) in the equivalent forms [Dun73] mk = n i=1 uq kixi n i=1 uq ki , k = 1 . . . K, (9.23) E(C) = K k=1 n i=1 uq ki(xi − mk) (xi − mk), (9.24) and make the transition from “hard” to “fuzzy” clustering by replacing (9.6) by continuous variables 0 uki 1, k = 1 . . . K, i = 1 . . . n, (9.25) but retaining requirements (9.7) and (9.8). The matrix u is now a fuzzy class membership matrix. With i fixed, we seek values for the uki which solve the minimization problem Ei = K k=1 uq ki(xi − mk) (xi − mk) → min, i = 1 . . . n, under conditions (9.7). By introducing the Lagrange function Li = Ei − λ K k=1 uki − 1 we can equivalently solve the unconstrained problem Li → min. Differentiating with respect to uki, ∂Li ∂uki = q(uki)q−1 (xi − mk) (xi − mk) − λ = 0, k = 1 . . . K,

9.3. EM CLUSTERING 85 from which we have uki = q−1 λ q q−1 1 (xi − mk) (xi − mk) . (9.26) The Lagrange multiplier λ is determined by 1 = K k=1 uki = q−1 λ q K k=1 q−1 1 (xi − mk) (xi − mk) , Substituting this into (9.26), we obtain finally uki = q−1 1 (xi−mk) (xi−mk) K k =1 q−1 1 (xi−mk ) (xi−mk ) , k = 1 . . . K, i = 1 . . . n. (9.27) The parameter q determines the “degree of fuzziness” and is usually chosen as q = 2. The fuzzy K-means (FKM) algorithm consists of a simple iteration of equations (9.23) and (9.27). The algorithm terminates when the cluster centers mk – or alternatively when the matrix elements uki – cease to change significantly. This algorithm should gives similar results to the K-means algorithm, but one expects it to be less likely to become trapped in local minima of the cost function. An ENVI plug-in for fuzzy k-means clustering is given in Appendix D.6.2. 9.3 EM Clustering The EM (= expectation maximization) algorithm, (see e.g. [Bis95]) replaces uki in (9.27) by the posterior probability p(Ck | xi) of class Ck given the observation xi. That is, using Bayes’ theorem, uki → p(Ck | xi) ∼ p(xi | Ck)p(Ck). Here p(xi | Ck) is taken to be a multivariate normal distribution function with estimated mean mk and estimated covariance matrix Fk given by (9.9) and (9.10), respectively. Thus uki ∼ p(Ck) 1 |Fk| exp − 1 2 (xi − mk) F−1 k (xi − mk) . (9.28) One can use the current class membership to estimate P(Ck) as pk according to (9.15). The EM algorithm is then an iteration of equations (9.9), (9.10), (9.15) and (9.28) with the same termination condition as for the fuzzy K-means algorithm, see also Eqs. (8.12). After each iteration the columns of u are normalized according to (9.7). Because of the exponential distance dependence of the membership probabilities in (9.28), the algorithm is very sensitive to initialization conditions, and can even become unstable. To avoid this problem, one can first obtain initial values for the mk and for u by preceding the calculation with the fuzzy K-means algorithm. Explicitly: Algorithm (EM clustering) 1. Determine starting values for cluster centers mk and initial memberships uki from the FKM algorithm.

86 CHAPTER 9. UNSUPERVISED CLASSIFICATION 2. Determine the cluster centers mk with (9.9) and the prior probabilities P(Ck) with (9.15). 3. Calculate the weighted covariance matrices Fk with (9.10) and with (9.28) the class membership probabilities uki. Normalize the columns of u. 4. If u has not changed significantly, stop, else go to 2. 9.3.1 Simulated annealing Even with initialization using the fuzzy K-means algorithm the EM algorithm may be trapped in a local optimum. An alternative scheme is to apply so-called simulated annealing. Essentially the initial memberships are random and only gradually are the calculated class memberships allowed to influence the estimation of the class centers [Hil01]. The rate of reduction of randomness is determined by a temperature parameter. For example, the class memberships in (9.28) may replaced by uki → uki(1 − r1/T ) on each iteration, where T is initialized to T0 and reduced at each iteration by a factor c 1: T → cT and where r ∈ (0, 1) is a uniformly distributed random number. As T approaches zero, uki will be determined more and more by the probability distribution parameters alone in (9.28). 9.3.2 Partition density Since the simple cost function E(C) of (9.12) is no longer relevant, we choose with [GG89] the partition density as a criterion for choosing the best number of clusters. The fuzzy hypervolume, defined as FHV = K k=1 |Fk|, is proportional to the volume in feature space occupied by the ellipsoidal clusters generated by the algorithm. For instance, for a two dimensional cluster with an elliptical probability density we have, in its principal axis coordinate system, |Σ| = σ2 1 0 0 σ2 2 = σ1σ2 ≈ area (volume) of the ellipse. Summing the memberships of the observations within one standard deviation of each cluster center, S = n i=1 K k=1 uik, ∀i ∈ {i | (xi − mk) · F−1 k · (xi − mk) 1}, the partition density is defined as PD = S/FHV. (9.29) Assuming that the data consist of ˜K well-separated clusters of approximately multivariate normally distributed pixels, the partition density should exhibit a maximum at K = ˜K. An ENVI plug-in for EM clustering is given in Appendix D.6.3.

9.3. EM CLUSTERING 87 9.3.3 Including spatial information The algorithms described thus far make exclusive use of the spectral properties of the individual observations (pixels). Spatial relationships within an image such as large scale, coherent regions, textures etc. are ignored entirely. The EM algorithm determines the a posteriori class membership probabilities of each observation for the classes in question. In this section we describe a post-processing technique to take account of some of the spatial information implicit in the classified image in order to improve the original classification. This technique makes use of the vectors of a posteriori probabilities associated with each classified pixel. Figure 9.1 shows schematically a single pixel m together with its immediate neighborhood n, which we take to consist of the four pixels above, below, to the left and to the right of m. Let its a posteriori probabilities be Pm(Ck), k = 1 . . . M, M k=1 Pm(Ck) = 1, or, more simply, the vector Pm. 1 2 3 4 C Pixel mC Neighborhood n Figure 9.1: A pixel neighborhood. A possible misclassification of the pixel m could in principle be corrected by examining its neighborhood. The neighboring pixels would have in some way to modify Pm such that the maximal probability corresponds to the true class. We now describe a purely heuristic but nevertheless intuitively satisfying procedure to do just that, the so-called probabilistic label relaxation method [JRR99]. Let Qm(Ck) be a neighborhood function for the mth Pixel, which is supposed to correct Pm(Ck) in the above sense, according to the prescription Pm(Ck) = Pm(Ck)Qm(Ck) k Pm(Ck )Qm(Ck ) , k = 1 . . . M, or, as a vector equation, according to Pm = Pm ⊗ Qm PmQm , (9.30) where ⊗ signifies the Hadamard product, which simply means component-by-component multiplication. The denominator ensures that the result is also a probability, in other words

88 CHAPTER 9. UNSUPERVISED CLASSIFICATION that M k=1 Pm(Ck) = 1. The neighborhood function must somehow reflect the spatial structure of the image. In order to define it we first postulate a compatibility measure Pmi(Ck|Cl), i = 1 . . . 4, namely, the conditional probability that pixel m belongs to class Ck, given that the neighboring pixel i, i = 1 . . . 4, belongs to Cl. A ‘small piece of evidence’ that m should be classified to Ck would then be Pmi(Ck|Cl)Pi(Cl), i = 1 . . . 4, that is, the conditional probability that pixel m is in class Ck if neighboring pixel i is in class Cl, i = 1 . . . 4. We obtain a Neighborhood function Qm(Ck) by summing over all pieces of evidence: Qm(Ck) = 1 4 4 i=1 M l=1 Pmi(Ck|Cl)Pi(Cl) = M l=1 Pmn(Ck|Cl)Pn(Cl), (9.31) where Pn(Cl) is the average over all four neighborhood pixels: Pn(Cl) = 1 4 4 i=1 Pi(Cl), and where Pmn(Ck|Cl) also corresponds to the average compatibility of pixel m with its entire neighborhood. We can write (9.31) again as a vector equation, Qm = Pmn · Pn and (9.30) finally as Pm = Pm ⊗ (PmnPn) PmPmnPn . (9.32) The matrix of average compatibilities Pmn can be estimated directly from the original classified image. A random central pixel m is chosen and its calss Ci determined. Then, again randomly, a pixel j out of its neighborhood its chosen and its class Cj is also determined. Thereupon the matrix element Pmn(Ci|Cj) (which was initialized to 0) is incremented by 1. This is repeated many times and finally the rows of the matrix are normalized. Equation (10.15) is well-suited for a simple algorithm: Algorithm (Probabilistic label relaxation) 1. Carry out a supervised classification, e.g. with a FFN, and determine the compatibility matrix Pmn.

9.4. THE KOHONEN SELF ORGANIZING MAP 89 2. Determine the average neighborhood vector Pn of all pixels m and replace Pm with Pm according to (9.32). Re-classify pixel m according to the largest membership probability in Pm. 3. If only a few re-classifications took place, stop otherwise go to step 2. The stopping condition in the algorithm is obviously rather arbitrary. Experience shows that the best results are obtained after 3–4 iterations, see [JRR99]. Too many iterations lead to a widening of the effective neighborhood of a pixel to such an extent that fully irrelevant spatial information falsifies the final product. The PLR method can be applied similarly to class probabilities generated by supervised classification algorithms. ENVI plug-ins for probabilistic label relaxation are given in Appendix D.6.4. 9.4 The Kohonen Self Organizing Map The Kohonen self organizing map, a simple example of which is sketched in Fig. 9.2 , belongs to a class of neural networks which are trained by competitive learning, [HKP91, Koh89]. The single layer of neurons can have any geometry, usually one- two- or three-dimensional. The input signal is represented by the vector x = (x1, x2 . . . xN ) . Each input to a neuron is associated with a synaptic weight, so that for M neurons, the synaptic weights can represented as a (M × N) matrix w =     w11 w12 · · · w1N w21 w22 · · · w2N ... ... ... ... wM1 wM2 · · · wMN     . The components of the vector wk = (wk1, wk2 . . . wkN ) are thus the synaptic weights of the kth neuron. We interpret the vectors {x(i)|i = 1 . . . p}. as training data for the neural network. The synaptic weight vectors are to be adjusted so as to reflect in some way the clustering of the training data in the N-dimensional feature space. When a training vector x is presented to the input of the network, the neuron whose weight vector wk lies nearest to x is designated to be the “winner”. Distances are given by (x − wk) (x − wk). Call the winner k∗ . Then its weight vector is shifted a small amount in the direction of the training vector: wk∗ (i + 1) = wk∗ (i) + η(x(i) − wk∗ (i)), where wk∗ (i + 1) is the weight vector after presentation of the ith training vector, see Fig. 9.3. The parameter η is called the learning rate of the network.

90 CHAPTER 9. UNSUPERVISED CLASSIFICATION y Qs u T T T x1 x2 4 0 w 1 2 3 k 16 k∗ Figure 9.2: The Kohonen feature map in two dimensions with a two-dimensional input. The intention is to repeat this learning procedure until the synaptic weight vectors reﬂect the class structure of the training data, thus achieving a vector quantization of the feature space. In order for this method to function, it is necessary to allow the learning rate to decrease gradually during the training process. A convenient function for this is η(i) = ηmax ηmin ηmax i/p . However the Kohonen feature map goes a step further and tries to map the topology of the feature space onto the network. This is achieved by deﬁning a neighborhood function for the winner neuron on the network of neurons. Usually a Gauss function of the form λ(k∗ , k) = exp(−d2 (k∗ , k)/2σ2 ) is used, where d2 (k∗ , k) is the square of the distance between neurons k∗ and k. For example, for a two-dimensional array of m × m neurons d2 (k∗ , k) =[(k∗ − 1) mod m − (k − 1) mod m]2 + [(k∗ − 1) div m − (k − 1) div m]2 , whereas for a cubic m × m × m array. d2 (k∗ , k) = [(k∗ − 1) mod m − (k − 1) mod m]2 + [((k∗ − 1) div m − (k − 1) div m) mod m]2 + [(k∗ − 1) div m2 − (k − 1) div m2 ]2 . During the learning phase not only the winner neuron, but also the neurons in its “neighborhood” are moved in the direction of the training vectors: wk(i + 1) = wk(i) + η(i)λ(k∗ , k)(x(i) − wk(i)), k = 1 . . . M.

9.5. UNSUPERVISED CLASSIFICATION OF CHANGES 91 U X x wk∗ (i) x(i) b wk∗ (i + 1) Figure 9.3: Movement of synaptic weight vector in the direction of training vector. Finally, the extent of the neighborhood is allowed to shrink steadily σ(µ) = σmax σmin σmax i/p . Typically, σmax ≈ m/2 and σmin ≈ 1/2. Thus the neighborhood is initially the entire network and toward the end of training it is very localized. For visualization or clustering of multispectral satellite imagery, a cubic network geometry is useful. After training, the image is classified by associating each pixel vector with the neuron having the closest synaptic weight vector. The pixel is then colored by mapping the position of the neuron in the cube to coordinates in RGB color space. Thus, pixels that are close together in feature space are represented by similar colors. An ENVI plug-in for the Kohonen self organizing map is given in Appendix D.6.5. 9.5 Unsupervised classification of changes We mention finally an extension of the procedure used to determine change/no-change decision thresholds discussed in Section 8.4.7. Rather than clustering the MAD change components individually as was done there, we can use any of the algorithms introduced in this chapter (except the Kohonen SOM) to classify the changes. Because of its ability to accommodate correlated clusters, we prefer the EM algorithm. Clustering of the change pixels can of course be applied in the full MAD or MNF/MAD feature space, where the number of clusters chosen determines the number of change categories. The approximate chi-square distribution of the sum of squares of the standardized variates allows the labelling of pixels with high no-change probability. These can be excluded from the clustering process e.g. by “freezing” their a posteriori probabilities to 1 for the no-change class, thereby speeding up the calculation considerably. Routines for change classification using the EM algorithm are included in the ENVI GUI for viewing change detection images given in Appendix D.6.6.

92 CHAPTER 9. UNSUPERVISED CLASSIFICATION

Chapter 10 Supervised Classification The pixel-oriented, supervised classification of multispectral images is a problem of probability density estimation. On the basis of representative training data for each class, the probability distributions for all of the classes are estimated and then used to classify all of the pixels in the image. We will consider three methods or models for supervised classification: a parametric model (Bayes maximum likelihood), a non-parametric model (Gaussian kernel) and a mixture model (the feed-forward neural network). The basis for all of these classifiers is Bayes’ decision rule, which we consider first. 10.1 Bayes decision rule The a posteriori probabilities for class Ck, Eq. (2.3), can be written for N-diminsional training data and M classes in the form P(Ck|x), k = 1 . . . M, x = (x1 . . . xN ) . (10.1) Let us define a loss function L(Ci, x) which measures the cost of associating the pixel with feature vector x with the class Ci. Let λik be the loss incurred if x in fact belongs to class Ci, but is classified as belonging to class Ck. We can reasonably assume λik = 0 if i = k 0 otherwise, i, k = 1 . . . M, (10.2) that is, a correct classification incurs no loss. We can now express the loss function as a sum over the individual losses, weighted according to (10.1): L(Ci, x) = M k=1 λikP(Ck|x). (10.3) Without further specifying λik, we can define a loss-minimizing decision rule for our classification as x ∈ Ci provided L(Ci, x) L(Cj, x) for all j = 1 . . . M, j = i. (10.4) Up till now we’ve been completely general. Now suppose the losses are independent of the kind of misclassification that occurs (for instance, the classification of a ‘forest’ pixel into 93

94 CHAPTER 10. SUPERVISED CLASSIFICATION the the class ‘meadow’ is just as bad as classifying it as ‘urban area’, etc). The we can write λik = 1 − δik, . Thus any given misclassification (i = k) costs unity, and a correct classification (i = k) costs nothing. We then obtain from (10.3) L(Ci, x) = M k=1 P(Ck|x) − P(Ci|x) = 1 − P(Ci|x), i = 1 . . . M, (10.5) and from (10.4) the Bayes’ decision rule x ∈ Ci if P(Ci|x) P(Cj|x) for all j = 1 . . . M, j = i. (10.6) 10.2 Training data The selection of representative training data is the most difficult and critical part of the classification process. The standard procedure is to select training areas within the image which are representative of each class of interest. In the ENVI environment, these are entered as regions of interest (ROI’s), from which the training pixel vectors are generated. Note that some fraction of the representative data must be withheld for later accuracy assessment. These are the so-called test data, which are not used for training purposes in order not to bias the accuracy assessment. We’ll discuss their use in detail in later in this chapter. Suppose there are just two classes, that is M = 2. If we apply decision rule (10.6) to some measured pixel vector x, the probability of incorrectly classifying the pixel is r(x) = min[P(C1|x), P(C2|x)]. The Bayes error is defined to be the average of r(x) over all pixels, = r(x)p(x)dx = min[P(C1|x), P(C2|x)]p(x)dx = min[P(x|C1)P(C1), P(x|C2)P(C2)]dx, where we used Bayes rule in the last step. We can use the Bayes error as a measure of the separability of the two classes, the smaller the error, the better the separability. Calculating the Bayes error is difficult, but we can at least get an approximate upper bound as follows. First note that, for any a, b ≥ 0, min[a, b] ≤ aS b1−S , 0 ≤ S ≤ 1. For example, if a b, then the inequality can be written a ≤ a b a 1−S which is clearly true. Applying the inequality to the expression for the Bayes error, we get the so-called Chernoff bound ≤ u = P(C1)S P(C2)1−S P(x|C1)S P(x|C2)1−S dx.

10.3. BAYES MAXIMUM LIKELIHOOD CLASSIFICATION 95 The best upper bound is then determined by minimizing u with respect to S. If we assume that P(x|C1) and P(x|C2) are normal distributions with Σ1 = Σ2, then the minimum occurs at S = 1/2. We get the Bhattacharyya bound B by using S = 1/2 also for the case where Σ1 = Σ2: B = P(C1)P(C2) P(x|C1)P(x|C2) dx. This integral can be evaluated explicitly. The result is B = P(C1)P(C2)e−B , where B is the Bhattacharyya distance given by B = 1 8 (µ2 − µ1) Σ1 + Σ2 2 −1 (µ2 − µ1) + 1 2 log Σ1 + Σ2 2 |Σ1||Σ2| . The first term is an “average Mahalinobis distance” (see below), the second term depends on the difference between the covariance matrices of the two classes. It vanishes when Σ1 = Σ2. Thus the first term gives the class separability due due the “distance” between the class means, while the second term gives the separability due to the difference in the covariance matrices. Finally, the Jeffries-Matusita distance measures separability of two classes on a scale [0 − 2] in terms of B: J = 2(1 − e−B ). (10.7) The ENVI menu command Basic Tools/Region of Interest/Compute ROI Separability calculates Jeffries-Matusita distances between all pairs of classes defined by a given set of ROIs. 10.3 Bayes Maximum likelihood classification Consider again Bayes’ rule: P(Ci|x) = P(x|Ci)P(Ci) P(x) where P(Ci), i = 1 . . . M, are a priori probabilities and where P(x) is given by P(x) = M j=1 P(x|Cj)P(Cj). Since P(x) is independent of i, we can write the decision rule (10.6) as x ∈ Ci if P(x|Ci)P(Ci) P(x|Cj)P(Cj) for all j = 1 . . . M, j = i. Now we make two simplifying assumptions: first, that all the a priori probabilities are equal, second, that the measured feature vectors from class Ci have been sampled from a multivariate normal distribution, that is, that they satisfy P(x|Ci) = 1 (2π)N/2|Σi|1/2 exp − 1 2 (x − µi) Σ−1 i (x − µi) . (10.8)

96 CHAPTER 10. SUPERVISED CLASSIFICATION According to the first assumption, we only need to associate x to that class Ci which maximizes P(x|Ci): x ∈ Ci if P(x|Ci) P(x|Cj) for all j = 1 . . . M, j = i. (10.9) Taking the logarithm of (10.8) gives log P(x|Ci) = − N 2 log(2π) − 1 2 log |Σi| − 1 2 (x − µi) Σ−1 i (x − µi) and we can ignore the first term, as it is independent of i. With the definition of a discriminant function di(x), di(x) = − log |Σi| − (x − µi) Σ−1 i (x − µi), (10.10) we obtain finally the Bayes maximum-likelihood classifier: x ∈ Ci if di(x) dj(x) for all j = 1 . . . M, j = i. (10.11) The expression (x−µi) Σ−1 i (x−µi) in (6.10) is referred to as the Mahalanobis distance. The moments of the distributions for the M classes, µi and Σi, which appear in the discriminant function (10.10), may be estimated from the training data using the maximum likelihood estimates: µi ≈ mi = 1 ni x∈Ci x Σi ≈ Fi = 1 ni x∈Ci (x − µi)(x − µi) , where ni is the number of training pixels in class Ci. The maximum likelihood classification algorithm can be called from the ENVI main menu with Classification/Supervised/Maximum Likelihood. 10.4 Non-parametric methods In non-parametric density estimation we wish to model the probability distribution generated by a given set of training data, without making any prior assumption about the form of the distribution function. An example is the class of kernel based methods. Here each data point is used as the center of a simple local probability density and the overall distribution is taken to be the sum of the local distributions. In N dimensions, we can model the class probability distribution as P(y|Ci) ≈ 1 ni x∈Ci 1 (2πσ2)N/2 e(y−x) (y−x)/2σ2 . The quantity σ is a smoothing parameter which we can choose for example by minimizing the misclassifications on the training data themselves with respect to σ. The kernel based method suffers from the drawback of requiring all training data points to be stored. This makes the evaluation of the density very slow if the number of training pixels is large. In general, the complexity grows with the amount of data, not with the difficulty of the estimation problem itself.

10.5. NEURAL NETWORKS 97 10.5 Neural networks Neural networks belong to the category of mixture models for probability density estimation, which lie somewhere between the parametric and non-parametric extremes. They make no assumption about the functional form of the probabilities and can be adjusted flexibly to the complexity of the system that they are being used to model. To motivate their use for classification, consider two classes C1 and C2 in a two-dimensional feature sspace. We could write (10.11) in the form of a discriminant m(x) = d1(x) − d2(x) and say that x is C1 if m(x) 0 C2 if m(x) 0. A much simpler discriminant is the linear function: m(x) = w0w1x1 + w2x2 = w0 + w x, (10.12) where w = (w1, w2) and w0 are parameters. The decision boundary occurs for m(x) = 0, i.e. for x2 = − w1 w2 x1 − w0 w2 , see Figure 10.1 u u u e u u u ee e e e e e e u e e u u u m(x) = 0 −w0 w2 −w1 w2 Figure 10.1: A linear discriminant for two classes. In N dimensions m(x) = w0 + w1x1 + . . . + wN xN = w x + w0 and we can represent this equation schematically as an artificial neuron as in shown Figure 10.2. The parameters wi are referred to as synaptic weights. Usually, the output m(x) is modified by a so-called sigmoid activation function, for example the logistic function g(x) = 1 1 + e−I(x) ,

98 CHAPTER 10. SUPERVISED CLASSIFICATION 1 i N ... ... ~ q X b E E E E E 01 x1 xi xN m(x) w0 w1 wi wN Figure 10.2: An artificial neuron. The first input is always unity and is called the bias. where I(x) = w x + w0. This is sometimes justified by the analogy to biological neurons. In IDL (see Figure 10.3): thisDevice =!D.Name set_plot, ’PS’ Device, Filename=’c:templogistic.eps’,xsize=15,ysize=10,/Encapsulated x=(findgen(100)-50)/10 plot, x,1/(1+exp(-x)) device,/close_file set_plot,thisDevice Figure 10.3: The logistic activation function.

10.5. NEURAL NETWORKS 99 There is also a statistical justiﬁcation, however [Bis95]. Suppose two classes in two- dimensional feature space are normally distributed with Σ1 = Σ2 = I, P(x|Ck) ∼ 1 2π exp( −|x − µk|2 2 ), k = 1, 2. Then we have P(C1|x) = P(x|C1)P(C1) P(x|C1)P(C1) + P(x|C2)P(C2) = 1 1 + P(x|C2)P(C2)/P(x|C1)P(C1) = 1 1 + exp(−1 2 [(x − µ2)2 − (x − µ1)2])(P(C2)/P(C1)) . With the substitution e−a = (P(C2)/P(C1)) we get P(C1|x) = 1 1 + exp(−1 2 [|x − µ2|2 − |x − µ1|2] − a) = 1 1 + exp(−w x − w0) = 1 1 + e−I(x) = m(x). Here we made the additional substitutions w = µ1 − µ2 w0 = − 1 2 |µ1|2 + 1 2 |µ2|2 + a. Thus we expect that the output of the neuron will not only discriminate between the two classes, but also that it will approximate the posterior class membership probability P(C1|x). 10.5.1 The feed-forward network In order to discriminate any number of classes, multilayer feed-forward networks are often used, see Figure 10.4. In this ﬁgure, the input signal is the N + 1-component vector x( ) = (1, x1( ) . . . xN ( )) for training sample , which is fed simultaneously to the so-called hidden layer consisting of L neurons. These in turn determine the L + 1-component vector n(x) = (1, n1(x) . . . nL(x)) according to nj(x) = g(Ih j (x)), j = 1 . . . L, with Ih j (x) = wh j x, where wh is the hidden weight vector for the jth neuron wh = (wh 0 , wh 1 . . . wh L) .

100 CHAPTER 10. SUPERVISED CLASSIFICATION ! # ! # ! # ! # ! # 1 i N j ... ... ~ q X b E E E E 01 x1( ) xi( ) xN ( ) ! # ! # 1 L B w s 0 E ... ... E ! # k ! # ! # 1 M ... ... ! # … w q E ‚ … b E ‚ 0 b E 0 m1( ) mk( ) mM ( ) Wh WoE 1 n1 nj nL E E E Figure 10.4: A two-layer, feed-forward neural network with L hidden neurons for classiﬁca- tion of N-dimensional data into M classes. In terms of the weight matrices Wh = (wh 1 , wh 2 , . . . wh L), Wo = (wo 1, wo 2, . . . wo M ), we can write this compactly as n = 1 g(Wh x) . The vector n is then fed to the output layer. If we interpret the outputs as probabilities, then we must ensure that 0 ≤ mk ≤ 1, k = 1 . . . M, and, furthermore, that M k=1 mk = 1. This can be done by using a modiﬁed logistic activation function for the output neurons, called softmax: mk(n) = eIo k (n) eIo 1 (n) + eIo 2 (n) + . . . + eIo M (n) , where Io k(n) = wo k n.

10.5. NEURAL NETWORKS 101 To quote Bishop [Bis95], “ ... such networks can approximate arbitrarily well any functional continuous mapping from one finite dimensional space to another, provided the number L of hidden units is sufficiently large. An important corollary of this result is, that in the context of a classification problem, networks with sigmoidal non- linearities and two layers of weights can approximate any decision boundary to arbitrary accuracy. More generally, the capability of such networks to approximate general smooth functions allows them to model posterior probabilities of class membership.” 10.5.2 Cost functions We haven’t yet considered the correct choice of synaptic weights. This procedure is called training the network. The training data can be represented as the set of labelled pairs {(x( ), y( )) | = 1 . . . p}, where y( ) = (0, 0 . . . 0, 1, 0 . . . 0) is an M-dimensional vector of zeroes, with a “1” at the kth position to indicate that x( ) belongs to class Ck. An intuitive training criterion is then the quadratic cost function E(Wh , Wo ) = 1 2 p =1 y( ) − m( ) 2 . (10.13) We must adjust the network weights so as to minimize E. Equivalently we can minimize the local cost functions E( ) := 1 2 y( ) − m( ) 2 , = 1 . . . p. (10.14) An alternative cost function can be obtained with the following argument: Choose the synaptic weights so as to maximize the probability of observing the training data: P(x( ), y( )) = P(y( ) | x( ))P(x( )) → max . The neural network predicts the posterior class membership probability, which we can write as P(y( ) | x( )) = M k=1 [ mk(x( )) ]yk( ) . For example: P((1, 0 . . . 0) |x) = m1(x)1 m2(x)0 smM (x)0 = m1(x). Therefore we wish to maximize M k=1 [ mk(x(i)) ]yk(i) P(x( ))

102 CHAPTER 10. SUPERVISED CLASSIFICATION Taking logarithms, dropping terms which are independent of the synaptic weights and summing over all of the training data, we see that this is equivalent to minimizing the cross entropy cost function E(Wh , Wo ) = − p =1 M k=1 yk( ) log mk(x( )) (10.15) with respect to the synaptic weight parameters. 10.5.3 Training Let w be the vector of all synaptic weights, i.e. E(Wh , Wo ) =: E(w) In one dimension, expanding in a Taylor series about a local minimum w∗ , E(w) = E(w∗ ) + (w − w∗ ) dE(w∗ ) dw + 1 2 (w − w∗ )2 d2 E(w∗ ) dw2 + . . . = E0 + 1 2 (w − w∗ )2 H + . . . , where H = d2 E(w∗ ) dw2 and we must have H 0 for a minimum, see Figure 10.5 w∗ w E(w) dE(w∗ ) dw = 0 Figure 10.5: Minimization of the cost function in one dimension. In many dimensions, we get the analogous expression E(w) = E0 + (w − w∗ ) H(w − w∗ ) + . . . where the matrix H is called the Hessian, Hij = ∂2 E(w∗ ) ∂wi∂wj . (10.16)

10.5. NEURAL NETWORKS 103 It is symmetric and it must be positive definite for a local minimum. It is positive definite if all of its eigenvalues are positive, see Appendix C. A local minimum can be found with various search algorithms. Backpropagation is the most well-known and extensively used method and is described below. It is used in the standard ENVI neural network for supervised classification. However much better algorithms exist, such as scaled conjugate gradient or Kalman filter. These are discussed in detail in Appendix C. ENVI plug-ins for supervised classification with a feed forward neural network trained with conjugate gradient and a fast Kalman filter algorithm are given in Appendices D.7 and D.8. 10.5.4 Backpropagation We will develop a training algorithm for the two-layer, feed-forward neural network of Figure 10.4. Our starting point is the local version of the cost function (10.15), E( ) = − M k=1 yk( ) log mk( ), = 1 . . . p, (10.17) or, in vector form E( ) = −y ( ) log m( ), which we wish to minimize with respect to the synaptic weights represented by the (N+1)×L matrix Wh = (wh 1 , wh 2 , . . . wh L) and the (L + 1) × M matrix Wo = (wo 1, wo 2, . . . wo M ). The following IDL object class FFN mirrors the network architecture of Figure 10.4 and will form the basis for the implementation of the training algorithms developed here and in Appendix C: ;+ ; NAME: ; FFN__DEFINE ; PURPOSE: ; Object class for implementation of a two-layer, feed-forward ; neural network for classification of multi-spectral images. ; This is a generic class with no training methods. ; Ref: M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999 ; AUTHOR ; Mort Canty (2005) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; ffn = Obj_New(FFN,Xs,Ys,L) ; ARGUMENTS: ; Xs: array of observation column vectors ; Ys: array of class label column vectors of form (0,0,1,0,0,...0)^T ; L: number of hidden neurons ; KEYWORDS ; None ; METHODS (external): ; FORWARDPASS: propagate a biased input column vector through the network

104 CHAPTER 10. SUPERVISED CLASSIFICATION ; returns the softmax probabilities vector ; m = ffn - ForwardPass() ; CLASS: return the class for an for an array of observation column vectors X ; return the class probabilities in array variable PROBS ; c = ffn - Class(X,Probs) ; COST: return the current cross entropy ; c = ffn - Cost() ; DEPENDENCIES: ; None ;-------------------------------------------------------------- Function FFN::Init, Xs, Ys, L catch, theError if theError ne 0 then begin catch, /cancel ok = dialog_message(!Error_State.Msg + ’ Returning...’, /error) return, 0 endif ; network architecture self.LL = L self.p = n_elements(Xs[*,0]) self.NN = n_elements(Xs[0,*]) self.MM = n_elements(Ys[0,*]) ; biased output vector from hidden layer (column vector) self.N= ptr_new(fltarr(L+1)) ; biased exemplars (column vectors) self.Xs = ptr_new([[fltarr(self.p)+1],[Xs]]) self.Ys = ptr_new(Ys) ; weight matrices (each column is a neuron weight vector) self.Wh = ptr_new(randomu(seed,L,self.NN+1)-0.5) self.Wo = ptr_new(randomu(seed,self.MM,L+1)-0.5) return,1 End Pro FFN::Cleanup ptr_free, self.Xs ptr_free, self.Ys ptr_free, self.Wh ptr_free, self.Wo ptr_free, self.N End Function FFN::forwardPass, x ; logistic activation for hidden neurons, N set as side effect *self.N = [[1],[1/(1+exp(-transpose(*self.Wh)##x))]] ; softmax activation for output neurons I = transpose(*self.Wo)##*self.N A = exp(I-max(I)) return, A/total(A)

10.5. NEURAL NETWORKS 105 End Function FFN:: class, X, Probs ; vectorized class membership probabilities nx = n_elements(X[*,0]) Ones = fltarr(nx) + 1.0 N = [[Ones],[1/(1+exp(-transpose(*self.Wh)##[[Ones],[X]]))]] Io = transpose(*self.Wo)##N maxIo = max(Io,dimension=2) for k=0,self.MM-1 do Io[*,k]=Io[*,k]-maxIo A = exp(Io) sum = total(A,2) Probs = fltarr(nx,self.MM) for k=0,self.MM-1 do Probs[*,k] = A[*,k]/sum ; vectorized class memberships maxM = max(Probs,dimension=2) M=fltarr(self.MM,nx) for i=0,self.MM-1 do M[i,*]=Probs[*,i]-maxM return, byte((where(M eq 0.0) mod self.MM)+1) End Function FFN:: cost Ones = fltarr(self.p) + 1.0 N = [[Ones],[1/(1+exp(-transpose(*self.Wh)##[*self.Xs]))]] Io = transpose(*self.Wo)##N maxIo = max(Io,dimension=2) for k=0,self.MM-1 do Io[*,k]=Io[*,k]-maxIo A = exp(Io) sum = total(A,2) Ms = fltarr(self.p,self.MM) for k=0,self.MM-1 do Ms[*,k] = A[*,k]/sum return, -total((*self.Ys)*alog(Ms)) End Pro FFN__Define struct = { FFN, $ NN: 0L, $ ;input dimension LL: 0L, $ ;number of hidden units MM: 0L, $ ;output dimension Wh:ptr_new(), $ ;hidden weights Wo:ptr_new(), $ ;output weights Xs:ptr_new(), $ ;training pairs Ys:ptr_new(), $ N:ptr_new(), $ ;output vector from hidden layer p: 0L $ ;number of training pairs } End

106 CHAPTER 10. SUPERVISED CLASSIFICATION Consider the following algorithm: Algorithm (Backpropagation or Generalized Least Mean Square) 1. Initialize the synaptic weights with random numbers and set = 1. 2. Choose training pair (x( ), y( )) and determine the output response m( ) of the network. 3. For k = 1 . . . M and j = 0 . . . L replace wo jk with wo jk − η ∂E( ) ∂wo jk . 4. For j = 1 . . . L and i = 0 . . . N replace wh ij with wh ij − η ∂E( ) ∂wh ij . 5. If E(Wh , Wo ) is sufficiently small, stop, otherwise set = mod p + 1 and go to 2. Thus we keep cycling through the training data, reducing the local cost function at each step by changing each synaptic weight by an amount proportional to the negative slope of the cost function with respect to that weight parameter, stopping when the overall cost function (10.15) is small enough. The constant of proportionality η is referred to as the learning rate for the network. This algorithm makes use only of the first derivatives of the cost function with respect to the synaptic weight parameters and is referred to as the backpropagation method. In order to implement this procedure, we require the partial derivatives of E( ) with respect to the synaptic weights. Let us begin with the output neurons, for which we have the softmax output signals mk( ) = eIo k ( ) eIo 1 ( ) + eIo 2 ( ) + . . . + eIo M ( ) , (10.18) where the activation of the kth neuron is Io k( ) = wo k n( ). We wish to determine ∂E( ) ∂wo jk , j = 0 . . . L, k = 1 . . . M. Applying the chain rule ∂E( ) ∂wo k = ∂E( ) ∂Io k( ) ∂Io k( ) ∂wo k = −δo k( )n( ), k = 1 . . . M, (10.19) where the quantity δo k( ) is defined as δo k( ) = − ∂E( ) ∂Io k( ) and is the negative rate of change of the local cost function with respect to the activation of the kth output neuron. Again applying the chain rule and with (10.16) and (10.18), ∂E( ) ∂Io k( ) = M k =1 ∂E( ) ∂mk ( ) ∂mk ( ) ∂Io k( ) = M k =1 − yk ( ) mk ( ) eIo k ( ) δkk M k =1 eIo k ( ) − eIo k ( ) eIo k ( ) ( M k =1 eIo k ( ) )2 .

10.5. NEURAL NETWORKS 107 Here, δkk is the Kronecker symbol δkk = 0 if k = k 1 if k = k . Continuing, ∂E( ) ∂Io k( ) = M k =1 − yk ( ) mk ( ) mk( )(δkk − mk ( )) = −yk( ) + mk( ) M k =1 yk ( ). But the sum over M is just one, and we have ∂E( ) ∂Io k( ) = −(yk( ) − mk( ) = −δo k( ), and therefore, with δo ( ) = (δo 1( ), . . . δo M ( )) , δo ( ) = y( ) − m( ). (10.20) Thus from (10.19) the third step in the backpropagation algorithm can be written in the form Wo ( + 1) → Wo ( ) + η n( )δo ( ). (10.21) Note that the second term on the right hand side of (10.21) is an outer product, giving a matrix of dimension (L + 1) × M matching that of Wo . For the hidden weights, step 4 of the algorithm, we proceed similarly: ∂E( ) ∂Wh j = ∂E( ) ∂Ih j ( ) ∂Ih j ( ) ∂Wh j = −δh j ( )x( ), j = 1 . . . L, (10.22) where δh j ( ) is the negative rate of change of the local cost function with respect to the activation of the jth hidden neuron: δh j ( ) = − ∂E( ) ∂Ih j ( ) . Applying again the chain rule: δh j ( ) = − M k=1 ∂E( ) ∂I0 k( ) ∂Io k( ) ∂Ih j ( ) = M k=1 δo k( ) ∂Io k( ) ∂Ih j ( ) = M k=1 δo k( ) ∂wo k n( ) ∂Ih j ( ) . Since only the output of the jth hidden neuron is a function of Ih j ( ) = wh j x( ), we have δh j ( ) = M k=1 δo k( )wo jk ∂nj( ) ∂Ih j ( ) . The hidden units use the logistic activation function nj(Ih j ) = 1 1 + e−Ih j

108 CHAPTER 10. SUPERVISED CLASSIFICATION for which dnj dx = n(x)(1 − n(x)). Therefore we can write δh j ( ) = M k=1 δo k( )wo jknj( )(1 − nj( )), or, more compactly, δh j ( ) = (wo j δo ( )) nj( )(1 − nj( )). More compactly still, we can write 0 δh ( ) = n( ) ⊗ (1 − n( )) ⊗ Wo δo ( ) . (10.23) Note that the fact that 1 − n0( ) = 0 is made explicit in the above expression. Equation (10.23) is the origin of the term “backpropagation”, since it propagates the output error δo backwards through the network to determine the hidden unit error δh . Finally, with (10.22) we obtain the update rule for step 4 of the backpropagation algorithm, Wh ( + 1) → Wh ( ) + η x( )δh ( ). (10.24) The choice of an appropriate learning rate η is problematic: small values imply slow convergence and large values produce oscillation. Some improvement can be achieved with an additional parameter called momentum. We replace (10.21) with Wo ( + 1) := Wo ( ) + ∆o ( ) + α∆o ( − 1), (10.25) where ∆o ( ) = η n( )δo ( ), and α is the momentum parameter. A similar expression replaces (10.24). Typical choices for the backpropagation parameters are η = 0.01 and α = 0.5. Here is an object class extending FFN which implements backpropagation: ;+ ; NAME: ; FFNBP__DEFINE ; PURPOSE: ; Object class for implementation of a two-layer, feed-forward ; neural network for classification of multi-spectral images. ; Implements ordinary backpropagation training. ; Extends the class FFN ; Ref: M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999 ; AUTHOR ; Mort Canty (2005) ; Juelich Research Center ; m.canty@fz-juelich.de

10.5. NEURAL NETWORKS 109 ; CALLING SEQUENCE: ; ffn = Obj_New(FFNBP,Xs,Ys,L) ; ARGUMENTS: ; Xs: array of observation column vectors ; Xs: array of class label column vectors of form (0,0,1,0,0,...0)^T ; L: number of hidden neurons ; KEYWORDS ; None ; METHODS: ; TRAIN: train the network ; ffn - train ; DEPENDENCIES: ; FFN__DEFINE ; PROGRESSBAR (FSC_COLOR) ;-------------------------------------------------------------- Function FFNBP::Init, Xs, Ys, L catch, theError if theError ne 0 then begin catch, /cancel ok = dialog_message(!Error_State.Msg + ’ Returning...’, /error) return, 0 endif ; initialize the superclass if not self-FFN::Init(Xs, Ys, L) then return, 0 self.iterations = 10*self.p self.cost_array = ptr_new(fltarr((self.iterations+100)/100)) return, 1 End Pro FFNBP::Cleanup ptr_free, self.cost_array self-FFN::Cleanup End Pro FFNBP::Train iter = 0L iter100 = 0L eta = 0.01 ; learn rate alpha = 0.5 ; momentum progressbar = Obj_New(’progressbar’, Color=’blue’, Text=’0’,$ title=’Training: exemplar number...’,xsize=250,ysize=20) progressbar-start window,12,xsize=400,ysize=400,title=’Cost Function’ wset,12 inc_o1 = 0 inc_h1 = 0 repeat begin if progressbar-CheckCancel() then begin print,’Training interrupted’

110 CHAPTER 10. SUPERVISED CLASSIFICATION progressbar-Destroy return endif ; select exemplar pair at random ell = long(self.p*randomu(seed)) x=(*self.Xs)[ell,*] y=(*self.Ys)[ell,*] ; send it through the network m=self-forwardPass(x) ; determine the deltas d_o = y - m d_h = (*self.N*(1-*self.N)*(*self.Wo##d_o))[1:self.LL] ; d_h is now a row vector ; update the synaptic weights inc_o = eta*(*self.N##transpose(d_o)) inc_h = eta*(x##d_h) *self.Wo = *self.Wo + inc_o + alpha*inc_o1 *self.Wh = *self.Wh + inc_h + alpha*inc_h1 inc_o1 = inc_o inc_h1 = inc_h ; record cost history if iter mod 100 eq 0 then begin (*self.cost_array)[iter100]=alog10(self-cost()) iter100 = iter100+1 progressbar-Update,iter*100/self.iterations,text=strtrim(iter,2) plot,*self.cost_array,xrange=[0,iter100],color=0,background=’FFFFFF’XL,$ xtitle=’Iterations/100)’,ytitle=’log(cross entropy)’ end iter=iter+1 endrep until iter eq self.iterations progressbar-destroy End Pro FFNBP__Define struct = { FFNBP, $ cost_array: ptr_new(), $ iterations: 0L, $ Inherits FFN $ } End In the Train method, the training pairs are chosen at random, rather than cyclically as indicated in the backpropagation Algorithm. 10.6 Evaluation The rate of misclassification offers us a reasonable and obvious basis not only for evaluating the quality of classifiers, but also for their comparison, for example to compare the feed- forward network with Bayes maximum-likelihood. We shall characterize this rate in the following with the parameter θ. Through classification of test data which have not been

10.6. EVALUATION 111 used for training, we can obtain unbiased estimates of θ. If, for n test data, y are found to have been misclassified, then an intuitive value for this estimate is θ ≈ y n =: ˆθ. (10.26) However the estimated misclassification rates alone are insufficient for model comparison. We require their uncertainties as well. 10.6.1 Standard deviation of misclassification The classification of a single test datum is a random experiment, whose possible result we can characterize as the set { ¯A, A}: ¯A= misclassified, A = correctly classified. We define a real-valued function on this set, i.e. a random variable X( ¯A) = 1, X(A) = 0, (10.27) with probabilities P(X = 1) = θ = 1 − P(X = 0). The expectation value of this random variable is X = 1θ + 0(1 − θ) = θ (10.28) and its variance is var(X) = X2 − X 2 = 12 θ + 02 (1 − θ) − θ2 = θ(1 − θ). (10.29) For the classification of n test data, denoted by random variables X1 . . . Xn, the random variable Y = X1 + X2 + . . . Xn (10.30) is clearly the associated number of misclassifications. Since Y = X1 + . . . + Xn = nθ we obtain ˆθ = 1 n ˆY = y n (10.31) as an unbiased estimate of the rate θ of misclassifications. From the independence of the Xi, i = 1 . . . n, the variance of Y is given by var(Y ) = var(X1) + . . . + var(Xn) = nθ(1 − θ), and the variance of the misclassification rate is var Y n = Y 2 n2 − Y n 2 = 1 n2 ( Y 2 − Y 2 ) = 1 n2 var(Y ), or var Y n = θ(1 − θ) n . (10.32) For y observed misclassifications we estimate θ with (10.31). Then the estimated variance is given by ˆvar Y n ≈ ˆθ(1 − ˆθ) n = y n 1 − y n n = y(n − y) n3 ,

112 CHAPTER 10. SUPERVISED CLASSIFICATION and the estimated standard deviation by ˆσ = y(n − y) n3 . (10.33) The random variable Y is binomially distributed. However for a sufficiently large number n of test data, the binomial distribution is well-approximated by the normal distribution. Mean and standard deviation are then sufficient to characterize the distribution function completely. 10.6.2 Model comparison A typical value for a misclassification rate is around θ ≈ 0.5. In order to claim that two values differ from one another significantly, they should lie at least about two standard deviations apart. If we wish to discriminate values separated by say 0.01, then ˆσ should be no greater than 0.005. From (10.32) this means 0.0052 ≈ 0.05(1 − 0.05) n , or n ≈ 2000. That’s quite a few. However since we are dealing with pixel data, such a number of test pixels – assuming sufficient training areas are available – is quite realistic. If training and test data are in fact at a premium, there exist efficient alternatives1 to the simple train-and-test philosophy presented here. However, since they are generally quite computer-intensive, we won’t consider them further. In order to express the claim that classifier A is better than classifier B more precisely, we can formulate an hypothesis test. The individual misclassification rates are approximately normally distributed. If they are also independent we can construct a test statistic S given by S = YA/n − YB/n + θA − θB var(YA/n − YB/n) = YA/n − YB/n + θA − θB var(YA/n) + var(YB/n) . We can then use S to decide between the null hypothesis H0 : θA = θB, i.e., the two classifiers are equivalent, and the alternative hypothesis H1 : θA θB or θA θB, i.e. one of the two methods is better. Thus under H0 we have S ∼ N(0, 1). We choose a decision threshold ±Zα/2 which corresponds to a probability α of an error of the first kind. With this probability the null hypothesis will be rejected although it is in fact true, see Figure 10.6. In fact the strict independence of the misclassification rates θA and θB is not given, since they are determined with the same set of test data. The above hypothesis test with the statistic S is therefore too conservative. For dependence we have namely var(YA/n − YB/n) = var(YA/n) + var(YB/n) − 2cov(YA/n, YB/n), 1The buzz-words here are Cross-Validation and Bootstrapping, see [WK91], Chapter 2, for an excellent introduction.

10.6. EVALUATION 113 φ( ˆS) ˆS Zα/2 acceptance region © α/2 E' w −Zα/2 α/2 w Figure 10.6: Acceptance region for the first hypothesis test. If −Zα/2 ≤ ˆS ≤ Zα/2, the null hypothesis is accepted, otherwise it is rejected. where the covariance term cov(YA/n, YB/n) is positive. The test statistic S is correspondingly underestimated. We can formulate a non-parametric hypothesis test which avoids this problem of dependence. We distinguish the following events for classification of the test data: ¯AB, A ¯B, ¯A ¯B und AB. The variable ¯AB is the event test observation is misclassified by A and correctly classified by B, while A ¯B is the event test observation is correctly classified by A and misclassified by B and so on. As before we define random variables: X ¯AB, XA ¯B, X ¯A ¯B and XAB where X ¯AB( ¯AB) = 1, X ¯AB(A ¯B) = X ¯AB( ¯A ¯B) = X ¯AB(AB) = 0, with probabilities P(X ¯AB = 1) = θ ¯AB = 1 − P(X ¯AB = 0). Corresponding definitions are made for XA ¯B, X ¯A ¯B and XAB. Now, in comparing the two classifiers we are interested in the events ¯AB and A ¯B. If the number of former is significantly smaller than the number of the latter, then A is better than B and vice versa. Events ¯A ¯B in which both methods perform poorly are excluded. For n test observations the random variables Y ¯AB = X ¯AB1 + . . . X ¯ABn and YA ¯B = XA ¯B1 + . . . XA ¯Bn are the frequencies of the respective events. We then have Y ¯AB = nθ ¯AB, var(Y ¯AB) = nθ ¯AB(1 − θ ¯AB) YA ¯B = nθA ¯B, var(YA ¯B) = nθA ¯B(1 − θA ¯B).

114 CHAPTER 10. SUPERVISED CLASSIFICATION We expect that θ ¯AB 1, that is, var(Y ¯AB) ≈ nθ ¯AB = Y ¯AB . The same goes for YA ¯B. For a sufficiently large number of test observationss, the random variables Y ¯AB − Y ¯AB Y ¯AB and YA¯B − YA¯B YA¯B are thus approximately standard normally distributed. Under the null hypothesis (equivalence of the two classifiers), the expectation values of Y ¯AB and YA ¯B satisfy Y ¯AB = YA ¯B =: Y . Therefore we form the test statistic S = (Y ¯AB − Y )2 Y + (YA ¯B − Y )2 Y . This statistic, being the sum squares of approximately normally distributed random variables, is chi-square distributed, see Chapter 2. Let y ¯AB and yA ¯B be the number of events actually measured. Then we estimate Y as ˆY = y ¯AB + yA ¯B 2 and determine our test statistic as ˆS = (y ¯AB − y ¯AB+yA ¯B 2 )2 y ¯AB+yA ¯B 2 + (yA ¯B − y ¯AB+yA ¯B 2 )2 y ¯AB+yA ¯B 2 . With a little algebra we get ˆS = (y ¯AB − yA ¯B)2 y ¯AB + yA ¯B , (10.34) the so-called McNemar statistic. It is chi-square distributed with one degree of freedom, see for example [Sie65]. A so-called continuity correction is usually made to (10.34) and S written as ˆS = (|y ¯AB − yA ¯B| − 1)2 y ¯AB + yA ¯B . But there are still reservations! We can only conclude that one classifier is or is not superior, relative to the common set of training data. We haven’t taken into account the variability of the training data, which were sampled just once from their underlying distributions, only that of the test data. If one or both of the classifiers is a neural network, we have also not considered the variability of the neural network training procedure with respect to the random initialization of the synaptic weights. All this constitutes an extremely computation-intensive task [Rip96]. 10.6.3 Confusion matrices The confusion matrix for M classes is defined as C =     c11 c12 s c1M c21 c22 s c2M ... ... ... ... cM1 cM2 s cMM    

10.6. EVALUATION 115 where cij is the number of test pixels from class Ci which are classified as Cj. The misclassification rate is ˆθ = y n = n − M i=1 cii n = n − Tr C n and only takes into account of the diagonal elements of the confusion matrix. The Kappa-coefficient κ make use of all the matrix elements. It is defined as follows: κ = correct classifications − chance correct classifications 1 − chance correct classifications For a purely randomly labeled test pixel, the proportion of correct classifications is approximately M i=1 ci ci n2 , where ci = M j=1 cij, ci = M j=1 cji. Hence an estimate of the Kappa coefficient is ˆκ = i cii n − i cici n2 1 − i cici n2 . (10.35) Again, the Kappa coefficient alone tells us little about the quality of the classifier. We require its standard deviation. This can be calculated in the large sample limit n → ∞ to be [BFH75] ˆσˆκ = 1 n θ1(1 − θ1) (1 − θ2)2 + 2(1 − θ1)(2θ1θ2 − θ3) (1 − θ3)3 + (1 − θ1)2 (θ4 − 4θ2 2) (1 − θ2)4 , (10.36) where θ1 = M i=1 cii θ2 = M i=1 cici θ3 = M i=1 cii(ci + ci) θ4 = M i,j=1 cij(cj + ci)2 .

116 CHAPTER 10. SUPERVISED CLASSIFICATION

Chapter 11 Hyperspectral analysis Hyperspectral – as opposed to multispectral – images combine both high or moderate spatial resolution with high spectral resolution. Typical sensors (imaging spectrometers) generate in excess of two hundred spectral channels. Figure 11.1 shows part of a so-called image cube for the AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) sensor taken over a region of the Californian coast. Sensors of this kind produce much more complex data and provide correspondingly much more information about the reﬂecting surfaces examined. Figure 11.2 displays the spectrum of a single pixel in the image. Figure 11.1: AVIRIS hyperspectral image cube, Santa Monica Mountains. 117

118 CHAPTER 11. HYPERSPECTRAL ANALYSIS Figure 11.2: AVIRIS spectrum of one pixel location. 11.1 Mixture modelling In working with multispectral images, the fact that at the scale of observation a pixel contains a mixture of materials is generally treated as a second order eﬀect and more or less ignored. With the availability of high spectral resolution sensors it has become possible to treat the problem of the “mixed pixel” quantitatively. The basic premise of mixture modelling is that within a given scene, the surface is dominated by a small number of common materials that have relatively constant spectral properties. These are referred to as the end-members. It is assumed that the spectral variability captured by the remote sensing system can be modelled by mixtures of these components. 11.1.1 Full linear unmixing Suppose that there are p end-members and spectral bands. Then we can denote the spectrum of the ith end-member by the vector mi =    mi 1 ... mi    . Now deﬁne the matrix of end-members M according to M = (m1 . . . mp ) =    m1 1 s mp 1 ... ... ... m1 s mp    , with one column for each end-member. For hyperspectral imagery we always have p .

11.1. MIXTURE MODELLING 119 The measured signal g is modelled as a linear combination of end-members plus a residual noise term: g = α1m1 + . . . + αpmp + n = Mα + n. The residual n is assumed to be normally distributed with covariance matrix Σn =     σ2 1 0 s 0 0 σ2 2 s 0 ... ... ... ... 0 0 s σ2     . The standardized residual is Σ−1/2 n n and the square of the standardized residual is (Σ−1/2 n n) (Σ−1/2 n n) = n Σ−1 n n. The mixing coefficients are determined my minimizing this quantity with respect to α under the condition that they sum to unity. The corresponding Lagrange function is L = n Σ−1 n n − 2λ( p i=1 αi − 1) = (g − Mα) Σ−1 n (g − Mα) − 2λ( p i=1 αi − 1) . Solving the set of equations ∂L ∂α = 0 ∂L ∂λ = 0 we obtain the solution α = (M Σ−1 n M)−1 (M Σ−1 n g − λ1p) α1p = 1, (11.1) where 1p = (1, 1 . . . 1) . The first equation determines the mixing coefficients in terms of known quantities and λ. The second equation can be used to eliminate λ. 11.1.2 Unconstrained linear unmixing If we work with MNF-projected data (see next section) then we can assume that Σn = σ2 I. If furthermore we ignore the constraint on α (i.e. λ = 0), then (11.1) reduces to α = [(M M)−1 M ]g. The expression in square brackets is the pseudoinverse of the matrix M, see Chapter 1. 11.1.3 Intrinsic end-members and pixel purity If a spectral library for all of the p end-members in M is available, the mixture coefficients can be calculated directly. The primary result of the spectral mixture analysis is the fraction

120 CHAPTER 11. HYPERSPECTRAL ANALYSIS images which show the spatial distribution and abundance of the end-member components in the scene. If such external data are unavailable, there are various strategies for determining end- members from the hyperspectral imagery itself. We describe briefly the method recom- mended in ENVI and implemented in the so-called “Spectral Hourglass Wizard”. The first step is to reduce the dimensionality of the data. This is done with the MNF transformation described in Chapter 3. By examining the eigenvalues of the transformation and retaining only the components with eigenvalues exceeding one (non-noise components), the number of dimensions can be reduced substantially, see Figure 11.3. Figure 11.3: Eigenvalues of the MNF transformation of the image in Figure 11.1. The so-called pixel purity index (PPI) is then used to find the most spectrally pure, or extreme, pixels in the remaining data. The most spectrally pure pixels typically correspond to mixing end-members. The PPI is computed by repeatedly projecting n-dimensional scatter plots onto a random unit vector. The extreme pixels in each projection are noted and the number of times each pixel is marked as extreme is recorded. The purest pixels must must be on the corners, edges or faces of the data cloud. A threshold value is used to define how many pixels are marked as extreme at the ends of the projected vector. This value should be 2-3 times the noise level in the data, which is 1 when using the MNF transformed channels. A minimum of about 5000 iterations is usually required to produce useful results. When the iterations are completed, a PPI image is created in which the value of each pixel corresponds to the number of times that pixel was recorded as extreme. So bright pixels are generally end-members. This image hints at locations and sites that could be visited for ground truth measurements. The n-dimensional visualizer, Figure 11.4 can then be used interactively to define classes of pixels corresponding to end-members and to plot their spectra. These can be saved along with their pixel locations as ROIs (regions of interest) for later use in spectral unmixing. This method is repeatable and has the advantage of objectivity in analysis of a data set to assess dimensionality and define end-members. The primary disadvantage is that it is a statistical approach dependent upon the specific spectral variance of the image. Thus the resulting end-members are mathematical constructs which may not be physically inter- pretable.

11.2. ORTHOGONAL SUBSPACE PROJECTION 121 Figure 11.4: The n-D visualizer. 11.2 Orthogonal subspace projection Orthogonal subspace projection is a transformation which is closely related to linear unmixing. Suppose that a multispectral image pixel g consists of a mixture of “desirable” and “undesirable” spectra, g = Dα + Uβ + n. The × matrix P = I − U(U U)−1 U “projects out” the undesirable components, since Pg = PDα + IUβ − Uβ + Pn = PDα + Pn. (11.2) An example of the use of this transformation is the suppression of cloud cover from a multispectral image. First an unsupervised classiﬁcation is carried out (see Chapter 9) and the clusters containing the undesired features (clouds) are identiﬁed. The mean vectors of these clusters can then be used as the undesired spectra and combined to form the matrix U. The the projection (11.2) can be applied to the entire image. Here is an ENVI/IDL program to implement this idea: ; Orthogonal subspace projection pro osp, event print, ’---------------------------------’ print, ’Orthogonal Subspace Projection’ print, systime(0)

122 CHAPTER 11. HYPERSPECTRAL ANALYSIS print, ’---------------------------------’ infile=dialog_pickfile(filter=’*.dat’,/read) ; read in cluster centers openr,lun,infile,/get_lun readf,lun,num_channels ; number of spectral channels readf,lun,K ; number of cluster centers Ms=fltarr(num_channels,K) readf,lun,Ms Us=transpose(Ms) print,’Cluster centers (in the columns)’ print,Us centers=indgen(K) print,’enter undesired centers as 1 (e.g. 0 1 1 0 0 ...)’ read,centers U = Us[where(centers),*] print,’Subspace U’ print,U Identity = fltarr(num_channels,num_channels) for i=0,num_channels-1 do Identity[i,i]=1.0 P = Identity - U##invert(transpose(U)##U,/double)##transpose(U) print,’projection matrix:’ print, P envi_select, title=’Choose multispectral image for projection’, $ fid=fid, dims=dims,pos=pos if (fid eq -1) then goto, done num_cols = dims[2]+1 num_lines = dims[4]+1 num_pixels = (num_cols*num_lines) if (num_channels ne n_elements(pos)) then begin print,’image dimensions are incorrect, aborting ...’ goto, done end image=fltarr(num_pixels,num_channels) for i=0,num_channels-1 do $ image[*,i]=envi_get_data(fid=fid,dims=dims,pos=pos[i])+0.0 print,’projecting ...’ ; do the projection image = P ## image out_array = bytarr(num_cols,num_lines,num_channels) for i = 0,num_channels-1 do out_array[*,*,i] = $ bytscl(reform(image[*,i],num_cols,num_lines,/overwrite)) base = widget_auto_base(title=’OSP Output’)

11.2. ORTHOGONAL SUBSPACE PROJECTION 123 sb = widget_base(base, /row, /frame) wp = widget_outfm(sb, uvalue=’outf’, /auto) result = auto_wid_mng(base) if (result.accept eq 0) then begin print, ’Output cancelled’ goto, done endif if (result.outf.in_memory eq 1) then begin envi_enter_data, out_array print, ’Result written to memory’ endif else begin openw, unit, result.outf.name, /get_lun band_names=strarr(num_channels) for i=0,num_channels-1 do begin writeu, unit, out_array[*,*,i] band_names[i]=’OSP component ’ + string(i+1) endfor envi_setup_head ,fname=result.outf.name, ns=num_samples, $ nl=num_lines, nb=num_channels $ ,data_type=1, interleave=0, /write $ ,bnames=band_names $ ,descrip=’OSP’ print, ’File created ’, result.outf.name close, unit endelse done: print,’done.’ end

124 CHAPTER 11. HYPERSPECTRAL ANALYSIS

Appendix A Least Squares Procedures A.1 Generalized least squares Consider the following data model: y = a x + relating n independent variables x = (x1 . . . xn) to a measured quantity y via the parameters a = (a1 . . . an) . The random variable represents measurement uncertainty, and we assume var( ) = σ2 . We wish to determine the “best values” for parameters a. If we perform m n measurements, we can write y1 = n j=1 aj(xj)1 + ... ym = n j=1 aj(xj)m + . (A.1) Deﬁning the m × n matrix A by (A)ij = (xj)i we can write (A.1) as y = Aa + (A.2) where y = (y1 . . . ym) and = ( . . . ) and Σ = = σ2 I. The “goodness of ﬁt” function is χ2 = m i=1 yi − n j=1 Aijaj σ 2 . 125

126 APPENDIX A. LEAST SQUARES PROCEDURES This is minimized by solving the equations ∂χ2 ∂ak = 0, k = 1 . . . n. We obtain m i=1  yi − n j=1 Aijaj   Aik = 0, k = 1 . . . n, which we can write in matrix form as A y = (A A)a . (A.3) Eq. (A.3) is referred to as the normal equation. The fitted parameters of the model are thus estimated by â = (A A)−1 A y =: Ly. (A.4) The matrix L = (A A)−1 A is called the pseudoinverse of A. Thinking now of a as a random variable with expectation value â, the uncertainties in the fitted parameters can be obtained as follows: Σ = (a − â)(a − â) = (a − Ly)(a − Ly) = (a − L(Aa + ))(a − L(Aa + )) But LA = I, so we have Σ = (−L )(−L ) = L L = σ2 LL = σ2 (A A)−1 . (A.5) To check that this is indeed a generalization of the simple linear regression, identify the parameter vector a with the straight line parameters a and b, i.e. a = a1 a2 = a b . The matrix A and vector y are similarly A =     1 x1 1 x2 ... ... 1 xm     , y =     y1 y2 ... ym     . Thus the best estimates for the parameters are â = â ˆb = (A A)−1 (A y).

A.2. RECURSIVE LEAST SQUARES 127 Evaluating: (A A)−1 = m xi xi x2 i −1 = m m¯x m¯x x2 i −1 . Recalling the expression for the inverse of a 2 × 2 matrix, (A A)−1 = 1 m x2 i + m2 ¯x2 x2 i −m¯x −m¯x m . Furthermore, we have A y = m¯y xiyi . Therefore the estimate for b is ˆb = 1 m x2 i + m2 ¯x2 (−m2 ¯x¯y + m xiyi) = −m¯x¯y + xiyi m x2 i + m2 ¯x2 . (A.6) From (A.3) the uncertainty in b is given by σ2 times the (2,2) element of (A A)−1 , σ2 b = σ2 m m x2 i + m2 ¯x2 . (A.7) Equations (A.6) and (A.7) correspond to those for ordinary least squares. A.2 Recursive least squares Suppose that the measurement data in (A.1) are presented sequentially and we wish to determine the best solution for the parameters a as the new data become available. We can write Eq. (A.2) in the form y = A a + (A.8) indicating that measurements have been made up till now (we assume n), where as before n is the number of parameters (the length of a). The least squares solution is, with (A.4), ˆa = (A A )−1 A y =: a( ) and, from (A.5), the covariance matrix of a( ) is Σ = (A A )−1 . (A.9) We have assumed for convenience that σ2 = 1. Therefore we can write a( ) = Σ A y . (A.10) Suppose a new observation becomes available. (We’ll call it (x( + 1), y( + 1)) rather than (x +1, y +1), as this simpliﬁes the notation considerably.) Now we must solve the least squares problem y y( + 1) = A A( + 1) a + , where A( + 1) = x( + 1) . According to (A.10) the solution is a( + 1) = Σ +1 A A( + 1) y y( + 1) . (A.11)

128 APPENDIX A. LEAST SQUARES PROCEDURES From (A.9) we can obtain a recursive formula for the covariance matrix Σ +1: Σ−1 +1 = A A( + 1) A A( + 1) = A A + A +1A +1 or Σ−1 +1 = Σ−1 + A( + 1) A( + 1). (A.12) Next we multiply Eq. (A.11) out, a( + 1) = Σ +1(A y + A( + 1) y( + 1)), and replace y with A a( ) to obtain a( + 1) = Σ +1(A A a( ) + A( + 1) y( + 1)). Using (A.9) and (A.12), a( + 1) = Σ +1(Σ−1 a( ) + A( + 1) y( + 1)) = Σ +1 Σ−1 +1a( ) − A( + 1) A( + 1)a( ) + A( + 1) y( + 1) . This simplifies to a( + 1) = a( ) + Σ +1A( + 1) y( + 1) − A( + 1)a( ) . Finally, with the definition of the Kalman gain K +1 := Σ +1A( + 1) , (A.13) we also obtain a recursive equation for the parameter vector a, namely a( + 1) = a( ) + K +1 y( + 1) − A( + 1)a( ) . (A.14) Equations (A.12–A.14) define a so-called Kalman filter for the least squares problem (A.8). For input x( + 1) = A( + 1) the system response A( + 1)a( ) is calculated in (A.14) and compared with the measurement y( + 1). Then the innovation, that is to say the difference between the measurement and system response, is multiplied by the Kalman gain determined by (A.13) and (A.12) and the old value a( ) is corrected accordingly. Relation (A.12) is inconvenient as it calculates the inverse of the covariance matrix Σ +1 whereas we require the non-inverted form in order to determine the Kalman gain (A.13). Fortunately (A.12) and (A.13) can be reformed as follows: Σ +1 = I − K +1A( + 1) Σ K +1 = Σ A( + 1) A( + 1)Σ A( + 1) + 1 −1 . (A.15)

A.3. ORTHOGONAL REGRESSION 129 To see this, first of all note that the second equation in (A.15) is a consequence of the first equation and (A.13). Therefore it suffices to show that the first equation is indeed the inverse of (A.12): Σ +1Σ−1 +1 = I − K +1A( + 1) Σ Σ−1 +1 = I − K +1A( + 1) + I − K +1A( + 1) Σ A( + 1) A( + 1) = I − K +1A( + 1) + Σ A( + 1) A( + 1) − K +1A( + 1)Σ A( + 1) A( + 1). The second equality above follows from (A.12). But from the second equation in (A.15) we have K +1A( + 1)Σ A( + 1) = Σ A( + 1) − K +1 and therefore Σ +1Σ−1 +1 = I − K +1A( + 1) + Σ A( + 1) A( + 1) − (Σ A( + 1) − K +1)A( + 1) = I as required. A.3 Orthogonal regression In the model for ordinary least squares regression the xs are assumed to be error-free. In the calibration case where it is arbitrary what we call the reference variable and what we call the uncalibrated variable to be normalized, we should allow for error in both x and y. If we impose the model1 yi − i = a + b(xi − δi), i = 1 . . . m (A.16) with and δ as uncorrelated, white, Gaussian noise terms with mean zero and equal variances σ2 , we get for the estimator of b, [KS79], ˆb = (s2 yy − s2 xx) + (s2 yy − s2 xx)2 + 4s2 xy 2sxy (A.17) with s2 yy = 1 m n i=1 (yi − ¯y)2 (A.18) and the remaining quantities defined in the section immediately above. The estimator for a is â = ¯y − ˆb¯x. (A.19) According to [Pat77, Bil89] we get for the dispersion matrix of the vector (â,ˆb) σ2ˆb(1 + ˆb2 ) msxy ¯x2 (1 + ˆτ) + sxy/ˆb −¯x(1 + ˆτ) −¯x(1 + ˆτ) 1 + ˆτ (A.20) with ˆτ = σ2ˆb (1 + ˆb2)sxy (A.21) 1The model in equation (A.16) is often referred to as a linear functional relationship in the literature.

130 APPENDIX A. LEAST SQUARES PROCEDURES and where σ2 can be replaced by ˆσ2 = m (n − 2)(1 + ˆb2) (s2 yy − 2ˆbsxy + ˆb2 s2 xx), (A.22) see [KS79]. It can be shown that estimators of a and b can be calculated by means of the elements in the eigenvector corresponding to the smallest eigenvalue of the dispersion matrix of the m by 2 data matrix with a vector of the xs in the ﬁrst column and a vector of the ys in the second column, [KS79]. This can be used to perform orthogonal regression in higher dimensions, i.e., when we have, for example, more x variables than the one variable we have here.

Appendix B The Discrete Wavelet Transformation The following discussion follows [AS99] closely. B.1 Inner product space Let f and g be two functions of the real numbers IR and define their inner product as f, g = ∞ −∞ f(t)g(t)dt. The inner product space L2 (IR) is the collection of all functions f : IR → IR such that f = f, f 1/2 = ∞ −∞ f(t)2 dt 1/2 ∞. The distance between two functions f(t) and g(t) in L2 (IR) is d(f, g) = f − g = ∞ −∞ (f(t) − g(t))2 dt 1/2 . B.2 Haar wavelets Let Vn be the collection of all piecewise constant functions of finite extent1 that have possible discontinuities at the rational points m × 2−n , where m and n are integers, m, n ∈ Z. Then all members of Vn belong to the inner product space L2 (IR), Vn ⊆ L2 (IR). Define the the Haar scaling function according to φ(t) = 1 if 0 ≤ t ≤ 1 0 otherwise . (B.1) 131

132 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION 1 1 t Figure B.1: The Haar scaling function. It is shown in Figure B.1. Any function in Vn in [0, 1] can be expressed as a linear combination of the standard Haar basis functions of the form Cn = {φn,k(t) = φ(2n t − k) | k = 0, 1 . . . 2n − 1}. These basis functions have compact support and are orthogonal in the following sense: φn,k, φn,k = 1 2n · δk,k . Note that φ0,0(t) = φ(t). Consider the function spaces V0 and V1 with orthogonal bases {φ0,0} and {φ1,0, φ1,1}, respectively. According to the orthogonal decomposition theorem [?], any function in V1 can be projected onto basis functions φ0,0(t) for V0 plus a residual in the space V ⊥ 0 which is orthogonal to V0. Formally, V1 = V0 ⊕ V ⊥ 0 . For example φ1,0(t) = φ1,0, φ0,0 φ0,0, φ0,0 φ0,0(t) + r(t). The residual function r(t) is in the residual space V ⊥ 0 . We see that r(t) = φ1,0(t) − 1 2 φ0,0(t) = φ(2t) − 1 2 φ(t) = φ(2t) − 1 2 [φ(2t) + φ(2t − 1)] = 1 2 ψ(t) where ψ(t) is the Haar wavelet derived from the scaling function φ according to ψ(t) = φ(2t) − φ(2t − 1). (B.2) 1Such functions are said to have compact support.

B.2. HAAR WAVELETS 133 Thus an alternative basis for V1 is B1 = {φ(t), ψ(t)} = {φ0,0(t), ψ0,0(t)}. The wavelet ψ0,0(t) is shown in Figure B.2 1 1 Figure B.2: The Haar wavelet. We can repeat this argument for V2 = V1 ⊕ V ⊥ 1 to obtain the basis B2 = {φ0,0(t), ψ0,0(t), ψ1,0(t), ψ1,1(t)}, where now {ψ1,0(t), ψ1,1(t)} is an orthogonal basis for V ⊥ i . In general, the Haar wavelet basis for Vn is Bn = {φ0,0(t), ψ0,0(t), ψ1,0(t), ψ1,1(t) . . . ψn−1,0(t), ψn−1,1(t) . . . ψn−1,2n−1(t)}, where {ψm,k(t) = ψ(2m t − k) | k = 0 . . . 2n − 1} is an orthogonal basis for V ⊥ m , and Vn = Vn−1 ⊕ V ⊥ n−1 = V0 ⊕ V ⊥ 0 ⊕ . . . ⊕ V ⊥ n−2 ⊕ V ⊥ n−1. In the case of the Haar wavelets, there is a simple correspondence between the basis functions (φ, ψ) and the vector space IR2n , i.e. the space of 2n -dimensional vectors. Consider for instance n = 2. Then the correspondence is φ0,0 =    1 1 1 1    , φ2,0 =    1 0 0 0    , φ2,1 =    0 1 0 0    , φ2,2 =    0 0 1 0    , φ2,3 =    0 0 0 1    ,

134 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION and ψ0,0 =    1 1 −1 −1    , ψ1,0 =    1 −1 0 0    , ψ1,1 =    0 0 1 −1    . Thus the orthogonal basis B2 can be represented by the mutually orthogonal vectors B2 =       1 1 1 1    ,    1 1 −1 −1    ,    1 −1 0 0    ,    0 0 1 −1       . Example: signal compression We consider the continuous function f(t) = sin(20t)(log t)2 sampled at 64 evenly spaced points on the interval [0, 1]. The 64 samples comprise a signal vector f = (f0, f1 . . . f63) = (f(0/63), f(1/63) . . . f(63/63)) and can also be thought of as a piecewise constant function ¯f(t) belonging to the function space V6. The function is shown in Figure B.3. Figure B.3: The function sin(20t)(log x)2 sampled at 64 points on [0, 1]. We can express the function ¯f(t) in the basis C6 as follows: ¯f = f0φ6,0(t) + f1φ6,1(t) + . . . f63φ6,63, where we think of the basis functions as vectors and where fi is the value of the function sampled at point i, i = 0 . . . 63. Alternatively the signal can be written in the vector basis B6, ¯f = w0φ0,0(t) + w1ψ0,0(t) + . . . + w63ψ5,31(t). We can write this equivalently as the matrix equation f = B6 · w. (B.3)

B.2. HAAR WAVELETS 135 where B6 is a (64 × 64)-dimensional matrix of ones and zeroes. This is too large to show here, but B3 =            1 1 1 0 1 0 0 0 1 1 1 0 −1 0 0 0 1 1 −1 0 0 1 0 0 1 1 −1 0 0 −1 0 0 1 −1 0 1 0 0 1 0 1 −1 0 1 0 0 −1 0 1 −1 0 −1 0 0 0 1 1 −1 0 −1 0 0 0 −1            , for example. The elements of the vector w comprise the wavelet coefficients. They are given by the wavelet transform w = B−1 6 · f. The wavelet coefficients are thus an alternative way of representing the original signal ¯f(t). They are plotted in Figure B.4 Figure B.4: The wavelet coefficients w for the signal in Figure B.3. Notice that many of the coefficients are close to zero. We can define a threshold below which all coefficients are set exactly to zero. This generally leads to long series of zeroes in w, so that it can be compressed efficiently, w → wcompr. Figure B.5 shows the result of reconstructing the signal according to f = B6 · wcompr after setting a threshold of 0.1. In all, 33 of the 64 wavelet coefficients are zero after thresholding.

136 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION Figure B.5: The reconstructed signal after thresholding at 0.1. The following program illustrates the above steps in IDL: ; generate a signal vector t=findgen(64)/63 f=sin(20*t)*alog(t)*alog(t) f[0]=0 ; output as EPS file thisDevice =!D.Name set_plot, ’PS’ Device, Filename=’c:tempsignal.eps’,xsize=15,ysize=10,/Encapsulated plot,t,f,color=1,background=’FFFFFF’XL device,/close_file ; read in basis B6 filename = Dialog_Pickfile(Filter=’*.dat’,/Read) openR,lun,filename,/get_lun B6 = fltarr(64,64) ReadF, lun, B6 ; do the wavelet transform and display w=invert(B6)##f Device, Filename=’c:tempwavelet_coeff.eps’,xsize=15,ysize=10,/Encapsulated plot,t,w,color=1,background=’FFFFFF’XL ; display compressed signal w( where(abs(w) lt 0.1) )=0.0 Device, Filename=’c:temprecon_signal.eps’,xsize=15,ysize=10,/Encapsulated plot,t,w,color=1,background=’FFFFFF’XL device,/close_file set_plot, thisDevice end

B.3. MULTI-RESOLUTION ANALYSIS 137 B.3 Multi-resolution analysis So far we have considered only functions on the interval [0, 1] with basis functions φn,k(t) = φ(2n t − k), k = 1 . . . 2n − 1. We can extend this to functions defined on all real numbers IR in a straightforward way. For example {φ(t − k) | k ∈ Z} is a basis for the space V0 of all piecewise constant functions with compact support (finite extent) having possible breaks at integer values. More generally, a basis for the set Vn of piecewise constant functions with possible breaks at m × 2−n and compact support is {φ(2n t − k) | k ∈ Z}. We can even allow n 0. For example n = −1 means that the possible breaks are at even integer values. We can think of the collection of nested subspaces of piecewise constant functions . . . ⊆ V−1 ⊆ V0 ⊆ V1 ⊆ V2 ⊆ . . . ⊆ L2 (IR), as being generated by the Haar scaling function φ. This collection is called a multiresolution analysis (MRA). A general MRA must have the following properties: 1. V = n∈Z Vn is dense in L2 (IR), that is, for any function f ∈ L2 (IR) there exists a series of functions, one in each Vn, which converges to f. This is true of the Haar MRA, see Figure 2.7 for example. 2. The separation property: I = n∈Z Vn = {0}. For the Haar MRA, this means that any function in I must be piecewise constant on all intervals. The only function in L2 (IR) with this property and compact support is f(t) = 0, so the separation property is satisfied. 3. The function f(t) ∈ Vn if and only if f(2−n t) ∈ V0. In the Haar MRA, if f(t) ∈ V1 then it is piecewise constant on intervals of length 1/2. Therefore the function f(2−1 t) is piecewise constant on intervals of length 1, that is f(2−1 t) ∈ V0, etc. 4. The scaling function φ is an orthonormal basis for the function space V0, i.e. φ(t − k), φ(t − k ) = δkk . This is of course the case for the Haar scaling function. In the following, we will think of φ(t) as any scaling function which generates an MRA in the above sense. Since {φ(t − k) | k ∈ Z} is an orthonormal basis for V0, it follows that {φ(2t − k) | k ∈ Z} is an orthogonal basis for V1. That is, let f(t) ∈ V1. Then by property 3, f(t/2) ∈ V0 and f(t/2) = k akφ(t − k) ⇒ f(t) = k akφ(2t − k). In particular, since φ(t) ∈ V0 ⊆ V1, we have the dilation equation φ(t) = k ckφ(2t − k). (B.4)

138 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION The constants ck are called the refinement coefficients. For example, the dilation equation for the Haar wavelets is φ(t) = φ(2t) + φ(2t − 1) so that the refinement coefficients are c0 = c1 = 1, ck = 0 otherwise. Note that c2 0 +c2 1 = 2. It is easy to show that this is a general property of the refinement coefficients: 1 = φ(t), φ(t) = k ckφ(2t − k), k ck φ(2t − k ) = 1 2 k c2 k. Therefore, ∞ k=−∞ c2 k = 2, (B.5) which is also called Parseval’s formula. In a similar way it is easy to show that ∞ k=−∞ ckck−2j = 0, j = 0. (B.6) B.4 Fixpoint wavelet approximation There are many other possible scaling functions that define or generate a MRA. Some of these cannot be expressed as simple, analytical functions. But once we have the refinement coefficients for a scaling function, we can approximate that scaling function to any desired degree of accuracy using the dilation equation. (In fact we can work with a MRA even when there is no simple analytical representation for the scaling function which generates it.) The idea is to iterate the refinement equation with a so-called fixpoint algorithm until it converges to a sequence of points which approximates φ(t). Let F be the function that assigns the expression F(γ)(t) = n cnγ(2t − n) to any function γ(t), where cn are refinement coefficients. Applying F to the Haar scaling function: F(φ)(t) = n cnφ(2t − n) = φ(t) where the second equality follows from the dilation equation. Thus φ is a fixpoint of F. The following recursive scheme can be used to estimate a scaling function with up to four refinement coefficients: f0(t) = δt,0 fi(t) = c0fi−1(2t) + c1fi−1(2t − 1) + c2fi−1(2t − 2) + c3fi−1(2t − 3). In this scheme, t takes on values of the form m × 2n , m, n ∈ Z, only. The first definition is the termination condition for the recursion and approximates the scaling function to zeroth order as the Dirac delta function. The second relation defines the ith approximation to the scaling function in terms of the (i−1)th approximation using the dilation equation. We can calculate the set φ ≈ fn j 2n j = 0 . . . 3(2n ) , n 1,

B.4. FIXPOINT WAVELET APPROXIMATION 139 as a pointwise approximation of φ. Here is a recursive IDL program to approximate any scaling function with 4 refinement coefficients: ffunction f, t, i common refinement, c0,c1,c2,c3 if (i eq 0) then if (t eq 0) then return, 1.0 else return, 0.0 $ else return, c0*f(2*t,i-1)+c1*f(2*t-1,i-1)+c2*f(2*t-2,i-1)+c3*f(2*t-3,i-1) end common refinement, c0,c1,c2,c3 ; refinement coefficients for Haar scaling function ; c0=1 c1=1 c2=0 c3=0 ; refinement coefficients for Daubechies scaling function c0=(1+sqrt(3))/4 c1=(3+sqrt(3))/4 c2=(3-sqrt(3))/4 c3=(1-sqrt(3))/4 ; fourth order approximation n=4 t = findgen(3*2^n) ff=fltarr(3*2^n) for i=0,3*2^n-1 do ff[i]=f(t[i]/2^n,n) ; output as EPS file thisDevice =!D.Name set_plot, ’PS’ Device, Filename=’c:tempdaubechies_approx.eps’,xsize=15,ysize=10,/Encapsulated plot,t/2^n,ff,yrange=[-1,2],color=1,background=’FFFFFF’XL device,/close_file set_plot,thisDevice end Figure B.6: The fixpoint approximation of the Haar scaling function to order n = 4. Figure B.6 shows the result of n=4 iterations using the refinement coefficients c0 = c1 = 1, c2 = c3 = 0 for the Haar scaling function.

140 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION B.5 The mother wavelet Let f be a signal or function, f ∈ L2 (IR), and let Pn(f) denote its projection onto the space Vn. We saw in the case of the Haar MRA that we can always write Pn+1(f) = Pn(f) + k f, ψn,k ψn,k, ψn,k ψn,k. The Haar wavelet ψn,k(t) = ψ(2n t − k), k, n ∈ Z, was seen to be an orthogonal basis for V ⊥ n and ψ(t) = φ(2t) − φ(2t − 1), (B.7) where φ is the scaling function. It can in fact be shown that this is always the case for any MRA, except that the last expression relating the mother wavelet ψ to the scaling function φ is generalized. Consider now some MRA with a normalized scaling function φ defined (in the sense of the preceding section) by the dilation equation (B.4). Since φ(2t − k), φ(2t − k) = 1 2 · φ(t), φ(t) = 1 2 , the functions √ 2φ(2t − k) are normalized and orthogonal. We write (B.4) in the form φ(t) = k hk √ 2φ(2t − k), (B.8) where hk = ck √ 2 . It follows from (B.8) that k h2 k = 1. (B.9) Now we assume, in analogy to (B.8), that ψ can be expressed in terms of the scaling function as ψ(t) = k gk √ 2φ(2t − k). (B.10) Since φ ∈ V0 and ψ ∈ V ⊥ 0 we have φ, ψ = k hkgk = 0. (B.11) Similarly, ψ(t − k), ψ(t − m) = i gigi−2(k−m) = δk,m. (B.12) A set of coefficients that satisfies (B.11) and (B.12) is given by gk = (−1)k h1−k, so we obtain, finally, the relationship between the wavelets and the scaling function: ψ(t) = k (−1)k h1−k √ 2φ(2t − k) = k (−1)k c1−kφ(2t − k). (B.13)

B.6. THE DAUBECHIES WAVELET 141 B.6 The Daubechies wavelet The Daubechies scaling function is derived according to the following two requirements on an MRA: 1. Compact support: The scaling function φ(t) is required to be zero outside the interval 0 t 3. This means that the refinement coefficients ck vanish for k 0, k 3. To see this, note that c−3 = 2 φ(t), φ(2t + 3) = 3 0 φ(t)φ(2t + 3)dt = 0 and similarly for k = −4, −5 . . . and for k = 6, 7 . . .. Therefore, from the dilation equation, φ(−1/2) = 0 = c−2φ(−1 + 2) + c−1φ(−1 + 1) + . . . ⇒ c−2 = 0 and similarly for k = −1, 4, 5. Thus from (B.5), we can conclude that c2 0 + c2 1 + c2 2 + c2 3 = 2 (B.14) and from (B.6), with j = 1, that c0c2 + c1c3 = 0. (B.15) In addition, again from the dilation equation, we can write ∞ −∞ φ(t)dt = 3 k=0 ck ∞ −∞ φ(2t − k)dt = 3 k=0 ck 2 ∞ −∞ φ(u)du. But one can show that an MRA implies ∞ −∞ φ(t)dt = 0 so we have c0 + c1 + c2 + c3 = 2. (B.16) 2. Regularity: All constant and linear polynomials can be written as a linear combination of the basis {φ(t − k) | k ∈ Z} for V0. This implies that there is no residual in the orthogonal decomposition of f(t) = 1 and f(t) = t onto the basis, that is, ∞ −∞ ψ(t)dt = ∞ −∞ tψ(t)dt = 0. (B.17) With (B.13) the mother wavelet is ψ(t) = −c0φ(2t − 1) + c1φ(2t) − c2φ(2t + 1) + c3φ(2t + 2). (B.18) The first requirement in (B.17) gives immediately −c0 + c1 − c2 + c3 = 0. (B.19) From the second requirement we have ∞ −∞ tψ(t)dt = 0 = 3 k=0 (−1)k+1 ck ∞ −∞ tφ(2t − 1 + k)dt = 3 k=0 (−1)k+1 ck ∞ −∞ u + 1 − k 4 φ(u)du = 0 4 · ∞ −∞ uφ(u)du + −c0 + c2 − 2c3 4 ∞ −∞ φ(u)du,

142 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION using (B.19). Thus −c0 + c2 − 2c3 = 0. (B.20) Equations (B.14), (B.15), (B.16), (B.19) and (B.20) comprise a system of five equations in four unknowns. A solution is given by c0 = 1 + √ 3 4 , c1 = 3 + √ 3 4 , c2 = 3 − √ 3 4 , c3 = 1 − √ 3 4 , which are known as the D4 refinement coefficients. Figure B.7 shows the corresponding scaling function, determined with the fixpoint method described earlier. Figure B.7: The fixpoint approximation of the Daubechies scaling function to order n = 4. Example: image compression The following program, adapted from the IDL Reference Guide, uses the Daubechies D4 wavelet to compress a gray-scale satellite image. It displays the original and compressed images and determines the file size of the compressed image. The next section discusses the implementation of the wavelet transformation in terms of a filter bank. ; Image compression with Daubechies wavelet ; read a bitmap image and cut out a 512x512 pixel array cd, ’c:idlprojectsimage analysis’ filename = Dialog_Pickfile(Filter=’*.bmp’,/Read) image1 = Read_BMP(filename) image = bytarr(512,512) image[*,*] = image1[1,0:511,0:511] ; display cutout and size window,0,xsize=512,ysize=512 window,1,xsize=512,ysize=512

B.7. WAVELETS AND FILTER BANKS 143 wset, 0 tv, bytscl(image) print, ’Size of original image is’, 512*512L, ’ bytes’ ; perform wavelet transform with D4 wavlet wtn_image = wtn(image, 4) ; convert to sparse array with threshold 20 and write to disk sparse_image = sprsin(wtn_image,thresh=20) write_spr, sparse_image, ’sparse.dat’ openr, 1, ’sparse.dat’ status = fstat(1) close, 1 print, ’Size of compressed image is’, status.size, ’ bytes’ ; reconstruct full array, do inverse wavelet transform and display wset,1 tv, bytscl(wtn(fulstr(sparse_image), 4, /inverse)) end B.7 Wavelets and filter banks In the case of the Haar wavelets we were able to carry out the wavelet transformation with vectors and matrices. In general, we can’t represent scaling functions in this way. In fact usually all that we have to work with are the refinement coefficients. So how can we perform the wavelet transformation? To answer this question, consider a row of pixels (s(0), s(1) . . . s(m − 1)) in a satellite image, where m = 2n , and the associated vector signal on [0, 1] given by s = (s0, s1 . . . sm−1) = (s(0/(m − 1)), s(1/(m − 1)) . . . s(1)) . In the MRA generated by a scaling function φ, such as D4, this signal defines a function fn(t) ∈ Vn on the interval [0, 1] according to fn(t) = m−1 j=0 sjφn,j = m−1 j=0 sjφ(2n t − j). (B.21) Assume that the basis functions are appropriately normalized. The projection of fn(t) onto Vn−1 is then fn−1(t) = m/2−1 k=0 fn, φ(2n−1 t − k) φ(2n−1 t − k) = m/2−1 k=0 (Hs)kφ(2n−1 t − k), where Hs = fn, φ(2n−1 t) , fn, φ(2n−1 t − 1) . . . fn, φ(2n−1 t − m/2 − 1)

144 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION is the signal vector in Vn−1. The operator H is interpreted as a low-pass filter. It averages the original signal s and reduces its length by a factor of two. We have, using (B.21), (Hs)k = m−1 j=0 sj φ(2n t − j), φ(2n−1 t − k) . From the dilation equation with normalized basis functions, φ(2n−1 t − k) = k hk φ(2n t − 2k − k ), so we can write (Hs)k = m−1 j=1 sj k hk φ(2n t − j), φ(2n t − 2k − k ) = m−1 j=1 sj k hk δj,k +2k. Therefore (Hs)k = m−1 j=0 hj−2ksj, k = 0 . . . m 2 − 1 = 2n−1 − 1. (B.22) For the Daubechies scaling function, h0 = 1 + √ 3 4 √ 2 , h1 = 3 + √ 3 4 √ 2 , h2 = 3 − √ 3 4 √ 2 , h3 = 1 − √ 3 4 √ 2 , h4 = 0, . . . . Thus the elements of the filtered signal are (Hs)0 = h0s0 + h1s1 + h2s2 + h3s3 (Hs)1 = h0s2 + h1s3 + h2s4 + h3s5 (Hs)3 = h0s4 + h1s5 + h2s6 + h3s7 ... This is just the convolution of the filter H = (h3, h2, h1, h0) with the signal s, Hs = H ∗ s, see Eq. (2.12), except that only every second term is retained. This is referred to as downsampling and is illustrated in Figure B.8. In the same way, the high-pass filter G projects fn(t) onto the orthogonal subspace V ⊥ n−1 according to (Gs)k = m−1 j=0 gj−2ksj, k = 0 . . . m 2 − 1 = 2n−1 − 1. (B.23) Recall that gk = (−1)k h1−k

B.7. WAVELETS AND FILTER BANKS 145 s HsH ↓ 2 Figure B.8: Schematic representation of the filter H. The symbol ↓ 2 indicates downsampling by a factor of two. so that the nonzero high-pass filter coefficients are actually g−2 = h3, g−1 = −h2, g0 = h1, g1 = −h0. The concatenated signal (Hs, Gs) = (s1 , d1 ) is the projection of fn onto Vn−1 ⊕ V ⊥ n−1. It has the same length as the original signal s and is an alternative representation of that signal. Its generation is illustrated in Figure B.9 as a filter bank. s H ↓ 2 G ↓ 2 d1 s1 Figure B.9: Schematic representation of the filter bank H, G. The projections can be repeated on s1 = Hs to obtain the projection (Hs1 , Gs1 , Gs) = (s2 , d2 , d1 ) onto Vn−2 ⊕ V ⊥ n−2 ⊕ V ⊥ n−1 and so on. The original signal can be reconstructed at any stage by applying the inverse operators H∗ and G∗ . For the first stage these are defined by (H∗ s1 )k = m/2−1 j=0 hk−2js1 j , k = 0 . . . m − 1 = 2n − 1, (B.24) (G∗ d1 )k = m/2−1 j=0 gk−2jd1 j , k = 0 . . . m − 1 = 2n − 1, (B.25) with analagous definitions for the other stages. To understand what’s happening, consider

146 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION the elements of the filtered signal (B.24). These are (H∗ s1 )0 = h0s1 0 (H∗ s1 )1 = h1s1 0 (H∗ s1 )2 = h2s1 0 + h0s1 1 (H∗ s1 )3 = h3s1 0 + h1s1 1 (H∗ s1 )4 = h2s1 1 + h0s1 2 (H∗ s1 )5 = h3s1 1 + h1s1 2 ... This is just the convolution of the filter H∗ = (h0, h1, h2, h3) with the signal s1 0, 0, s1 1, 0, s1 2, 0 . . . s1 m/2−1, 0. This is called the upsampled signal. The filter (B.24) can be represented schematically as in Figure B.10. s1 H∗ s1↑ 2 H∗ Figure B.10: Schematic representation of the filter H∗ . The symbol ↑ 2 indicates upsampling by a factor of two. Equation (B.25) is interpreted in a similar way. Finally we add the two results to get the original signal: H∗ s1 + G∗ d1 = s. To see this, write the equation out for a particular value of k: (H∗ s1 )k + (G∗ d1 )k = m/2−1 j=0 hk−2j   m−1 j =0 hj −2jsj + gk−2j m−1 j =0 gj −2jsj   Combining terms and interchanging the summations, we get (H∗ s1 )k + (G∗ d1 )k = m−1 j =0 sj m/2−1 j=0 [hk−2jhj −2j + gk−2jgj −2j]. Now, using gk = (−1)k h1−k, (H∗ s1 )k + (G∗ d1 )k = m−1 j =0 sj m/2−1 j=0 [hk−2jhj −2j + (−1)k+j h1−k+2jh1−j +2j].

B.7. WAVELETS AND FILTER BANKS 147 With the help of (B.5) and (B.6) it is easy to show that the second summation above is just δj k. For example, suppose k is even. Then m/2−1 j=0 [hk−2jhj −2j + (−1)k+j h1−k+2jh1−j +2j] = h0hj −k + h2hj −k+2 + (−1)j [h1h1−j +k + h3h3−j +k]. If j = k, the right hand side reduces to h2 0 + h2 1 + h2 2 + h2 3 = 1, from (B.5) and hk = ck/ √ 2. For any other value of j , the expression is zero. Therefore we can write (H∗ s1 )k + (G∗ d1 )k = m−1 j =0 sj δj k = sk, as claimed. The reconstruction of the original signal from s1 and d1 is shown in Figure B.11 as a synthesis bank. s1 ↑ 2 H∗ ↑ 2 G∗ s d1 + Figure B.11: Schematic representation of the synthesis bank H∗ , G∗ . The extension of the procedure to two-dimensional signals (e.g. satellite imagery) is straightforward, see [Mal89]. Figure B.12 shows a single application of the filters H and G to the rows and columns of a satellite image. The image is a signal which defines a two- dimensional function f in V10 ⊗ V10. The Daubechies D4 refinement coefficients are used to generate the filters. The result of the low-pass filter is in the upper left hand quadrant. This is the projection of f onto V9 ⊗ V9. The other three quadrants represent the projections onto the orthogonal subspaces V ⊥ 9 ⊗V9, V9 ⊗V ⊥ 9 and V ⊥ 9 ⊗V ⊥ 9 . The following IDL program illustrates the procedure. ; the Daubechies kernels H = [1-sqrt(3),3-sqrt(3),3+sqrt(3),1+sqrt(3)]/(4*sqrt(2)) G = [-H[0],H[1],-H[2],H[3]] ; arrays for wavelet coefficients f0 = fltarr(512,512) f1 = fltarr(512,256) g1 = fltarr(512,256) ff1 = fltarr(256,256)

148 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION Figure B.12: Wavelet projection of a satellite image with (2.36) and(2.37). fg1 = fltarr(256,256) gf1 = fltarr(256,256) gg1 = fltarr(256,256) ; read a bitmap image and cut out a 512x512 pixel array filename = Dialog_Pickfile(Filter=’*.bmp’,/Read) image = Read_BMP(filename) ; 24 bit image, so get first layer f0[*,*] = image[1,0:511,0:511] ; display cutout window,0,xsize=512,ysize=512 wset, 0 tv, bytscl(f0) ; filter columns and downsample ds = findgen(256)*2 for i=0,511 do begin temp = convol(transpose(f0[i,*]),H,center=0,/edge_wrap) f1[i,*] = temp[ds] temp = convol(transpose(f0[i,*]),G,center=0,/edge_wrap)

B.7. WAVELETS AND FILTER BANKS 149 g1[i,*] = temp[ds] endfor ; filter rows and downsample for i=0,255 do begin temp = convol(f1[*,i],H,center=0,/edge_wrap) ff1[*,i] = temp[ds] temp = convol(f1[*,i],G,center=0,/edge_wrap) fg1[*,i] = temp[ds] temp = convol(g1[*,i],H,center=0,/edge_wrap) gf1[*,i] = temp[ds] temp = convol(g1[*,i],G,center=0,/edge_wrap) gg1[*,i] = temp[ds] endfor f0[0:255,256:511]=bytscl(ff1[*,*]) f0[0:255,0:255]=bytscl(gf1[*,*]) f0[256:511,0:255]=bytscl(gg1[*,*]) f0[256:511,256:511]=bytscl(fg1[*,*]) ; output as EPS file thisDevice =!D.Name set_plot, ’PS’ Device, Filename=’c:temppyramid.eps’,xsize=10,ysize=10,/Encapsulated tv,f0 device,/close_file set_plot,thisDevice end

150 APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION

Appendix C Advanced Neural Network Training Algorithms The standard backpropagation algorithm introduced in Chapter 10 is notoriously slow to converge. In this Appendix we will develop two additional training algorithms for the two-layer, feed-froward neural network of Figure 10.4. The first of these, scaled conjugate gradient makes use of the second derivatives of the cost function with respect to the synaptic weights, i.e. the Hessian matrix. The second, the Kalman filter method, takes advantage of the statistical properties of the weight parameters. Both techniques are considerably more efficient than backpropagation. C.1 The Hessian matrix The Hessian matrix was introduced in Chapter 10 as Hij = ∂2 E(w) ∂wi∂wj . (C.1) It is the (symmetric) matrix of second order partial derivatives of the cost function E(w) with respect to the synaptic weights, the latter thought of as a single column vector w =           wh 1 ... wh L wo 1 ... wo M           of length nw = L(N + 1) + M(L + 1). Since H is symmetric, its eigenvectors ui, i = 1 . . . nw are orthogonal and any vector v in the space of the synaptic weights can be expressed as a linear combination of them, e.g. v = nw i=1 βiui. 151

152 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS But then we have v Hv = v i βiλiui = i β2 i λi, and we conclude that H is positive definite if and only of all of its eigenvalues λi are positive. Thus a good way to check if one is at or near a local minimum in the cost function is to examine the eigenvalues of the Hessian. The scaled conjugate gradient algorithm makes explicit use of the Hessian matrix for more efficient convergence to a minimum in the cost function. The disadvantage of using H is that it is difficult to compute efficiently. For example, for a typical classification problem with N = 3-dimensional input data, L = 8 hidden neurons and M = 12 land use categories, there are [L(N + 1) + M(L + 1)]2 = 19, 600 matrix elements to determine at each iteration. We develop in the following an efficient method to calculate not H directly, but rather the product v H for any vector v having nw components. Our approach follows Bishop [Bis95] closely. C.1.1 The R-operator Let us begin by summarizing some results of Chapter 10 for the two-layer, feed forward network: x = (x1 . . . xN ) input observation y = (0 . . . 1 . . . 0) class label x = (1, x ) biased input observation Ih = Wh x activation vector for the hidden layer n = gh (Ih ) output signal vector from the hidden layer n = (1, n ) ditto with bias Io = Wo n activation vector for the output layer m = go (Io ) output vector from the network. (C.2) The corresponding activation functions are, for the hidden neurons, gh (Ih j ) = 1 1 + e−Ih j , j = 1 . . . L, (C.3) and for the output neurons, go (Io k) = eIo k M k =1 eIo k , k = 1 . . . M. (C.4) The first derivatives of the local cost function with respect to the output and hidden weights, (10.19) and (10.22), can be written concisely as ∂E ∂Wo = −nδo ∂E ∂Wh = −xδh , (C.5) where δo = y − m (C.6)

C.1. THE HESSIAN MATRIX 153 and 0 δh = n ⊗ (1 − n) ⊗ Wo δo . (C.7) With Bishop [Bis95] we introduce the R-operator according to Rv{·} := v ∂ ∂w , v = (v1 . . . vnw ). Obviously we have Rv{w} = j vj ∂w ∂wj = v. We adopt the convention that the result of applying the R-operator has the same structure as the argument to which it is applied. Thus for example Rv{Wh } = Vh , where Vh , like Wh , is an (N + 1) × L matrix consisting of the ﬁrst (N + 1) × L components of the vector v. Next we derive an expression for v H in terms of the R-operator. (v H)j = nw i=1 viHij = nw i=1 vi ∂2 E ∂wi∂wj = nw i=1 vi ∂ ∂wi ∂E ∂wj or (v H)j = v ∂ ∂w ∂E ∂wj = Rv ∂E ∂wj , j = 1 . . . nw. Since v H is a row vector, this can be written v H = Rv ∂E ∂w ∼= Rv ∂E ∂Wh , Rv ∂E ∂Wo . (C.8) Note the reorganization of the structure in the argument of Rv, namely w → (Wh , Wo ). This is merely for convenience. Once the expressions on the right have been evaluated, the result must be “ﬂattened” back to a row vector. Equation (C.1.1) is understood to involve the local cost function. In order to complete the calculation we must sum over all training pairs. Applying the chain rule to (C.5), Rv ∂E ∂Wo = −nRv{δo } − RV {n}δo Rv ∂E ∂Wh = −xRv{δh }, (C.9) so that, in order to evaluate (C.1.1), we only need expressions for Rv{n}, Rv{δo } and Rv{δh }.

154 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS Determination of Rv{n} From (C.2) we can write Rv{n} = 0 Rv{n } (C.10) and, from the chain rule, Rv{n } = n ⊗ (1 − n ) ⊗ Rv{Ih } (C.11) and Rv{Ih } = Vh x. (C.12) Note that, according to our convention, Vh is interpreted as an L × (N + 1)-dimensional matrix, since the argument Ih is a vector of length L. Determination of Rv{δo } With (C.6) and (C.2) we get Rv{δo } = −Rv{m} = −v ∂m ∂w = −go (Io ) ⊗ Rv{Io }, where the prime denotes diﬀerentiation, or Rv{δo } = −m ⊗ (1 − m) ⊗ Rv{Io }. (C.13) Again with (C.2) we determine Rv{Io } = Wo Rv{n} + Vo n, (C.14) where Rv{n} is determined by (C.10–12). Determination of Rv{δh } We begin by writing (C.7) in the form 0 δh = 0 gh (Ih ) ⊗ Wo δo . Operating with Rv{·}, 0 Rv{δh } = 0 gh (Ih ) ⊗ 0 Rv{Ih } ⊗ Wo δo + 0 gh (Ih ) ⊗ Vo δo + 0 gh (Ih ) ⊗ Wo Rv{δo }. Now we use the derivatives of the activation function gh (Ih ) = n (1 − n ) gh (Ih ) = n (1 − n )(1 − 2n ) to obtain 0 Rv{δh } = n ⊗ (1 − n) ⊗ (1 − 2n) ⊗ 0 Rv{Ih } ⊗ Wo δo + Vo δo + Wo Rv{δo } (C.15) in which all of the terms on the right have now been determined. As already mentioned, equation () has been evaluated in terms of the local cost function. The ﬁnal step in the calculation involves summing over all of the training pairs.

C.1. THE HESSIAN MATRIX 155 C.1.2 Calculating the Hessian To calculate the Hessian matrix for the neural network, we evaluate (C.1.1) successively for the vectors v1 = (1, 0, 0 . . . 0) . . . vnw = (0, 0, 0 . . . 1) and build up H row for row: H =    v1 H ... vnw H    . The following excerpt from the IDL program FFNCG DEFINE (see Appendix D) implements a vectorized version of the preceding calculation of v H and H: Function FFNCG::Rop, V nw = self.LL*(self.NN+1)+self.MM*(self.LL+1) ; reform V to dimensions of Wh and Wo and transpose VhT = transpose(reform(V[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1)) Vo = reform(V[self.LL*(self.NN+1):*],self.MM,self.LL+1) VoT = transpose(Vo) ; transpose the weights WhT = transpose(*self.Wh) Wo = *self.Wo WoT = transpose(Wo) ; vectorized forward pass X = *self.Xs Zeroes = fltarr(self.p) Ones = Zeroes + 1.0 N = [[Ones],[1/(1+exp(-WhT##X))]] Io = WoT##N maxIo = max(Io,dimension=2) for k=0,self.MM-1 do Io[*,k]=Io[*,k]-maxIo A = exp(Io) sum = total(A,2) M = fltarr(self.p,self.MM) for k=0,self.MM-1 do M[*,k] = A[*,k]/sum ; evaluation of v^T.H D_o = *self.Ys-M ; dô RIh = VhT##X ; Rv{I^h} RN = N*(1-N)*[[Zeroes],[RIh]] ; Rv{n} RIo = WoT##RN + VoT##N ; Rv{Iô} Rd_o = -M*(1-M)*RIo ; Rv{dô} Rd_h = N*(1-N)*((1-2*N)*[[Zeroes],[RIh]]*(Wo##D_o) + Vo##D_o + Wo##Rd_o) Rd_h = Rd_h[*,1:*] ; Rv{d^h} REo = -N##transpose(Rd_o)-RN##transpose(D_o) ; Rv{dE/dWo} REh = -X##transpose(Rd_h) ; Rv{dE/dWh} return, [REh[*],REo[*]] ; v^T.H End

156 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS Function FFNCG::Hessian nw = self.LL*(self.NN+1)+self.MM*(self.LL+1) v = diag_matrix(fltarr(nw)+1.0) H = fltarr(nw,nw) for i=0,nw-1 do H[*,i] = self - Rop(v[*,i]) return, H End C.2 Scaled conjugate gradient training The backpropagation algorithm of Chapter 10 attempts to minimize the cost function locally, that is, weight updates are made immediately after presentation of a single training pair to the network. We will now consider a global approach aimed at minimization of the full cost function (10.15), which we denote in the following E(w). The symbol w is, as before, the nw-component vector of synaptic weights. Now let the gradient of the cost function at the point w be g(w), i.e. g(w) i = ∂ ∂wi E(w), i = 1 . . . nw. The Hessian matrix (H)ij = ∂2 E(w) ∂wi∂wj i, j = 1 . . . nw can then be expressed conveniently as the outer product H = ∂ ∂w g(w) . (C.16) C.2.1 Conjugate directions The search for a minimum in the cost function can be visualized as a series of points in the space of synaptic weight parameters, w1 , w2 . . . wk−1 , wk , wk+1 . . . , whereby the point wk is determined by minimizing E(w) along some search direction dk−1 which originated at the preceding point wk−1 . This is illustrated in Figure C.1 and corresponds to the vector equation wk = wk−1 + αk−1dk−1 . (C.17) Here dk−1 is a unit vector along the chosen search direction and the scalar αk−1 minimizes the cost function along that direction: αk−1 = arg min α E wk−1 + α · dk−1 . (C.18) If, starting from wk , we now wish to take the next minimizing step in the weight space, it is not eﬃcient simply to choose, as in backpropagation, the direction of the local gradient g(wk) at the new starting point wk . It follows namely from (C.18) that ∂ ∂α E wk−1 + α · dk−1 α=αk−1 = 0

C.2. SCALED CONJUGATE GRADIENT TRAINING 157 0 BE c ‚wk−1 dk−1 wk g(wk ) αk−1 dk ? Figure C.1: Search directions in weight space. or ∂ ∂w E wk−1 + αk−1dk−1 dk−1 = g(wk ) dk−1 = 0. (C.19) The gradient g(wk ) at the new point wk is thus always orthogonal to the preceding search direction dk−1 . This is indicated in Figure C.1. Since the algorithms has just succeeded in reducing the gradient of the cost function along dk−1 to zero, we would prefer to choose the search direction dk so that the component of the gradient along the old search direction remains as small as possible. Otherwise we are undoing what we have just accomplished. Therefore we choose dk according to the condition g wk + α · dk dk−1 = 0. But to first order in α we have, with (C.16), g wk + α · dk = g(wk ) + α · dk ∂ ∂w g(wk ) = g(wk ) + α · dk H and the above condition is, with (C.19), equivalent to dk Hdk−1 = 0. (C.20) Directions which satisfy Equation (C.20) are referred to as conjugate directions. C.2.2 Minimizing a quadratic function Of course the neural network cost function is not quadratic in the synaptic weights. However within a sufficiently small region of weight space it can be approximated as a quadratic function. We describe in the following an efficient procedure to find the global minimum of a quadratic function of w having the general form E(w) = E0 + b w + 1 2 w Hw, (C.21)

158 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS where b and H are constant and the matrix H is positive definite. The local gradient at the point w is given by g(w) = ∂ ∂w E(w) = b + Hw, and at the global minimum w∗ , b + Hw∗ = 0. (C.22) Now let {dk | k = 1 . . . nw} be a set of conjugate directions satisfying (C.20),1 dk Hd = 0 for k = , k, = 1 . . . nw. (C.23) The search directions dk are linearly independent. In order to demonstrate this let us assume the contrary, that is, that there exists an index k and constants αk , k = k, not all of which are zero, such that dk = nw k =1 k =k αk dk . But from (C.23) we have at once αk dk Hdk = 0 for k = k and, since H is positive definite, αk = 0 for k = k. The assumption thus leads to a contradiction and the dk are indeed linearly independent. The conjugate directions thus constitute a (non-orthogonal) vector basis for the entire weight space. In the search for the global minimum suppose we begin at an arbitrary point w1 and express the vector w∗ − w1 spanning the distance to the global minimum as a linear combination of the basis vectors dk : w∗ − w1 = nw k=1 αkdk . (C.24) Further, define wk = w1 + k−1 =1 α d (C.25) and split (C.24) up into nw steps wk+1 = wk + αkdk , k = 1 . . . nw. (C.26) At the kth step the search starts at the point wk and proceeds a distance αk along the conjugate direction dk . After nw such steps the global minimum w∗ is reached, since from (C.24–C.26) it follows that w∗ = w1 + nw k=1 αkdk = wnw + αnw dnw = wnw+1 . 1It can be shown that such a set always exists, see e.g. [Bis95].

C.2. SCALED CONJUGATE GRADIENT TRAINING 159 We get the necessary step sizes αk from (C.24) by multiplying from the left with d H, d Hw∗ − d Hw1 = nw k=1 αkd Hdk . From (C.22) and (C.23) we can write this as −d (b + Hw1 ) = α d Hd , and an explicit formula for the step sizes is given by α = − d (b + Hw1 ) d Hd , = 1 . . . nw. But with (C.24) and (C.25), dk Hwk = dk Hw1 + 0, and therefore, replacing index k by , d Hw = d Hw1 . The step lengths are thus α = − d (b + Hw ) d Hd , = 1 . . . nw. Finally, using the notation g = g(w ) = b + Hw and substituting → k, αk = − dk gk dk Hdk , k = 1 . . . nw. (C.27) For want of a better alternative we can choose the ﬁrst search direction along the negative local gradient d1 = −g1 = − ∂ ∂w E(w1 ). (Note that d1 is not a unit vector.) We move according to (C.27) a distance α1 = d1 d1 d1 Hd1 along this direction to the point w2 , at which the local gradient g2 is orthogonal to d1 . We then choose the new conjugate search direction d2 as a linear combination of the two: d2 = −g2 + β1d1 or, at the kth step, dk+1 = −gk+1 + βkdk . (C.28)

160 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS We get the coefficient βk from (C.28) and (C.20) by multiplication on the left with dk H: 0 = −dk Hgk+1 + βkdk Hdk , from which follows βk = gk+1 Hdk dk Hdk . (C.29) Equations (C.26–C.29) constitute a recipe with which, starting at an arbitrary point w1 in weight space, the global minimum of the quadratic function (C.21) is found in precisely nw steps. C.2.3 The algorithm Returning now to the non-quadratic neural net cost function E(w) we will apply the above method to minimize it. We must take two things into consideration. First of all, the Hessian matrix H is neither constant nor everywhere positive definite. We will denote its local value at the point wl as Hk . When Hk is not positive definite it can happen that (C.27) leads to a step along the wrong direction – the numerator might turn out to be negative. Therefore we replace (C.27) with2 αk = − dk gk dk Hdk + λk|dk|2 , k = 1 . . . nw. (C.30) The constant λk is supposed to ensure that the denominator in (C.30) is always positive. It is initialized for k = 1 with a small numerical value. If, at the kth iteration, it is determined that δk := dk Hdk + λk(dk )2 0, then λk is replaced by the larger value ¯λk given by ¯λk = 2 λk − δk |dk|2 . (C.31) This ensures that the denominator in (C.30) becomes positive again. Note that this increase in λk has the effect of decreasing the step size αk, as is apparent from (C.30). Second, we must take into account any deviation of the cost function from its local quadratic approximation. Such deviations are to be expected for large step sizes αk. As a measure of the quadricity of E(w) along the chosen step length we can use the ratio ∆k = − 2 E(wk ) − E(wk + αkdk ) αkdk gk . (C.32) 2This corresponds to the substitution Hk → Hk + λkI, where I is the identity matrix.

C.2. SCALED CONJUGATE GRADIENT TRAINING 161 This quantity is precisely 1 for a strictly quadratic function like (C.21). Therefore we can use the following heuristic: For the k + 1st iteration if ∆k 3/4, λk+1 := λk/2 if ∆k 1/4, λk+1 := 4λk else, λk+1 := λk. In other words, if the local quadratic approximation looks good according to criterion (C.32), then the step size can be increased (λk+1 is reduced relative to λk). If this is not the case then the step size is decreased (λk+1 is made larger). All of which leads us finally to the following algorithm (see e.g. [Moe93]) Algorithm (Scaled Conjugate Gradient) 1. Initialize the synaptic weights w with random numbers, set k = 0, λ = 0.001 and d = −g = −∂E(w)/∂w. 2. Set δ = d Hd + λ|d|2 . If δ 0, set λ = 2(λ − δ/d2 ) and δ = −d Hd. Save the current cost function E1 = E(w). 3. Determine the step size α = −d g/δ and new synaptic weights w = w + αd. 4. Calculate the quadricity ∆ = −(E1 − E(w))/(α · d g). If ∆ 1/4, restore the old weights: w = w − α · d, set λ = 4λ, d = −g and go to 2. 5. Set k = k + 1. If ∆ 3/4 set λ = λ/2. 6. Determine the new local gradient g = ∂E(w)/∂w and the new search direction d = −g + βd, whereby, if k mod nw = 0 then β = g Hd/(d Hd) else β = 0. 7. If E(w) is small enough stop, else go to 2. A few remarks on this algorithm: • The integer k counts the total number of iterations. Whenever k mod nw = 0 exactly nw weight updates have been carried out and the minimum of a truly quadratic function would have been reached. This is taken as a good stage at which to restart the search along the negative local gradient −g rather than continuing along the current conjugate direction d. One expects that approximation errors will gradually corrupt the determination of the conjugate directions and the “fresh start” is intended to counter this. • Whenever the quadricity condition is not filled, i.e. whenever ∆ 1/4, the last weight update is cancelled and the search again restarted along −g. • Since the Hessian only occurs in the forms d H, and g H, it can be determined efficiently with the R-operator method. Here is an excerpt from the object FFNCG class extending FFN, showing the training method which implements scaled conjugate gradient algorithm:

162 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS Pro FFNCG::Train w = [(*self.Wh)[*],(*self.Wo)[*]] nw = n_elements(w) g = self-gradient() d = -g ; search direction, row vector k = 0L lambda = 0.001 window,12,xsize=600,ysize=400,title=’FFN(scaled conjugate gradient)’ wset,12 progressbar = Obj_New(’progressbar’, Color=’blue’, Text=’0’,$ title=’Training: epoch number...’,xsize=250,ysize=20) progressbar-start eivminmax = ’?’ repeat begin if progressbar-CheckCancel() then begin print,’Training interrupted’ progressbar-Destroy return endif d2 = total(d*d) ; d^2 dTHd = total(self-Rop(d)*d) ; d^T.H.d delta = dTHd+lambda*d2 if delta lt 0 then begin lambda = 2*(lambda-delta/d2) delta = -dTHd endif E1 = self-cost() ; E(w) (*self.cost_array)[k] = E1 dTg = total(d*g) ; d^T.g alpha = -dTg/delta dw = alpha*d w = w+dw *self.Wh = reform(w[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1) *self.Wo = reform(w[self.LL*(self.NN+1):*],self.MM,self.LL+1) E2 = self-cost() ; E(w+dw) Ddelta = -(E1-E2)/(alpha*dTg) ; quadricity if Ddelta lt 0.25 then begin w = w - dw ; undo change in the weights *self.Wh = reform(w[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1) *self.Wo = reform(w[self.LL*(self.NN+1):*],self.MM,self.LL+1) lambda = 4*lambda ; decrease step size d = -g ; restart along gradient end else begin k++ if Ddelta gt 0.75 then lambda = lambda/2 g = self-gradient() if k mod nw eq 0 then begin beta = 0 eivs = self-eigenvalues() eivminmax = string(min(eivs)/max(eivs),format=’(F10.6)’)

C.3. KALMAN FILTER TRAINING 163 end else beta = total(self-Rop(g)*d)/dTHd d = beta*d-g plot,*self.cost_array,xrange=[0,k100],color=0,background=’FFFFFF’XL,$ ytitle=’cross entropy’,xtitle= $ ’Epoch [’+textoidl(’min(lambda)/max(lambda)=’)+eivminmax+’]’ endelse progressbar-Update,k*100/self.iterations,text=strtrim(k,2) endrep until k gt self.iterations End C.3 Kalman filter training In this Section we apply the recursive least squares method described in Appendix A to train the feed forward neural network of Figure 10.4. The appropriate cost function is the quadratic function (10.13) or, more specifically, its local version (10.14). 1 j L k ... ... ~q X b E 0 m( + 1) wo k 1 n1( + 1) nj( + 1) nL( + 1) n( + 1) Figure C.2: An isolated output neuron. We begin with consideration of the training process of an isolated neuron. Figure C.2 depicts an output neuron in the network during presentation of the + 1st training pair (x( + 1), y( + 1)). The neuron receives its input from the hidden layer (input vector n( + 1)) and generates the signal mk( + 1) = g(wo k n( + 1)) = ewo k n( +1) M k =1 ewo k n( +1 , k = 1 . . . M, which is compared to the desired output y( +1). It is easy to show that differentiation with respect to wo k yields ∂ ∂wo k g(wo k n( + 1)) = mk( + 1)(1 − mk( + 1))n( + 1). (C.33) and with respect to n, ∂ ∂n g(wo k n( + 1)) = mk( + 1)(1 − mk( + 1))wo k( + 1). (C.34)

164 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS C.3.1 Linearization We shall drop for the time being the indices on wo k, writing it simply as w. Let us call w( ) an approximation to the desired synaptic weight vector for our isolated output neuron, one which has been achieved so far in the training process, i.e. after presentation of the first training pairs. Then a linear approximation to mk( + 1) can be obtained by expanding in a first order Taylor series about the point w( ), m( + 1) ≈ g(w( ) n( + 1)) + ∂ ∂w g(w( ) n( + 1)) (w − w( )). With (C.34) we can then write m( + 1) ≈ ˆm( + 1) + ˆm( + 1)(1 − ˆm( + 1))n( + 1) (w − w( )), (C.35) where ˆm( + 1) is given by ˆm( + 1) = g(w( ) n( + 1)). With the definition of the linearized input A( + 1) = ˆm( + 1)(1 − ˆm( + 1))n( + 1) (C.36) we can write (C.35) in the form m( + 1) ≈ A( + 1)w + [ ˆm( + 1) − A( + 1)w( )]. The term in square brackets is - to first order - the error that arises from the fact that the neuron’s output signal is not simply proportional to w. If we neglect it altogether, then we get the linearized neuron output signal m( + 1) = A( )w. In order to calculate the synaptic weight vector w, we can now apply the theory of recursive least squares developed in Appendix A. We simply identify the parameter vector a with the synaptic weight vector w. We then have the least squares problem y y( + 1) = A A( + 1) w + . The Kalman filter equations for the recursive solution of this problem, Eq. (A.15), are unchanged: Σ +1 = I − K +1A( + 1) Σ K +1 = Σ A( + 1) A( + 1)Σ A( + 1) + 1 −1 . (C.37) while the recursive expression (A.14) for the parameter vector becomes w( + 1) = w( ) + K +1 y( + 1) − A( + 1)w( ) .

C.3. KALMAN FILTER TRAINING 165 This can be improved somewhat by replacing the linear approximation to the system output A( + 1)w( ) by the actual output for the + 1st training observation, namely ˆm( + 1), so we have w( + 1) = w( ) + K +1 y( + 1) − ˆm( + 1) . (C.38) C.3.2 The algorithm The recursive calculation of w is depicted in Figure C.3. The input is the current weight vector w( ), its covariance matrix Σ and the output vector of the hidden layer n( + 1) obtained by propagating the next input observation x( + 1) through the network. After determining the linearized input A( + 1), Eq. (C.36), the Kalman gain K +1 and the new covariance matrix Σ +1 are calculated with (C.37). Finally, the weights are updated in (C.38) to give w( + 1) and the procedure is repeated. C.36 c E Q b T c c C.37 C.38 C.36 n( + 1) y( + 1) n( + 2) w( ) Σ A( + 1) K +1 w( + 1) A( + 2) Σ +1 E Figure C.3: Determination of the synaptic weights for an isolated neuron with the Kalman ﬁlter. To make our notation explicit for the output neurons, we substitute y( ) → yk( ) w( ) → wo k( ) ˆm( + 1) → ˆmk( + 1) = g wo k ( )n( + 1) A( + 1) → Ao k( + 1) = ˆmk( + 1)(1 − ˆmk( + 1))n( + 1) K → Ko k( ) Σ → Σo k( ), for k = 1 . . . M. Then (C.38) becomes wo k( + 1) = wo k( ) + Ko k( + 1) y( + 1) − ˆmk( + 1) , k = 1 . . . M. (C.39) Recalling that we wish to minimize the local quadratic cost function E( ) given by Eq. (10.14), note that the expression in square brackets above is in fact the negative derivative

166 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS of E( ) with respect to the output signal of the neuron, i.e. y( + 1) − ˆm( + 1) = − ∂E( ) ∂mk( ) so that wo k( + 1) = wo k( ) − Ko k( + 1) ∂E( ) ∂mk( ) ˆmk( +1) . (C.40) With this result, we can turn consideration to the hidden neurons, making the substitutions w( ) → wh j ( ) ˆm( + 1) → ˆnj( + 1) = g wh j ( )x( + 1) A( + 1) → Ah j ( + 1) = ˆnj( + 1)(1 − ˆnj( + 1))x( + 1) K → Kh j ( ) Σ → Σh j ( ), for j = 1 . . . L. Then, analogously to (C.40), the update equation for the weight vector of the jth hidden neuron is wh j ( + 1) = wh j ( ) − Kh j ( + 1) ∂E( + 1) ∂nj( + 1) ˆnj ( +1) . (C.41) To obtain the partial derivative in (C.41), we diﬀerentiate the cost function (10.14) ∂E( + 1) ∂nj( + 1) = − M k=1 (yk( + 1) − mk( + 1)) ∂mk(µ + 1) ∂nj(µ + 1) . From (C.34), noting that (wo k)j = Wo jk, we have ∂mk( + 1) ∂nj( + 1) = mk( + 1)(1 − mk( + 1))Wo jk( + 1) Combining the last two equations, ∂E( + 1) ∂nj( + 1) = − M k=1 (yk( + 1) − mk( + 1))mk( + 1)(1 − mk( + 1))Wo jk( + 1) which we can write more compactly as ∂E( + 1) ∂nj( + 1) = −Wo j·( + 1)βo ( + 1), (C.42) where Wo j· is the jth row of the output layer weight matrix, and where βo ( + 1) = (y( + 1) − m( + 1)) ⊗ m( + 1) ⊗ (1 − m( + 1)). The correct update relation for the weights of the jth hidden neuron is therefore wh j ( + 1) = wh j ( ) + Kh j ( + 1) Wo j·( + 1)βo ( + 1) . (C.43) Apart from initialization of the covariance matrices Σh j ( = 0) and Σo ( = 1), the Kalman training procedure has no adjustable parameters whatsoever. The covariance matrices are simply taken to be proportional to the corresponding identity matrices: Σh j (0) = ZIh , Σo k(0) = ZIo , Z 1, j = 1 . . . L, k = 1 . . . M, where Ih is the (N + 1) × (N + 1) and Io the (L + 1) × (L + 1) identity matrix. We choose Z = 100 and obtain

C.3. KALMAN FILTER TRAINING 167 Algorithm (Kalman filter training) 1. Set = 0, Σh j (0) = 100 · Io , j = 1 . . . L, Σo k(0) = 100 · Ih , k = 1 . . . M and initialize the synaptic weight matrices Wh (0) and Wo (0) with random numbers. 2. Choose a training pair (x( +1), y( +1)) and determine the hidden layer output vector ˆn( + 1) = 1 g Wh ( ) x( + 1) and with it the quantities Ah j ( + 1) = ˆnj( + 1)(1 − ˆnj( + 1))x( + 1) , j = 1 . . . L, ˆmk( + 1) = g wo k ( )ˆn( + 1) Ao k( + 1) = ˆmk( + 1)(1 − ˆmk( + 1))ˆn( + 1) , k = 1 . . . M and βo ( + 1) = (y( + 1) − ˆm( + 1)) ⊗ ˆm( + 1) ⊗ (1 − ˆm( + 1)). 3. Determine the Kalman gains for all of the neurons according to Ko k( + 1) = Σo k( )Ao k( + 1) Ao k( + 1)Σo k( )Ao k( + 1) + 1 −1 , k = 1 . . . M Kh k( + 1) = Σh j ( )Ah j ( + 1) Ah j ( + 1)Σh j ( )Ah j ( + 1) + 1 −1 , j = 1 . . . L 4. Update the synaptic weight matrices: wo k( + 1) = wo k( ) + Ko k( + 1)[yk( + 1) − ˆmk( + 1)], k = 1 . . . M wh j ( + 1) = wh j ( ) + Kh j ( + 1) Wo j·( + 1)βo ( + 1) , j = 1 . . . L 5. Determine the new covariance matrices: Σo k( + 1) = Io − Ko k( + 1)Ao k( + 1) Σo k( ), k = 1 . . . M Σh j ( + 1) = Ih − Kh j ( + 1)Ah j ( + 1) Σh j ( ), j = 1 . . . L 6. If the overall cost function (10.13) is sufficiently small, stop, else set = + 1 and go to 2. This method was originally suggested by Shah and Palmieri [SP90], who called it the mul- tiple extended Kalman algorithm (MEKA). Here is an excerpt from the object FFNKAL class extending FFN, showing the class method which implements the Kalman filter algorithm: Pro FFNKAL:: train ; define update matrices for Wh and Wo dWh = fltarr(self.LL,self.NN+1) dWo = fltarr(self.MM,self.LL+1) iter = 0L iter100 = 0L progressbar = Obj_New(’progressbar’, Color=’blue’, Text=’0’,$ title=’Training: exemplar number...’,xsize=250,ysize=20)

168 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS progressbar-start window,12,xsize=600,ysize=400,title=’FFF(Kalman filter)’ wset,12 repeat begin if progressbar-CheckCancel() then begin print,’Training interrupted’ progressbar-Destroy return endif ; select exemplar pair at random ell = long(self.p*randomu(seed)) x=(*self.Xs)[ell,*] y=(*self.Ys)[ell,*] ; send it through the network m=self-forwardPass(x) ; error at output e=y-m ; loop over the output neurons for k=0,self.MM-1 do begin ; linearized input (column vector) Ao = m[k]*(1-m[k])*(*self.N) ; Kalman gain So = (*self.So)[*,*,k] SA = So##Ao Ko = SA/((transpose(Ao)##SA)[0]+1) ; determine delta for this neuron dWo[k,*] = Ko*e[k] ; update its covariance matrix So = So - Ko##transpose(Ao)##So (*self.So)[*,*,k] = So endfor ; update the output weights *self.Wo = *self.Wo + dWo ; backpropagated error beta_o =e*m*(1-m) ; loop over the hidden neurons for j=0,self.LL-1 do begin ; linearized input (column vector) Ah = X*(*self.N)[j+1]*(1-(*self.N)[j+1]) ; Kalman gain Sh = (*self.Sh)[*,*,j] SA = Sh##Ah Kh = SA/((transpose(Ah)##SA)[0]+1) ; determine delta for this neuron dWh[j,*] = Kh*((*self.Wo)[*,j+1]##beta_o)[0] ; update its covariance matrix Sh = Sh - Kh##transpose(Ah)##Sh (*self.Sh)[*,*,j] = Sh endfor ; update the hidden weights

C.3. KALMAN FILTER TRAINING 169 *self.Wh = *self.Wh + dWh ; record cost history if iter mod 100 eq 0 then begin (*self.cost_array)[iter100]=alog10(self-cost()) iter100 = iter100+1 progressbar-Update,iter*100/self.iterations,text=strtrim(iter,2) plot,*self.cost_array,xrange=[0,iter100],color=0,background=’FFFFFF’XL,$ xtitle=’Iterations/100)’,ytitle=’log(cross entropy)’ end iter=iter+1 endrep until iter eq self.iterations progressbar-Destroy End

170 APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS

Appendix D ENVI Extensions D.1 Installation To install the complete extension package: 1. Place the file cursor motion.pro in your save add directory. In File→Preferences→User Defined Motion Routine in the ENVI main menu enter: cursor motion. 2. Place the remaining .PRO files anywhere in your IDL !PATH. 3. Place the files madviewhelp.pdf and aboutmadview.pdf in your IDL !HELP PATH. 4. Under Preferences in the ENVI main menu select the tab User Defined Files. Under Envi Menu File and Display Menu File enter the paths to the two menu files envi.men and display.me provided in the package. 171

172 APPENDIX D. ENVI EXTENSIONS D.2 Topographic modelling D.2.1 Calculating building heights CALC HEIGHT is an ENVI extension to determine height of vertical buildings in Quick- Bird/Ikonos images using rational function models (RFMs) provided with ortho-ready imagery. It is invoked as Tools/Building Height from the ENVI display menu. Usage Load an RFM ﬁle in the CalcHeight window with File/Load RPC File (extension RPC or RPB). If a DEM is available for the scene, this can also be loaded with File/Load DEM File. A DEM is not required, however. Click on the bottom of a vertical structure to set the base height and then shift-click on the top of the structure. Press the CALC button to display the structure’s height, latitude, longitude and base elevation. The number in brackets next to the height is the minimum distance (in pixels) between the top pixel and a vertical line through the bottom pixel. It should be of the order of 1 or less. If no DEM is loaded, the base elevation is the average value for the whole scene. If a DEM is used, the base elevation is taken from it. The latitude and longitude are then orthorectiﬁed values. Source headers ;+ ; NAME: ; CALCHEIGHT ; PURPOSE: ; Determine height (and lat, long, elevation) of vertical buildings ; in QuickBird/Ikonos images using RPCs ; AUTHOR; ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; CalcHeight ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; COMMON BLOCKS: ; Shared, RPC, Cb, Rb, Ct, Rt, elev ; Cursor_Motion_C, dn, Cbtext, Rbtext, Cttext, Rttext ; RPC: structure with RPC camera model ; Cb, Rb: coordinates of building base ; Ct, Rt: coordinates of building top ; elev: elevation of base ; dn: display number

D.2. TOPOGRAPHIC MODELLING 173 ; Cbtext ... : Edit widgets ; DEPENDENCIES: ; ENVI ; CURSOR_MOTION ; ------------------------------------------------------------- ;+ ; NAME: ; CURSOR_MOTION ; PURPOSE: ; Cursor communication with ENVI image windows ; AUTHOR; ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Cursor_Motion, dn, xloc, yloc, xstart=xstart, ystart=ystart, event=event ; ARGUMENTS: ; dn: display number ; xloc,yloc: mouse position ; KEYWORDS ; xstart, ystart: display origin ; event: mouse event ; COMMON BLOCKS: ; Cursor_Motion_C, dn, Cbtext, Rbtext, Cttext, Rttext ; DEPENDENCIES: ; None ;-------------------------------------------------------------------------- D.2.2 Illumination correction C CORRECTION is an ENVI extension for local illumination correction for multispectral images. It is invoked from the ENVI main menu as Topographic/Illumination Correction. Usage From the Choose image for correction menu select the (spectral/spatial subset of the) image to be corrected. Then in the C-correction parameters box enter the solar elevation and azimuth in degrees and, if desired, a new size for the kernel used for slope/aspect determination (default 9×9). In the Choose digital elevation file window select the corresponding DEM ﬁle. Finally in the Output corrected image box choose an output ﬁle name or select memory.

174 APPENDIX D. ENVI EXTENSIONS Source headers ;+ ; NAME: ; C_CORRECTION ; PURPOSE: ; ENVI extension for c-correction for solar illumination in rough terrain ; Ref: D. Riano et al. IEEE Transactions on ; Geoscience and Remote Sensing, 41(5) 2003, 1056-1061 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; C_Correction ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ;------------------------------------------------------------------------

D.3. IMAGE REGISTRATION 175 D.3 Image registration CONTOUR MATCH is an ENVI extension for determination of ground control points (GCPs) for image-image registration. It is invoked from the ENVI main menu as Map/Registration/Contour Matching. Usage In the Choose base image band window enter a (spatial subset) of the base image. Then in the Choose warp image band window select the image to be warped. In the LoG sigma box choose the size of the Laplacian of Gaussian ﬁlter kernel. The default is 25 (σ = 2.5). Finally in the Save GCPs to ASCII menu enter a ﬁle name (extension .pts) for the CGPs. After the calculation, these can then be loaded and inspected in the usual ENVI image-image registration dialog. Source headers ;+ ; NAME: ; CONTOUR_MATCH ; PURPOSE: ; ENVI extension for extraction of ground control points for image-image registration ; Images may be already georeferenced, in which case GCPs are for fine adjustement ; Uses Laplacian of Gaussian filter and contour tracing to match closed contours ; Ref: Li et al, IEEE Transactions on Image Processing, 4(3) (1995) 320-334 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Contour_Match ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; CI_DEFINE ; PROGRESSBAR_DEFINE (FSC_COLOR) ;------------------------------------------------------------------------ ;+ ; NAME: ; CI__DEFINE ; PURPOSE: ; Find thin closed contours in an image band with combined Sobel-LoG filtering ; Ref: Li et al, IEEE Transactions on Image Processing, 4(3) (1995) 320-334 ; AUTHOR ; Mort Canty (2004)

176 APPENDIX D. ENVI EXTENSIONS ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; ci = Obj_New(CI,image,sigma) ; ARGUMENTS: ; image: grayscale image band to be searched for contours ; sigma: Gaussian radius for LoG filter ; KEYWORDS ; None ; METHODS: ; GET_MAX_CONTOURS return maximum number of closed contours (8000) ; GET_MAX_LENGTH return maximum contour length (200) ; GET_CONTOUR_IMAGE return contour (filtered) image ; CLEAR_CONTOUR_IMAGE erase contour image ; TO_PIXELS read contour structure and return its pixel array ; TO_FILTERED_CC read contour structure and return filtered chain code ; TO_MOMENTS read contour structure and return Hu invariant moments ; WRITE_CONTOUR read contour stucture and display on contour_image ; TRACE_CONTOUR search contour image for next closed contour ; DEPENDENCIES: ; None ; Contour structure ; c = { sp: intarr(2), $ ; starting point coordinates ; length: 0L, $ ; number of pixels ; closed: -1L, $ ; set to zero while tracing, 1 when ; closed, -1 when no more contours ; code: bytarr(self.max_length), $ ; chain code ; icode: bytarr(self.max_length) } ; rotationally invariant chain code ; ---------------------------------------------------------------------

D.4. IMAGE FUSION 177 D.4 Image fusion D.4.1 DWT fusion ARSIS DWT is an ENVI extension for panchromatic sharpening with the discrete wavelet transform (DWT). It is invoked from the ENVI main menu as Transform/Image Sharpening/Wavelet(ARSIS Model)/DWT Usage In the Select low resolution multi-band input file window choose the (spatial/spectral subset of the) image to be sharpened. In the Select hi res input band window choose the corresponding panchromatic or high resolution image. Then in the ARSIS Fusion Output box select an output ﬁle name or memory. Source headers ;+ ; NAME: ; ARSIS_DWT ; PURPOSE: ; ENVI extension for panchromatic sharpening under ARSIS model ; with Mallat’s discrete wavelet transform and Daubechies wavelets ; Ref: Ranchin and Wald, Photogramm. Eng. Remote. Sens. ; 66(1), 2000, 49-61 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; ARSIS_DWT ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; DWT__DEFINE(PHASE_CORR) ; ORTHO_REGRESS ;------------------------------------------------------------------------ ;+ ; NAME: ; DWT__DEFINE ; PURPOSE: ; Discrete wavelet transform class using Daubechies wavelets ; for construction of pyramid representations of images, fusion etc. ; Ref: T. Ranchin, L. Wald, Photogammetric Engineering and ; Remote Sensing 66(1) (2000) 49-61.

178 APPENDIX D. ENVI EXTENSIONS ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; dwt = Obj_New(DWT,image) ; ARGUMENTS: ; image: grayscale image to be compressed ; KEYWORDS ; None ; METHODS: ; SET_COEFF: choose the Daubechies wavelet ; dwt - Set_Coeff, n ; n = 4,6,8,12 ; SHOW_IMAGE: display the image pyramid in a window ; dwt - Show_Image, wn ; INJECT: overwrite upper left quadrant ; after phase correlation match if keyword pc is set (default) ; dwt - Inject, array, pc = pc ; SET_COMPRESSIONS: set the number of compressions ; dwt - Set_Compressions, nc ; GET_COMPRESSIONS: get the number of compressions ; nc = dwt - Get_Compressions() ; GET_NUM_COLS: get the number of columns in the compressed image ; cols = dwt - Get_Num_Cols() ; GET_NUM_ROWS: get the number of rows in the compressed image ; cols = dwt - Get_Num_Rows() ; GET_IMAGE: return the pyramid image ; im = dwt - Get_Image() ; GET_QUADRANT: get compressed image (as 2D array) or innermost ; wavelet coefficients as vector ; wc = dwt - Get_Quadrant(n) ; n = 0,1,2,3 ; NORMALIZE_WC: normalize wavelet coefficients at all levels ; dwt - Normalize, a, b ; a, b are normalization parameters ; COMPRESS: perform a single compression ; dwt - Compress ; dwt - Inject, array, pcr=pc ; EXPAND: perfrom a single expansion ; dwt - Expand ; DEPENDENCIES: ; PHASE_CORR ; --------------------------------------------------------------------- ;+ ; NAME: ; PHASE_CORR ; PURPOSE: ; Returns relative offset [xoff,yoff] of two images using phase correlation

D.4. IMAGE FUSION 179 ; Maximum offset should not exceed +- 5 pixels in each dimension ; Returns -1 if dimensions are not equal ; Ref: H, Shekarforoush et al. INRIA 2707 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; shft = Phase_Corr(im1,im2,display=display,subpixel=subpixel) ; ARGUMENTS: ; im1, im2: the images to be correlated ; KEYWORDS: ; Display: (optional) show a surface plot if the correlation ; in window with display number display ; Subpixel: returns result to subpixel accuracy if set, ; otherwise nearest integer (default) ; DEPENDENCIES: ; None ;--------------------------------------------------------------------------- ; NAME: ; ORTHO_REGRESS ; PURPOSE: ; Orthogonal regression between two vectors ; Ref: M. Canty et al. Remote Sensing of Environment 91(3,4) (2004) 441-451 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Ortho_Regress, X, Y, a, Xm, Ym, sigma_a, sigma_b ; regression line is Y = Ym + a(X-Xm) = (Ym-aXm) + aX = b + aX ; ARGUMENTS: ; input column vectors X and Y ; returns a, Xm, Ym, sigma_a, sigma_b ; KEYWORDS: ; None ; DEPENDENCIES: ; None ;------------------------------------------------------------------- D.4.2 ATWT fusion ARSIS ATWT is an ENVI extension for panchromatic sharpening with the `a trous wavelet transform (ATWT). It is invoked from the ENVI main menu as Transform/Image Sharpening/Wavelet(ARSIS Model)/ATWT

180 APPENDIX D. ENVI EXTENSIONS Usage In the Select low resolution multi-band input file window choose the (spatial/spectral subset of the) image to be sharpened. In the Select hi res input band window choose the corresponding panchromatic or high resolution image. Then in the ARSIS Fusion Output box select an output ﬁle name or memory. Source headers ;+ ; NAME: ; ARSIS_ATWT ; PURPOSE: ; ENVI extension for panchromatic sharpening under ARSIS model ; with A trous wavelet transform. ; Ref: Aiazzi et al, IEEE Transactions on Geoscience and ; Remote Sensing, 40(10) 2300-2312, 2002 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; ARSIS_ATWT ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; ATWT__DEFINE(WARP_SHIFT, PHASE_CORR) ; ORTHO_REGRESS ;------------------------------------------------------------------------ ;+ ; NAME: ; ATWT__DEFINE ; PURPOSE: ; A Trous wavelet transform class using Daubechies wavelets. ; Used for shift invariant image fusion ; Ref: Aiazzi et al. IEEE Transactions on Geoscience and ; Remote Sensing 40(10) (2002) 2300-2312 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; atwt = Obj_New(ATWT,image) ; ARGUMENTS: ; image: grayscale image to be processed ; KEYWORDS

D.4. IMAGE FUSION 181 ; None ; METHODS: ; SHOW_IMAGE: display the image pyramid in a window ; dwt - Show_Image, wn ; INJECT: overwrite the filtered image ; dwt - Inject, im ; SET_TRANSFORMS: set the number of transformations ; dwt - Set_Transforms, nc ; GET_TRANSFORMS: get the number of transformations ; nc = dwt - Get_Transforms() ; GET_NUM_COLS: get the number of columns in the compressed image ; cols = dwt - Get_Num_Cols() ; GET_NUM_ROWS: get the number of rows in the compressed image ; cols = dwt - Get_Num_Rows() ; GET_IMAGE: return filtered image or details ; im = dwt - Get_Image(i) ; i = 0 for filters image, i 0 for details ; NORMALIZE_WC: normalize details at all levels ; dwt - Normalize, a, b ; a, b are normalization parameters ; COMPRESS: perform a single transformation ; dwt - Compress ; EXPAND: perfrom a single reverse transformation ; dwt - Expand ; DEPENDENCIES: ; WARP_SHIFT ; PHASE_CORR ; --------------------------------------------------------------------- ;+ ; NAME: ; WARP_SHIFT ; PURPOSE: ; Use RST with bilinear interpolation to shift band to sub-pixel accuracy ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; sband = Warp_Shift(band,shft) ; ARGUMENTS: ; band: the image band to be shifted ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ;---------------------------------------------------------------------------

182 APPENDIX D. ENVI EXTENSIONS D.4.3 Quality index RUN QUALITY INDEX is an ENVI extension to determine the Wang-Bovik quality index of a pan-sharpened image. It is invoked from the ENVI main menu as Transform/Image Sharpening/Quality Index Usage From the Choose reference image menu select the multispectral image to which the sharpened image is to be compared. In the Choose pan-sharpened image menu, select the image whose quality is to be determined. Source headers ;+ ; NAME: ; RUN_QUALITY_INDEX ; PURPOSE: ; ENVI extension for radiometric comparison of two multispectral images ; Ref: Wang and Bovik, IEEE Signal Processing Letters 9(3) 2002, 81-84 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Run_Quality_Index ; ARGUMENTS: ; Run_Quality_Index, Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; QI ;-------------------------------------------------------------------------- ;+ ; NAME: ; QI ; PURPOSE: ; Determine the Wang-Bovik quality index for a pan-sharpened image band ; Ref: Wang and Bovik, IEEE Signal Processing Letters 9(3) 2002, 81-84 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; index = QI(band1,band2) ; ARGUMENTS: ; band1: reference band

D.4. IMAGE FUSION 183 ; band2: degraded pan-sharpened band ; KEYWORDS: ; None ; DEPENDENCIES: ; None ;----------------------------------------------------------------------

184 APPENDIX D. ENVI EXTENSIONS D.5 Change detection D.5.1 Multivariate Alteration Detecton MAD RUN is an ENVI extension for change detection with the MAD transformation. It is invoked from the ENVI main menu as Basic Tools/Change Detection/MAD Usage From the Choose first image window enter the first (spatial/spectral subset) of the two image files. In the Choose second image window enter the second image file name. The spatial and spectral subsets must be identical. If an input image is in BSQ format, it is converted in place, after a warning, to BIP. In the MAD Output box choose a file name or memory. The calculation begins and can be interrupted at any time with the Cancel button. Before output, the spatial subset for the final MAD transformation can be changed, e.g. extended to a full scene, if desired. Source headers ;+ ; NAME: ; MAD_RUN ; PURPOSE: ; ENVI extension for Multivariate Alteration Detection. ; Ref: A. A. Nielsen et al. Remote Sensing of Environment 64 (1998), 1-19 ; Uses spectral tiling and therefore suitable for large datasets. ; Reads in two registered multispectral images (spectral/spatial subsets ; must have the same dimensions, spectral subset size must be at least 2). ; If an input image is in BSQ format, it is converted in place to BIP. ; Writes the MAD variates to disk. ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Mad_Run ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; MAD_TILED (COVPM_DEFINE, GEN_EIGENPROBLEM) ; PROGRESSBAR_DEFINE (FSC_COLOR) ;--------------------------------------------------------------------- ;+ ; NAME:

D.5. CHANGE DETECTION 185 ; MAD_TILED ; PURPOSE: ; Function for Multivariate Alteration Detection. ; Ref: A. A. Nielsen et al. Remote Sensing of Environment 64 (1998), 1-19 ; Uses spectral tiling and therefore suitable for large datasets. ; Input files must be BIL or BIP format. ; On error or if interrupted during the first iteration, returns = -1 else 0 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; result = Mad_Tiled(fid1,fid2,dims1,dims2,pos1,pos2) ; ARGUMENTS: ; fid1, fid2 input file specifications ; dims1, dims2 ; pos1, pos2 ; KEYWORDS: ; A, B output: transformation eigenvectors ; means1, means2 weighted mean values for transformation, row-replicated ; cp change probability image from chi-square distribution ; DEPENDENCIES: ; ENVI ; COVPM_DEFINE ; GEN_EIGENPROBLEM ; PROGRESSBAR_DEFINE (FSC_COLOR) ;--------------------------------------------------------------------- ;+ ; NAME: ; COVPM__DEFINE ; PURPOSE: ; Object class for iterative covariance matrix calculation ; using the method of provisional means. ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; covpm = Obj_New(COVPM) ; ARGUMENTS: ; None ; KEYWORDS ; None ; METHODS: ; UPDATE: update the covariance matrix with an observation ; covpm - Update, v, weight = w ; v is an obsevation vector (array) ; w is an optioanl weight for that observation ; COVARIANCE: read out the covariance matrix

186 APPENDIX D. ENVI EXTENSIONS ; cov = covpm - Covariance() ; MEANS: read out the observation means ; mns = covpm - Means() ; DEPENDENCIES: ; None ;-------------------------------------------------------------- ;+ ; NAME: ; GEN_EIGENPROBLEM ; PURPOSE: ; Solve the generalized eigenproblem ; C##a = lambda*B##a ; using Cholesky factorization ; AUTHOR: ; Mort Canty (2001) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Gen_Eigenproblem, C, B, A, lambda ; ARGUMENTS: ; C and B are real, square, symmetric matrices ; returns the eigenvalues in the row vector lambda ; returns the eigenvectors a as the columns of A ; KEYWORDS: ; None ; DEPENDENCIES ; None ;--------------------------------------------------------------------- D.5.2 Maximum autocorrelation factor MAF is an ENVI extension for performing the MAF transformation, usually on previously calculated MAD variates. It is invoked from the ENVI main menu as Basic Tools/Change Detection/MAF (of MAD) Usage In the Choose multispectral image box select the ﬁle to be transformed. In the MAF Output box select an output ﬁle name or memory. Source headers ;+ ; NAME: ; MAF ; PURPOSE: ; ENVI extension for Maximum Autocorrelation Fraction transformation. ; Ref: Green et al, IEEE Transaction on Geoscience and Remote Sensing,

D.5. CHANGE DETECTION 187 ; 26(1):65-74,1988 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Maf ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; GEN_EIGENPROBLEM ;--------------------------------------------------------------------- D.5.3 Radiometric normalization RADCAL is an ENVI extension for radiometric normalization with the MAD transformation. It is invoked from the ENVI main menu as Basic Tools/Change Detection/MAD Radiometric Normalization Usage From the Choose reference image window enter the first (spatial/spectral subset) of the two image files. In the Choose target image window enter the second image file name. The spatial and spectral subsets must be identical. If an input image is in BSQ format, it is converted in place, after a warning, to BIP. In the MAD Output box choose a file name or memory. The calculation begins and can be interrupted at any time with the Cancel button. In a series of plot windows the regression lines used for the normalization are plotted. The results can be then used to calibrate another file, e.g. a full scene. Source headers ;+ ; NAME: ; RADCAL ; PURPOSE: ; Radiometric calibration using MAD ; Ref: M. Canty et al. Remote Sensing of Environment 91(3,4) (2004) 441-451 ; Reference and target images must have equal spatial and spectral dimensions, ; at least 2 spectral components, and be registered to one another. ; Once the regression coefficients have been determined, they can be used to ; calibrate another file, for example a full scene, which need not be registered ; to the reference image. ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de

188 APPENDIX D. ENVI EXTENSIONS ; CALLING SEQUENCE: ; Radcal ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; ORTHO_REGRESS ; MAD_TILED (COVPM_DEFINE, GEN_EIGENPROBLEM) ; PROGRESSBAR_DEFINE (FSC_COLOR) ;-----------------------------------------------------------------

D.6. UNSUPERVISED CLASSIFICATION 189 D.6 Unsupervised classiﬁcation D.6.1 Hierarchical clustering HCLRUN is an ENVI extension for agglomerative hierarchical clustering. It is invoked from the ENVI main menu as Classification/Unsupervised/Hierarchic It is intended as a demonstration, and writes to memory only. Usage In the Choose multispectral image for clustering window select the (spatial/spectral subset of the) desired image. In the Number of Samples box choose the size of the representative random sample (default 1000). In the Number of Classes box select the desired number of clusters. Source headers ;+ ; NAME: ; HCLRUN ; PURPOSE: ; ENVI extension for hierarchical agglomerative clustering ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; HCLrun ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; HCL (PROGRESSBAR__DEFINE (FSC_COLOR)) ; CLASS_LOOKUP_TABLE ;--------------------------------------------------------------------- ;+ ; NAME: ; HCL ; PURPOSE: ; Agglomerative hierarchic clustering with sum of squares cost function. ; Takes data array Xs (column vectors) and number of clusters K as input. ; Returns cluster memberships Cs. ; Ref. Fraley Technical Report 311, Dept. of Statistics, ; University of Washington, Seattle (1996). ; AUTHOR

190 APPENDIX D. ENVI EXTENSIONS ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; HCL, Xs, K, Cs ; ARGUMENTS: ; Xs: input observations array (column vectors) ; K: number of clusters ; Cs: Cluster labels of observations ; KEYWORDS: ; None ; DEPENDENCIES: ; PROGRESSBAR__DEFINE (FSC_COLOR) ;-------------------------------------------------------------------- ;+ ; NAME: ; CLASS_LOOKUP_TABLE ; PURPOSE: ; Provide 16 class colors for supervised and unsupervised classification programs ; AUTHOR; ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; colors = Class_Lookup_Table(Ptr) ; ARGUMENTS: ; Ptr: a vector of pointers into the table ; KEYWORDS: ; None ; DEPENDENCIES: ; None ;--------------------------------------------------------------------- D.6.2 Fuzzy K-means clustering SAMPLE FKMRUN is an ENVI extension for fuzzy k-means clustering. It is invoked from the ENVI main menu as Classification/Unsupervised/Fuzzy-K-Means Usage In the Choose multispectral image window select the (spatial/spectral subset of the) desired image. In the Number of Classes box select the desired number of clusters. In the FKM Output box select the output ﬁle name or memory. Source headers ;+

D.6. UNSUPERVISED CLASSIFICATION 191 ; NAME: ; SAMPLE_FKMRUN ; PURPOSE: ; ENVI extension for fuzzy K-means clustering with sampled data ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Sample_FKMrun ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; FKM (PROGRESSBAR_DEFINE (FSC_COLOR)) ; CLUSTER_FKM ; CLASS_LOOKUP_TABLE ;--------------------------------------------------------------------- ;+ ; NAME: ; FKM ; PURPOSE: ; Fuzzy Kmeans clustering algorithm. ; Takes data array Xs (data as column vectors), number of clusters K. ; Returns fuzzy membership matrix U and the class centers Ms. ; Ref: J. C. Dunn, Journal of Cybernetics, PAM1:32-57, 1973 ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; FKM, Xs, K, U, Ms, niter=niter, seed=seed ; ARGUMENTS: ; Xs: input observations array (column vectors) ; K: number of clusters ; U: final class probability membership matrix (output) ; Ms: cluster means (output) ; KEYWORDS: ; niter: number of iterations (optional) ; seed: initial random number seed (optional) ; DEPENDENCIES: ; PROGRESSBAR__DEFINE (FSC_COLOR) ;-------------------------------------------------------------------- ;+ ; NAME: ; CLUSTER_FKM

192 APPENDIX D. ENVI EXTENSIONS ; PURPOSE: ; Modified distance clusterer from IDL library ; CALLING SEQUENCE: ; labels = Cluster_fkm(Array,Weights,Double=Double,N_clusters=N_clusters) ;------------------------------------------------------------------------- D.6.3 EM clustering SAMPLE EMRUN is an ENVI extension for EM clustering. It is invoked from the ENVI main menu as Classification/Unsupervised/EM(Sampled) TILED EMRUN can be used to cluster large data sets. It is invoked from the ENVI main menu as Classification/Unsupervised/EM(Tiled) Usage In the Choose multispectral image for clustering window select the (spatial/spectral subset of the) desired image. In the Number of Samples box choose the size of the representative random sample (default 1000). In the Number of Classes box select the desired number of clusters. In the FKM Output box select the output ﬁle name or memory. In the Output class membership probs box select the output ﬁle name for the probabilities (rule) image, or Cancel if this is not desired. The rule image will be byte coded (0 = probability 0, 255 = probability 1). In the tiled version, output to memory is not possible. During calculation a log likelihood plot is shown. Calculation can be interrupted at any time. Source headers ;+ ; NAME: ; SAMPLE_EMRUN ; PURPOSE: ; ENVI extension for EM clustering with sampled data ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Sample_EMrun ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; EM (PROGRESSBAR__DEFINE (FSC_COLOR)) ; CLUSTER_EM ; CLASS_LOOKUP_TABLE

D.6. UNSUPERVISED CLASSIFICATION 193 ;--------------------------------------------------------------------- ;+ ; NAME: ; TILED_EMRUN ; PURPOSE: ; ENVI extension for EM clustering on sampled data, large data sets ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Tiled_EMrun ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; EM (PROGRESSBAR__DEFINE (FSC_COLOR)) ; FKM ; CLUSTER_EM ; CLASS_LOOKUP_TABLE ;--------------------------------------------------------------------- ;+ ; NAME: ; EM ; PURPOSE: ; Expectation maximization clustering algorithm for Gaussian mixtures. ; Takes data array Xs (data as column vectors) and initial ; class membership probability matrix U as input. ; Returns U, the class centers Ms, Priors Ps and final ; class covariances Fs. ; Allows for simulated annealing ; Ref: Gath and Geva, IEEE Trans. Pattern Anal. and Mach. ; Intell. 3(3):773-781, 1989 ; Hilger, Exploratory Analysis of Multivariate Data, ; PhD Thesis, IMM, Technical University of Denmark, 2001 ; AUTHOR ; Mort Canty (2005) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Pro EM, Xs, U, Ms, Ps, Fs, unfrozen=unfrozen, wnd=wnd, $ ; maxiter=maxiter, miniter=miniter, verbos=verbose, $ ; pdens=pdens, pd_exclude=pdens_exclude, fhv=fhv, T0=T0 ; ARGUMENTS: ; Xs: input observations array (column vectors) ; U: initial class probability membership matrix (column vectors)

194 APPENDIX D. ENVI EXTENSIONS ; Ms: cluster means (output) ; Ps: cluster priors (output) ; Fs: cluster covariance matrices (output) ; KEYWORDS: ; unfrozen: Indices of the observations which ; take part in the iteration (default all) ; wnd: window for displaying the log likelihood (optional) ; maxinter: maximum iterations (optional) ; minimter: minimum iterations (optional) ; pdens: partition density (output, optional) ; pd_exclude: array of classes to be excluded from pdens and fhv (optional) ; fhv: fuzzy hypervolume (output, optional) ; T0: initial annealing temperature (default 1.0) ; verbose: set to print output info to IDL log ; DEPENDENCIES: ; PROGRESSBAR__DEFINE (FSC_COLOR) ;-------------------------------------------------------------------- ;+ ; NAME: ; CLUSTER_EM ; PURPOSE: ; Cluster data after running the EM algorithm ; Takes data array (as row vectors), means Ms (as row vectors), priors Ps ; and covariance matrices Fs and returns the class labels. ; Class membership probabilities are returned in class_probs ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; labels = Cluster_EM(Xs,Ms,Ps,Fs,class_probs=class_probs,progress_bar=progress_bar) ; ARGUMENTS: ; Xs: data array ; Ms: cluster means ; Ps: cluster priors ; Fs: cluster covariance matrices ; KEYWORDS: ; class_probs (optional): contains cluster membership probability image ; progress_bar: set to 0 if no progressbar is desired ; DEPENDENCIES: ; PROGRESSBAR__DEFINE (FSC_COLOR) ;-------------------------------------------------------------------- D.6.4 Probabilistic label relaxation PLR is an ENVI extension for performing probabilistic relaxation on rule (class membership probability) images generated by supervised and unsupervised classiﬁcation algorithms. It is invoked from the ENVI main menu as

D.6. UNSUPERVISED CLASSIFICATION 195 Classification/Post Classification/Probabilistic Label Relaxation/Run PLR Usage In the Choose probabilities image window select the rule image to be processed. Select the number of iterations (default 3) in the Number of iterations box. Finally choose a file name for the output rule image in the PLR output box. PLR RECLASS is an ENVI extension for (re)classifying on rule (class membership probability) images generated by supervised and unsupervised classification algorithms and probabilistic label relaxation. It is invoked from the ENVI main menu as Classification/Post Classification/Probabilistic Label Relaxation/Reclassify Usage In the Choose classification file optionally specify a previous classification image. This determines the color coding of the reclassification output. If cancelled, a default color code is used. In the Choose probabilities window, select the file to be processed. Answer the Include dummy unclassified class query with “Yes”. In the Output PLR reclassification box choose an output file name or memory. Source headers ;+ ; NAME: ; PLR ; PURPOSE: ; ENVI extension for postclassification with ; Probabilistic Label Relaxation ; Ref. Richards and Jia, Remote Sensing Digital Image Analysis (1999) Springer ; Processes a rule image (class membership probabilities), outputs a ; new rule image ; AUTHOR; ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Plr ; ARGUMENTS ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; PROGRESSBAR_DEFINE (FSC_COLOR) ;---------------------------------------------------------------------- ;+ ; NAME: ; PLR_RECLASS

196 APPENDIX D. ENVI EXTENSIONS ; PURPOSE: ; ENVI extension for postclassification with ; Probabilistic Label Relaxation ; Ref. Richards and Jia, Remote Sensing Digital Image Analysis (1999) Springer ; Processes a rule image (class membership probabilities), outputs a ; new classification file ; AUTHOR; ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Plr_Reclass ; ARGUMENTS ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; PROGRESSBAR_DEFINE (FSC_COLOR) ;---------------------------------------------------------------------- D.6.5 Kohonen self organizing map SAMPLE SOMRUN is an ENVI extension for clustering with the Kohonen self-organizing map. It is invoked from the ENVI main menu as Classification/Unsupervised/Kohonen SOM Usage In the Choose multispectral image window select the (spatial/spectral subset of the) desired image. In the Cube side dimension box select the desired dimension of the cubic neural network (default 6). In the SOM Output box select the output ﬁle name or memory. Source headers ;+ ; NAME: ; SAMPLE_SOMRUN ; PURPOSE: ; ENVI extension for Kohonen Self Organizing Map with sampled data ; Ref. T. Kohonen, Self Organization and Associative Memory, Springer 1989. ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Sample_KFrun ; ARGUMENTS: ; Event (if used as a plug-in menu item)

D.6. UNSUPERVISED CLASSIFICATION 197 ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; PROGRESSBAR__DEFINE (FSC_COLOR) ;--------------------------------------------------------------------- D.6.6 A GUI for change clustering MAD VIEW is an IDL GUI (graphical user interface) for viewing and processing MAD and MNF/MAD change images. It is invoked from the ENVI main menu as Basic Tools/Change Detection/MAD View Usage This extension is provided with an on-line help. Source headers ;+ ; NAME: ; MAD_VIEW ; PURPOSE: ; GUI for viewing, thresholding and clustering MAD/MNF images ; Ref: A. A. Nielsen et al. Remote Sensing of Environment 64 (1998), 1-19 ; A. A. Nielsen private communication (2004) ; AUTHOR ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; Mad_View ; ARGUMENTS: ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; EM ; CLUSTER_EM ; PROGRESSBAR_DEFINE (FSC_COLOR) ;---------------------------------------------------------------------

198 APPENDIX D. ENVI EXTENSIONS D.7 Neural network: Scaled conjugate gradient FFNCG RUN is an ENVI extension for supervised classification with a two-layer feed forward neural network. It uses the scaled conjugate gradient training algorithm and can be used as a replacement for the much slower backpropagation neural network implemented in ENVI. It is invoked from the ENVI main menu as Classification/Supervised/Neural Net/Conjugate Gradient Usage In the Enter file for classification window select the (spatial/spectral subset of the) desired image. This must be in BIP format. In the ROI selection box choose the training regions desired. In the Output FFN classification to file box select the output file name. In the Output FFN probabilities to file box select the output file name for the probabilities (rule) image, or Cancel if this is not desired. The rule image will be byte coded (0 = probability 0, 255 = probability 1). In the Number of hidden units box select the number of neurons in the first layer (default 4). As the calculation proceeds, the cost function is displayed in a plot window. The calculation can be interrupted with Cancel. Source headers ;+ ; NAME: ; FFNCG_RUN ; PURPOSE: ; ENVI extension for classification of a multispectral image ; with a feed forward neural network using scaled conjugate gradient training ; AUTHOR; ; Mort Canty (2004) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; FfnCG_Run ; ARGUMENTS ; Event (if used as a plug-in menu item) ; KEYWORDS: ; None ; DEPENDENCIES: ; ENVI ; PROGRESSBAR_DEFINE (FSC_COLOR) ; FFNCG__DEFINE (FFN__DEFINE) ;---------------------------------------------------------------------- ;+ ; NAME: ; FFNCG__DEFINE ; PURPOSE: ; Object class for implementation of a two-layer, feed-forward ; neural network for classification of multi-spectral images.

D.7. NEURAL NETWORK: SCALED CONJUGATE GRADIENT 199 ; Implements scaled conjugate gradient training. ; Ref: C. Bishop, Neural Networks for Pattern Recognition, Oxford 1995 ; M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999 ; AUTHOR ; Mort Canty (2005) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; ffn = Obj_New(FFNCG,Xs,Ys,L) ; ARGUMENTS: ; Xs: array of observation column vectors ; Xs: array of class label column vectors of form (0,0,1,0,0,...0)^T ; L: number of hidden neurons ; KEYWORDS ; None ; METHODS: ; ROP: determine the matrix product v^t.H, where H is the Hessian of ; the cost function wrt the weights, using the R-operator ; r = ffn - Rop(v) ; HESSIAN: claculate the Hessian ; h = ffn - Hessian() ; EIGENVALUES: calculate the eigenvalues of the Hessian ; e = ffn - Eigenvalues() ; GRADIENT: calculate the gradient of the global cost function ; g = ffn - Gradient() ; TRAIN: train the network ; ffn - train ; DEPENDENCIES: ; FFN__DEFINE ; PROGRESSBAR (FSC_COLOR) ;-------------------------------------------------------------- ;+ ; NAME: ; FFN__DEFINE ; PURPOSE: ; Object class for implementation of a two-layer, feed-forward ; neural network for classification of multi-spectral images. ; This is a generic class with no training methods. ; Ref: M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999 ; AUTHOR ; Mort Canty (2005) ; Juelich Research Center ; m.canty@fz-juelich.de ; CALLING SEQUENCE: ; ffn = Obj_New(FFN,Xs,Ys,L) ; ARGUMENTS: ; Xs: array of observation column vectors ; Ys: array of class label column vectors of form (0,0,1,0,0,...0)^T ; L: number of hidden neurons

Morton john canty image analysis and pattern recognition for remote sensing with algorithms in envi-idl

More Related Content

What's hot

Similar to Morton john canty image analysis and pattern recognition for remote sensing with algorithms in envi-idl

Recently uploaded

Morton john canty image analysis and pattern recognition for remote sensing with algorithms in envi-idl