From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference

Randall Balestriero; Richard G. Baraniuk

From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference

Randall Balestriero, Richard G. Baraniuk

Research output: Contribution to conference › Paper › peer-review

Abstract

Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the rôle played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from these operations can be interpreted as max-affine spline operators (MASOs) that have an elegant link to vector quantization (VQ) and K-means. While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs). We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural “hard” VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding “soft” VQ inference problems. We further extend the framework by hybridizing the hard and soft VQ optimizations to create a β-VQ inference that interpolates between hard, soft, and linear VQ inference. A prime example of a β-VQ DN nonlinearity is the swish nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters.

Original language	English (US)
State	Published - 2019
Event	7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States Duration: May 6 2019 → May 9 2019

Other

Other	7th International Conference on Learning Representations, ICLR 2019
Country/Territory	United States
City	New Orleans
Period	5/6/19 → 5/9/19

ASJC Scopus subject areas

Education
Computer Science Applications
Linguistics and Language
Language and Linguistics

Cite this

@conference{c10bea344f174d449e0c544c3cbe314b,

title = "From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference",

abstract = "Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the r{\^o}le played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from these operations can be interpreted as max-affine spline operators (MASOs) that have an elegant link to vector quantization (VQ) and K-means. While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs). We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural “hard” VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding “soft” VQ inference problems. We further extend the framework by hybridizing the hard and soft VQ optimizations to create a β-VQ inference that interpolates between hard, soft, and linear VQ inference. A prime example of a β-VQ DN nonlinearity is the swish nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters.",

author = "Randall Balestriero and Baraniuk, {Richard G.}",

note = "Funding Information: Our development of the SMASO model opens the door to several new research questions. First, we have merely scratched the surface in the exploration of new nonlinear activation functions and pooling operators based on the SVQ and β-VQ. For example, the soft-or β-VQ versions of leaky-ReLU, absolute value, and other piecewise affine and convex nonlinearities could outperform the new swish nonlinearity. Second, replacing the entropy penalty in the (7) and (8) with a different penalty will create entirely new classes of nonlinearities that inherit the rich analytical properties of MASO DNs. Third, orthogonal DN filters will enable new analysis techniques and DN probing methods, since from a signal processing point of view problems such as denoising, reconstruction, compression have been extensively studied in terms of orthogonal filters. This work was partially supported by NSF grants IIS-17-30574 and IIS-18-38177, AFOSR grant FA9550-18-1-0478, ARO grant W911NF-15-1-0316, ONR grants N00014-17-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047. Publisher Copyright: {\textcopyright} 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved.; 7th International Conference on Learning Representations, ICLR 2019 ; Conference date: 06-05-2019 Through 09-05-2019",

year = "2019",

language = "English (US)",

}

TY - CONF

T1 - From hard to soft

T2 - 7th International Conference on Learning Representations, ICLR 2019

AU - Balestriero, Randall

AU - Baraniuk, Richard G.

N1 - Funding Information: Our development of the SMASO model opens the door to several new research questions. First, we have merely scratched the surface in the exploration of new nonlinear activation functions and pooling operators based on the SVQ and β-VQ. For example, the soft-or β-VQ versions of leaky-ReLU, absolute value, and other piecewise affine and convex nonlinearities could outperform the new swish nonlinearity. Second, replacing the entropy penalty in the (7) and (8) with a different penalty will create entirely new classes of nonlinearities that inherit the rich analytical properties of MASO DNs. Third, orthogonal DN filters will enable new analysis techniques and DN probing methods, since from a signal processing point of view problems such as denoising, reconstruction, compression have been extensively studied in terms of orthogonal filters. This work was partially supported by NSF grants IIS-17-30574 and IIS-18-38177, AFOSR grant FA9550-18-1-0478, ARO grant W911NF-15-1-0316, ONR grants N00014-17-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047. Publisher Copyright: © 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved.

PY - 2019

Y1 - 2019

N2 - Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the rôle played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from these operations can be interpreted as max-affine spline operators (MASOs) that have an elegant link to vector quantization (VQ) and K-means. While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs). We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural “hard” VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding “soft” VQ inference problems. We further extend the framework by hybridizing the hard and soft VQ optimizations to create a β-VQ inference that interpolates between hard, soft, and linear VQ inference. A prime example of a β-VQ DN nonlinearity is the swish nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters.

AB - Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the rôle played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from these operations can be interpreted as max-affine spline operators (MASOs) that have an elegant link to vector quantization (VQ) and K-means. While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs). We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural “hard” VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding “soft” VQ inference problems. We further extend the framework by hybridizing the hard and soft VQ optimizations to create a β-VQ inference that interpolates between hard, soft, and linear VQ inference. A prime example of a β-VQ DN nonlinearity is the swish nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters.

UR - http://www.scopus.com/inward/record.url?scp=85083950229&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85083950229&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85083950229

Y2 - 6 May 2019 through 9 May 2019

ER -

From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference

Abstract

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this