Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Tan Nguyen; Richard G. Baraniuk; Robert M. Kirby; Stanley J. Osher; Bao Wang

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Tan Nguyen, Richard G. Baraniuk, Robert M. Kirby, Stanley J. Osher, Bao Wang

Research output: Contribution to journal › Conference article › peer-review

Abstract

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the momentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.

Original language	English (US)
Pages (from-to)	189-204
Number of pages	16
Journal	Proceedings of Machine Learning Research
Volume	190
State	Published - 2022
Event	3rd Annual Conference on Mathematical and Scientific Machine Learning, MSML 2022 - Beijing, China Duration: Aug 15 2022 → Aug 17 2022

Keywords

adaptive momentum
efficient attention
transformer

ASJC Scopus subject areas

Artificial Intelligence
Software
Control and Systems Engineering
Statistics and Probability

Cite this

@article{593c23c87bf54fa69e13520123fa475b,

title = "Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization",

abstract = "Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the momentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.",

keywords = "adaptive momentum, efficient attention, transformer",

author = "Tan Nguyen and Baraniuk, {Richard G.} and Kirby, {Robert M.} and Osher, {Stanley J.} and Bao Wang",

note = "Publisher Copyright: {\textcopyright} 2022 T. Nguyen, R.G. Baraniuk, R.M. Kirby, S.J. Osher & B. Wang.; 3rd Annual Conference on Mathematical and Scientific Machine Learning, MSML 2022 ; Conference date: 15-08-2022 Through 17-08-2022",

year = "2022",

language = "English (US)",

volume = "190",

pages = "189--204",

journal = "Proceedings of Machine Learning Research",

issn = "2640-3498",

}

TY - JOUR

T1 - Momentum Transformer

T2 - 3rd Annual Conference on Mathematical and Scientific Machine Learning, MSML 2022

AU - Nguyen, Tan

AU - Baraniuk, Richard G.

AU - Kirby, Robert M.

AU - Osher, Stanley J.

AU - Wang, Bao

PY - 2022

Y1 - 2022

N2 - Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the momentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.

AB - Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the momentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.

KW - adaptive momentum

KW - efficient attention

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85163164296&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85163164296&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85163164296

SN - 2640-3498

VL - 190

SP - 189

EP - 204

JO - Proceedings of Machine Learning Research

JF - Proceedings of Machine Learning Research

Y2 - 15 August 2022 through 17 August 2022

ER -

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Abstract

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this