Improving Transformers with Probabilistic Attention Keys

Tam Nguyen; Tan M. Nguyen; Dung D. Le; Duy Khuong Nguyen; Viet Anh Tran; Richard G. Baraniuk; Nhat Ho; Stanley J. Osher

Improving Transformers with Probabilistic Attention Keys

Tam Nguyen, Tan M. Nguyen, Dung D. Le, Duy Khuong Nguyen, Viet Anh Tran, Richard G. Baraniuk, Nhat Ho, Stanley J. Osher

Research output: Contribution to journal › Conference article › peer-review

Abstract

Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

Original language	English (US)
Pages (from-to)	16595-16621
Number of pages	27
Journal	Proceedings of Machine Learning Research
Volume	162
State	Published - 2022
Event	39th International Conference on Machine Learning, ICML 2022 - Baltimore, United States Duration: Jul 17 2022 → Jul 23 2022

ASJC Scopus subject areas

Artificial Intelligence
Software
Control and Systems Engineering
Statistics and Probability

Cite this

@article{f3289f1981b9495db0e43e476f5b8c81,

title = "Improving Transformers with Probabilistic Attention Keys",

abstract = "Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.",

author = "Tam Nguyen and Nguyen, {Tan M.} and Le, {Dung D.} and Nguyen, {Duy Khuong} and Tran, {Viet Anh} and Baraniuk, {Richard G.} and Nhat Ho and Osher, {Stanley J.}",

note = "Funding Information: This material is based on research sponsored by the AFOSR MURI FA9550-18-1-0502, the ONR grant N00014-20-1-2093, the MURI N00014-20-1-2787, and the NSF under Grant# 2030859 to the Computing Research Association for the CIFellows Project (CIF2020-UCLA-38). NH acknowledges support from the NSF IFML 2019844 and the NSF AI Institute for Foundations of Machine Learning. RGB was supported by NSF grants CCF-1911094, IIS-1838177, IIS-1730574, and a Vannevar Bush Faculty Fellowship. Publisher Copyright: Copyright {\textcopyright} 2022 by the author(s); 39th International Conference on Machine Learning, ICML 2022 ; Conference date: 17-07-2022 Through 23-07-2022",

year = "2022",

language = "English (US)",

volume = "162",

pages = "16595--16621",

journal = "Proceedings of Machine Learning Research",

issn = "2640-3498",

}

TY - JOUR

T1 - Improving Transformers with Probabilistic Attention Keys

AU - Nguyen, Tam

AU - Nguyen, Tan M.

AU - Le, Dung D.

AU - Nguyen, Duy Khuong

AU - Tran, Viet Anh

AU - Baraniuk, Richard G.

AU - Ho, Nhat

AU - Osher, Stanley J.

N1 - Funding Information: This material is based on research sponsored by the AFOSR MURI FA9550-18-1-0502, the ONR grant N00014-20-1-2093, the MURI N00014-20-1-2787, and the NSF under Grant# 2030859 to the Computing Research Association for the CIFellows Project (CIF2020-UCLA-38). NH acknowledges support from the NSF IFML 2019844 and the NSF AI Institute for Foundations of Machine Learning. RGB was supported by NSF grants CCF-1911094, IIS-1838177, IIS-1730574, and a Vannevar Bush Faculty Fellowship. Publisher Copyright: Copyright © 2022 by the author(s)

PY - 2022

Y1 - 2022

N2 - Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

AB - Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

UR - http://www.scopus.com/inward/record.url?scp=85163137584&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85163137584&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85163137584

SN - 2640-3498

VL - 162

SP - 16595

EP - 16621

JO - Proceedings of Machine Learning Research

JF - Proceedings of Machine Learning Research

T2 - 39th International Conference on Machine Learning, ICML 2022

Y2 - 17 July 2022 through 23 July 2022

ER -

Improving Transformers with Probabilistic Attention Keys

Abstract

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this