Positional encodings are essential for two reasons:
1. They guarantee Transformers/graphNNs to be universal approximators for functions invariant by index permutation. Most real-world graphs have natural symmetries, like the line graph for Transformers. https://twitter.com/francoisfleuret/status/1333727738696519682
1. They guarantee Transformers/graphNNs to be universal approximators for functions invariant by index permutation. Most real-world graphs have natural symmetries, like the line graph for Transformers. https://twitter.com/francoisfleuret/status/1333727738696519682
These symmetries produce e.g. isomorphic nodes, which introduce ambiguities and decrease the ability of the network to disentangle node information. Unique PEs like cos/sin PEs in Transformers remove these ambiguities, and guarantee universal approximators.
See https://arxiv.org/pdf/1903.02541.pdf, https://arxiv.org/pdf/1907.03199.pdf, https://arxiv.org/pdf/2006.07846.pdf
2. Distance-sensitive PEs like cos/sin provide relevant distance info between nodes to identify future/past in NLP and node coordinates for graphs/manifolds.
For standard convolutions in e.g. CV, the notion of PEs is intrinsic because the nodes of the convolutional filter are ordered (we know where the top, bottom, right, left are). For Transformers/graphs, the order of the nodes is irrelevant, and thus require PEs.