The role of attention in neuroscience, deep learning, and everyday life
The Large Hadron Collider (LHC) is one of the most intricate machines mankind has ever built. When it’s running, every second approximately one billion particles smash into each other at velocities close to the speed of light, probing physics beyond the edges of the current standard model of particle physics.
A lot can happen in these one billion particle collisions, and immense detectors are built around the ring of the LHC to not miss out on anything important. But such large quantities of collisions, coupled with complex detectors, generate a lot of data. Seriously, it’s a lot of data. About one petabyte, or one thousand terabytes, or one million gigabytes of collision data per second if you sum it all up.
If you hear these numbers it’s pretty obvious that such an insane amount of data is impossible to analyze or even to record with current computing devices. Therefore, a lot of data has to be thrown out in real-time before any analysis of it even becomes feasible.
This is why the detectors at the LHC have built into them a lot of fast, automated triggering and filtering systems that signal to the detector that it is worth starting to record an event and which events don’t tell us anything meaningful and can be safely discarded.
But even after the drastic data reduction achieved through these means, the data center at the LHC still ends up with a petabyte of data each day, making up only 0.001 percent of the original incoming data.
Our brains are faced with a similar challenge on a daily basis: cognitive processing and storage, much like computation time, is the brain’s most precious resource, and for cognitive systems that came about through evolution, spending resources parsimoniously is one of the keys to survival.
All cognition can be viewed as a trade-off between information gain and metabolic expenditure. The LHC can be viewed as its own kind of superhuman cognitive system, probing its environment and extracting relevant information from it at a minimal cost.
One of the keys to extracting information efficiently is to have highly optimized sensors that do their own kind of quick thinking before any kind of deep processing takes place. These low-level filters trigger which events are recorded and which events are discarded before spending too much energy on them.
They are combined with the high-level scientific ambitions of the scientists that form the other part of this superhuman cognitive system that is theLHC: from the enormous current of information about the environment flowing in through the sensors, which events seem interesting enough to investigate further? Which events should grab our attention, and demand further investigation? And how do our goals (e.g. finding the Higgs boson or supersymmetry particles) inform where we look to begin with and how we construct the filters?
Attention in the Brain
“Everyone knows what attention is. It is the taking possession by the mind, in clear, and vivid form, of one out of what seems several simultaneously possible objects or trains of thought.”
Even while William James claimed that everyone knows what attention is, and it is important to emphasize at this point that attention is not one homogeneous thing controlled by the homunculus sitting at the steering wheel inside the pineal gland, but a complex multi-faceted phenomenon best thought of on several layers and as composed of several individual moving parts. In my previous article on why we might be looking at the brain in the wrong way, I discussed this frequent issue of applying terminology that is centuries old to describe novel neuroscientific phenomena (the heritage of James plays its own part), and frequently, relying on ancient terminology and ancient intuitions can stand in our own way of properly grasping what is going on.
This holds especially true for attention, something so familiar from everyday life yet at the same time meaning so many things at once, and tied up with other wobbly concepts like free will and consciousness.
So in the rest of this article, I don’t want to be exhaustive (for a more detailed overview on all kinds of facets of attention, this review article might be a good place to start), but rather illustrate some useful key components and functions of attention both in the brain and in machine learning.
The LHC lays out nicely how attention determines where limited resources should best be spent in a complex environment. And I believe it also introduces the key components that compose the attention mechanism in the brain well: our attention system can be thought of as consisting of both bottom-up and top-down control, as Adam Gazzaley and Larry Rosen describe in The Distracted Mind: Ancient Brains in a High Tech World.
Top-down mechanisms try to implement our high-level goals by guiding our attention. Say your new year’s goal is to lose weight: your cortex will then try to convince your eyes not to notice that deliciously nutritious looking bar of chocolate lying over there next to the couch. Top-down goals might be thought of as one of the pinnacles of evolution: as I noted in my article on the Bayesian Brain, discovering and predicting the future offered an enormous evolutionary advantage, with our cities and technologies built by the dead giving impressive testament to the fact.
Bottom-up mechanisms, on the other hand, automatically pull your attention towards something that was, in our long evolutionary history, worth noting. Be it when there is a loud explosion down the street, a jaguar-like looking shape in the dark, or someone saying your name at the table next to you.
Top-down attention is closely linked to our brain’s executive functions, defined by the ability of the prefrontal cortex to exert top-down control on the rest of your brain, with the rest of the brain pushing back with bottom-up attention grabs. The orbitofrontal cortex, for instance, tries to work out how your emotions are regulated by your goals, and through the limbic system translates abstract goals into the language of body and action.
An interesting to note is that much of this is mostly done through inhibition rather than activation. What counts in life is often what we do not do, the impulses we manage not follow, be it when we decide against staying in our comfortable bed in the morning to go to work or when stopping ourselves from gambling our retirement pension away in a night in Las Vegas. Studies with patients having had lesions in brain areas responsible for attention control (or, in the case of Phineas Gage, had iron rods explode through these relevant brain regions) show how detrimental the inability to control impulses and pursue long term goals was for their quality of life. A similar notion was popularized by Mischel’s famous Marshmallow experiment, indicating that gratification postponement in children at an early age is a good predictor of long-term success.
Thus, controlling attention is closely intertwined with controlling our behavior. Our brains might be viewed as supercomputers built on monkey brains, and with the supercomputers coming to the party relatively late, exerting cognitive control is tricky, and top-down and bottom-up mechanism are in constant competition for the brain’s most precious resource.
Often, they are in direct conflict with each other.
You might be strong-willed, but after a long day of work, when the sight of the chocolate bar triggers saliva building up in your mouth with tsunami-like intensity, your executive functions are easily overcome and you are shown that you are not master in your own house.
More generally, getting distracted and losing focus can be seen from the perspective of goal interference. Attention is, as most things in the brain, regulated dynamically, and as such is prone to interference. Staying focused is an active process, filtering information and ignoring the irrelevant is an active process, taking time and energy. Goals compete, and attentional resources are always in direct competitioin with each other.
This is another important aspect of attention that makes sense evolutionarily, and brings us back to the LHC metaphor: global attention guides our sensors to maxima in the information gathering landscape. These can be achieved within individual sensory modalities (e.g. looking in a certain direction), by switching between different networks in the brain (e.g. listening carefully vs. looking carefully) or switching between tasks (reading a newspaper instead of watching Youtube). After enough information is gathered, attention also guides our memories to decide what is worth storing and remembering in future situations.
But our ability to frequently initiate network switching through attention might be seen both as a feature and a bug: as there are usually many things to be paid attention to, it is crucial to have an automated infrastructure in place that shines the ray of attention on inputs and switches our action to an activities that promise most information gain.
In modern high-tech society, we are constantly bombarded by attention grabs. As laid out in The Distracted Mind, too many opportunities for task switching becomes a huge problem when ancient brains continually try to maximize information in an exponentially growing information landscape, leading to constant distraction, failed attempts at multitasking, feelings of insatiability, sleep deprivation, and much more. As complex tasks are usually distributed across several brain areas, this process of switching is also relatively slow, which is why it can take time a lot of time to move in between different tasks, making multitasking, which we all engage in constantly, highly inefficient.
This phenomenon is especially pronounced among younger generations, and countering these highly detrimental effects on our mental life should warrant our collective society’s undivided attention (pun intended).
Attention in Deep Learning (is all you need)
To sum things up, we can conceptualize attention as something like an overarching organizing principle of a multi-purpose agent having to do loads of tasks at the same time in the real world and to navigate meaningfully between them. This includes information gathering across several sensor modalities and the implementation of high level goals via cognitive control.
It’s not entirely clear how to get from this perspective on attention to useful ideas in artificial intelligence. But if we view attention as a general tool to reduce and guide computational resources, it is already used quite successfully in several machine learning architectures.
Most recently, attention has grabbed the attention of the deep learning community in the context of transformers, with the paper “Attention Is All You Need” becoming one of the most influential in the field, and being cited over 16000 times.
Transformers have revolutionized natural-language processing and allowed architectures like BERT (Bidirectional Encoder Representations from Transformers) to generate eerily human-like texts.
Without getting too technical, text generation is a sequential task, composed of an encoder (the text that goes in) and a decoder (the text that goes out). The input and output of the model are therefore composed of sequences. Sequential models can be very tricky to learn because the input can become quite large, requiring the model to potentially learn long-range dependencies in the input (see the exploding-gradient problems when training recurrent neural networks, something partly solved by long-short-term memory networks, which come with their own set of problems).
As this blog post describes in more detail, attention mechanisms, as used by transformers, try to circumvent this problem by introducing the so-called self-attention operation. This operation is computed between the input vectors of the sequence, and can be used when generating output sequences.
Self-attention is then a way of figuring out global dependencies, so of demarcating which part of an input sequence belongs together, and is, in turn, going to be relevant for output generation. One often-cited application lies in translation between different languages (e.g. between French and English, using the wonderfully named CamemBERT model), where words that belong together in meaning can pop up in different parts of the sentence in different languages.
Attention also helps circumvent the problem of having very long input sequences because the model does not need to remember the whole input sequence (say when you have to translate a very long sentence…looking at you, Marcel Proust), but can prioritize and batch the input more flexibly, much as a human translator would (see this video by Andrew Ng for a more detailed explanation).
This helps in effectively reducing the dimensionality of the input vector, as the model implicitly selects which part of the sequence is going to be relevant, and thus figures out what to pay attention to. This also nicely introduces context-dependence into the model, something absolutely crucial in our human understanding of text (and the world in general).
While there are comparisons to be drawn between this and the brain, some of the connections might admittedly also seem a little far-fetched here.
Our human intelligence is so impressive because it works across such a wide array of different tasks, and neural network architectures, being for the most part still highly specialized, have a hard time with it. So something more akin to our human-like attention could very well become increasingly important as a global organizational principle in the context of multi-task learning agents, for example in robotics.
There is still a lot of room for fresh and interesting ideas (e.g. by modeling attention in transformers on attention in natural language processing in the brain), and nicely underscores why further down the road, concepts from neuroscience might infuse machine learning with a set of useful new ideas, and why the two disciplines should stay in close contact to each other.