cybersalience / write-up.txt
paulbricman's picture
fix: typos
tldr: I render transformer self-attention explicit as a pulsing saliency map overlayed on a piece of text, in order to help guide the user's attention across it.
This time, we'll try something a bit different. This project's demo and write-up are one and the same, as they explore a new mechanic for digesting content. What better place, then, for playing around with this way of navigating information than the write-up itself? The way I suggest going about it is to first read the content through, before trying out the interactive bits of this page. This way, even if reading the whole text beforehand defeats the purpose of the mechanic, you'll manage to get a sense of what you could use it for in the future. Some technical quirks lead to this write-up lacking formatting and links, so please find related resources in the sidebar, if you're interested.
Over the past few months, I explored a couple different approaches for navigating large amounts of knowledge. The lexiscore attempted to rank content items by a Shannon-esque estimate of how much information they provide you with, while the decontextualizer attempted to extract self-contained stand-alone snippets from those documents. My understanding of how to best build tools for navigating large corpora evolved quite a bit during those months, and I now feel it's particularly fruitful to frame those tools as tiny building blocks of broader and more integrated perceptual engines. In contrast to search engines, perceptual engines aim to be more abstractive (than extractive), more interactive (than a static dump of results), and richer in top-down influences (than a single text query to rule them all). In this context, cybersalience is a building block meant to be placed relatively late in the processing pathway β€” only after mountains of data have been compressed so as to pass through the bottleneck of the user's actual sensorium.
The motivation behind cybersalience specifically is that text as we know it β€” written symbols stringed together on screen β€” is far from being a humane representation of thought. Just think of how long it takes to perceive a natural image compared to reading a thousand words describing it, perhaps a second versus a few minutes. Assuming the old adage is decently accurate in comparing the amount of information contained in the depiction and the description, it's quite obvious that text is not particularly ergonomic for people as a medium for encoding information. We invented paragraphs and outlines to help us hierarchically navigate it to a first approximation, and various formatting tools to make certain bits stand out β€” both of which feel rudimentary, even forced, approaches to coerce text into being more brain-friendly. On the other hand, NLP models trained on large bodies of text have it "in their blood" β€” they're designed from the ground up to read, attend to, form memories of, and write text. A lot of text, but just text.
Given this discrepancy, what if we borrowed the ability to perceive text effectively from NLP models? I'm not referring to asking them to write new and more brain-friendly (i.e. shorter) text (e.g. a summary, answers to questions). That would surely have its place somewhere in a perceptual engine, but I'm wondering here whether we could specifically borrow the very ability to perceive a certain text as a whole. It could be the original text, or the result of a previous processing step.
A promising place where this deeper human-machine integration could happen is at the level of attention. Interestingly enough, many recent NLP models incorporate what are called self-attention and cross-attention layers. Both flavors are involved in helping the model as a whole figure out what it should attend to, what things it should allocate its representational resources on. For instance, in machine translation β€” arguably the birthplace of attention layers β€” a model learns to pay attention to different parts of an English sentence as it writes it in French. It doesn't encode the whole English sentence before regurgitating a French translation, but "keeps an eye on" relevant parts of the input as it writes different parts of the output. This lead to better results, especially in situations where the input was way too complex to represent at once.
That's all great, the ability to attend to the right things helped NLP models (and soon after ML models in general) get better at what they were doing before: machine translation, language modeling, etc. However, an underrated side-effect of those improvements is the fact that the trained model knows what is relevant to attend to! You can reverse engineer what specific parts of the input the model is attending to as it's doing its job. That's super interesting! You get a glimpse into its idealized perception mechanism, and get a feel for what it's looking at. This is of course useful as an interpretability tool, as an "input features" type of local explanation: if the image classifier looks at the snow around the dog, rather than at the dog itself, when trying to determine whether there's a Husky in the picture, then it's time to head back to the drawing board. There are dozens other interpretability techniques, all useful debugging tools by day and juicy human-machine synergies by night (working on a novel one as part of my bachelor's project!). However, I chose to focus on attention here due to the obvious analogy to cognitive psychology.
Coming back to cybersalience as a tiny building block of perceptual engines, how can those attention layers actually help us perceive a long text? One possible approach is to use the artificial attention deployed by the NLP model to guide human attention. We can let the model do its job (e.g. compute semantic embeddings), then simply disregard the main outputs, and instead look under the hood for the attention matrices derived as a side-product. After mean-pooling across the attention heads, filtering for the attention used to specifically inform the representation of a custom query appended to the input content, and cleaning up the resulting values a bit more based on their overall distribution, we get a compact human-readable explanation of what the model is attending to when trying to understand the query. If we then take those values and represent them using cognitively ergonomic pre-attentive features like color and movement, we can finally guide the user to pay attention to what the machine is paying attention, weaving together parts of their perceptual systems.
A few more technical details. I opted for the bert-base-cased model, working with all its attention heads from its second-to-last layer. Each paragraph is processed separately, after appending the user's custom query to it behind the scenes. I went with processing individual paragraphs instead of the whole document at once because the model appeared to have a bias towards the last part of the document. It makes a lot of sense β€” the last couple paragraphs are particularly relevant for understanding what comes next. The tokens which are particularly salient for the NLP model get marked with custom CSS styling in order to grab the user's attention. This includes color and a pulsing animation as a means to cover those pre-attentive features. I made the formatting quite customizable in order for people to play around with different visual representations of the model's attention, see how it feels.
The two most important settings to be configured are the driving query and the amount of focus. On one hand, the driving query forces the model to attend to the parts of the text which help inform its understanding of what the query means. It could roughly be seen as a top-down influence on the way the model's attention gets deployed. The query could be a short noun phrase, a complete question, or whatever fits in a text box. On the other hand, the amount of focus determines how sparse or narrow the saliency map should be. Low focus leads to the model's attention being spread out across the whole text, while high focus sharpens the saliency map on the few most relevant tokens.
While this project focused on using artificial attention to help people perceive text, the same ability could be repurposed in a few different ways. For instance, what if you used a similar saliency map to get a better feel for how two written notes relate to each other? Treat one as the query and the other as the main document, and you can guide user attention towards the potential connections. Also, if you have a small set of notes, each color-coded using a subtle color palette, you might learn how to perceive the way they all relate to each other, by color-coding the saliency maps accordingly. "Ah, so this note might relate to the light blue one through those blue highlights, and to the light green one through those green ones, now I see it..." Perhaps with a direct graph in the background depicting how information flows among those, how they inform each other's meaning. Think ad-hoc continuous links. Selecting a bit of text might get you to other notes with a probability based on the strength of the color-coded saliency map at that location. But what if the user selection, too, would be a continuously-updated continuous distribution over tokens instead of a discrete snippet? Maybe via eye tracking? Or perhaps via intent recognition based on interaction history and a user model? I sometimes feel that every month I put together a toy project on which I could easily spend a few years working on, looking into its ramifications.
Before I end this, I wanted to highlight the fact that the analogy between organic and artificial attention goes way deeper than the surface feature of "focusing on specific things." For instance, it has been hypothesized that endogenous attention (i.e. intentionally paying attention to something, rather than something grabbing your attention) is partially realized in the brain by means of firing synchrony: higher-lever neuron clusters (allegedly coding more abstract concepts) encouraging particular signals in lower-level ones (allegedly coding raw sensory information) by means of synchronized firing. Similarly, artificial attention is roughly implemented by selecting for raw signals which are aligned with higher-level "queries" by means of a dot product. It would be really interesting if top-down human queries (e.g. What are perceptual engines?) could directly guide bottom-up artificial processing without having to write out the query in text, and instead picking it up neurally or predicting it.
To wrap up, borrowing from an NLP model's native ability to attend to different parts of a text might, in turn, help us deploy our attention effectively. Now, this is only a "last-mile" solution to perceiving large corpora β€” it helps you find stuff that's already on screen, and doesn't focus on deciding what makes it there. Still, this project has been a useful exercise in thinking more about my wishlist of features and the architecture of perceptual engines. Give it a shot! Try setting a driving query in the sidebar, play around with the focus, or even add your own text by reseting the content.