import streamlit as st st.set_page_config(page_title="Memory and Mirroring", page_icon="🧠", layout="wide") # Hiding the main menu and footer using CSS hide_streamlit_style = """ """ st.markdown(hide_streamlit_style, unsafe_allow_html=True) st.title("🧠 Memory and Mirroring in AI - Simulated and Personalized Semantic and Episodic Memory") # Using expanders for different sections with st.expander("📝 Semantic and Episodic Memory as Cognitive AI Tools"): st.subheader("1️⃣ Semantic Memory") st.markdown(""" **Semantic memory** is a crucial type of long-term memory that houses our knowledge of facts, concepts, and the broader world. Unlike episodic memory, which is personal and subjective, semantic memory is about objective truths and shared knowledge that help us navigate everyday life. It includes everything from understanding the laws of physics to recognizing the names of colors or the shapes of letters. This memory system is essential for language, reasoning, and the application of knowledge in new contexts. It allows us to form a framework of the external world, enabling systematic and informed decision-making and interaction. As we accumulate experiences, our semantic memory continuously expands and refines, solidifying our grasp on reality and enhancing our cognitive processes. """) st.subheader("2️⃣ Episodic Memory") st.markdown(""" **Episodic memory** is a form of long-term memory that captures personal experiences and events, deeply intertwined with sensory details and emotional undercurrents. This type of memory is not just about the when and where of events, but also about the feelings and senses involved—such as the visual and auditory impressions, the scents, and the tactile experiences. For example, even if language skills were not fully developed, one could vividly recall the emotions, sights, and sounds of a fifth birthday party. This vividness is largely due to the interaction between the neocortex, which processes the details of these memories, and the amygdala, the part of the brain crucial for emotional tagging. This emotional connection often makes episodic memories particularly strong and enduring. """) with st.expander("🤖 Mirroring in Humans and Applying it to AI"): st.subheader("1️⃣ What is Mirroring?") st.markdown(""" **Mirroring** is a sophisticated social technique in which individuals subtly replicate the gestures, speech patterns, and attitudes of others. This behavior is not just mimicry but a strategic approach to fostering rapport and enhancing understanding among individuals. By reflecting someone else’s behavior, people can create a sense of empathy and connection, which facilitates smoother and more effective communication. In the context of artificial intelligence, applying mirroring involves programming AI systems to recognize and adapt these human nuances, allowing them to interact more naturally with users. This capability can transform AI from a simple tool into a more engaging and empathetic companion, capable of supporting more complex and sensitive human interactions. """) st.subheader("2️⃣ Benefits of Mirroring") st.markdown(""" **Mirroring** enhances communication by creating a supportive and empathetic environment, crucial for effective interaction. This technique goes beyond mere replication of actions; it involves understanding and responding to the underlying emotions and intentions, which helps to build trust and rapport. In therapeutic settings, mirroring is a powerful tool that allows therapists to connect with their clients more deeply, facilitating a greater understanding and faster healing. In the realm of AI, integrating mirroring techniques can significantly improve the interaction between humans and machines. By enabling AI systems to respond to human emotions and behaviors in a contextually appropriate manner, these systems become more than tools—they evolve into empathetic partners that can anticipate needs and react sensitively. This capability is particularly beneficial in domains such as healthcare, customer service, and education, where understanding and trust are paramount. """) st.subheader("3️⃣ Leveraging Mirroring to Enhance AI Learning and Perception") st.markdown(""" **Mirroring**, when applied effectively in AI design, is more than just copying human behaviors—it’s about enhancing the AI's learning process through action-based communication. This method taps into the psychology of learning and perception, enabling AI systems to not only replicate human actions but also understand the intentions and emotions behind those actions. By observing and reflecting human behaviors, AI can develop a richer context for its interactions, improving its decision-making processes and making its interactions more natural and intuitive. This approach helps bridge the gap between human and machine, facilitating a more seamless integration of AI into everyday human activities. The goal is not just to mirror but to adapt and evolve in response to human cues, thereby enriching the AI's experiential learning and enhancing its cognitive capabilities. """) with st.expander("🤖 Mirroring as a Cognitive Tool in AI and Neuroscience"): st.subheader("1️⃣ The Concept of Mirroring in Cognitive Science") st.markdown(""" **Mirroring** is a pivotal communication mechanism prevalent across various life forms. It entails the nuanced imitation and adjustment of behaviors—ranging from physical movements to complex emotional expressions—to foster empathy and deepen understanding. In human interactions, mirroring includes matching gestures like eye contact and nods, which facilitates a shared cognitive space, enhancing interpersonal connectivity. In artificial intelligence, this concept is mirrored by equipping AI systems with the ability to detect and emulate human emotional and physical cues. This capability not only helps in building a connection but also in understanding user intent, thereby improving interaction quality. """) st.subheader("2️⃣ Cognitive Benefits of Mirroring") st.markdown(""" The utility of mirroring transcends simple replication of actions seen in natural contexts, such as a human reassuring animals by gesturing, to indicate a safe environment. These non-verbal cues, even effective across species, highlight the potent impact of adaptive communication without reliance on language. For AI, this capability suggests systems can be more responsive and attuned to the emotional dynamics of users, enhancing user experience by providing a secure and engaging environment. """) st.subheader("3️⃣ Enhancing AI's Cognitive Models Through Mirroring") st.markdown(""" Implementing mirroring in AI involves more than the straightforward imitation of human actions; it's about creating systems that can interpret and adapt to the complex web of human interactions. This requires AI to not only replicate but also to understand the context and significance behind human behaviors. Such systems need advanced cognitive models that can process and mimic the subtleties of human gestures and emotions, thereby making AI interactions more intuitive and meaningful. """) unsafestring=""" # 🩺🔍 AI and Neuroscience Paper References - 🤖 Mirroring, 📝 Semantic and Episodic Memory as Cognitive AI Tools 18 Jan 2023 | Joint Representation Learning for Text and 3D Point Cloud | ⬇️ Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data structure. In this paper, we propose a novel Text4Point framework to construct language-guided 3D point cloud models. The key idea is utilizing 2D images as a bridge to connect the point cloud and the language modalities. The proposed Text4Point follows the pre-training and fine-tuning paradigm. During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns task-specific 3D representations under informative language guidance from the label set without 2D images. Extensive experiments demonstrate that our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection. The code will be available here: https://github.com/LeapLabTHU/Text4Point 14 Jul 2017 | A Semantics-Based Measure of Emoji Similarity | ⬇️ Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran Emoji have grown to become one of the most important forms of communication on the web. With its widespread use, measuring the similarity of emoji has become an important problem for contemporary text processing since it lies at the heart of sentiment analysis, search, and interface design tasks. This paper presents a comprehensive analysis of the semantic similarity of emoji through embedding models that are learned over machine-readable emoji meanings in the EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji sense definitions, and with different training corpora obtained from Twitter and Google News, we develop and test multiple embedding models to measure emoji similarity. To evaluate our work, we create a new dataset called EmoSim508, which assigns human-annotated semantic similarity scores to a set of 508 carefully selected emoji pairs. After validation with EmoSim508, we present a real-world use-case of our emoji embedding models using a sentiment analysis task and show that our models outperform the previous best-performing emoji embedding model on this task. The EmoSim508 dataset and our emoji embedding models are publicly released with this paper and can be downloaded from http://emojinet.knoesis.org/. 11 Mar 2021 | Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views | ⬇️ Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?") from egocentric observations of an RGB-D camera with known pose (via localization sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length x width x feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the neural episodic memories and spatio-semantic allocentric representations build by SMNet for subsequent tasks in the same space - navigating to objects seen during the tour("Find chair") or answering questions about the space ("How many chairs did you see in the house?"). Project page: https://vincentcartillier.github.io/smnet.html. 06 Mar 2020 | Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach | ⬇️ Oleksandr Palagin, Vitalii Velychko, Kyrylo Malakhov and Oleksandr Shchurov We design a new technique for the distributional semantic modeling with a neural network-based approach to learn distributed term representations (or term embeddings) - term vector space models as a result, inspired by the recent ontology-related approach (using different types of contextual knowledge such as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to the identification of terms (term extraction) and relations between them (relation extraction) called semantic pre-processing technology - SPT. Our method relies on automatic term extraction from the natural language texts and subsequent formation of the problem-oriented or application-oriented (also deeply annotated) text corpora where the fundamental entity is the term (includes non-compositional and compositional terms). This gives us an opportunity to changeover from distributed word representations (or word embeddings) to distributed term representations (or term embeddings). This transition will allow to generate more accurate semantic maps of different subject domains (also, of relations between input terms - it is useful to explore clusters and oppositions, or to test your hypotheses about them). The semantic map can be represented as a graph using Vec2graph - a Python library for visualizing word embeddings (term embeddings in our case) as dynamic and interactive graphs. The Vec2graph library coupled with term embeddings will not only improve accuracy in solving standard NLP tasks, but also update the conventional concept of automated ontology development. The main practical result of our work is the development kit (set of toolkits represented as web service APIs and web application), which provides all necessary routines for the basic linguistic pre-processing and the semantic pre-processing of the natural language texts in Ukrainian for future training of term vector space models. 19 Jan 2024 | PoseScript: Linking 3D Human Poses and Natural Language | ⬇️ Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Gr'egory Rogez Natural language plays a critical role in many computer vision applications, such as image captioning, visual question answering, and cross-modal retrieval, to provide fine-grained semantic information. Unfortunately, while human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. To address this issue, we have introduced the PoseScript dataset. This dataset pairs more than six thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. Additionally, to increase the size of the dataset to a scale that is compatible with data-hungry learning algorithms, we have proposed an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information, known as "posecodes", using a set of simple but generic rules on the 3D keypoints. These posecodes are then combined into higher level textual descriptions using syntactic rules. With automatic annotations, the amount of available data significantly scales up (100k), making it possible to effectively pretrain deep models for finetuning on human captions. To showcase the potential of annotated poses, we present three multi-modal learning tasks that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps 3D poses and textual descriptions into a joint embedding space, allowing for cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we establish a baseline for a text-conditioned model generating 3D poses. Thirdly, we present a learned process for generating pose descriptions. These applications demonstrate the versatility and usefulness of annotated poses in various tasks and pave the way for future research in the field. 11 Sep 2023 | Tell me what you see: A zero-shot action recognition method based on natural language descriptions | ⬇️ Valter Estevam and Rayson Laroca and David Menotti and Helio Pedrini This paper presents a novel approach to Zero-Shot Action Recognition. Recent works have explored the detection and classification of objects to obtain semantic information from videos with remarkable performance. Inspired by them, we propose using video captioning methods to extract semantic information about objects, scenes, humans, and their relationships. To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences. More specifically, we represent videos using sentences generated via video captioning methods and classes using sentences extracted from documents acquired through search engines on the Internet. Using these representations, we build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets. The projection of both visual and semantic information onto this space is straightforward, as they are sentences, enabling classification using the nearest neighbor rule. We demonstrate that representing videos and labels with sentences alleviates the domain adaptation problem. Additionally, we show that word vectors are unsuitable for building the semantic embedding space of our descriptions. Our method outperforms the state-of-the-art performance on the UCF101 dataset by 3.3 p.p. in accuracy under the TruZe protocol and achieves competitive results on both the UCF101 and HMDB51 datasets under the conventional protocol (0/50% training/testing split). Our code is available at https://github.com/valterlej/zsarcap. 16 Jun 2023 | M3PT: A Multi-Modal Model for POI Tagging | ⬇️ Jingsong Yang, Guanzhou Han, Deqing Yang, Jingping Liu, Yanghua Xiao, Xiang Xu, Baohua Wu, Shenghua Ni POI tagging aims to annotate a point of interest (POI) with some informative tags, which facilitates many services related to POIs, including search, recommendation, and so on. Most of the existing solutions neglect the significance of POI images and seldom fuse the textual and visual features of POIs, resulting in suboptimal tagging performance. In this paper, we propose a novel Multi-Modal Model for POI Tagging, namely M3PT, which achieves enhanced POI tagging through fusing the target POI's textual and visual features, and the precise matching between the multi-modal representations. Specifically, we first devise a domain-adaptive image encoder (DIE) to obtain the image embeddings aligned to their gold tags' semantics. Then, in M3PT's text-image fusion module (TIF), the textual and visual representations are fully fused into the POIs' content embeddings for the subsequent matching. In addition, we adopt a contrastive learning strategy to further bridge the gap between the representations of different modalities. To evaluate the tagging models' performance, we have constructed two high-quality POI tagging datasets from the real-world business scenario of Ali Fliggy. Upon the datasets, we conducted the extensive experiments to demonstrate our model's advantage over the baselines of uni-modality and multi-modality, and verify the effectiveness of important components in M3PT, including DIE, TIF and the contrastive learning strategy. 21 Jun 2023 | EmTract: Extracting Emotions from Social Media | ⬇️ Domonkos F. Vamossy and Rolf Skog We develop an open-source tool (EmTract) that extracts emotions from social media text tailed for financial context. To do so, we annotate ten thousand short messages from a financial social media platform (StockTwits) and combine it with open-source emotion data. We then use a pre-tuned NLP model, DistilBERT, augment its embedding space by including 4,861 tokens (emojis and emoticons), and then fit it first on the open-source emotion data, then transfer it to our annotated financial social media data. Our model outperforms competing open-source state-of-the-art emotion classifiers, such as Emotion English DistilRoBERTa-base on both human and chatGPT annotated data. Compared to dictionary based methods, our methodology has three main advantages for research in finance. First, our model is tailored to financial social media text; second, it incorporates key aspects of social media data, such as non-standard phrases, emojis, and emoticons; and third, it operates by sequentially learning a latent representation that includes features such as word order, word usage, and local context. Using EmTract, we explore the relationship between investor emotions expressed on social media and asset prices. We show that firm-specific investor emotions are predictive of daily price movements. Our findings show that emotions and market dynamics are closely related, and we provide a tool to help study the role emotions play in financial markets. 29 Oct 2022 | Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding | ⬇️ Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, Xiaodan Liang To bridge the gap between supervised semantic segmentation and real-world applications that acquires one model to recognize arbitrary new concepts, recent zero-shot segmentation attracts a lot of attention by exploring the relationships between unseen and seen object categories, yet requiring large amounts of densely-annotated data with diverse base classes. In this paper, we propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations, by purely exploiting the image-caption data that naturally exist on the Internet. Our method, Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a text encoder to generate visual and text embeddings for the image-caption data, with two core components that endow its segmentation ability: First, the image encoder is jointly trained with a vision-based contrasting and a cross-modal contrasting, which encourage the visual embeddings to preserve both fine-grained semantics and high-level category information that are crucial for the segmentation task. Furthermore, an online clustering head is devised over the image encoder, which allows to dynamically segment the visual embeddings into distinct semantic groups such that they can..into distinct semantic groups such that they can be classified by comparing with various text embeddings to complete our segmentation pipeline. Experiments show that without using any data with dense annotations, our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets. 10 Feb 2020 | The Tensor Brain: Semantic Decoding for Perception and Memory | ⬇️ Volker Tresp and Sahand Sharifzadeh and Dario Konopatzki and Yunpu Ma We analyse perception and memory, using mathematical models for knowledge graphs and tensors, to gain insights into the corresponding functionalities of the human mind. Our discussion is based on the concept of propositional sentences consisting of \textit{subject-predicate-object} (SPO) triples for expressing elementary facts. SPO sentences are the basis for most natural languages but might also be important for explicit perception and declarative memories, as well as intra-brain communication and the ability to argue and reason. A set of SPO sentences can be described as a knowledge graph, which can be transformed into an adjacency tensor. We introduce tensor models, where concepts have dual representations as indices and associated embeddings, two constructs we believe are essential for the understanding of implicit and explicit perception and memory in the brain. We argue that a biological realization of perception and memory imposes constraints on information processing. In particular, we propose that explicit perception and declarative memories require a semantic decoder, which, in a simple realization, is based on four layers: First, a sensory memory layer, as a buffer for sensory input, second, an index layer representing concepts, third, a memoryless representation layer for the broadcasting of information ---the "blackboard", or the "canvas" of the brain--- and fourth, a working memory layer as a processing center and data buffer. We discuss the operations of the four layers and relate them to the global workspace theory. In a Bayesian brain interpretation, semantic memory defines the prior for observable triple statements. We propose that ---in evolution and during development--- semantic memory, episodic memory, and natural language evolved as emergent properties in agents' process to gain a deeper understanding of sensory information. 09 Sep 2021 | Talk-to-Edit: Fine-Grained Facial Editing via Dialog | ⬇️ Yuming Jiang, Ziqi Huang, Xingang Pan, Chen Change Loy, Ziwei Liu Facial editing is an important task in vision and graphics with numerous applications. However, existing works are incapable to deliver a continuous and fine-grained editing mode (e.g., editing a slightly smiling face to a big laughing one) with natural interactions with users. In this work, we propose Talk-to-Edit, an interactive facial editing framework that performs fine-grained attribute manipulation through dialog between the user and the system. Our key insight is to model a continual "semantic field" in the GAN latent space. 1) Unlike previous works that regard the editing as traversing straight lines in the latent space, here the fine-grained editing is formulated as finding a curving trajectory that respects fine-grained attribute landscape on the semantic field. 2) The curvature at each step is location-specific and determined by the input image as well as the users' language requests. 3) To engage the users in a meaningful dialog, our system generates language feedback by considering both the user request and the current state of the semantic field. We also contribute CelebA-Dialog, a visual-language facial editing dataset to facilitate large-scale study. Specifically, each image has manually annotated fine-grained attribute annotations as well as template-based textual descriptions in natural language. Extensive quantitative and qualitative experiments demonstrate the superiority of our framework in terms of 1) the smoothness of fine-grained editing, 2) the identity/attribute preservation, and 3) the visual photorealism and dialog fluency. Notably, user study validates that our overall system is consistently favored by around 80% of the participants. Our project page is https://www.mmlab-ntu.com/project/talkedit/. ---------------""" safe_text = unsafestring.encode('utf-8', 'replace').decode('utf-8') st.markdown(safe_text)