awacke1's picture
Update app.py
f664569 verified
import streamlit as st
st.set_page_config(page_title="Memory and Mirroring", page_icon="🧠", layout="wide")
# Hiding the main menu and footer using CSS
hide_streamlit_style = """
<style>
#MainMenu {visibility: hidden;}
footer {visibility: hidden;}
</style>
"""
st.markdown(hide_streamlit_style, unsafe_allow_html=True)
st.title("🧠 Memory and Mirroring in AI - Simulated and Personalized Semantic and Episodic Memory")
# Using expanders for different sections
with st.expander("📝 Semantic and Episodic Memory as Cognitive AI Tools"):
st.subheader("1️⃣ Semantic Memory")
st.markdown("""
**Semantic memory** is a crucial type of long-term memory that houses our knowledge of facts, concepts, and the broader world. Unlike episodic memory, which is personal and subjective, semantic memory is about objective truths and shared knowledge that help us navigate everyday life. It includes everything from understanding the laws of physics to recognizing the names of colors or the shapes of letters. This memory system is essential for language, reasoning, and the application of knowledge in new contexts. It allows us to form a framework of the external world, enabling systematic and informed decision-making and interaction. As we accumulate experiences, our semantic memory continuously expands and refines, solidifying our grasp on reality and enhancing our cognitive processes.
""")
st.subheader("2️⃣ Episodic Memory")
st.markdown("""
**Episodic memory** is a form of long-term memory that captures personal experiences and events, deeply intertwined with sensory details and emotional undercurrents. This type of memory is not just about the when and where of events, but also about the feelings and senses involved—such as the visual and auditory impressions, the scents, and the tactile experiences. For example, even if language skills were not fully developed, one could vividly recall the emotions, sights, and sounds of a fifth birthday party. This vividness is largely due to the interaction between the neocortex, which processes the details of these memories, and the amygdala, the part of the brain crucial for emotional tagging. This emotional connection often makes episodic memories particularly strong and enduring.
""")
with st.expander("🤖 Mirroring in Humans and Applying it to AI"):
st.subheader("1️⃣ What is Mirroring?")
st.markdown("""
**Mirroring** is a sophisticated social technique in which individuals subtly replicate the gestures, speech patterns, and attitudes of others. This behavior is not just mimicry but a strategic approach to fostering rapport and enhancing understanding among individuals. By reflecting someone else’s behavior, people can create a sense of empathy and connection, which facilitates smoother and more effective communication.
In the context of artificial intelligence, applying mirroring involves programming AI systems to recognize and adapt these human nuances, allowing them to interact more naturally with users. This capability can transform AI from a simple tool into a more engaging and empathetic companion, capable of supporting more complex and sensitive human interactions.
""")
st.subheader("2️⃣ Benefits of Mirroring")
st.markdown("""
**Mirroring** enhances communication by creating a supportive and empathetic environment, crucial for effective interaction. This technique goes beyond mere replication of actions; it involves understanding and responding to the underlying emotions and intentions, which helps to build trust and rapport. In therapeutic settings, mirroring is a powerful tool that allows therapists to connect with their clients more deeply, facilitating a greater understanding and faster healing.
In the realm of AI, integrating mirroring techniques can significantly improve the interaction between humans and machines. By enabling AI systems to respond to human emotions and behaviors in a contextually appropriate manner, these systems become more than tools—they evolve into empathetic partners that can anticipate needs and react sensitively. This capability is particularly beneficial in domains such as healthcare, customer service, and education, where understanding and trust are paramount.
""")
st.subheader("3️⃣ Leveraging Mirroring to Enhance AI Learning and Perception")
st.markdown("""
**Mirroring**, when applied effectively in AI design, is more than just copying human behaviors—it’s about enhancing the AI's learning process through action-based communication. This method taps into the psychology of learning and perception, enabling AI systems to not only replicate human actions but also understand the intentions and emotions behind those actions.
By observing and reflecting human behaviors, AI can develop a richer context for its interactions, improving its decision-making processes and making its interactions more natural and intuitive. This approach helps bridge the gap between human and machine, facilitating a more seamless integration of AI into everyday human activities. The goal is not just to mirror but to adapt and evolve in response to human cues, thereby enriching the AI's experiential learning and enhancing its cognitive capabilities.
""")
with st.expander("🤖 Mirroring as a Cognitive Tool in AI and Neuroscience"):
st.subheader("1️⃣ The Concept of Mirroring in Cognitive Science")
st.markdown("""
**Mirroring** is a pivotal communication mechanism prevalent across various life forms. It entails the nuanced imitation and adjustment of behaviors—ranging from physical movements to complex emotional expressions—to foster empathy and deepen understanding. In human interactions, mirroring includes matching gestures like eye contact and nods, which facilitates a shared cognitive space, enhancing interpersonal connectivity.
In artificial intelligence, this concept is mirrored by equipping AI systems with the ability to detect and emulate human emotional and physical cues. This capability not only helps in building a connection but also in understanding user intent, thereby improving interaction quality.
""")
st.subheader("2️⃣ Cognitive Benefits of Mirroring")
st.markdown("""
The utility of mirroring transcends simple replication of actions seen in natural contexts, such as a human reassuring animals by gesturing, to indicate a safe environment. These non-verbal cues, even effective across species, highlight the potent impact of adaptive communication without reliance on language. For AI, this capability suggests systems can be more responsive and attuned to the emotional dynamics of users, enhancing user experience by providing a secure and engaging environment.
""")
st.subheader("3️⃣ Enhancing AI's Cognitive Models Through Mirroring")
st.markdown("""
Implementing mirroring in AI involves more than the straightforward imitation of human actions; it's about creating systems that can interpret and adapt to the complex web of human interactions. This requires AI to not only replicate but also to understand the context and significance behind human behaviors. Such systems need advanced cognitive models that can process and mimic the subtleties of human gestures and emotions, thereby making AI interactions more intuitive and meaningful.
""")
unsafestring="""
# 🩺🔍 AI and Neuroscience Paper References - 🤖 Mirroring, 📝 Semantic and Episodic Memory as Cognitive AI Tools
18 Jan 2023 | Joint Representation Learning for Text and 3D Point Cloud | ⬇️
Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang
Recent advancements in vision-language pre-training (e.g. CLIP) have shown
that vision models can benefit from language supervision. While many models
using language modality have achieved great success on 2D vision tasks, the
joint representation learning of 3D point cloud with text remains
under-explored due to the difficulty of 3D-Text data pair acquisition and the
irregularity of 3D data structure. In this paper, we propose a novel Text4Point
framework to construct language-guided 3D point cloud models. The key idea is
utilizing 2D images as a bridge to connect the point cloud and the language
modalities. The proposed Text4Point follows the pre-training and fine-tuning
paradigm. During the pre-training stage, we establish the correspondence of
images and point clouds based on the readily available RGB-D data and use
contrastive learning to align the image and point cloud representations.
Together with the well-aligned image and text features achieved by CLIP, the
point cloud features are implicitly aligned with the text embeddings. Further,
we propose a Text Querying Module to integrate language information into 3D
representation learning by querying text embeddings with point cloud features.
For fine-tuning, the model learns task-specific 3D representations under
informative language guidance from the label set without 2D images. Extensive
experiments demonstrate that our model shows consistent improvement on various
downstream tasks, such as point cloud semantic segmentation, instance
segmentation, and object detection. The code will be available here:
https://github.com/LeapLabTHU/Text4Point
14 Jul 2017 | A Semantics-Based Measure of Emoji Similarity | ⬇️
Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran
Emoji have grown to become one of the most important forms of communication
on the web. With its widespread use, measuring the similarity of emoji has
become an important problem for contemporary text processing since it lies at
the heart of sentiment analysis, search, and interface design tasks. This paper
presents a comprehensive analysis of the semantic similarity of emoji through
embedding models that are learned over machine-readable emoji meanings in the
EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji
sense definitions, and with different training corpora obtained from Twitter
and Google News, we develop and test multiple embedding models to measure emoji
similarity. To evaluate our work, we create a new dataset called EmoSim508,
which assigns human-annotated semantic similarity scores to a set of 508
carefully selected emoji pairs. After validation with EmoSim508, we present a
real-world use-case of our emoji embedding models using a sentiment analysis
task and show that our models outperform the previous best-performing emoji
embedding model on this task. The EmoSim508 dataset and our emoji embedding
models are publicly released with this paper and can be downloaded from
http://emojinet.knoesis.org/.
11 Mar 2021 | Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views | ⬇️
Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra
We study the task of semantic mapping - specifically, an embodied agent (a
robot or an egocentric AI assistant) is given a tour of a new environment and
asked to build an allocentric top-down semantic map ("what is where?") from
egocentric observations of an RGB-D camera with known pose (via localization
sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists
of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame,
(2) a Feature Projector that projects egocentric features to appropriate
locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan
length x width x feature-dims that learns to accumulate projected egocentric
features, and (4) a Map Decoder that uses the memory tensor to produce semantic
top-down maps. SMNet combines the strengths of (known) projective camera
geometry and neural representation learning. On the task of semantic mapping in
the Matterport3D dataset, SMNet significantly outperforms competitive baselines
by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1
metrics. Moreover, we show how to use the neural episodic memories and
spatio-semantic allocentric representations build by SMNet for subsequent tasks
in the same space - navigating to objects seen during the tour("Find chair") or
answering questions about the space ("How many chairs did you see in the
house?"). Project page: https://vincentcartillier.github.io/smnet.html.
06 Mar 2020 | Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach | ⬇️
Oleksandr Palagin, Vitalii Velychko, Kyrylo Malakhov and Oleksandr Shchurov
We design a new technique for the distributional semantic modeling with a
neural network-based approach to learn distributed term representations (or
term embeddings) - term vector space models as a result, inspired by the recent
ontology-related approach (using different types of contextual knowledge such
as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to
the identification of terms (term extraction) and relations between them
(relation extraction) called semantic pre-processing technology - SPT. Our
method relies on automatic term extraction from the natural language texts and
subsequent formation of the problem-oriented or application-oriented (also
deeply annotated) text corpora where the fundamental entity is the term
(includes non-compositional and compositional terms). This gives us an
opportunity to changeover from distributed word representations (or word
embeddings) to distributed term representations (or term embeddings). This
transition will allow to generate more accurate semantic maps of different
subject domains (also, of relations between input terms - it is useful to
explore clusters and oppositions, or to test your hypotheses about them). The
semantic map can be represented as a graph using Vec2graph - a Python library
for visualizing word embeddings (term embeddings in our case) as dynamic and
interactive graphs. The Vec2graph library coupled with term embeddings will not
only improve accuracy in solving standard NLP tasks, but also update the
conventional concept of automated ontology development. The main practical
result of our work is the development kit (set of toolkits represented as web
service APIs and web application), which provides all necessary routines for
the basic linguistic pre-processing and the semantic pre-processing of the
natural language texts in Ukrainian for future training of term vector space
models.
19 Jan 2024 | PoseScript: Linking 3D Human Poses and Natural Language | ⬇️
Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Gr'egory Rogez
Natural language plays a critical role in many computer vision applications,
such as image captioning, visual question answering, and cross-modal retrieval,
to provide fine-grained semantic information. Unfortunately, while human pose
is key to human understanding, current 3D human pose datasets lack detailed
language descriptions. To address this issue, we have introduced the PoseScript
dataset. This dataset pairs more than six thousand 3D human poses from AMASS
with rich human-annotated descriptions of the body parts and their spatial
relationships. Additionally, to increase the size of the dataset to a scale
that is compatible with data-hungry learning algorithms, we have proposed an
elaborate captioning process that generates automatic synthetic descriptions in
natural language from given 3D keypoints. This process extracts low-level pose
information, known as "posecodes", using a set of simple but generic rules on
the 3D keypoints. These posecodes are then combined into higher level textual
descriptions using syntactic rules. With automatic annotations, the amount of
available data significantly scales up (100k), making it possible to
effectively pretrain deep models for finetuning on human captions. To showcase
the potential of annotated poses, we present three multi-modal learning tasks
that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps
3D poses and textual descriptions into a joint embedding space, allowing for
cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we
establish a baseline for a text-conditioned model generating 3D poses. Thirdly,
we present a learned process for generating pose descriptions. These
applications demonstrate the versatility and usefulness of annotated poses in
various tasks and pave the way for future research in the field.
11 Sep 2023 | Tell me what you see: A zero-shot action recognition method based on natural language descriptions | ⬇️
Valter Estevam and Rayson Laroca and David Menotti and Helio Pedrini
This paper presents a novel approach to Zero-Shot Action Recognition. Recent
works have explored the detection and classification of objects to obtain
semantic information from videos with remarkable performance. Inspired by them,
we propose using video captioning methods to extract semantic information about
objects, scenes, humans, and their relationships. To the best of our knowledge,
this is the first work to represent both videos and labels with descriptive
sentences. More specifically, we represent videos using sentences generated via
video captioning methods and classes using sentences extracted from documents
acquired through search engines on the Internet. Using these representations,
we build a shared semantic space employing BERT-based embedders pre-trained in
the paraphrasing task on multiple text datasets. The projection of both visual
and semantic information onto this space is straightforward, as they are
sentences, enabling classification using the nearest neighbor rule. We
demonstrate that representing videos and labels with sentences alleviates the
domain adaptation problem. Additionally, we show that word vectors are
unsuitable for building the semantic embedding space of our descriptions. Our
method outperforms the state-of-the-art performance on the UCF101 dataset by
3.3 p.p. in accuracy under the TruZe protocol and achieves competitive results
on both the UCF101 and HMDB51 datasets under the conventional protocol (0/50%
training/testing split). Our code is available at https://github.com/valterlej/zsarcap.
16 Jun 2023 | M3PT: A Multi-Modal Model for POI Tagging | ⬇️
Jingsong Yang, Guanzhou Han, Deqing Yang, Jingping Liu, Yanghua Xiao, Xiang Xu, Baohua Wu, Shenghua Ni
POI tagging aims to annotate a point of interest (POI) with some informative
tags, which facilitates many services related to POIs, including search,
recommendation, and so on. Most of the existing solutions neglect the
significance of POI images and seldom fuse the textual and visual features of
POIs, resulting in suboptimal tagging performance. In this paper, we propose a
novel Multi-Modal Model for POI Tagging, namely M3PT, which achieves enhanced
POI tagging through fusing the target POI's textual and visual features, and
the precise matching between the multi-modal representations. Specifically, we
first devise a domain-adaptive image encoder (DIE) to obtain the image
embeddings aligned to their gold tags' semantics. Then, in M3PT's text-image
fusion module (TIF), the textual and visual representations are fully fused
into the POIs' content embeddings for the subsequent matching. In addition, we
adopt a contrastive learning strategy to further bridge the gap between the
representations of different modalities. To evaluate the tagging models'
performance, we have constructed two high-quality POI tagging datasets from the
real-world business scenario of Ali Fliggy. Upon the datasets, we conducted the
extensive experiments to demonstrate our model's advantage over the baselines
of uni-modality and multi-modality, and verify the effectiveness of important
components in M3PT, including DIE, TIF and the contrastive learning strategy.
21 Jun 2023 | EmTract: Extracting Emotions from Social Media | ⬇️
Domonkos F. Vamossy and Rolf Skog
We develop an open-source tool (EmTract) that extracts emotions from social
media text tailed for financial context. To do so, we annotate ten thousand
short messages from a financial social media platform (StockTwits) and combine
it with open-source emotion data. We then use a pre-tuned NLP model,
DistilBERT, augment its embedding space by including 4,861 tokens (emojis and
emoticons), and then fit it first on the open-source emotion data, then
transfer it to our annotated financial social media data. Our model outperforms
competing open-source state-of-the-art emotion classifiers, such as Emotion
English DistilRoBERTa-base on both human and chatGPT annotated data. Compared
to dictionary based methods, our methodology has three main advantages for
research in finance. First, our model is tailored to financial social media
text; second, it incorporates key aspects of social media data, such as
non-standard phrases, emojis, and emoticons; and third, it operates by
sequentially learning a latent representation that includes features such as
word order, word usage, and local context. Using EmTract, we explore the
relationship between investor emotions expressed on social media and asset
prices. We show that firm-specific investor emotions are predictive of daily
price movements. Our findings show that emotions and market dynamics are
closely related, and we provide a tool to help study the role emotions play in
financial markets.
29 Oct 2022 | Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding | ⬇️
Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, Xiaodan Liang
To bridge the gap between supervised semantic segmentation and real-world
applications that acquires one model to recognize arbitrary new concepts,
recent zero-shot segmentation attracts a lot of attention by exploring the
relationships between unseen and seen object categories, yet requiring large
amounts of densely-annotated data with diverse base classes. In this paper, we
propose a new open-world semantic segmentation pipeline that makes the first
attempt to learn to segment semantic objects of various open-world categories
without any efforts on dense annotations, by purely exploiting the
image-caption data that naturally exist on the Internet. Our method,
Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a
text encoder to generate visual and text embeddings for the image-caption data,
with two core components that endow its segmentation ability: First, the image
encoder is jointly trained with a vision-based contrasting and a cross-modal
contrasting, which encourage the visual embeddings to preserve both
fine-grained semantics and high-level category information that are crucial for
the segmentation task. Furthermore, an online clustering head is devised over
the image encoder, which allows to dynamically segment the visual embeddings
into distinct semantic groups such that they can..into distinct semantic groups such that they can be classified by comparing
with various text embeddings to complete our segmentation pipeline. Experiments
show that without using any data with dense annotations, our method can
directly segment objects of arbitrary categories, outperforming zero-shot
segmentation methods that require data labeling on three benchmark datasets.
10 Feb 2020 | The Tensor Brain: Semantic Decoding for Perception and Memory | ⬇️
Volker Tresp and Sahand Sharifzadeh and Dario Konopatzki and Yunpu Ma
We analyse perception and memory, using mathematical models for knowledge
graphs and tensors, to gain insights into the corresponding functionalities of
the human mind. Our discussion is based on the concept of propositional
sentences consisting of \textit{subject-predicate-object} (SPO) triples for
expressing elementary facts. SPO sentences are the basis for most natural
languages but might also be important for explicit perception and declarative
memories, as well as intra-brain communication and the ability to argue and
reason. A set of SPO sentences can be described as a knowledge graph, which can
be transformed into an adjacency tensor. We introduce tensor models, where
concepts have dual representations as indices and associated embeddings, two
constructs we believe are essential for the understanding of implicit and
explicit perception and memory in the brain. We argue that a biological
realization of perception and memory imposes constraints on information
processing. In particular, we propose that explicit perception and declarative
memories require a semantic decoder, which, in a simple realization, is based
on four layers: First, a sensory memory layer, as a buffer for sensory input,
second, an index layer representing concepts, third, a memoryless
representation layer for the broadcasting of information ---the "blackboard",
or the "canvas" of the brain--- and fourth, a working memory layer as a
processing center and data buffer. We discuss the operations of the four layers
and relate them to the global workspace theory. In a Bayesian brain
interpretation, semantic memory defines the prior for observable triple
statements. We propose that ---in evolution and during development--- semantic
memory, episodic memory, and natural language evolved as emergent properties in
agents' process to gain a deeper understanding of sensory information.
09 Sep 2021 | Talk-to-Edit: Fine-Grained Facial Editing via Dialog | ⬇️
Yuming Jiang, Ziqi Huang, Xingang Pan, Chen Change Loy, Ziwei Liu
Facial editing is an important task in vision and graphics with numerous
applications. However, existing works are incapable to deliver a continuous and
fine-grained editing mode (e.g., editing a slightly smiling face to a big
laughing one) with natural interactions with users. In this work, we propose
Talk-to-Edit, an interactive facial editing framework that performs
fine-grained attribute manipulation through dialog between the user and the
system. Our key insight is to model a continual "semantic field" in the GAN
latent space. 1) Unlike previous works that regard the editing as traversing
straight lines in the latent space, here the fine-grained editing is formulated
as finding a curving trajectory that respects fine-grained attribute landscape
on the semantic field. 2) The curvature at each step is location-specific and
determined by the input image as well as the users' language requests. 3) To
engage the users in a meaningful dialog, our system generates language feedback
by considering both the user request and the current state of the semantic
field.
We also contribute CelebA-Dialog, a visual-language facial editing dataset to
facilitate large-scale study. Specifically, each image has manually annotated
fine-grained attribute annotations as well as template-based textual
descriptions in natural language. Extensive quantitative and qualitative
experiments demonstrate the superiority of our framework in terms of 1) the
smoothness of fine-grained editing, 2) the identity/attribute preservation, and
3) the visual photorealism and dialog fluency. Notably, user study validates
that our overall system is consistently favored by around 80% of the
participants. Our project page is https://www.mmlab-ntu.com/project/talkedit/.
---------------"""
safe_text = unsafestring.encode('utf-8', 'replace').decode('utf-8')
st.markdown(safe_text)