Spaces:

awacke1
/

Semantic_and_Episodic_AI_Memory_with_Mirroring

Runtime error

App Files Files Community

Semantic_and_Episodic_AI_Memory_with_Mirroring / app.py

awacke1

Update app.py

f664569 verified over 1 year ago

raw

history blame contribute delete

27.5 kB

	import streamlit as st
	st.set_page_config(page_title="Memory and Mirroring", page_icon="🧠", layout="wide")

	# Hiding the main menu and footer using CSS
	hide_streamlit_style = """
	<style>
	#MainMenu {visibility: hidden;}
	footer {visibility: hidden;}
	</style>
	"""
	st.markdown(hide_streamlit_style, unsafe_allow_html=True)

	st.title("🧠 Memory and Mirroring in AI - Simulated and Personalized Semantic and Episodic Memory")

	# Using expanders for different sections
	with st.expander("📝 Semantic and Episodic Memory as Cognitive AI Tools"):
	st.subheader("1️⃣ Semantic Memory")
	st.markdown("""
	Semantic memory is a crucial type of long-term memory that houses our knowledge of facts, concepts, and the broader world. Unlike episodic memory, which is personal and subjective, semantic memory is about objective truths and shared knowledge that help us navigate everyday life. It includes everything from understanding the laws of physics to recognizing the names of colors or the shapes of letters. This memory system is essential for language, reasoning, and the application of knowledge in new contexts. It allows us to form a framework of the external world, enabling systematic and informed decision-making and interaction. As we accumulate experiences, our semantic memory continuously expands and refines, solidifying our grasp on reality and enhancing our cognitive processes.
	""")

	st.subheader("2️⃣ Episodic Memory")
	st.markdown("""
	Episodic memory is a form of long-term memory that captures personal experiences and events, deeply intertwined with sensory details and emotional undercurrents. This type of memory is not just about the when and where of events, but also about the feelings and senses involved—such as the visual and auditory impressions, the scents, and the tactile experiences. For example, even if language skills were not fully developed, one could vividly recall the emotions, sights, and sounds of a fifth birthday party. This vividness is largely due to the interaction between the neocortex, which processes the details of these memories, and the amygdala, the part of the brain crucial for emotional tagging. This emotional connection often makes episodic memories particularly strong and enduring.
	""")

	with st.expander("🤖 Mirroring in Humans and Applying it to AI"):
	st.subheader("1️⃣ What is Mirroring?")
	st.markdown("""
	Mirroring is a sophisticated social technique in which individuals subtly replicate the gestures, speech patterns, and attitudes of others. This behavior is not just mimicry but a strategic approach to fostering rapport and enhancing understanding among individuals. By reflecting someone else’s behavior, people can create a sense of empathy and connection, which facilitates smoother and more effective communication.

	In the context of artificial intelligence, applying mirroring involves programming AI systems to recognize and adapt these human nuances, allowing them to interact more naturally with users. This capability can transform AI from a simple tool into a more engaging and empathetic companion, capable of supporting more complex and sensitive human interactions.
	""")

	st.subheader("2️⃣ Benefits of Mirroring")
	st.markdown("""
	Mirroring enhances communication by creating a supportive and empathetic environment, crucial for effective interaction. This technique goes beyond mere replication of actions; it involves understanding and responding to the underlying emotions and intentions, which helps to build trust and rapport. In therapeutic settings, mirroring is a powerful tool that allows therapists to connect with their clients more deeply, facilitating a greater understanding and faster healing.

	In the realm of AI, integrating mirroring techniques can significantly improve the interaction between humans and machines. By enabling AI systems to respond to human emotions and behaviors in a contextually appropriate manner, these systems become more than tools—they evolve into empathetic partners that can anticipate needs and react sensitively. This capability is particularly beneficial in domains such as healthcare, customer service, and education, where understanding and trust are paramount.
	""")

	st.subheader("3️⃣ Leveraging Mirroring to Enhance AI Learning and Perception")
	st.markdown("""
	Mirroring, when applied effectively in AI design, is more than just copying human behaviors—it’s about enhancing the AI's learning process through action-based communication. This method taps into the psychology of learning and perception, enabling AI systems to not only replicate human actions but also understand the intentions and emotions behind those actions.

	By observing and reflecting human behaviors, AI can develop a richer context for its interactions, improving its decision-making processes and making its interactions more natural and intuitive. This approach helps bridge the gap between human and machine, facilitating a more seamless integration of AI into everyday human activities. The goal is not just to mirror but to adapt and evolve in response to human cues, thereby enriching the AI's experiential learning and enhancing its cognitive capabilities.
	""")

	with st.expander("🤖 Mirroring as a Cognitive Tool in AI and Neuroscience"):
	st.subheader("1️⃣ The Concept of Mirroring in Cognitive Science")
	st.markdown("""
	Mirroring is a pivotal communication mechanism prevalent across various life forms. It entails the nuanced imitation and adjustment of behaviors—ranging from physical movements to complex emotional expressions—to foster empathy and deepen understanding. In human interactions, mirroring includes matching gestures like eye contact and nods, which facilitates a shared cognitive space, enhancing interpersonal connectivity.

	In artificial intelligence, this concept is mirrored by equipping AI systems with the ability to detect and emulate human emotional and physical cues. This capability not only helps in building a connection but also in understanding user intent, thereby improving interaction quality.
	""")

	st.subheader("2️⃣ Cognitive Benefits of Mirroring")
	st.markdown("""
	The utility of mirroring transcends simple replication of actions seen in natural contexts, such as a human reassuring animals by gesturing, to indicate a safe environment. These non-verbal cues, even effective across species, highlight the potent impact of adaptive communication without reliance on language. For AI, this capability suggests systems can be more responsive and attuned to the emotional dynamics of users, enhancing user experience by providing a secure and engaging environment.
	""")

	st.subheader("3️⃣ Enhancing AI's Cognitive Models Through Mirroring")
	st.markdown("""
	Implementing mirroring in AI involves more than the straightforward imitation of human actions; it's about creating systems that can interpret and adapt to the complex web of human interactions. This requires AI to not only replicate but also to understand the context and significance behind human behaviors. Such systems need advanced cognitive models that can process and mimic the subtleties of human gestures and emotions, thereby making AI interactions more intuitive and meaningful.
	""")

	unsafestring="""

	# 🩺🔍 AI and Neuroscience Paper References - 🤖 Mirroring, 📝 Semantic and Episodic Memory as Cognitive AI Tools

	18 Jan 2023 \| Joint Representation Learning for Text and 3D Point Cloud \| ⬇️
	Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang

	Recent advancements in vision-language pre-training (e.g. CLIP) have shown
	that vision models can benefit from language supervision. While many models
	using language modality have achieved great success on 2D vision tasks, the
	joint representation learning of 3D point cloud with text remains
	under-explored due to the difficulty of 3D-Text data pair acquisition and the
	irregularity of 3D data structure. In this paper, we propose a novel Text4Point
	framework to construct language-guided 3D point cloud models. The key idea is
	utilizing 2D images as a bridge to connect the point cloud and the language
	modalities. The proposed Text4Point follows the pre-training and fine-tuning
	paradigm. During the pre-training stage, we establish the correspondence of
	images and point clouds based on the readily available RGB-D data and use
	contrastive learning to align the image and point cloud representations.
	Together with the well-aligned image and text features achieved by CLIP, the
	point cloud features are implicitly aligned with the text embeddings. Further,
	we propose a Text Querying Module to integrate language information into 3D
	representation learning by querying text embeddings with point cloud features.
	For fine-tuning, the model learns task-specific 3D representations under
	informative language guidance from the label set without 2D images. Extensive
	experiments demonstrate that our model shows consistent improvement on various
	downstream tasks, such as point cloud semantic segmentation, instance
	segmentation, and object detection. The code will be available here:
	https://github.com/LeapLabTHU/Text4Point

	14 Jul 2017 \| A Semantics-Based Measure of Emoji Similarity \| ⬇️
	Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran

	Emoji have grown to become one of the most important forms of communication
	on the web. With its widespread use, measuring the similarity of emoji has
	become an important problem for contemporary text processing since it lies at
	the heart of sentiment analysis, search, and interface design tasks. This paper
	presents a comprehensive analysis of the semantic similarity of emoji through
	embedding models that are learned over machine-readable emoji meanings in the
	EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji
	sense definitions, and with different training corpora obtained from Twitter
	and Google News, we develop and test multiple embedding models to measure emoji
	similarity. To evaluate our work, we create a new dataset called EmoSim508,
	which assigns human-annotated semantic similarity scores to a set of 508
	carefully selected emoji pairs. After validation with EmoSim508, we present a
	real-world use-case of our emoji embedding models using a sentiment analysis
	task and show that our models outperform the previous best-performing emoji
	embedding model on this task. The EmoSim508 dataset and our emoji embedding
	models are publicly released with this paper and can be downloaded from
	http://emojinet.knoesis.org/.

	11 Mar 2021 \| Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views \| ⬇️
	Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra

	We study the task of semantic mapping - specifically, an embodied agent (a
	robot or an egocentric AI assistant) is given a tour of a new environment and
	asked to build an allocentric top-down semantic map ("what is where?") from
	egocentric observations of an RGB-D camera with known pose (via localization
	sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists
	of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame,
	(2) a Feature Projector that projects egocentric features to appropriate
	locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan
	length x width x feature-dims that learns to accumulate projected egocentric
	features, and (4) a Map Decoder that uses the memory tensor to produce semantic
	top-down maps. SMNet combines the strengths of (known) projective camera
	geometry and neural representation learning. On the task of semantic mapping in
	the Matterport3D dataset, SMNet significantly outperforms competitive baselines
	by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1
	metrics. Moreover, we show how to use the neural episodic memories and
	spatio-semantic allocentric representations build by SMNet for subsequent tasks
	in the same space - navigating to objects seen during the tour("Find chair") or
	answering questions about the space ("How many chairs did you see in the
	house?"). Project page: https://vincentcartillier.github.io/smnet.html.

	06 Mar 2020 \| Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach \| ⬇️
	Oleksandr Palagin, Vitalii Velychko, Kyrylo Malakhov and Oleksandr Shchurov

	We design a new technique for the distributional semantic modeling with a
	neural network-based approach to learn distributed term representations (or
	term embeddings) - term vector space models as a result, inspired by the recent
	ontology-related approach (using different types of contextual knowledge such
	as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to
	the identification of terms (term extraction) and relations between them
	(relation extraction) called semantic pre-processing technology - SPT. Our
	method relies on automatic term extraction from the natural language texts and
	subsequent formation of the problem-oriented or application-oriented (also
	deeply annotated) text corpora where the fundamental entity is the term
	(includes non-compositional and compositional terms). This gives us an
	opportunity to changeover from distributed word representations (or word
	embeddings) to distributed term representations (or term embeddings). This
	transition will allow to generate more accurate semantic maps of different
	subject domains (also, of relations between input terms - it is useful to
	explore clusters and oppositions, or to test your hypotheses about them). The
	semantic map can be represented as a graph using Vec2graph - a Python library
	for visualizing word embeddings (term embeddings in our case) as dynamic and
	interactive graphs. The Vec2graph library coupled with term embeddings will not
	only improve accuracy in solving standard NLP tasks, but also update the
	conventional concept of automated ontology development. The main practical
	result of our work is the development kit (set of toolkits represented as web
	service APIs and web application), which provides all necessary routines for
	the basic linguistic pre-processing and the semantic pre-processing of the
	natural language texts in Ukrainian for future training of term vector space
	models.

	19 Jan 2024 \| PoseScript: Linking 3D Human Poses and Natural Language \| ⬇️
	Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Gr'egory Rogez

	Natural language plays a critical role in many computer vision applications,
	such as image captioning, visual question answering, and cross-modal retrieval,
	to provide fine-grained semantic information. Unfortunately, while human pose
	is key to human understanding, current 3D human pose datasets lack detailed
	language descriptions. To address this issue, we have introduced the PoseScript
	dataset. This dataset pairs more than six thousand 3D human poses from AMASS
	with rich human-annotated descriptions of the body parts and their spatial
	relationships. Additionally, to increase the size of the dataset to a scale
	that is compatible with data-hungry learning algorithms, we have proposed an
	elaborate captioning process that generates automatic synthetic descriptions in
	natural language from given 3D keypoints. This process extracts low-level pose
	information, known as "posecodes", using a set of simple but generic rules on
	the 3D keypoints. These posecodes are then combined into higher level textual
	descriptions using syntactic rules. With automatic annotations, the amount of
	available data significantly scales up (100k), making it possible to
	effectively pretrain deep models for finetuning on human captions. To showcase
	the potential of annotated poses, we present three multi-modal learning tasks
	that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps
	3D poses and textual descriptions into a joint embedding space, allowing for
	cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we
	establish a baseline for a text-conditioned model generating 3D poses. Thirdly,
	we present a learned process for generating pose descriptions. These
	applications demonstrate the versatility and usefulness of annotated poses in
	various tasks and pave the way for future research in the field.

	11 Sep 2023 \| Tell me what you see: A zero-shot action recognition method based on natural language descriptions \| ⬇️
	Valter Estevam and Rayson Laroca and David Menotti and Helio Pedrini

	This paper presents a novel approach to Zero-Shot Action Recognition. Recent
	works have explored the detection and classification of objects to obtain
	semantic information from videos with remarkable performance. Inspired by them,
	we propose using video captioning methods to extract semantic information about
	objects, scenes, humans, and their relationships. To the best of our knowledge,
	this is the first work to represent both videos and labels with descriptive
	sentences. More specifically, we represent videos using sentences generated via
	video captioning methods and classes using sentences extracted from documents
	acquired through search engines on the Internet. Using these representations,
	we build a shared semantic space employing BERT-based embedders pre-trained in
	the paraphrasing task on multiple text datasets. The projection of both visual
	and semantic information onto this space is straightforward, as they are
	sentences, enabling classification using the nearest neighbor rule. We
	demonstrate that representing videos and labels with sentences alleviates the
	domain adaptation problem. Additionally, we show that word vectors are
	unsuitable for building the semantic embedding space of our descriptions. Our
	method outperforms the state-of-the-art performance on the UCF101 dataset by
	3.3 p.p. in accuracy under the TruZe protocol and achieves competitive results
	on both the UCF101 and HMDB51 datasets under the conventional protocol (0/50%

	training/testing split). Our code is available at https://github.com/valterlej/zsarcap.
	16 Jun 2023 \| M3PT: A Multi-Modal Model for POI Tagging \| ⬇️
	Jingsong Yang, Guanzhou Han, Deqing Yang, Jingping Liu, Yanghua Xiao, Xiang Xu, Baohua Wu, Shenghua Ni

	POI tagging aims to annotate a point of interest (POI) with some informative
	tags, which facilitates many services related to POIs, including search,
	recommendation, and so on. Most of the existing solutions neglect the
	significance of POI images and seldom fuse the textual and visual features of
	POIs, resulting in suboptimal tagging performance. In this paper, we propose a
	novel Multi-Modal Model for POI Tagging, namely M3PT, which achieves enhanced
	POI tagging through fusing the target POI's textual and visual features, and
	the precise matching between the multi-modal representations. Specifically, we
	first devise a domain-adaptive image encoder (DIE) to obtain the image
	embeddings aligned to their gold tags' semantics. Then, in M3PT's text-image
	fusion module (TIF), the textual and visual representations are fully fused
	into the POIs' content embeddings for the subsequent matching. In addition, we
	adopt a contrastive learning strategy to further bridge the gap between the
	representations of different modalities. To evaluate the tagging models'
	performance, we have constructed two high-quality POI tagging datasets from the
	real-world business scenario of Ali Fliggy. Upon the datasets, we conducted the
	extensive experiments to demonstrate our model's advantage over the baselines
	of uni-modality and multi-modality, and verify the effectiveness of important
	components in M3PT, including DIE, TIF and the contrastive learning strategy.

	21 Jun 2023 \| EmTract: Extracting Emotions from Social Media \| ⬇️
	Domonkos F. Vamossy and Rolf Skog

	We develop an open-source tool (EmTract) that extracts emotions from social
	media text tailed for financial context. To do so, we annotate ten thousand
	short messages from a financial social media platform (StockTwits) and combine
	it with open-source emotion data. We then use a pre-tuned NLP model,
	DistilBERT, augment its embedding space by including 4,861 tokens (emojis and
	emoticons), and then fit it first on the open-source emotion data, then
	transfer it to our annotated financial social media data. Our model outperforms
	competing open-source state-of-the-art emotion classifiers, such as Emotion
	English DistilRoBERTa-base on both human and chatGPT annotated data. Compared
	to dictionary based methods, our methodology has three main advantages for
	research in finance. First, our model is tailored to financial social media
	text; second, it incorporates key aspects of social media data, such as
	non-standard phrases, emojis, and emoticons; and third, it operates by
	sequentially learning a latent representation that includes features such as
	word order, word usage, and local context. Using EmTract, we explore the
	relationship between investor emotions expressed on social media and asset
	prices. We show that firm-specific investor emotions are predictive of daily
	price movements. Our findings show that emotions and market dynamics are
	closely related, and we provide a tool to help study the role emotions play in
	financial markets.

	29 Oct 2022 \| Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding \| ⬇️
	Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, Xiaodan Liang

	To bridge the gap between supervised semantic segmentation and real-world
	applications that acquires one model to recognize arbitrary new concepts,
	recent zero-shot segmentation attracts a lot of attention by exploring the
	relationships between unseen and seen object categories, yet requiring large
	amounts of densely-annotated data with diverse base classes. In this paper, we
	propose a new open-world semantic segmentation pipeline that makes the first
	attempt to learn to segment semantic objects of various open-world categories
	without any efforts on dense annotations, by purely exploiting the
	image-caption data that naturally exist on the Internet. Our method,
	Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a
	text encoder to generate visual and text embeddings for the image-caption data,
	with two core components that endow its segmentation ability: First, the image
	encoder is jointly trained with a vision-based contrasting and a cross-modal
	contrasting, which encourage the visual embeddings to preserve both
	fine-grained semantics and high-level category information that are crucial for
	the segmentation task. Furthermore, an online clustering head is devised over
	the image encoder, which allows to dynamically segment the visual embeddings
	into distinct semantic groups such that they can..into distinct semantic groups such that they can be classified by comparing
	with various text embeddings to complete our segmentation pipeline. Experiments
	show that without using any data with dense annotations, our method can
	directly segment objects of arbitrary categories, outperforming zero-shot
	segmentation methods that require data labeling on three benchmark datasets.

	10 Feb 2020 \| The Tensor Brain: Semantic Decoding for Perception and Memory \| ⬇️
	Volker Tresp and Sahand Sharifzadeh and Dario Konopatzki and Yunpu Ma

	We analyse perception and memory, using mathematical models for knowledge
	graphs and tensors, to gain insights into the corresponding functionalities of
	the human mind. Our discussion is based on the concept of propositional
	sentences consisting of \textit{subject-predicate-object} (SPO) triples for
	expressing elementary facts. SPO sentences are the basis for most natural
	languages but might also be important for explicit perception and declarative
	memories, as well as intra-brain communication and the ability to argue and
	reason. A set of SPO sentences can be described as a knowledge graph, which can
	be transformed into an adjacency tensor. We introduce tensor models, where
	concepts have dual representations as indices and associated embeddings, two
	constructs we believe are essential for the understanding of implicit and
	explicit perception and memory in the brain. We argue that a biological
	realization of perception and memory imposes constraints on information
	processing. In particular, we propose that explicit perception and declarative
	memories require a semantic decoder, which, in a simple realization, is based
	on four layers: First, a sensory memory layer, as a buffer for sensory input,
	second, an index layer representing concepts, third, a memoryless
	representation layer for the broadcasting of information ---the "blackboard",
	or the "canvas" of the brain--- and fourth, a working memory layer as a
	processing center and data buffer. We discuss the operations of the four layers
	and relate them to the global workspace theory. In a Bayesian brain
	interpretation, semantic memory defines the prior for observable triple
	statements. We propose that ---in evolution and during development--- semantic
	memory, episodic memory, and natural language evolved as emergent properties in
	agents' process to gain a deeper understanding of sensory information.

	09 Sep 2021 \| Talk-to-Edit: Fine-Grained Facial Editing via Dialog \| ⬇️
	Yuming Jiang, Ziqi Huang, Xingang Pan, Chen Change Loy, Ziwei Liu

	Facial editing is an important task in vision and graphics with numerous
	applications. However, existing works are incapable to deliver a continuous and
	fine-grained editing mode (e.g., editing a slightly smiling face to a big
	laughing one) with natural interactions with users. In this work, we propose
	Talk-to-Edit, an interactive facial editing framework that performs
	fine-grained attribute manipulation through dialog between the user and the
	system. Our key insight is to model a continual "semantic field" in the GAN
	latent space. 1) Unlike previous works that regard the editing as traversing
	straight lines in the latent space, here the fine-grained editing is formulated
	as finding a curving trajectory that respects fine-grained attribute landscape
	on the semantic field. 2) The curvature at each step is location-specific and
	determined by the input image as well as the users' language requests. 3) To
	engage the users in a meaningful dialog, our system generates language feedback
	by considering both the user request and the current state of the semantic
	field.
	We also contribute CelebA-Dialog, a visual-language facial editing dataset to
	facilitate large-scale study. Specifically, each image has manually annotated
	fine-grained attribute annotations as well as template-based textual
	descriptions in natural language. Extensive quantitative and qualitative
	experiments demonstrate the superiority of our framework in terms of 1) the
	smoothness of fine-grained editing, 2) the identity/attribute preservation, and
	3) the visual photorealism and dialog fluency. Notably, user study validates
	that our overall system is consistently favored by around 80% of the
	participants. Our project page is https://www.mmlab-ntu.com/project/talkedit/.

	---------------"""

	safe_text = unsafestring.encode('utf-8', 'replace').decode('utf-8')

	st.markdown(safe_text)