Spaces:

awacke1
/

Semantic_and_Episodic_AI_Memory_with_Mirroring

Runtime error

App Files Files Community

awacke1 commited on Apr 13

Commit

10857e6

•

1 Parent(s): dd9950d

Create app.py

Browse files

Files changed (1) hide show

app.py +764 -0

app.py ADDED Viewed

	@@ -0,0 +1,764 @@

+import streamlit as st
+st.set_page_config(page_title="Memory and Mirroring", page_icon=":brain:", layout="wide")
+hide_streamlit_style = """
+            <style>
+            #MainMenu {visibility: hidden;}
+            footer {visibility: hidden;}
+            </style>
+            """
+st.markdown(hide_streamlit_style, unsafe_allow_html=True)
+st.title(":brain: Memory and Mirroring")
+with st.expander(":memo: Semantic and Episodic Memory"):
+    st.subheader(":one: Semantic Memory")
+    st.markdown("**Semantic memory** is a type of long-term memory that stores facts, concepts, and knowledge about the world. It's responsible for our general knowledge and understanding.")
+    st.subheader(":two: Episodic Memory")
+    st.markdown("**Episodic memory** is another type of long-term memory that stores personal experiences and events, including their temporal and spatial contexts.")
+with st.expander(":robot: Mirroring in Behavioral Health"):
+    st.subheader(":one: What is Mirroring?")
+    st.markdown("**Mirroring** is a technique used in behavioral health where a person subtly imitates the gestures, speech patterns, or attitudes of another to build rapport and understanding.")
+    st.subheader(":two: Benefits of Mirroring")
+    st.markdown("Mirroring can help improve communication, empathy, and trust between individuals, making it a valuable tool in therapy and coaching.")
+    st.subheader(":three: Mirroring vs. Mimicry")
+    st.markdown("While mirroring is a subtle and respectful way of connecting with someone, **mimicry** is an exaggerated form of imitation that can come off as mocking or insincere.")
+hide_streamlit_style = """
+            <style>
+            #MainMenu {visibility: hidden;}
+            footer {visibility: hidden;}
+            </style>
+            """
+st.markdown(hide_streamlit_style, unsafe_allow_html=True)
+st.title(":brain: Memory and Mirroring")
+with st.expander(":memo: Semantic and Episodic Memory"):
+    st.subheader(":one: Semantic Memory")
+    st.markdown("**Semantic memory** is a type of long-term memory that stores facts, concepts, and knowledge about the world. It's responsible for our general knowledge and understanding.")
+    st.subheader(":two: Episodic Memory")
+    st.markdown("**Episodic memory** is another type of long-term memory that stores personal experiences and events, including their temporal and spatial contexts.")
+with st.expander(":robot: Mirroring in Behavioral Health"):
+    st.subheader(":one: What is Mirroring?")
+    st.markdown("**Mirroring** is a technique used in behavioral health where a person subtly imitates the gestures, speech patterns, or attitudes of another to build rapport and understanding.")
+    st.subheader(":two: Benefits of Mirroring")
+    st.markdown("Mirroring can help improve communication, empathy, and trust between individuals, making it a valuable tool in therapy and coaching.")
+    st.subheader(":three: Mirroring vs. Mimicry")
+    st.markdown("While mirroring is a subtle and respectful way of connecting with someone, **mimicry** is an exaggerated form of imitation that can come off as mocking or insincere.")
+st.sidebar.title(":guardsman: Rules")
+st.sidebar.markdown("""
+1. **Respect** the other person's personal# 🩺🔍 Search Results
+### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [⬇️](https://arxiv.org/pdf/2301.07584)
+*Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji  Song, Gao Huang*
+  Recent advancements in vision-language pre-training (e.g. CLIP) have shown
+that vision models can benefit from language supervision. While many models
+using language modality have achieved great success on 2D vision tasks, the
+joint representation learning of 3D point cloud with text remains
+under-explored due to the difficulty of 3D-Text data pair acquisition and the
+irregularity of 3D data structure. In this paper, we propose a novel Text4Point
+framework to construct language-guided 3D point cloud models. The key idea is
+utilizing 2D images as a bridge to connect the point cloud and the language
+modalities. The proposed Text4Point follows the pre-training and fine-tuning
+paradigm. During the pre-training stage, we establish the correspondence of
+images and point clouds based on the readily available RGB-D data and use
+contrastive learning to align the image and point cloud representations.
+Together with the well-aligned image and text features achieved by CLIP, the
+point cloud features are implicitly aligned with the text embeddings. Further,
+we propose a Text Querying Module to integrate language information into 3D
+representation learning by querying text embeddings with point cloud features.
+For fine-tuning, the model learns task-specific 3D representations under
+informative language guidance from the label set without 2D images. Extensive
+experiments demonstrate that our model shows consistent improvement on various
+downstream tasks, such as point cloud semantic segmentation, instance
+segmentation, and object detection. The code will be available here:
+https://github.com/LeapLabTHU/Text4Point
+---------------
+### 14 Jul 2017 | [A Semantics-Based Measure of Emoji Similarity](https://arxiv.org/abs/1707.04653) | [⬇️](https://arxiv.org/pdf/1707.04653)
+*Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran*
+  Emoji have grown to become one of the most important forms of communication
+on the web. With its widespread use, measuring the similarity of emoji has
+become an important problem for contemporary text processing since it lies at
+the heart of sentiment analysis, search, and interface design tasks. This paper
+presents a comprehensive analysis of the semantic similarity of emoji through
+embedding models that are learned over machine-readable emoji meanings in the
+EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji
+sense definitions, and with different training corpora obtained from Twitter
+and Google News, we develop and test multiple embedding models to measure emoji
+similarity. To evaluate our work, we create a new dataset called EmoSim508,
+which assigns human-annotated semantic similarity scores to a set of 508
+carefully selected emoji pairs. After validation with EmoSim508, we present a
+real-world use-case of our emoji embedding models using a sentiment analysis
+task and show that our models outperform the previous best-performing emoji
+embedding model on this task. The EmoSim508 dataset and our emoji embedding
+models are publicly released with this paper and can be downloaded from
+http://emojinet.knoesis.org/.
+---------------
+### 11 Mar 2021 | [Semantic MapNet: Building Allocentric Semantic Maps and Representations  from Egocentric Views](https://arxiv.org/abs/2010.01191) | [⬇️](https://arxiv.org/pdf/2010.01191)
+*Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa,  Dhruv Batra*
+  We study the task of semantic mapping - specifically, an embodied agent (a
+robot or an egocentric AI assistant) is given a tour of a new environment and
+asked to build an allocentric top-down semantic map ("what is where?") from
+egocentric observations of an RGB-D camera with known pose (via localization
+sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists
+of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame,
+(2) a Feature Projector that projects egocentric features to appropriate
+locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan
+length x width x feature-dims that learns to accumulate projected egocentric
+features, and (4) a Map Decoder that uses the memory tensor to produce semantic
+top-down maps. SMNet combines the strengths of (known) projective camera
+geometry and neural representation learning. On the task of semantic mapping in
+the Matterport3D dataset, SMNet significantly outperforms competitive baselines
+by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1
+metrics. Moreover, we show how to use the neural episodic memories and
+spatio-semantic allocentric representations build by SMNet for subsequent tasks
+in the same space - navigating to objects seen during the tour("Find chair") or
+answering questions about the space ("How many chairs did you see in the
+house?"). Project page: https://vincentcartillier.github.io/smnet.html.
+---------------
+### 06 Mar 2020 | [Distributional semantic modeling: a revised technique to train term/word  vector space models applying the ontology-related approach](https://arxiv.org/abs/2003.03350) | [⬇️](https://arxiv.org/pdf/2003.03350)
+*Oleksandr Palagin, Vitalii Velychko, Kyrylo Malakhov and Oleksandr  Shchurov*
+  We design a new technique for the distributional semantic modeling with a
+neural network-based approach to learn distributed term representations (or
+term embeddings) - term vector space models as a result, inspired by the recent
+ontology-related approach (using different types of contextual knowledge such
+as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to
+the identification of terms (term extraction) and relations between them
+(relation extraction) called semantic pre-processing technology - SPT. Our
+method relies on automatic term extraction from the natural language texts and
+subsequent formation of the problem-oriented or application-oriented (also
+deeply annotated) text corpora where the fundamental entity is the term
+(includes non-compositional and compositional terms). This gives us an
+opportunity to changeover from distributed word representations (or word
+embeddings) to distributed term representations (or term embeddings). This
+transition will allow to generate more accurate semantic maps of different
+subject domains (also, of relations between input terms - it is useful to
+explore clusters and oppositions, or to test your hypotheses about them). The
+semantic map can be represented as a graph using Vec2graph - a Python library
+for visualizing word embeddings (term embeddings in our case) as dynamic and
+interactive graphs. The Vec2graph library coupled with term embeddings will not
+only improve accuracy in solving standard NLP tasks, but also update the
+conventional concept of automated ontology development. The main practical
+result of our work is the development kit (set of toolkits represented as web
+service APIs and web application), which provides all necessary routines for
+the basic linguistic pre-processing and the semantic pre-processing of the
+natural language texts in Ukrainian for future training of term vector space
+models.
+---------------
+### 23 Jan 2023 | [Lexi: Self-Supervised Learning of the UI Language](https://arxiv.org/abs/2301.10165) | [⬇️](https://arxiv.org/pdf/2301.10165)
+*Pratyay Banerjee, Shweti Mahajan, Kushal Arora, Chitta Baral, Oriana  Riva*
+  Humans can learn to operate the user interface (UI) of an application by
+reading an instruction manual or how-to guide. Along with text, these resources
+include visual content such as UI screenshots and images of application icons
+referenced in the text. We explore how to leverage this data to learn generic
+visio-linguistic representations of UI screens and their components. These
+representations are useful in many real applications, such as accessibility,
+voice navigation, and task automation. Prior UI representation models rely on
+UI metadata (UI trees and accessibility labels), which is often missing,
+incompletely defined, or not accessible. We avoid such a dependency, and
+propose Lexi, a pre-trained vision and language model designed to handle the
+unique features of UI screens, including their text richness and context
+sensitivity. To train Lexi we curate the UICaption dataset consisting of 114k
+UI images paired with descriptions of their functionality. We evaluate Lexi on
+four tasks: UI action entailment, instruction-based UI image retrieval,
+grounding referring expressions, and UI entity recognition.
+---------------
+### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
+*Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan  Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu  Sharma*
+  Screen user interfaces (UIs) and infographics, sharing similar visual
+language and design principles, play important roles in human communication and
+human-machine interaction. We introduce ScreenAI, a vision-language model that
+specializes in UI and infographics understanding. Our model improves upon the
+PaLI architecture with the flexible patching strategy of pix2struct and is
+trained on a unique mixture of datasets. At the heart of this mixture is a
+novel screen annotation task in which the model has to identify the type and
+location of UI elements. We use these text annotations to describe screens to
+Large Language Models and automatically generate question-answering (QA), UI
+navigation, and summarization training datasets at scale. We run ablation
+studies to demonstrate the impact of these design choices. At only 5B
+parameters, ScreenAI achieves new state-of-the-artresults on UI- and
+infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
+Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
+InfographicVQA) compared to models of similar size. Finally, we release three
+new datasets: one focused on the screen annotation task and two others focused
+on question answering.
+---------------
+### 14 Jul 2017 | [EmojiNet: An Open Service and API for Emoji Sense Discovery](https://arxiv.org/abs/1707.04652) | [⬇️](https://arxiv.org/pdf/1707.04652)
+*Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran*
+  This paper presents the release of EmojiNet, the largest machine-readable
+emoji sense inventory that links Unicode emoji representations to their English
+meanings extracted from the Web. EmojiNet is a dataset consisting of: (i)
+12,904 sense labels over 2,389 emoji, which were extracted from the web and
+linked to machine-readable sense definitions seen in BabelNet, (ii) context
+words associated with each emoji sense, which are inferred through word
+embedding models trained over Google News corpus and a Twitter message corpus
+for each emoji sense definition, and (iii) recognizing discrepancies in the
+presentation of emoji on different platforms, specification of the most likely
+platform-based emoji sense for a selected set of emoji. The dataset is hosted
+as an open service with a REST API and is available at
+http://emojinet.knoesis.org/. The development of this dataset, evaluation of
+its quality, and its applications including emoji sense disambiguation and
+emoji sense similarity are discussed.
+---------------
+### 22 Dec 2021 | [VoiceMoji: A Novel On-Device Pipeline for Seamless Emoji Insertion in  Dictation](https://arxiv.org/abs/2112.12028) | [⬇️](https://arxiv.org/pdf/2112.12028)
+*Sumit Kumar, Harichandana B S S, and Himanshu Arora*
+  Most of the speech recognition systems recover only words in the speech and
+fail to capture emotions. Users have to manually add emoji(s) in text for
+adding tone and making communication fun. Though there is much work done on
+punctuation addition on transcribed speech, the area of emotion addition is
+untouched. In this paper, we propose a novel on-device pipeline to enrich the
+voice input experience. It involves, given a blob of transcribed text,
+intelligently processing and identifying structure where emoji insertion makes
+sense. Moreover, it includes semantic text analysis to predict emoji for each
+of the sub-parts for which we propose a novel architecture Attention-based Char
+Aware (ACA) LSTM which handles Out-Of-Vocabulary (OOV) words as well. All these
+tasks are executed completely on-device and hence can aid on-device dictation
+systems. To the best of our knowledge, this is the first work that shows how to
+add emoji(s) in the transcribed text. We demonstrate that our components
+achieve comparable results to previous neural approaches for punctuation
+addition and emoji prediction with 80% fewer parameters. Overall, our proposed
+model has a very small memory footprint of a mere 4MB to suit on-device
+deployment.
+---------------
+### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal  Controls](https://arxiv.org/abs/2305.02677) | [⬇️](https://arxiv.org/pdf/2305.02677)
+*Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li,  Mingqi Gao, Shanshan Zhao*
+  Controllable image captioning is an emerging multimodal topic that aims to
+describe the image with natural language following human purpose,
+$\textit{e.g.}$, looking at the specified regions or telling in a particular
+text style. State-of-the-art methods are trained on annotated pairs of input
+controls and output captions. However, the scarcity of such well-annotated
+multimodal data largely limits their usability and scalability for interactive
+AI systems. Leveraging unimodal instruction-following foundation models is a
+promising alternative that benefits from broader sources of data. In this
+paper, we present Caption AnyThing (CAT), a foundation model augmented image
+captioning framework supporting a wide range of multimodel controls: 1) visual
+controls, including points, boxes, and trajectories; 2) language controls, such
+as sentiment, length, language, and factuality. Powered by Segment Anything
+Model (SAM) and ChatGPT, we unify the visual and language prompts into a
+modularized framework, enabling the flexible combination between different
+controls. Extensive case studies demonstrate the user intention alignment
+capabilities of our framework, shedding light on effective user interaction
+modeling in vision-language applications. Our code is publicly available at
+https://github.com/ttengwang/Caption-Anything.
+---------------
+### 12 Sep 2022 | [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) | [⬇️](https://arxiv.org/pdf/2209.09871)
+*Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht,  Seyed Ahmad Mansouri*
+  In the absence of nonverbal cues during messaging communication, users
+express part of their emotions using emojis. Thus, having emojis in the
+vocabulary of text messaging language models can significantly improve many
+natural language processing (NLP) applications such as online communication
+analysis. On the other hand, word embedding models are usually trained on a
+very large corpus of text such as Wikipedia or Google News datasets that
+include very few samples with emojis. In this study, we create emojiSpace,
+which is a combined word-emoji embedding using the word2vec model from the
+Genism library in Python. We trained emojiSpace on a corpus of more than 4
+billion tweets and evaluated it by implementing sentiment analysis on a Twitter
+dataset containing more than 67 million tweets as an extrinsic task. For this
+task, we compared the performance of two different classifiers of random forest
+(RF) and linear support vector machine (SVM). For evaluation, we compared
+emojiSpace performance with two other pre-trained embeddings and demonstrated
+that emojiSpace outperforms both.
+---------------
+### 18 May 2022 | [Graph Adaptive Semantic Transfer for Cross-domain Sentiment  Classification](https://arxiv.org/abs/2205.08772) | [⬇️](https://arxiv.org/pdf/2205.08772)
+*Kai Zhang, Qi Liu, Zhenya Huang, Mingyue Cheng, Kun Zhang, Mengdi  Zhang, Wei Wu, Enhong Chen*
+  Cross-domain sentiment classification (CDSC) aims to use the transferable
+semantics learned from the source domain to predict the sentiment of reviews in
+the unlabeled target domain. Existing studies in this task attach more
+attention to the sequence modeling of sentences while largely ignoring the rich
+domain-invariant semantics embedded in graph structures (i.e., the
+part-of-speech tags and dependency relations). As an important aspect of
+exploring characteristics of language comprehension, adaptive graph
+representations have played an essential role in recent years. To this end, in
+the paper, we aim to explore the possibility of learning invariant semantic
+features from graph-like structures in CDSC. Specifically, we present Graph
+Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding
+method that is able to learn domain-invariant semantics from both word
+sequences and syntactic graphs. More specifically, we first raise a
+POS-Transformer module to extract sequential semantic features from the word
+sequences as well as the part-of-speech tags. Then, we design a Hybrid Graph
+Attention (HGAT) module to generate syntax-based semantic features by
+considering the transferable dependency relations. Finally, we devise an
+Integrated aDaptive Strategy (IDS) to guide the joint learning process of both
+modules. Extensive experiments on four public datasets indicate that GAST
+achieves comparable effectiveness to a range of state-of-the-art models.
+---------------
+### 03 Apr 2018 | [Contrastive Learning of Emoji-based Representations for Resource-Poor  Languages](https://arxiv.org/abs/1804.01855) | [⬇️](https://arxiv.org/pdf/1804.01855)
+*Nurendra Choudhary, Rajat Singh, Ishita Bindlish and Manish  Shrivastava*
+  The introduction of emojis (or emoticons) in social media platforms has given
+the users an increased potential for expression. We propose a novel method
+called Classification of Emojis using Siamese Network Architecture (CESNA) to
+learn emoji-based representations of resource-poor languages by jointly
+training them with resource-rich languages using a siamese network.
+  CESNA model consists of twin Bi-directional Long Short-Term Memory Recurrent
+Neural Networks (Bi-LSTM RNN) with shared parameters joined by a contrastive
+loss function based on a similarity metric. The model learns the
+representations of resource-poor and resource-rich language in a common emoji
+space by using a similarity metric based on the emojis present in sentences
+from both languages. The model, hence, projects sentences with similar emojis
+closer to each other and the sentences with different emojis farther from one
+another. Experiments on large-scale Twitter datasets of resource-rich languages
+- English and Spanish and resource-poor languages - Hindi and Telugu reveal
+that CESNA outperforms the state-of-the-art emoji prediction approaches based
+on distributional semantics, semantic rules, lexicon lists and deep neural
+network representations without shared parameters.
+---------------
+### 16 Feb 2023 | [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video  Paragraph Captioning](https://arxiv.org/abs/2211.15103) | [⬇️](https://arxiv.org/pdf/2211.15103)
+*Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le*
+  Video paragraph captioning aims to generate a multi-sentence description of
+an untrimmed video with several temporal event locations in coherent
+storytelling. Following the human perception process, where the scene is
+effectively understood by decomposing it into visual (e.g. human, animal) and
+non-visual components (e.g. action, relations) under the mutual influence of
+vision and language, we first propose a visual-linguistic (VL) feature. In the
+proposed VL feature, the scene is modeled by three modalities including (i) a
+global visual environment; (ii) local visual main agents; (iii) linguistic
+scene elements. We then introduce an autoregressive Transformer-in-Transformer
+(TinT) to simultaneously capture the semantic coherence of intra- and
+inter-event contents within a video. Finally, we present a new VL contrastive
+loss function to guarantee learnt embedding features are matched with the
+captions semantics. Comprehensive experiments and extensive ablation studies on
+ActivityNet Captions and YouCookII datasets show that the proposed
+Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
+state-of-the-art methods on accuracy and diversity. Source code is made
+publicly available at: https://github.com/UARK-AICV/VLTinT.
+---------------
+### 21 Jun 2023 | [EmTract: Extracting Emotions from Social Media](https://arxiv.org/abs/2112.03868) | [⬇️](https://arxiv.org/pdf/2112.03868)
+*Domonkos F. Vamossy and Rolf Skog*
+  We develop an open-source tool (EmTract) that extracts emotions from social
+media text tailed for financial context. To do so, we annotate ten thousand
+short messages from a financial social media platform (StockTwits) and combine
+it with open-source emotion data. We then use a pre-tuned NLP model,
+DistilBERT, augment its embedding space by including 4,861 tokens (emojis and
+emoticons), and then fit it first on the open-source emotion data, then
+transfer it to our annotated financial social media data. Our model outperforms
+competing open-source state-of-the-art emotion classifiers, such as Emotion
+English DistilRoBERTa-base on both human and chatGPT annotated data. Compared
+to dictionary based methods, our methodology has three main advantages for
+research in finance. First, our model is tailored to financial social media
+text; second, it incorporates key aspects of social media data, such as
+non-standard phrases, emojis, and emoticons; and third, it operates by
+sequentially learning a latent representation that includes features such as
+word order, word usage, and local context. Using EmTract, we explore the
+relationship between investor emotions expressed on social media and asset
+prices. We show that firm-specific investor emotions are predictive of daily
+price movements. Our findings show that emotions and market dynamics are
+closely related, and we provide a tool to help study the role emotions play in
+financial markets.
+---------------
+### 29 Oct 2022 | [Open-world Semantic Segmentation via Contrasting and Clustering  Vision-Language Embedding](https://arxiv.org/abs/2207.08455) | [⬇️](https://arxiv.org/pdf/2207.08455)
+*Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, Xiaodan  Liang*
+  To bridge the gap between supervised semantic segmentation and real-world
+applications that acquires one model to recognize arbitrary new concepts,
+recent zero-shot segmentation attracts a lot of attention by exploring the
+relationships between unseen and seen object categories, yet requiring large
+amounts of densely-annotated data with diverse base classes. In this paper, we
+propose a new open-world semantic segmentation pipeline that makes the first
+attempt to learn to segment semantic objects of various open-world categories
+without any efforts on dense annotations, by purely exploiting the
+image-caption data that naturally exist on the Internet. Our method,
+Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a
+text encoder to generate visual and text embeddings for the image-caption data,
+with two core components that endow its segmentation ability: First, the image
+encoder is jointly trained with a vision-based contrasting and a cross-modal
+contrasting, which encourage the visual embeddings to preserve both
+fine-grained semantics and high-level category information that are crucial for
+the segmentation task. Furthermore, an online clustering head is devised over
+the image encoder, which allows to dynamically segment the visual embeddings
+into distinct semantic groups such that they can be classified by comparing
+with various text embeddings to complete our segmentation pipeline. Experiments
+show that without using any data with dense annotations, our method can
+directly segment objects of arbitrary categories, outperforming zero-shot
+segmentation methods that require data labeling on three benchmark datasets.
+---------------
+### 19 Jan 2024 | [PoseScript: Linking 3D Human Poses and Natural Language](https://arxiv.org/abs/2210.11795) | [⬇️](https://arxiv.org/pdf/2210.11795)
+*Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc  Moreno-Noguer, Gr\'egory Rogez*
+  Natural language plays a critical role in many computer vision applications,
+such as image captioning, visual question answering, and cross-modal retrieval,
+to provide fine-grained semantic information. Unfortunately, while human pose
+is key to human understanding, current 3D human pose datasets lack detailed
+language descriptions. To address this issue, we have introduced the PoseScript
+dataset. This dataset pairs more than six thousand 3D human poses from AMASS
+with rich human-annotated descriptions of the body parts and their spatial
+relationships. Additionally, to increase the size of the dataset to a scale
+that is compatible with data-hungry learning algorithms, we have proposed an
+elaborate captioning process that generates automatic synthetic descriptions in
+natural language from given 3D keypoints. This process extracts low-level pose
+information, known as "posecodes", using a set of simple but generic rules on
+the 3D keypoints. These posecodes are then combined into higher level textual
+descriptions using syntactic rules. With automatic annotations, the amount of
+available data significantly scales up (100k), making it possible to
+effectively pretrain deep models for finetuning on human captions. To showcase
+the potential of annotated poses, we present three multi-modal learning tasks
+that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps
+3D poses and textual descriptions into a joint embedding space, allowing for
+cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we
+establish a baseline for a text-conditioned model generating 3D poses. Thirdly,
+we present a learned process for generating pose descriptions. These
+applications demonstrate the versatility and usefulness of annotated poses in
+various tasks and pave the way for future research in the field.
+---------------
+### 11 Sep 2023 | [Tell me what you see: A zero-shot action recognition method based on  natural language descriptions](https://arxiv.org/abs/2112.09976) | [⬇️](https://arxiv.org/pdf/2112.09976)
+*Valter Estevam and Rayson Laroca and David Menotti and Helio Pedrini*
+  This paper presents a novel approach to Zero-Shot Action Recognition. Recent
+works have explored the detection and classification of objects to obtain
+semantic information from videos with remarkable performance. Inspired by them,
+we propose using video captioning methods to extract semantic information about
+objects, scenes, humans, and their relationships. To the best of our knowledge,
+this is the first work to represent both videos and labels with descriptive
+sentences. More specifically, we represent videos using sentences generated via
+video captioning methods and classes using sentences extracted from documents
+acquired through search engines on the Internet. Using these representations,
+we build a shared semantic space employing BERT-based embedders pre-trained in
+the paraphrasing task on multiple text datasets. The projection of both visual
+and semantic information onto this space is straightforward, as they are
+sentences, enabling classification using the nearest neighbor rule. We
+demonstrate that representing videos and labels with sentences alleviates the
+domain adaptation problem. Additionally, we show that word vectors are
+unsuitable for building the semantic embedding space of our descriptions. Our
+method outperforms the state-of-the-art performance on the UCF101 dataset by
+3.3 p.p. in accuracy under the TruZe protocol and achieves competitive results
+on both the UCF101 and HMDB51 datasets under the conventional protocol (0/50\%
+- training/testing split). Our code is available at
+https://github.com/valterlej/zsarcap.
+---------------
+### 16 Jun 2023 | [M3PT: A Multi-Modal Model for POI Tagging](https://arxiv.org/abs/2306.10079) | [⬇️](https://arxiv.org/pdf/2306.10079)
+*Jingsong Yang, Guanzhou Han, Deqing Yang, Jingping Liu, Yanghua Xiao,  Xiang Xu, Baohua Wu, Shenghua Ni*
+  POI tagging aims to annotate a point of interest (POI) with some informative
+tags, which facilitates many services related to POIs, including search,
+recommendation, and so on. Most of the existing solutions neglect the
+significance of POI images and seldom fuse the textual and visual features of
+POIs, resulting in suboptimal tagging performance. In this paper, we propose a
+novel Multi-Modal Model for POI Tagging, namely M3PT, which achieves enhanced
+POI tagging through fusing the target POI's textual and visual features, and
+the precise matching between the multi-modal representations. Specifically, we
+first devise a domain-adaptive image encoder (DIE) to obtain the image
+embeddings aligned to their gold tags' semantics. Then, in M3PT's text-image
+fusion module (TIF), the textual and visual representations are fully fused
+into the POIs' content embeddings for the subsequent matching. In addition, we
+adopt a contrastive learning strategy to further bridge the gap between the
+representations of different modalities. To evaluate the tagging models'
+performance, we have constructed two high-quality POI tagging datasets from the
+real-world business scenario of Ali Fliggy. Upon the datasets, we conducted the
+extensive experiments to demonstrate our model's advantage over the baselines
+of uni-modality and multi-modality, and verify the effectiveness of important
+components in M3PT, including DIE, TIF and the contrastive learning strategy.
+---------------
+### 10 Feb 2020 | [The Tensor Brain: Semantic Decoding for Perception and Memory](https://arxiv.org/abs/2001.11027) | [⬇️](https://arxiv.org/pdf/2001.11027)
+*Volker Tresp and Sahand Sharifzadeh and Dario Konopatzki and Yunpu Ma*
+  We analyse perception and memory, using mathematical models for knowledge
+graphs and tensors, to gain insights into the corresponding functionalities of
+the human mind. Our discussion is based on the concept of propositional
+sentences consisting of \textit{subject-predicate-object} (SPO) triples for
+expressing elementary facts. SPO sentences are the basis for most natural
+languages but might also be important for explicit perception and declarative
+memories, as well as intra-brain communication and the ability to argue and
+reason. A set of SPO sentences can be described as a knowledge graph, which can
+be transformed into an adjacency tensor. We introduce tensor models, where
+concepts have dual representations as indices and associated embeddings, two
+constructs we believe are essential for the understanding of implicit and
+explicit perception and memory in the brain. We argue that a biological
+realization of perception and memory imposes constraints on information
+processing. In particular, we propose that explicit perception and declarative
+memories require a semantic decoder, which, in a simple realization, is based
+on four layers: First, a sensory memory layer, as a buffer for sensory input,
+second, an index layer representing concepts, third, a memoryless
+representation layer for the broadcasting of information ---the "blackboard",
+or the "canvas" of the brain--- and fourth, a working memory layer as a
+processing center and data buffer. We discuss the operations of the four layers
+and relate them to the global workspace theory. In a Bayesian brain
+interpretation, semantic memory defines the prior for observable triple
+statements. We propose that ---in evolution and during development--- semantic
+memory, episodic memory, and natural language evolved as emergent properties in
+agents' process to gain a deeper understanding of sensory information.
+---------------
+### 09 Sep 2021 | [Talk-to-Edit: Fine-Grained Facial Editing via Dialog](https://arxiv.org/abs/2109.04425) | [⬇️](https://arxiv.org/pdf/2109.04425)
+*Yuming Jiang, Ziqi Huang, Xingang Pan, Chen Change Loy, Ziwei Liu*
+  Facial editing is an important task in vision and graphics with numerous
+applications. However, existing works are incapable to deliver a continuous and
+fine-grained editing mode (e.g., editing a slightly smiling face to a big
+laughing one) with natural interactions with users. In this work, we propose
+Talk-to-Edit, an interactive facial editing framework that performs
+fine-grained attribute manipulation through dialog between the user and the
+system. Our key insight is to model a continual "semantic field" in the GAN
+latent space. 1) Unlike previous works that regard the editing as traversing
+straight lines in the latent space, here the fine-grained editing is formulated
+as finding a curving trajectory that respects fine-grained attribute landscape
+on the semantic field. 2) The curvature at each step is location-specific and
+determined by the input image as well as the users' language requests. 3) To
+engage the users in a meaningful dialog, our system generates language feedback
+by considering both the user request and the current state of the semantic
+field.
+  We also contribute CelebA-Dialog, a visual-language facial editing dataset to
+facilitate large-scale study. Specifically, each image has manually annotated
+fine-grained attribute annotations as well as template-based textual
+descriptions in natural language. Extensive quantitative and qualitative
+experiments demonstrate the superiority of our framework in terms of 1) the
+smoothness of fine-grained editing, 2) the identity/attribute preservation, and
+3) the visual photorealism and dialog fluency. Notably, user study validates
+that our overall system is consistently favored by around 80% of the
+participants. Our project page is https://www.mmlab-ntu.com/project/talkedit/.
+---------------<s>[INST] Context:
+ 1. <b> Joint Representation Learning for Text and 3D Point Cloud </b>
+ Abstract:   Recent advancements in vision-language pre-training (e.g. CLIP) have shown
+that vision models can benefit from language supervision. While many models
+using language modality have achieved great success on 2D vision tasks, the
+joint representation learning of 3D point cloud with text remains
+under-explored due to the difficulty of 3D-Text data pair acquisition and the
+irregularity of 3D data structure. In this paper, we propose a novel Text4Point
+framework to construct language-guided 3D point cloud models. The key idea is
+utilizing 2D images as a bridge to connect the point cloud and the language
+modalities. The proposed Text4Point follows the pre-training and fine-tuning
+paradigm. During the pre-training stage, we establish the correspondence of
+images and point clouds based on the readily available RGB-D data and use
+contrastive learning to align the image and point cloud representations.
+Together with the well-aligned image and text features achieved by CLIP, the
+point cloud features are implicitly aligned with the text embeddings. Further,
+we propose a Text Querying Module to integrate language information into 3D
+representation learning by querying text embeddings with point cloud features.
+For fine-tuning, the model learns task-specific 3D representations under
+informative language guidance from the label set without 2D images. Extensive
+experiments demonstrate that our model shows consistent improvement on various
+downstream tasks, such as point cloud semantic segmentation, instance
+segmentation, and object detection. The code will be available here:
+https://github.com/LeapLabTHU/Text4Point
+2. <b> A Semantics-Based Measure of Emoji Similarity </b>
+ Abstract:   Emoji have grown to become one of the most important forms of communication
+on the web. With its widespread use, measuring the similarity of emoji has
+become an important problem for contemporary text processing since it lies at
+the heart of sentiment analysis, search, and interface design tasks. This paper
+presents a comprehensive analysis of the semantic similarity of emoji through
+embedding models that are learned over machine-readable emoji meanings in the
+EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji
+sense definitions, and with different training corpora obtained from Twitter
+and Google News, we develop and test multiple embedding models to measure emoji
+similarity. To evaluate our work, we create a new dataset called EmoSim508,
+which assigns human-annotated semantic similarity scores to a set of 508
+carefully selected emoji pairs. After validation with EmoSim508, we present a
+real-world use-case of our emoji embedding models using a sentiment analysis
+task and show that our models outperform the previous best-performing emoji
+embedding model on this task. The EmoSim508 dataset and our emoji embedding
+models are publicly released with this paper and can be downloaded from
+http://emojinet.knoesis.org/.
+3. <b> Semantic MapNet: Building Allocentric Semantic Maps and Representations  from Egocentric Views </b>
+ Abstract:   We study the task of semantic mapping - specifically, an embodied agent (a
+robot or an egocentric AI assistant) is given a tour of a new environment and
+asked to build an allocentric top-down semantic map ("what is where?") from
+egocentric observations of an RGB-D camera with known pose (via localization
+sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists
+of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame,
+(2) a Feature Projector that projects egocentric features to appropriate
+locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan
+length x width x feature-dims that learns to accumulate projected egocentric
+features, and (4) a Map Decoder that uses the memory tensor to produce semantic
+top-down maps. SMNet combines the strengths of (known) projective camera
+geometry and neural representation learning. On the task of semantic mapping in
+the Matterport3D dataset, SMNet significantly outperforms competitive baselines
+by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1
+metrics. Moreover, we show how to use the neural episodic memories and
+spatio-semantic allocentric representations build by SMNet for subsequent tasks
+in the same space - navigating to objects seen during the tour("Find chair") or
+answering questions about the space ("How many chairs did you see in the
+house?"). Project page: https://vincentcartillier.github.io/smnet.html.
+4. <b> Distributional semantic modeling: a revised technique to train term/word  vector space models applying the ontology-related approach </b>
+ Abstract:   We design a new technique for the distributional semantic modeling with a
+neural network-based approach to learn distributed term representations (or
+term embeddings) - term vector space models as a result, inspired by the recent
+ontology-related approach (using different types of contextual knowledge such
+as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to
+the identification of terms (term extraction) and relations between them
+(relation extraction) called semantic pre-processing technology - SPT. Our
+method relies on automatic term extraction from the natural language texts and
+subsequent formation of the problem-oriented or application-oriented (also
+deeply annotated) text corpora where the fundamental entity is the term
+(includes non-compositional and compositional terms). This gives us an
+opportunity to changeover from distributed word representations (or word
+embeddings) to distributed term representations (or term embeddings). This
+transition will allow to generate more accurate semantic maps of different
+subject domains (also, of relations between input terms - it is useful to
+explore clusters and oppositions, or to test your hypotheses about them). The
+semantic map can be represented as a graph using Vec2graph - a Python library
+for visualizing word embeddings (term embeddings in our case) as dynamic and
+interactive graphs. The Vec2graph library coupled with term embeddings will not
+only improve accuracy in solving standard NLP tasks, but also update the
+conventional concept of automated ontology development. The main practical
+result of our work is the development kit (set of toolkits represented as web
+service APIs and web application), which provides all necessary routines for
+the basic linguistic pre-processing and the semantic pre-processing of the
+natural language texts in Ukrainian for future training of term vector space
+models.
+5. <b> Lexi: Self-Supervised Learning of the UI Language </b>
+ Abstract:   Humans can learn to operate the user interface (UI) of an application by
+reading an instruction manual or how-to guide. Along with text, these resources
+include visual content such as UI screenshots and images of application icons
+referenced in the text. We explore how to leverage this data to learn generic
+visio-linguistic representations of UI screens and their components. These
+representations are useful in many real applications, such as accessibility,
+voice navigation, and task automation. Prior UI representation models rely on
+UI metadata (UI trees and accessibility labels), which is often missing,
+incompletely defined, or not accessible. We avoid such a dependency, and
+propose Lexi, a pre-trained vision and language model designed to handle the
+unique features of UI screens, including their text richness and context
+sensitivity. To train Lexi we curate the UICaption dataset consisting of 114k
+UI images paired with descriptions of their functionality. We evaluate Lexi on
+four tasks: UI action entailment, instruction-based UI image retrieval,
+grounding referring expressions, and UI entity recognition.
+6. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b>
+ Abstract:   Screen user interfaces (UIs) and infographics, sharing similar visual
+language and design principles, play important roles in human communication and
+human-machine interaction. We introduce ScreenAI, a vision-language model that
+specializes in UI and infographics understanding. Our model improves upon the
+PaLI architecture with the flexible patching strategy of pix2struct and is
+trained on a unique mixture of datasets. At the heart of this mixture is a
+novel screen annotation task in which the model has to identify the type and
+location of UI elements. We use these text annotations to describe screens to
+Large Language Models and automatically generate question-answering (QA), UI
+navigation, and summarization training datasets at scale. We run ablation
+studies to demonstrate the impact of these design choices. At only 5B
+parameters, ScreenAI achieves new state-of-the-artresults on UI- and
+infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
+Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
+InfographicVQA) compared to models of similar size. Finally, we release three
+new datasets: one focused on the screen annotation task and two others focused
+on question answering.
+7. <b> EmojiNet: An Open Service and API for Emoji Sense Discovery </b>
+ Abstract:   This paper presents the release of EmojiNet, the largest machine-readable
+emoji sense inventory that links Unicode emoji representations to their English
+meanings extracted from the Web. EmojiNet is a dataset consisting of: (i)
+12,904 sense labels over 2,389 emoji, which were extracted from the web and
+linked to machine-readable sense definitions seen in BabelNet, (ii) context
+words associated with each emoji sense, which are inferred through word
+embedding models trained over Google News corpus and a Twitter message corpus
+for each emoji sense definition, and (iii) recognizing discrepancies in the
+presentation of emoji on different platforms, specification of the most likely
+platform-based emoji sense for a selected set of emoji. The dataset is hosted
+as an open service with a REST API and is available at
+http://emojinet.knoesis.org/. The development of this dataset, evaluation of
+its quality, and its applications including emoji sense disambiguation and
+emoji sense similarity are discussed.
+8. <b> VoiceMoji: A Novel On-Device Pipeline for Seamless Emoji Insertion in  Dictation </b>
+ Abstract:   Most of the speech recognition systems recover only words in the speech and
+fail to capture emotions. Users have to manually add emoji(s) in text for
+adding tone and making communication fun. Though there is much work done on
+punctuation addition on transcribed speech, the area of emotion addition is
+untouched. In this paper, we propose a novel on-device pipeline to enrich the
+voice input experience. It involves, given a blob of transcribed text,
+intelligently processing and identifying structure where emoji insertion makes
+sense. Moreover, it includes semantic text analysis to predict emoji for each
+of the sub-parts for which we propose a novel architecture Attention-based Char
+Aware (ACA) LSTM which handles Out-Of-Vocabulary (OOV) words as well. All these
+tasks are executed completely on-device and hence can aid on-device dictation
+systems. To the best of our knowledge, this is the first work that shows how to
+add emoji(s) in the transcribed text. We demonstrate that our components
+achieve comparable results to previous neural approaches for punctuation
+addition and emoji prediction with 80% fewer parameters. Overall, our proposed
+model has a very small memory footprint of a mere 4MB to suit on-device
+deployment.
+9. <b> Caption Anything: Interactive Image Description with Diverse Multimodal  Controls </b>
+ Abstract:   Controllable image captioning is an emerging multimodal topic that aims to
+describe the image with natural language following human purpose,
+$\textit{e.g.}$, looking at the specified regions or telling in a particular
+text style. State-of-the-art methods are trained on annotated pairs of input
+controls and output captions. However, the scarcity of such well-annotated
+multimodal data largely limits their usability and scalability for interactive
+AI systems. Leveraging unimodal instruction-following foundation models is a
+promising alternative that benefits from broader sources of data. In this
+paper, we present Caption AnyThing (CAT), a foundation model augmented image
+captioning framework supporting a wide range of multimodel controls: 1) visual
+controls, including points, boxes, and trajectories; 2) language controls, such
+as sentiment, length, language, and factuality. Powered by Segment Anything
+Model (SAM) and ChatGPT, we unify the visual and language prompts into a
+modularized framework, enabling the flexible combination between different
+controls. Extensive case studies demonstrate the user intention alignment
+capabilities of our framework, shedding light on effective user interaction
+modeling in vision-language applications. Our code is publicly available at
+https://github.com/ttengwang/Caption-Anything.
+10. <b> emojiSpace: Spatial Representation of Emojis </b>
+ Abstract:   In the absence of nonverbal cues during messaging communication, users
+express part of their emotions using emojis. Thus, having emojis in the
+vocabulary of text messaging language models can significantly improve many
+natural language processing (NLP) applications such as online communication
+analysis. On the other hand, word embedding models are usually trained on a
+very large corpus of text such as Wikipedia or Google News datasets that
+include very few samples with emojis. In this study, we create emojiSpace,
+which is a combined word-emoji embedding using the word2vec model from the
+Genism library in Python. We trained emojiSpace on a corpus of more than 4
+billion tweets and evaluated it by implementing sentiment analysis on a Twitter
+dataset containing more than 67 million tweets as an extrinsic task. For this
+task, we compared the performance of two different classifiers of random forest
+(RF) and linear support vector machine (SVM). For evaluation, we compared
+emojiSpace performance with two other pre-trained embeddings and demonstrated
+that emojiSpace outperforms both.
+""")