TWIGMA / home.py
Yiqun Chen
fix typos
c665342
from pathlib import Path
import streamlit as st
import streamlit.components.v1 as components
from PIL import Image
import base64
import pandas as pd
from matplotlib import pyplot as plt
def read_markdown_file(markdown_file):
return Path(markdown_file).read_text()
def render_svg(svg_filename):
with open(svg_filename,"r") as f:
lines = f.readlines()
svg=''.join(lines)
"""Renders the given svg string."""
b64 = base64.b64encode(svg.encode('utf-8')).decode("utf-8")
html = r'<img src="data:image/svg+xml;base64,%s"/>' % b64
st.write(html, unsafe_allow_html=True)
def app():
st.markdown("## TWIGMA (TWItter Generative-ai images with MetadatA)")
st.markdown("### A dataset to understand content, variation, and longitudinal theme shifts of AI-generated images on Twitter.")
st.markdown("Recent progress in generative artificial intelligence (gen-AI) has enabled the generation of photo-realistic and artistically-inspiring photos at a single click, catering to millions of users online. To explore how people use gen-AI models such as DALLE and StableDiffusion, it is critical to understand the themes, contents, and variations present in the AI-generated photos. In this work, we introduce TWIGMA (TWItter Generative-ai images with MetadatA), a comprehensive dataset encompassing 800,000 gen-AI images collected from Jan 2021 to March 2023 on Twitter, with associated metadata (e.g., tweet text, creation date, number of likes). ", unsafe_allow_html=True)
# st.markdown("This is a website for TWIGMA (TWItter Generative-ai images with MetadatA), a comprehensive dataset encompassing 800,000 gen-AI images collected from Jan 2021 to March 2023 on Twitter, with associated metadata (e.g., tweet text, creation date, number of likes). Through a comparative analysis of TWIGMA with natural images and human artwork, we find that gen-AI images possess distinctive characteristics and exhibit, on average, lower variability when compared to their non-gen-AI counterparts. Additionally, we find that the similarity between a gen-AI image and natural images (i) is inversely correlated with the number of likes; and (ii) can be used to identify human images that served as inspiration for the gen-AI creations. Finally, we observe a longitudinal shift in the themes of AI-generated images on Twitter, with users increasingly sharing artistically sophisticated content such as intricate human portraits, whereas their interest in simple subjects such as natural scenes and animals has decreased. Our analyses and findings underscore the significance of TWIGMA as a unique data resource for studying AI-generated images.", unsafe_allow_html=True)
fig1e = Image.open('./resources/4x/test_figure_1_web.png')
st.image(fig1e, caption='Creation process and content overview of TWIGMA.', output_format='png')
# render_svg("./resources/SVG/Asset 49.svg")
st.markdown("### Preview of TWIGMA: ")
st.markdown("We display 10\% of the TWIGMA data below; the full data can be downladed at [insert url]. Note that in accordance with the privacy and control policy of Twitter, *no raw content* from Twitter is included in this dataset and users could and need to retrieve the original Twitter content used for analysis using the Twitter id.")
twigma_df = pd.read_csv("./resources/csv_files/twigma_release_sampled.csv") #path folder of the data file
st.write(twigma_df) #displays the table of data
st.markdown("""
TWIGMA contains the following fields:
- id: This is the Twitter id uniquely identifying each tweet used in this dataset and our analysis;
- image_name: This is the media id (see details at the Twitter [page](https://developer.twitter.com/en/docs/twitter-ads-api/creatives/guides/identifying-media)) used to uniquely identify each photo. Leveraging this field is necessary since a tweet can contain multiple images;
- created_at: This is the time of creation corresponding to the Twitter id;
- like_count: This is the number of likes collected from official Twitter API (snapshot: the week of May 29th). Note that some likes are not available because the corresponding tweets have been deleted since we first downloaded the photos;
- quote_count: Same as like_count, but for quotes;
- reply_count: Same as like_count, but for replies;
- all_captions: This is the BLIP-generated (Li et al. 2022) captions for the corresponding image;
- label_10_cluster: This is the assigned k-means cluster (k=10 so this number varies from 1 to 10);
- possibly_sensitive: Binary variable indicating whether the media content has been marked as sensitive/NSFW by Twitter;
- nsfw_score: The predicted NSFW from a pre-trained CLIP-based NSFW detector (ranges from 0 to 1; closer to 1 means more likely to be NSFW);
- UMAP_dim_1: The first dimension for a two-dimensional UMAP projection of the CLIP-ViT-L-14 embeddings of the images in TWIGMA.
- UMAP_dim_2: The second dimension for a two-dimensional UMAP projection of the CLIP-ViT-L-14 embeddings of the images in TWIGMA.
""")
# Group the DataFrame by 'possibly_sensitive' and calculate the mean 'nsfw_score' for each category
st.markdown("### Safety note: ")
st.markdown("""
It is important to note that *a substantial amount* of images have been classified as NSFW (not-safe-for-work, e.g., violent, pornographic, nude content) by both Twitter and a CLIP-based-NSFW model.
Therefore, we included two fields in the final TWIGMA dataset (`possibly_sensitive` and `nsfw_score` to help filter out these images.
Additionally, we note that the NSFW contents vary quite a bit by cluster labels and two of the clusters contain primarily NSFW contents.
""")
st.markdown("""
### References:
- CLIP-based NSFW Detector. 2022. https://github.com/LAION-AI/CLIP-based-NSFW-Detector. Accessed: 2023-06-06.
- Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 12888–12900. PMLR.
- Chen, Y. and Zou J. 2023+. TWIGMA: A dataset of AI-Generated Images with Metadata From Twitter
""")
st.markdown("""---""")
st.markdown('Acknowledgement:')
st.caption('Huge thanks to Zhi Huang for sharing their code to query Twitter data; Federico for sharing their webplip webapp, which greatly inspired this website.')
st.markdown('Disclaimer:')
st.caption('Please be advised that this function has been developed in compliance with the Twitter policy of data usage and sharing. The use of this function is solely at your own risk and should be consistent with applicable laws, regulations, and ethical considerations. If you wish to review the original Twitter post, you should access the source page directly on Twitter.')
st.markdown('Privacy statement:')
st.caption('In accordance with the privacy and control policy of Twitter, we hereby declare that the data redistributed by us only comprise of Tweet IDs and non-identified metadata derived from the contents (e.g., clustering assignment based on images, image media id, predicted NSFW scores) but not the contents themselves (no text, hashtag, or image redistributed). The Tweet IDs could be used to retrieve the original Twitter post, as long as the original post is still accessible. The hyperlink will cease to function if the user deletes the original post. It is important to note that a substantial amount of tweets in our dataset have been classified as sensitive/NSFW by Twitter. Any distribution carried out must adhere to the regulations by Twitter, as well as laws and regulations applicable in your jurisdiction, including export control laws and embargoes.')