Florent Daudens


AI & ML interests

AI & Journalism


fdaudens's activity

posted an update 4 days ago
view post
How can AI help us write better headlines and reach more people?

I experimented with a new approach that is both useful and fun. It can help you overcome writer’s block, find better headlines, and make your blog posts and news articles climb in search engine results. Plus, we will learn new concepts along the way!

1️⃣ First, I scraped all the blog posts written on Hugging Face to create a dataset with the headlines, texts, dates, and authors' names.

2️⃣ I filtered the dataset to remove posts that were too long and would require a model with a longer context window. This was done to keep the project simple and cost-effective (actually, free).

3️⃣ Then, I used a dataset generation workflow built by @davanstrien to generate a DPO dataset.

4️⃣ As a last step, you can collectively rate these evaluations to improve the quality of the dataset using an easy-to-use interface with Argilla. Take a look at it and rate some of them! This way, you can contribute to making this dataset useful for different newsrooms that could use it as a starting point.

𝐖𝐡𝐲 𝐢𝐭 𝐦𝐚𝐭𝐭𝐞𝐫𝐬. This example is compelling because, if you look at the dataset, you can see some examples where the headlines are enhanced by the addition of an important keyword or an action verb.
These tweaks can have a big impact on your position in search engines and, therefore, on your traffic. It’s also good leverage for our creativity since you can compare the initial idea with another one from an outside perspective.

Imagine if you’re a large news organization; you could run this experiment with thousands of news articles.

With a dataset of several hundred to thousands of entries, you could fine-tune a model to suggest headlines better tailored to your needs and writing style.

👉 Take a look at it and rate the headlines fdaudens/journalism-argilla-space
👉 Daniel's code https://github.com/huggingface/data-is-better-together/blob/main/dpo/README.md
  • 1 reply
posted an update 5 days ago
view post
If you're part of the Journalists on Hugging Face community, did you know you can receive notifications on ongoing discussions?

- "Repo discussions" for repo discussions you're participating in or mentioned in
- "New activity on watched orgs/users" for repo discussions & posts from users & orgs
you're watching

Activate them here: https://huggingface.co/settings/notifications

Join the community: If you’re part of the Journalists on Hugging Face community, did you know you can receive notifications about ongoing discussions?
posted an update 9 days ago
view post
Switching from French to German to Chinese in the same discussion 😅

Impressive to see Cohere for AI's new Aya model multilingual capabilities.

- C4AI Aya 23 is a research open weights release
- 8 and 35 billion parameter models
- 23 languages supported

You can try it out here: CohereForAI/aya-23
posted an update 11 days ago
view post
Excited to share a new project to make journalists’ lives easier when gathering information!

Collecting data like lists, URLs, etc., from websites is not always easy (and sometimes painful). Web scraping requires technical skills that only a handful of people in each newsroom have.

I recently stumbled upon @scrapegraphai , a scraper that does the heavy lifting with AI for the user with a simple prompt in natural language. I asked them if they could integrate the Hugging Face Hub to use open-source models and created a no-code, easy-to-use interface on Gradio.

You can then save time and focus on storytelling!

🔧 How It Works
1. Input Your Prompt and Source URL
2. Click ‘Scrape and Summarize’
3. Receive Summarized Results

👩‍💻 Get Involved!
This is just the first version of the tool, and it’s pretty basic. I’ve uploaded it to the Journalists on Hugging Face community so we can work together on it. Whether you’re a developer, a data scientist, or a journalist with ideas, you can contribute to this project.

You can also copy this app to your own account or organization to customize it to your needs.

👉 Test the scraper here: JournalistsonHF/ai-scraper

🤝 Join the Journalists on 🤗 community: https://huggingface.co/JournalistsonHF
posted an update 11 days ago
view post
80% of fact-checked misinformation claims involve media, with a rise in AI-generated content in 2023, according to a new study, “A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild.” Worth a read for journalists, especially fact-checkers.

• 📊 135,838 fact checks analyzed
• 📸 80% of these claims involve media
• 🎥 Videos became more common starting in 2022, now more than 60% of fact-checked claims that include media
• 🤖 AI-generated content was rare until Spring of 2023, and then dramatically increased
• 🖼️ Image manipulations don’t require complex operations. Most of the time it’s context manipulations

• Read the paper here: AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild (2405.11697)
• Take a look at the dataset: academic-datasets/AMMeBa

Thanks @davanstrien for spotting it!
posted an update 12 days ago
view post
Do you want to improve AI in your language? Here's how you can help.

I'm exploring different AI techniques for an upcoming project in journalism, and I wanted to test a cool idea by @davanstrien , Data is better together, which aims to foster a community of people to create DPO datasets in different languages.

This project gives the opportunity to explore various concepts:
- Direct Preference Optimization (DPO)
- Synthetic data
- Data annotation
- LLM as a judge

1️⃣ Take the Aya dataset of human-annotated prompt-completion pairs across 71 languages and filter it to include only those in the language you’re interested in.

2️⃣ Use distilabel from Argilla to generate a second response for each prompt and evaluate which response is best.

Basicaly, DPO datasets have a chosen and a rejected responses to a question, which helps align models on specific tasks. To quote Daniel: "Currently, there are only a few DPO datasets available for a limited number of languages. By generating more DPO datasets for different languages, we can help to improve the quality of generative models in a wider range of languages."

3️⃣ Send this dataset and evaluations to the easy-to-use interface to evaluate the evaluations.

This is where you can help. :) You can rate the LLM evaluation of the prompt-responses pairs. For my example, I built a dataset in French. And without wanting to start a debate about homeopathy, the second result is clearly better in the example below! fdaudens/demo-aya-dpo-french

The final dataset can be found here: fdaudens/aya_french_dpo

To contribute to other languages and learn more about synthetic data, you can also produce datasets in the language of your choice! Read more about the project: https://github.com/huggingface/data-is-better-together/blob/main/dpo/README.md
  • 1 reply
posted an update 12 days ago
view post
A useful tool for journalists: AutoQuizzer generates a quiz from a URL. You can play the quiz, or let the LLM play it!

posted an update 16 days ago
view post
Access to computational resources is key for democratizing AI, in all domains.

We cooked up something we're proud of: Hugging Face is committing $10 million in free GPUs to help developers create new AI technologies.

“AI should not be held in the hands of the few. With this commitment to open-source developers, we’re excited to see what everyone will cook up next in the spirit of collaboration and transparency.” — @clem

Read the exclusive by Kylie Robison: https://www.theverge.com/2024/5/16/24156755/hugging-face-celement-delangue-free-shared-gpus-ai
posted an update 18 days ago
view post
"This Journalism Professor Made a NYC Chatbot in Minutes. It Actually Worked"

A lot of interesting quotes in this interview in The Markup with Jonathan Soma, a professor in data journalism at Columbia University: https://themarkup.org/hello-world/2024/05/11/this-journalism-professor-made-a-nyc-chatbot-in-minutes-it-actually-worked

When the New York City government released its chatbot, journalists found that "Again and again, the bot messing up on city laws and regulations."

Enter Jonathan Soma, who tried to build his own version of the chatbot. And guess what? He got accurate responses.

💬 Still, he remains cautious: "Chatbots are great for low-stakes things. They are great when something is fun, they are great for a task where you do not need 100 percent accuracy, when you just want a little bit of guidance."

"I think that AI in general is absolutely useful for journalism, and I’ve been teaching machine learning and AI to journalists long before ChatGPT hit the scene. I think it is explicitly chatbots that are probably the most problematic part, because they are so confident in everything that they say."

🤗 I have a particular soft spot for this project, as it uses many Hugging Face tools under the hood. This is precisely the kind of work we want to build with the Journalists on HF community. Join us: https://huggingface.co/JournalistsonHF

📺 I can't recommend enough watching his video serie "Practical AI for Investigative Journalism": https://www.youtube.com/watch?v=N5wvtYRYbfA&list=PLewNEVDy7gq1_GPUaL0OQ31QsiHP5ncAQ

— Thanks @BrigitteTousi for the link!
posted an update 24 days ago
view post
What tools do you need to deconstruct bias in algorithms? (You know, this thing that is becoming increasingly prevalent in our lives)

Participate in the new discussion in the Journalists on Hugging Face community: JournalistsonHF/README#4
posted an update 26 days ago
posted an update 29 days ago
view post
A new dataset for anyone interested in Satellite imagery: 3 million @Satellogic images of unique locations — 6 million images, including location revisits — from around the world under a Creative Commons CC-BY 4.0 license.

Interesting potential in journalism.

posted an update 30 days ago
view post
I've added new collections to the Journalists on 🤗 community, focusing on Data Visualization, Optical Character Recognition, and Multimodal Models:

- TinyChart-3B: This model interprets data visualizations based on your prompts. It can generate the underlying data table from a chart or recreate the chart with Python code.
- PDF to OCR: Convert your PDFs to text—ideal for FOI records sent as images.
- Idefics-8b: A multimodal model that allows you to ask questions about images.

Explore these tools here: 👉 https://huggingface.co/JournalistsonHF
posted an update about 1 month ago
view post
New conversation in our Journalists on Hugging Face community: Exploring auto-tagging articles for taxonomy.

I've shared insights from my previous experience with fine-tuning a model for a classification task. But has anyone built a similar use case? Or are you seeking a solution for this task too? Join the discussion here: JournalistsonHF/README#2
posted an update about 1 month ago
view post
Should media organizations strike deals with big tech companies? Here are two colliding news stories about licensing:

1. The Financial Times has secured a licensing agreement with OpenAI to license its material both for training and queries on ChatGPT. It is the fifth such deal, following similar agreements with Associated Press, Axel Springer, Le Monde and Prisa Media. "Financial terms were not disclosed."

"Apart from the benefits to the FT, there are broader implications for the industry. It’s right, of course, that AI platforms pay publishers for the use of their material. OpenAI understands the importance of transparency, attribution, and compensation – all essential for us."

2. Meanwhile, French media outlet Mediapart is refusing to cash in money from Google, which it is entitled to under so-called "neighbouring rights" for the right to display their news content online.

Why? Due to issues with disclosing financial terms: "The confidentiality clauses imposed by Google today prevent us from publicizing to our readers not only the total amount paid, but also the amount Mediapart is entitled to receive."

"In our view, financial dependence on platforms is incompatible with our public service mission, which is to make the powerful face up to their responsibilities. It also seems extremely dangerous economically."

Two positions at opposite sides of the spectrum.

- The Financial Times and OpenAI strike content licensing deal

- Droits voisins : Mediapart lance la bataille de la transparence contre Google (in French) https
posted an update about 1 month ago
view post
How do Microsoft and Alphabet (Google) results compare?

Microsoft Reports Rising Revenues as A.I. Investments Bear Fruit
- 17 % jump in revenue and a 20 % increase in profit for the first three months of the year.
- Revenue was $61.9 billion, up from $52.9 billion a year earlier.
- Profit hit $21.9 billion, up from $18.3 billion.
- More than a fifth of that growth came from its generative A.I. services

Alphabet’s Revenue Jumps 15% to $80.5 Billion
- $80.5 billion in quarterly sales, up 15 % from a year earlier. Profit climbed 36 % to $23.7 billion.
- For the first time, a dividend of 20 cents per share
- It spent $12 billion on capital expenditures in the first quarter, soaring 91 % from a year earlier.

Meta’s Open Source Llama 3 Is Already Nipping at OpenAI’s Heels - Wired
- "if open source models prove competitive, developers and entrepreneurs may decide to stop paying to access the latest model from OpenAI or Google and use Llama 3 or one of the other increasingly powerful open source models that are popping up."
- "Open models appear to be dropping at an impressive clip."
posted an update about 1 month ago
view post
5 interesting news stories today:

An AI startup made a hyperrealistic deepfake of me that’s so good it’s scary
- "'I think we might just have to say goodbye to finding out about the truth in a quick way,” says Sandra Wachter, a professor at the Oxford Internet Institute"
- "Synthesia uses both large language models and diffusion models to do this. Sees itself as a platform for businesses. Its bet is this: As people spend more time watching videos on YouTube and TikTok, there will be more demand for video content."
- "Synthesia’s policy is to not create avatars of people without their explicit consent. But it hasn’t been immune from abuse."

WIRED found thousands of ads running on Meta's social platforms promoting sexually explicit "Al girlfriend" apps.
- "Some human sex workers say the platform unfairly polices their own posts more harshly."
- "Many of the virtual women seen in ads reviewed by WIRED are lifelike—if somewhat uncanny—young, and stereotypically pornographic."

Wall Street’s Patience for a Costly A.I. Arms Race Is Waning
- "A sell-off in Meta’s stock after the company disclosed huge investments in the technology may be a sign of investor fears about tech giants’ spending."
- "The company plans to spend $35 billion to $40 billion this year — much of that on the technology."

Saudia Arabia Spends Big to Become an A.I. Superpower

UK competition watchdog steps up scrutiny of big tech’s role in AI startups
replied to their post about 1 month ago
posted an update about 1 month ago
view post
It's been only a week since I joined 🤗 and the community has released a constant flow of content!

Notable models:
- Apple OpenELM apple/openelm-instruct-models-6619ad295d7ae9f868b759ca + apple/openelm-pretrained-models-6619ac6ca12a10bd0d0df89e
- HuggingFaceM4 Idefics2 HuggingFaceM4/idefics2-8b
- Meta Llama 3 meta-llama/meta-llama-3-66214712577ca38149ebb2b6
- Microsoft Phi-3 microsoft/phi-3-6626e15e9585a200d2d761e3
- Snowflake Arctic Snowflake/arctic-66290090abe542894a5ac520

Great datasets:
- HuggingFaceFW FineWeb HuggingFaceFW/fineweb
- HuggingFaceM4/the_cauldron HuggingFaceM4/the_cauldron
- PleIAs/YouTube-Commons PleIAs/YouTube-Commons

Fascinating Spaces
- InstantMesh TencentARC/InstantMesh
- Chat with Llama 3 8B ysharma/Chat_with_Meta_llama3_8b
- Parler-TTS parler-tts/parler_tts_mini
- AI Jukebox enzostvs/ai-jukebox
- CosXL multimodalart/cosxl
- Singing songstarter nateraw/singing-songstarter
- Play with Idefics2 8B https://huggingface.co/spaces/HuggingFaceM4/idefics-8b
- CodeQwen1.5-7B-Chat Bot👾

I expected to be at the center of AI development. I'm not disappointed!
posted an update about 1 month ago
view post
Testing the Phi-3-mini 4k on HuggingChat. How well can it craft a tweet?

Not bad at all:
Excited to unveil phi-3-mini, a compact yet powerful 3.8B parameter model, outperforming giants like Mixtral & GPT-3.5 on benchmarks & safe for phones! *
#Al #Phi3 #LanguageModel #Techinnovation #Phi3Miniml

The models are here:
- Phi-3-Mini-4K-Instruct: microsoft/Phi-3-mini-4k-instruct
-Phi-3-Mini-128K-Instruct: microsoft/Phi-3-mini-128k-instruct

Try it out in Hugging Chat: https://huggingface.co/chat/models/microsoft/Phi-3-mini-4k-instruct
posted an update about 1 month ago
posted an update about 1 month ago
posted an update about 1 month ago