fdaudens (Florent Daudens)

posted an update about 19 hours ago

Post

491

Excited to share a new project to make journalists’ lives easier when gathering information!

Collecting data like lists, URLs, etc., from websites is not always easy (and sometimes painful). Web scraping requires technical skills that only a handful of people in each newsroom have.

I recently stumbled upon @scrapegraphai , a scraper that does the heavy lifting with AI for the user with a simple prompt in natural language. I asked them if they could integrate the Hugging Face Hub to use open-source models and created a no-code, easy-to-use interface on Gradio.

You can then save time and focus on storytelling!

🔧 How It Works
1. Input Your Prompt and Source URL
2. Click ‘Scrape and Summarize’
3. Receive Summarized Results

👩‍💻 Get Involved!
This is just the first version of the tool, and it’s pretty basic. I’ve uploaded it to the Journalists on Hugging Face community so we can work together on it. Whether you’re a developer, a data scientist, or a journalist with ideas, you can contribute to this project.

You can also copy this app to your own account or organization to customize it to your needs.

👉 Test the scraper here: JournalistsonHF/ai-scraper

🤝 Join the Journalists on 🤗 community: https://huggingface.co/JournalistsonHF

posted an update 1 day ago

Post

1269

80% of fact-checked misinformation claims involve media, with a rise in AI-generated content in 2023, according to a new study, “A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild.” Worth a read for journalists, especially fact-checkers.

TL;DR:
• 📊 135,838 fact checks analyzed
• 📸 80% of these claims involve media
• 🎥 Videos became more common starting in 2022, now more than 60% of fact-checked claims that include media
• 🤖 AI-generated content was rare until Spring of 2023, and then dramatically increased
• 🖼️ Image manipulations don’t require complex operations. Most of the time it’s context manipulations

• Read the paper here: AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild (2405.11697)
• Take a look at the dataset: academic-datasets/AMMeBa

Thanks @davanstrien for spotting it!

posted an update 3 days ago

Post

1885

Do you want to improve AI in your language? Here's how you can help.

I'm exploring different AI techniques for an upcoming project in journalism, and I wanted to test a cool idea by @davanstrien , Data is better together, which aims to foster a community of people to create DPO datasets in different languages.

This project gives the opportunity to explore various concepts:
- Direct Preference Optimization (DPO)
- Synthetic data
- Data annotation
- LLM as a judge

1️⃣ Take the Aya dataset of human-annotated prompt-completion pairs across 71 languages and filter it to include only those in the language you’re interested in.

2️⃣ Use distilabel from Argilla to generate a second response for each prompt and evaluate which response is best.

Basicaly, DPO datasets have a chosen and a rejected responses to a question, which helps align models on specific tasks. To quote Daniel: "Currently, there are only a few DPO datasets available for a limited number of languages. By generating more DPO datasets for different languages, we can help to improve the quality of generative models in a wider range of languages."

3️⃣ Send this dataset and evaluations to the easy-to-use interface to evaluate the evaluations.

This is where you can help. :) You can rate the LLM evaluation of the prompt-responses pairs. For my example, I built a dataset in French. And without wanting to start a debate about homeopathy, the second result is clearly better in the example below! fdaudens/demo-aya-dpo-french

The final dataset can be found here: fdaudens/aya_french_dpo

To contribute to other languages and learn more about synthetic data, you can also produce datasets in the language of your choice! Read more about the project: https://github.com/huggingface/data-is-better-together/blob/main/dpo/README.md

1 reply

·

posted an update 3 days ago

Post

1228

A useful tool for journalists: AutoQuizzer generates a quiz from a URL. You can play the quiz, or let the LLM play it!

deepset/autoquizzer

posted an update 7 days ago

Post

1215

Access to computational resources is key for democratizing AI, in all domains.

We cooked up something we're proud of: Hugging Face is committing $10 million in free GPUs to help developers create new AI technologies.

“AI should not be held in the hands of the few. With this commitment to open-source developers, we’re excited to see what everyone will cook up next in the spirit of collaboration and transparency.” — @clem

Read the exclusive by Kylie Robison: https://www.theverge.com/2024/5/16/24156755/hugging-face-celement-delangue-free-shared-gpus-ai

posted an update 9 days ago

Post

845

"This Journalism Professor Made a NYC Chatbot in Minutes. It Actually Worked"

A lot of interesting quotes in this interview in The Markup with Jonathan Soma, a professor in data journalism at Columbia University: https://themarkup.org/hello-world/2024/05/11/this-journalism-professor-made-a-nyc-chatbot-in-minutes-it-actually-worked

When the New York City government released its chatbot, journalists found that "Again and again, the bot messing up on city laws and regulations."

Enter Jonathan Soma, who tried to build his own version of the chatbot. And guess what? He got accurate responses.

💬 Still, he remains cautious: "Chatbots are great for low-stakes things. They are great when something is fun, they are great for a task where you do not need 100 percent accuracy, when you just want a little bit of guidance."

"I think that AI in general is absolutely useful for journalism, and I’ve been teaching machine learning and AI to journalists long before ChatGPT hit the scene. I think it is explicitly chatbots that are probably the most problematic part, because they are so confident in everything that they say."

🤗 I have a particular soft spot for this project, as it uses many Hugging Face tools under the hood. This is precisely the kind of work we want to build with the Journalists on HF community. Join us: https://huggingface.co/JournalistsonHF

📺 I can't recommend enough watching his video serie "Practical AI for Investigative Journalism": https://www.youtube.com/watch?v=N5wvtYRYbfA&list=PLewNEVDy7gq1_GPUaL0OQ31QsiHP5ncAQ

— Thanks @BrigitteTousi for the link!

posted an update 14 days ago

Post

2079

What tools do you need to deconstruct bias in algorithms? (You know, this thing that is becoming increasingly prevalent in our lives)

Participate in the new discussion in the Journalists on Hugging Face community: JournalistsonHF/README#4

posted an update 16 days ago

Post

2293

Audio transcription is one of the most useful use cases for journalists (and many other professions!). @sergeipetrov , @reach-vb , @pcuenq , and @philschmid have created an optimized Whisper with Speaker Diarization for @huggingface Inference Endpoints—definitely worth a read!

Check out their blog post here: https://huggingface.co/blog/asr-diarization

You can find the notebook here: sergeipetrov/asrdiarization-handler

posted an update 20 days ago

Post

2469

A new dataset for anyone interested in Satellite imagery: 3 million @Satellogic images of unique locations — 6 million images, including location revisits — from around the world under a Creative Commons CC-BY 4.0 license.

Interesting potential in journalism.

satellogic/EarthView

posted an update 20 days ago

Post

2047

I've added new collections to the Journalists on 🤗 community, focusing on Data Visualization, Optical Character Recognition, and Multimodal Models:

- TinyChart-3B: This model interprets data visualizations based on your prompts. It can generate the underlying data table from a chart or recreate the chart with Python code.
- PDF to OCR: Convert your PDFs to text—ideal for FOI records sent as images.
- Idefics-8b: A multimodal model that allows you to ask questions about images.

Explore these tools here: 👉 https://huggingface.co/JournalistsonHF

posted an update 22 days ago

Post

1521

New conversation in our Journalists on Hugging Face community: Exploring auto-tagging articles for taxonomy.

I've shared insights from my previous experience with fine-tuning a model for a classification task. But has anyone built a similar use case? Or are you seeking a solution for this task too? Join the discussion here: JournalistsonHF/README#2

posted an update 24 days ago

Post

1664

Should media organizations strike deals with big tech companies? Here are two colliding news stories about licensing:

1. The Financial Times has secured a licensing agreement with OpenAI to license its material both for training and queries on ChatGPT. It is the fifth such deal, following similar agreements with Associated Press, Axel Springer, Le Monde and Prisa Media. "Financial terms were not disclosed."

"Apart from the benefits to the FT, there are broader implications for the industry. It’s right, of course, that AI platforms pay publishers for the use of their material. OpenAI understands the importance of transparency, attribution, and compensation – all essential for us."

2. Meanwhile, French media outlet Mediapart is refusing to cash in money from Google, which it is entitled to under so-called "neighbouring rights" for the right to display their news content online.

Why? Due to issues with disclosing financial terms: "The confidentiality clauses imposed by Google today prevent us from publicizing to our readers not only the total amount paid, but also the amount Mediapart is entitled to receive."

"In our view, financial dependence on platforms is incompatible with our public service mission, which is to make the powerful face up to their responsibilities. It also seems extremely dangerous economically."

Two positions at opposite sides of the spectrum.

- The Financial Times and OpenAI strike content licensing deal
https://www.ft.com/content/33328743-ba3b-470f-a2e3-f41c3a366613

- Droits voisins : Mediapart lance la bataille de la transparence contre Google (in French) https

posted an update 27 days ago

Post

1570

How do Microsoft and Alphabet (Google) results compare?

Microsoft Reports Rising Revenues as A.I. Investments Bear Fruit
- 17 % jump in revenue and a 20 % increase in profit for the first three months of the year.
- Revenue was $61.9 billion, up from $52.9 billion a year earlier.
- Profit hit $21.9 billion, up from $18.3 billion.
- More than a fifth of that growth came from its generative A.I. services
https://www.nytimes.com/2024/04/25/technology/microsoft-earnings.html

Alphabet’s Revenue Jumps 15% to $80.5 Billion
- $80.5 billion in quarterly sales, up 15 % from a year earlier. Profit climbed 36 % to $23.7 billion.
- For the first time, a dividend of 20 cents per share
- It spent $12 billion on capital expenditures in the first quarter, soaring 91 % from a year earlier.
https://www.nytimes.com/2024/04/25/technology/alphabet-earnings.html

Meta’s Open Source Llama 3 Is Already Nipping at OpenAI’s Heels - Wired
- "if open source models prove competitive, developers and entrepreneurs may decide to stop paying to access the latest model from OpenAI or Google and use Llama 3 or one of the other increasingly powerful open source models that are popping up."
- "Open models appear to be dropping at an impressive clip."
https://www.wired.com/story/metas-open-source-llama-3-nipping-at-openais-heels/

posted an update 28 days ago

Post

2380

5 interesting news stories today:

An AI startup made a hyperrealistic deepfake of me that’s so good it’s scary
- "'I think we might just have to say goodbye to finding out about the truth in a quick way,” says Sandra Wachter, a professor at the Oxford Internet Institute"
- "Synthesia uses both large language models and diffusion models to do this. Sees itself as a platform for businesses. Its bet is this: As people spend more time watching videos on YouTube and TikTok, there will be more demand for video content."
- "Synthesia’s policy is to not create avatars of people without their explicit consent. But it hasn’t been immune from abuse."
https://www.technologyreview.com/2024/04/25/1091772/new-generative-ai-avatar-deepfake-synthesia/

WIRED found thousands of ads running on Meta's social platforms promoting sexually explicit "Al girlfriend" apps.
- "Some human sex workers say the platform unfairly polices their own posts more harshly."
- "Many of the virtual women seen in ads reviewed by WIRED are lifelike—if somewhat uncanny—young, and stereotypically pornographic."
https://www.wired.com/story/ads-for-explicit-ai-girlfriends-swarming-facebook-and-instagram/

Wall Street’s Patience for a Costly A.I. Arms Race Is Waning
- "A sell-off in Meta’s stock after the company disclosed huge investments in the technology may be a sign of investor fears about tech giants’ spending."
- "The company plans to spend $35 billion to $40 billion this year — much of that on the technology."
https://www.nytimes.com/2024/04/25/business/dealbook/meta-artificial-intelligence-spending.html

Saudia Arabia Spends Big to Become an A.I. Superpower
https://www.nytimes.com/2024/04/25/technology/to-the-future-saudi-arabia-spends-big-to-become-an-ai-superpower.html

UK competition watchdog steps up scrutiny of big tech’s role in AI startups
https://www.theguardian.com/technology/2024/apr/24/uk-competition-watchdog-steps-up-scrutiny-of-big-techs-role-in-ai-startups-cma-microsoft-amazon

replied to their post 28 days ago

Teamwork ;)

posted an update 29 days ago

Post

2670

It's been only a week since I joined 🤗 and the community has released a constant flow of content!

Notable models:
- Apple OpenELM apple/openelm-instruct-models-6619ad295d7ae9f868b759ca + apple/openelm-pretrained-models-6619ac6ca12a10bd0d0df89e
- HuggingFaceM4 Idefics2 HuggingFaceM4/idefics2-8b
- Meta Llama 3 meta-llama/meta-llama-3-66214712577ca38149ebb2b6
- Microsoft Phi-3 microsoft/phi-3-6626e15e9585a200d2d761e3
- Snowflake Arctic Snowflake/arctic-66290090abe542894a5ac520

Great datasets:
- HuggingFaceFW FineWeb HuggingFaceFW/fineweb
- HuggingFaceM4/the_cauldron HuggingFaceM4/the_cauldron
- PleIAs/YouTube-Commons PleIAs/YouTube-Commons

Fascinating Spaces
- InstantMesh TencentARC/InstantMesh
- Chat with Llama 3 8B ysharma/Chat_with_Meta_llama3_8b
- Parler-TTS parler-tts/parler_tts_mini
- AI Jukebox enzostvs/ai-jukebox
- CosXL multimodalart/cosxl
- Singing songstarter nateraw/singing-songstarter
- Play with Idefics2 8B https://huggingface.co/spaces/HuggingFaceM4/idefics-8b
- CodeQwen1.5-7B-Chat Bot👾
https://huggingface.co/spaces/Qwen/CodeQwen1.5-7b-Chat-demo

I expected to be at the center of AI development. I'm not disappointed!

4 replies

·

posted an update 30 days ago

Post

2250

Testing the Phi-3-mini 4k on HuggingChat. How well can it craft a tweet?

Not bad at all:

Excited to unveil phi-3-mini, a compact yet powerful 3.8B parameter model, outperforming giants like Mixtral & GPT-3.5 on benchmarks & safe for phones! *
#Al #Phi3 #LanguageModel #Techinnovation #Phi3Miniml

The models are here:
- Phi-3-Mini-4K-Instruct: microsoft/Phi-3-mini-4k-instruct
-Phi-3-Mini-128K-Instruct: microsoft/Phi-3-mini-128k-instruct

Try it out in Hugging Chat: https://huggingface.co/chat/models/microsoft/Phi-3-mini-4k-instruct

posted an update about 1 month ago

Post

1617

Meta Llama 3 70B landed on the Leaderboard at the 11th position: HuggingFaceH4/open_llm_leaderboard

1 reply

·

posted an update about 1 month ago

Post

1794

Love this new Space built by @enzostvs + @Xenova for Transformers.js: Generate your own AI music (In-browser generation) with AI Jukebox

enzostvs/ai-jukebox

posted an update about 1 month ago

Post

3644

Open-source AI on your phone? The HuggingChat app is out for iOs, with the best models: Command R, Zephyr Orpo, Mixtral, Gemma... https://apps.apple.com/ca/app/huggingchat/id6476778843?l=fr-CA

2 replies

·

Florent Daudens

AI & ML interests

Organizations

fdaudens's activity