Margaux Ammour's picture

Margaux Ammour

mammour
ยท

AI & ML interests

Instead of looking for something that is potentially harmful, better grasp what we can already make happen Artificially Augmented Intelligence advocate

Recent Activity

Organizations

Ammour family's profile picture

mammour's activity

New activity in DebateLabKIT/syncialo-raw 22 days ago

Bad practices correction

#1 opened 22 days ago by
mammour
view reply

Following your exemple :
Your FoF-2 = FoF-1, as it stand there it biases the dataset by overponderating/oversaturating the same argument as two different ones.

https://argdown.org/syntax/#equivalence-classes

They should look like that :

<Focus on Fundamentals>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
<Focus on Fundamentals>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.

leading to the following :

[Learning Over Leisure]: Schools should restrict students' access to fan fiction and social media to protect the integrity of education. 
    <- <Restriction Infringes on Freedom of Expression>: Restricting access to fan fiction and social media unconstitutionally limits students' right to freedom of expression and stifles their creativity.
        <+ <Lifelong Learning>: By exercising their freedom of expression, students develop essential skills in critical thinking, problem-solving, and effective communication, preparing them for success in their future careers and personal lives.
        <- <Echo Chamber Effect>: Exercising freedom of expression in an unstructured environment can create an echo chamber where students only communicate with like-minded individuals, failing to develop the skills to engage with diverse perspectives and opposing views.
            <- <Silent Observer>: Developing skills to engage with diverse perspectives and opposing views is not essential for effective communication in situations where listening and observing, rather than actively engaging, is the most effective strategy.
        <- <Fan Fiction Distortion>: Fan fiction and social media often distort students' creativity by promoting unoriginal and copyrighted content, rather than fostering genuine artistic expression.
            <- <Artistic Evolution>: The value of artistic expression lies in its ability to evoke emotions and spark new ideas, regardless of whether it is original or builds upon existing works, making the distinction between original and unoriginal content irrelevant.
        <+ <Innovation Incubator>: Unrestricted freedom of expression enables students to develop critical thinking, problem-solving, and communication skills, essential for academic and professional success.
    <+ <Focus on Fundamentals>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
    <+ <Focus on Fundamentals>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.
        <+ <Knowledge Pyramid>: A strong grasp of foundational subjects allows students to recognize relationships between different ideas and concepts, creating a hierarchical structure of knowledge that enhances retention and recall of critical information.

Problem solved, now we need to fix the dataset :

Pass all jsons trough :

#!/usr/bin/env python3
"""
Script to fix โ€œalmost duplicatedโ€ labels in a debate JSON.
It reads an input JSON file (with a โ€œnodesโ€ array where each node has a โ€œlabelโ€),
finds labels that are very similar (according to a fuzzyโ€“match threshold),
and then updates all such nodes to share a canonical label.
"""

import json
import sys
import logging
import argparse
from difflib import SequenceMatcher
from typing import List, Dict, Any

# Set up logging configuration
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

def similarity(a: str, b: str) -> float:
    """Return a similarity ratio between two strings (0 to 1)."""
    return SequenceMatcher(None, a, b).ratio()

def cluster_labels(labels: List[str], threshold: float = 0.90) -> Dict[str, str]:
    """
    Given a list of labels, return a dictionary mapping each label to a canonical label.
    Two labels that are at least 'threshold' similar will be treated as duplicates.
    (The first label encountered becomes the canonical version.)
    """
    canonical: Dict[str, str] = {}
    unique_labels = list(set(labels))  # unique labels in no particular order
    unique_labels.sort()  # sort for consistency

    # Build clusters by iterating over the unique labels.
    for i, label in enumerate(unique_labels):
        if label in canonical:
            continue
        canonical[label] = label  # label becomes its own canonical version
        for other_label in unique_labels[i + 1:]:
            if other_label in canonical:
                continue
            if similarity(label, other_label) >= threshold:
                canonical[other_label] = label
    return canonical

def fix_labels(data: Dict[str, Any], threshold: float = 0.90) -> Dict[str, Any]:
    """
    Given a debate JSON object (with a "nodes" key), fix labels by unifying similar ones.
    Returns the modified JSON object.
    """
    if "nodes" not in data:
        logging.error("No 'nodes' key found in JSON data.")
        return data

    nodes = data["nodes"]
    if not isinstance(nodes, list):
        logging.error("'nodes' should be a list.")
        return data

    # Extract all labels; if a node doesn't have a "label", default to an empty string.
    labels = [node.get("label", "") for node in nodes if isinstance(node, dict)]
    
    # Build mapping from each label to its canonical version.
    mapping = cluster_labels(labels, threshold=threshold)
    logging.info("Found %d unique labels; mapping to canonical labels:", len(mapping))
    for key, canonical_label in mapping.items():
        if key != canonical_label:
            logging.info("  %r --> %r", key, canonical_label)

    # Update each node's label using the mapping.
    for node in nodes:
        if isinstance(node, dict):
            original_label = node.get("label", "")
            if original_label in mapping:
                node["label"] = mapping[original_label]
    return data

def parse_args() -> argparse.Namespace:
    """Parse command-line arguments."""
    parser = argparse.ArgumentParser(
        description="Fix almost duplicated labels in a debate JSON file."
    )
    parser.add_argument("input_file", help="Path to the input JSON file.")
    parser.add_argument("output_file", help="Path where the fixed JSON will be saved.")
    parser.add_argument(
        "--threshold", type=float, default=0.90,
        help="Fuzzy matching threshold (default: 0.90)."
    )
    return parser.parse_args()

def main() -> None:
    args = parse_args()

    # Load JSON data from file with error handling.
    try:
        with open(args.input_file, "r", encoding="utf-8") as infile:
            data = json.load(infile)
    except FileNotFoundError:
        logging.error("Input file '%s' not found.", args.input_file)
        sys.exit(1)
    except json.JSONDecodeError as e:
        logging.error("Error decoding JSON from '%s': %s", args.input_file, e)
        sys.exit(1)
    except Exception as e:
        logging.error("An unexpected error occurred while reading '%s': %s", args.input_file, e)
        sys.exit(1)

    # Fix labels in the data.
    fixed_data = fix_labels(data, threshold=args.threshold)

    # Write the fixed data to the output file with error handling.
    try:
        with open(args.output_file, "w", encoding="utf-8") as outfile:
            json.dump(fixed_data, outfile, indent=2, ensure_ascii=False)
    except Exception as e:
        logging.error("An error occurred while writing to '%s': %s", args.output_file, e)
        sys.exit(1)

    logging.info("Fixed JSON written to '%s'", args.output_file)

if __name__ == "__main__":
    main()

with https://huggingface.co/datasets/DebateLabKIT/syncialo-raw/raw/main/data/synthetic_corpus-001/train/debate-train-0444/node_link_data-debate-train-0444.json

we get this stdo :

ฮป python fix_labels.py input.json output.json
INFO: Found 638 unique labels; mapping to canonical labels:
INFO:   'Algorithmic Bias Amplification' --> 'Algorithmic Amplification'
INFO:   'Biased Benchmarks' --> 'Biased Benchmark'
INFO:   'Crime Deterrent' --> 'Crime Deterrence'
INFO:   'Dataset Augmentation' --> 'Data Augmentation'
INFO:   'Data Deserts' --> 'Data Desert'
INFO:   'Diverse Datasets' --> 'Diverse Data Sets'
INFO:   'Surveillance Slippery Slope' --> 'Mass Surveillance Slippery Slope'
INFO:   'National Security Exemption' --> 'National Security Exception'
INFO:   'Protecting the Vulnerable:' --> 'Protecting the Vulnerable'
INFO:   'Redundant Safeguards' --> 'Redundancy Safeguard'
INFO: Fixed JSON written to 'output.json'

all you need to do is to adapt main and make a pass through. atm your dataset is bad practice.

Credits : me, argdown docs, AI for [code review] and [error handling].

New activity in mistralai/Mistral-Small-24B-Instruct-2501 23 days ago

Mistral Small 24 B

4
#19 opened 26 days ago by
HandsomeMagyar

Why increase censorship?

21
#20 opened 26 days ago by
notafraud
New activity in Qwen/QwQ-32B-Preview 3 months ago

multi GPU inferencing

2
#18 opened 3 months ago by
cjj2003
New activity in Jacoby746/Casual-Magnum-34B-exl2-4.0bpw 5 months ago

Error during inference

7
#1 opened 5 months ago by
Jellon
reacted to singhsidhukuldeep's post with ๐Ÿ‘ 5 months ago
view post
Post
3998
Researchers have developed a novel approach called Logic-of-Thought (LoT) that significantly enhances the logical reasoning capabilities of large language models (LLMs).

Here are the steps on how Logic-of-Thought (LoT) is implemented:

-- 1. Logic Extraction

1. Use Large Language Models (LLMs) to identify sentences containing conditional reasoning relationships from the input context.
2. Generate a collection of sentences with logical relationships.
3. Use LLMs to extract the set of propositional symbols and logical expressions from the collection.
4. Identify propositions with similar meanings and represent them using identical propositional symbols.
5. Analyze the logical relationships between propositions based on their natural language descriptions.
6. Add negation (ยฌ) for propositions that express opposite meanings.
7. Use implication (โ†’) to connect propositional symbols when a conditional relationship exists.

-- 2. Logic Extension

1. Apply logical reasoning laws to the collection of logical expressions from the Logic Extraction phase.
2. Use a Python program to implement logical deduction and expand the expressions.
3. Apply logical laws such as Double Negation, Contraposition, and Transitivity to derive new logical expressions.

-- 3. Logic Translation

1. Use LLMs to translate the newly generated logical expressions into natural language descriptions.
2. Combine the natural language descriptions of propositional symbols according to the extended logical expressions.
3. Incorporate the translated logical information as a new part of the original input prompt.

-- 4. Integration with Existing Prompting Methods

1. Combine the LoT-generated logical information with the original prompt.
2. Use this enhanced prompt with existing prompting methods like Chain-of-Thought (CoT), Self-Consistency (SC), or Tree-of-Thoughts (ToT).
3. Feed the augmented prompt to the LLM to generate the final answer.

What do you think about LoT?
  • 2 replies
ยท
updated a Space 5 months ago
reacted to bartowski's post with ๐Ÿ‘ 6 months ago
view post
Post
4765
@victor (is this the only way to "DM" on HF?)

Had a funny thought, would it be at all possible to rework what shows up on our personal HF page?

Picture this: I upload a model to an organization, someone who follows me now has no idea that I've uploaded a model or to where, unless they also watch those repos (which also floods them with other notifications)

What if our main Huggingface page was a collection of both models that we've uploaded specifically to our profile, as well as models we've uploaded to organizations? That way it would all be contained in one central followable location, and I wouldn't have concerns about losing followership if I wanted to upload to an organization all of a sudden.
ยท
reacted to m-ric's post with ๐Ÿง  7 months ago
view post
Post
1751
๐—ฆ๐—”๐—  ๐Ÿฎ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ: ๐—ก๐—ฒ๐˜„ ๐—ฆ๐—ข๐—ง๐—” ๐—ผ๐—ป ๐˜€๐—ฒ๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป, ๐—ฏ๐˜† ๐—ฐ๐—ผ๐—บ๐—ฏ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜„๐—ถ๐˜๐—ต ๐—ต๐˜‚๐—บ๐—ฎ๐—ป ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ ๐Ÿš€

It's a model for Object segmentation, for both image and video:
๐Ÿ‘‰ input = a text prompt, or a click on a specific object
๐Ÿ‘‰ output = the model draws a mask around the object. In video segmentation, the mask should follow the object's movements (it is then called a masklet)

๐Ÿ’ช SAM 2 is 6x faster than the previous version, it now also works on a video, and it beats SOTA by far on both image and video segmentation tasks.

How did they pull that?

The main blocker for video segmentation was that data is really hard to collect: to build your training dataset, should you manually draw masks on every frame? That would be way too costly! โžก๏ธ As a result, existing video segmentation datasets have a real lack of coverage: few examples, few masklets drawn.

๐Ÿ’ก Key idea: researchers they decided to use a segmentation model to help them collect the dataset.

But then itโ€™s a chicken and egg problem: you need the model to create the dataset and the opposite as well? ๐Ÿค”

โ‡’ To solve this, they build a data generation system that they scale up progressively in 3 successive manual annotations phases:

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿญ: Annotators use only SAM + manual editing tools on each frame โ‡’ Create 16k masklets across 1.4k videos

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿฎ: Then train a first SAM 2, add it in the loop to temporally propagate frames, and correct by re-doing a mask manually when an error has occured โ‡’ This gets a 5.1x speedup over data collection in phase 1! ๐Ÿƒ Collect 60k masklets

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿฏ: Now SAM 2 is more powerful, it has the โ€œsingle clickโ€ prompting option, thus annotators can use it with simple clicks to re-annotate data.

They even add a completely automatic step to generate 350k more masklets!
And in turn, the model perf gradually increases.

I find this a great example of combining synthetic data generation with human annotation ๐Ÿ‘