
ReadyArt/Skyfall-36B-v2_EXL2_3.75bpw_H8
Updated
โข
12
โข
1
Following your exemple :
Your FoF-2 = FoF-1, as it stand there it biases the dataset by overponderating/oversaturating the same argument as two different ones.
https://argdown.org/syntax/#equivalence-classes
They should look like that :
<Focus on Fundamentals>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
<Focus on Fundamentals>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.
leading to the following :
[Learning Over Leisure]: Schools should restrict students' access to fan fiction and social media to protect the integrity of education.
<- <Restriction Infringes on Freedom of Expression>: Restricting access to fan fiction and social media unconstitutionally limits students' right to freedom of expression and stifles their creativity.
<+ <Lifelong Learning>: By exercising their freedom of expression, students develop essential skills in critical thinking, problem-solving, and effective communication, preparing them for success in their future careers and personal lives.
<- <Echo Chamber Effect>: Exercising freedom of expression in an unstructured environment can create an echo chamber where students only communicate with like-minded individuals, failing to develop the skills to engage with diverse perspectives and opposing views.
<- <Silent Observer>: Developing skills to engage with diverse perspectives and opposing views is not essential for effective communication in situations where listening and observing, rather than actively engaging, is the most effective strategy.
<- <Fan Fiction Distortion>: Fan fiction and social media often distort students' creativity by promoting unoriginal and copyrighted content, rather than fostering genuine artistic expression.
<- <Artistic Evolution>: The value of artistic expression lies in its ability to evoke emotions and spark new ideas, regardless of whether it is original or builds upon existing works, making the distinction between original and unoriginal content irrelevant.
<+ <Innovation Incubator>: Unrestricted freedom of expression enables students to develop critical thinking, problem-solving, and communication skills, essential for academic and professional success.
<+ <Focus on Fundamentals>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
<+ <Focus on Fundamentals>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.
<+ <Knowledge Pyramid>: A strong grasp of foundational subjects allows students to recognize relationships between different ideas and concepts, creating a hierarchical structure of knowledge that enhances retention and recall of critical information.
Problem solved, now we need to fix the dataset :
Pass all jsons trough :
#!/usr/bin/env python3
"""
Script to fix โalmost duplicatedโ labels in a debate JSON.
It reads an input JSON file (with a โnodesโ array where each node has a โlabelโ),
finds labels that are very similar (according to a fuzzyโmatch threshold),
and then updates all such nodes to share a canonical label.
"""
import json
import sys
import logging
import argparse
from difflib import SequenceMatcher
from typing import List, Dict, Any
# Set up logging configuration
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
def similarity(a: str, b: str) -> float:
"""Return a similarity ratio between two strings (0 to 1)."""
return SequenceMatcher(None, a, b).ratio()
def cluster_labels(labels: List[str], threshold: float = 0.90) -> Dict[str, str]:
"""
Given a list of labels, return a dictionary mapping each label to a canonical label.
Two labels that are at least 'threshold' similar will be treated as duplicates.
(The first label encountered becomes the canonical version.)
"""
canonical: Dict[str, str] = {}
unique_labels = list(set(labels)) # unique labels in no particular order
unique_labels.sort() # sort for consistency
# Build clusters by iterating over the unique labels.
for i, label in enumerate(unique_labels):
if label in canonical:
continue
canonical[label] = label # label becomes its own canonical version
for other_label in unique_labels[i + 1:]:
if other_label in canonical:
continue
if similarity(label, other_label) >= threshold:
canonical[other_label] = label
return canonical
def fix_labels(data: Dict[str, Any], threshold: float = 0.90) -> Dict[str, Any]:
"""
Given a debate JSON object (with a "nodes" key), fix labels by unifying similar ones.
Returns the modified JSON object.
"""
if "nodes" not in data:
logging.error("No 'nodes' key found in JSON data.")
return data
nodes = data["nodes"]
if not isinstance(nodes, list):
logging.error("'nodes' should be a list.")
return data
# Extract all labels; if a node doesn't have a "label", default to an empty string.
labels = [node.get("label", "") for node in nodes if isinstance(node, dict)]
# Build mapping from each label to its canonical version.
mapping = cluster_labels(labels, threshold=threshold)
logging.info("Found %d unique labels; mapping to canonical labels:", len(mapping))
for key, canonical_label in mapping.items():
if key != canonical_label:
logging.info(" %r --> %r", key, canonical_label)
# Update each node's label using the mapping.
for node in nodes:
if isinstance(node, dict):
original_label = node.get("label", "")
if original_label in mapping:
node["label"] = mapping[original_label]
return data
def parse_args() -> argparse.Namespace:
"""Parse command-line arguments."""
parser = argparse.ArgumentParser(
description="Fix almost duplicated labels in a debate JSON file."
)
parser.add_argument("input_file", help="Path to the input JSON file.")
parser.add_argument("output_file", help="Path where the fixed JSON will be saved.")
parser.add_argument(
"--threshold", type=float, default=0.90,
help="Fuzzy matching threshold (default: 0.90)."
)
return parser.parse_args()
def main() -> None:
args = parse_args()
# Load JSON data from file with error handling.
try:
with open(args.input_file, "r", encoding="utf-8") as infile:
data = json.load(infile)
except FileNotFoundError:
logging.error("Input file '%s' not found.", args.input_file)
sys.exit(1)
except json.JSONDecodeError as e:
logging.error("Error decoding JSON from '%s': %s", args.input_file, e)
sys.exit(1)
except Exception as e:
logging.error("An unexpected error occurred while reading '%s': %s", args.input_file, e)
sys.exit(1)
# Fix labels in the data.
fixed_data = fix_labels(data, threshold=args.threshold)
# Write the fixed data to the output file with error handling.
try:
with open(args.output_file, "w", encoding="utf-8") as outfile:
json.dump(fixed_data, outfile, indent=2, ensure_ascii=False)
except Exception as e:
logging.error("An error occurred while writing to '%s': %s", args.output_file, e)
sys.exit(1)
logging.info("Fixed JSON written to '%s'", args.output_file)
if __name__ == "__main__":
main()
we get this stdo :
ฮป python fix_labels.py input.json output.json
INFO: Found 638 unique labels; mapping to canonical labels:
INFO: 'Algorithmic Bias Amplification' --> 'Algorithmic Amplification'
INFO: 'Biased Benchmarks' --> 'Biased Benchmark'
INFO: 'Crime Deterrent' --> 'Crime Deterrence'
INFO: 'Dataset Augmentation' --> 'Data Augmentation'
INFO: 'Data Deserts' --> 'Data Desert'
INFO: 'Diverse Datasets' --> 'Diverse Data Sets'
INFO: 'Surveillance Slippery Slope' --> 'Mass Surveillance Slippery Slope'
INFO: 'National Security Exemption' --> 'National Security Exception'
INFO: 'Protecting the Vulnerable:' --> 'Protecting the Vulnerable'
INFO: 'Redundant Safeguards' --> 'Redundancy Safeguard'
INFO: Fixed JSON written to 'output.json'
all you need to do is to adapt main and make a pass through. atm your dataset is bad practice.
Credits : me, argdown docs, AI for [code review] and [error handling].