∞🧙🏼‍♂️AnyClassifier - Generating Synthetic Data For Text Classification

Community Article Published August 19, 2024

GitHub

I am thrilled to share AnyClassifier 0.2.0 with everyone. In this version, I introduced synthetic data generation for text classification.

1. Introduction

image/png image/png

AnyClassifier is a framework that empowers you to create high-performance classifiers without any labels or data, using minimal code. It's designed to revolutionize the machine learning development process by eliminating the need for extensive data curation and labeling.
Key features are:

  • Zero Data Required: Build classifiers from scratch, even without a dataset
  • Competitive Result: Achieving competitive results with synthetic data, comparable to using real data
  • Multilingual Synthetic Data Generation: Supports generating whatever language as long as your LLM model can
  • One-Line Implementation: Democratizing AI for everyone, and agent (as a plugin for agentic flow)
  • LLM-Powered Synthetic Data and Annotations: Leverage SOTA LLM for high-quality synthetic data generation and labeling
  • Designed for Real-World Use: Created by ML engineers, for ML and software engineers

2. Motivation

As machine learning engineers, we understand the challenges of building and curating high-quality labeled datasets.
The challenge amplifies when you want to build a multilingual dataset.
AnyClassifier eliminates this bottleneck, allowing you to focus on what matters most - solving real-world problems.

3. Why Synthetic Data?

There are a lot of motivations:

  • Model performance - most common reason of model is not performing is due to lack of data in certain class or certain domain
  • Cost saving of data collection and labeling
  • Rapid prototyping - time is a bigger cost
  • Overcome data scarcity - data of imbalanced classes are hard to collect because of their nature, for example, safety moderation requires non safe data which is rare compared to safe data, and the real world data might not represent all cases that one wants to detect.
  • Data privacy - although there is still a chance that language model will output the exact same training data, it poses a much lower risk than using real data itself.

4. Synthetic Data Generation Design

We adopt a hierarchical generation approach by asking LLM to suggest topic to expand on a particular label on first level, and suggest subtopic on second level.

image/png

It is also inspired by [1], where multiple personas are simulated to cover diverse scenario. We adopted source-type-driven strategy instead of persona-driven strategy as source type usually reflects a different groups of target audiences, and it is less fine-grained, which is more suitable for text classification model which is less data hungry in training.
Each data synthesis is grounded on unique combination of (instruction, label, subtopic, source_type). By enumerating on all these combination, it ensures the diversity of synthetic data by design.

image/png

Example (Using Llama 3.1 8B Instruct)

Synthetic Data for imdb

Label Examples
negative sentiment
  • 'I was really looking forward to this movie, but unfortunately, it was a complete disappointment. The plot was predictable and lacked any real tension. The characters were underdeveloped and their motivations were unclear. The pacing was slow and dragged on for far too long. Overall, I would not recommend this movie to anyone.'
  • "I'm extremely disappointed with the service I received at this restaurant. The hostess was unfriendly and unhelpful, and our server seemed completely overwhelmed. We had to ask multiple times for basic things like water and utensils. The food was overpriced and not even that good. Definitely will not be returning."
  • "I'm extremely disappointed with my recent purchase from this store. The quality of the product is subpar and the price is way too high. I paid $200 for a cheap-looking item that broke after just a week of use. Not worth the money at all. 1/10 would not recommend."
positive sentiment
  • "I just got tickets to see my favorite artist in concert and I'm beyond thrilled! The energy in the crowd is going to be electric! #concertseason #musiclover"
  • "I just had the most amazing experience at this restaurant! The service was lightning fast, and the food was prepared to perfection. Our server, Alex, was attentive and friendly, making sure we had everything we needed. The bill was reasonable, and we left feeling satisfied and eager to come back. 5 stars isn't enough, I'd give it 10 if I could!"
  • 'The action scenes in this movie are absolutely mind-blowing! The stunts are incredibly well-choreographed and the special effects are top-notch. I was on the edge of my seat the entire time, cheering on the heroes as they fought to save the world. The cast is also excellent, with standout performances from the lead actors. Overall, I would highly recommend this movie to anyone who loves action-packed thrill rides.'

Click here to inspect more.

Synthetic Data for zeroshot/twitter-financial-news-sentiment

Label Examples
Bearish
  • "I'm getting out of the market before it's too late. The Dow is plummeting and I don't see any signs of recovery. The economic indicators are all pointing to a recession and I'm not willing to risk my retirement savings. I've been in this market for years, but I think it's time to cut my losses and move to cash. Has anyone else seen this coming?"
  • "Bloomberg: Tesla's Earnings Miss Estimates as Revenue Declines 20% - The electric vehicle maker's quarterly earnings fell short of analysts' expectations, with revenue plummeting 20% to $24.58 billion. The company's gross margin also declined, sparking concerns about its ability to maintain profitability. Tesla's stock price tumbled 8% in after-hours trading, as investors reacted to the disappointing results. The miss is a setback for CEO Elon Musk, who has been under pressure to deliver consistent growth. Analysts had expected Tesla to report earnings of $0.55 per share, but the company came in at $0.35 per share. The revenue decline was driven by a 15% drop in automotive sales, as well as a 10% decline in energy generation and storage sales. The company's guidance for the current quarter also fell short of expectations, with Tesla forecasting revenue of $24 billion, below the consensus estimate of $25.5 billion. The earnings miss is a reminder that the electric vehicle market remains highly competitive, and that Tesla faces significant challenges in maintaining its market share."
  • "I'm getting out of the market before it's too late. The Dow just plummeted 500 points and I'm not willing to risk losing my shirt. The economic indicators are all pointing to a recession and I'm not convinced that the Fed can stem the tide. Anyone else bailing out before the ship sinks?"
Neutral
  • "Earnings Report: Tech Giant's Revenue Beats Expectations, But Margins Narrow. Despite a 10% increase in revenue, the company's net income fell short of forecasts due to higher operating expenses. The stock price remained relatively stable, with investors taking a cautious approach ahead of the company's upcoming product launch."
  • 'Just got back from a meeting with my financial advisor and we discussed the latest quarterly earnings reports. No major surprises, just a steady performance from the market. Nothing to get too excited about, but nothing to worry about either. #stockmarket #investing'
  • 'The proposed merger between Coca-Cola and Coca-Cola European Partners is expected to create a beverage giant with a combined market value of over $100 billion. While the deal is still pending regulatory approval, analysts are cautiously optimistic about its potential to drive growth and increase efficiency. The merged entity is likely to benefit from economies of scale and a stronger presence in key markets, but some investors are concerned about the potential impact on jobs and supply chains. As the deal moves forward, investors will be closely watching for any signs of regulatory hurdles or other issues that could delay or derail the merger.'
Bullish
  • "Just got my hands on the new #Robinhood app update! They've finally added a feature to track my portfolio's performance in real-time. Huge shoutout to the @RobinhoodApp team for listening to user feedback and delivering on their promises! #investing #stockmarket #bullish"
  • "Just got out of the @AAPL earnings call and I'm feeling super bullish about the company's future. The guidance on EPS is looking strong and I think we're on the verge of a major breakout. #AAPL #StockMarket #Earnings"
  • "Just heard that the Federal Reserve is considering cutting interest rates to stimulate the economy. This is a huge positive for the stock market and I'm expecting a strong rally in the coming weeks. Anyone else feeling bullish about the market right now?"

Click here to inspect more.

Synthetic Data for ccdv/arxiv-classification

Label Examples
Neural and Evolutionary
  • 'Evolutionary algorithms have been widely used in various optimization problems due to their ability to efficiently search for optimal solutions. In this paper, we propose a novel hybrid approach that combines the strengths of genetic algorithms and differential evolution to solve complex optimization problems. The proposed method, called GEDE, integrates the exploration capabilities of genetic algorithms with the exploitation capabilities of differential evolution. We evaluate the performance of GEDE on several benchmark problems and compare it with other state-of-the-art algorithms. The results show that GEDE outperforms the other algorithms in terms of convergence speed and solution quality. We also analyze the convergence behavior of GEDE and provide insights into its performance. The proposed approach has the potential to be applied to a wide range of optimization problems in various fields, including engineering, economics, and computer science.'
  • 'Title: A Deep Learning Approach for Sentiment Analysis of Text Data\nAbstract: This paper proposes a novel deep learning model for sentiment analysis of text data. The proposed model combines the strengths of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to effectively capture the spatial and temporal dependencies in text data. Experimental results on several benchmark datasets demonstrate the superiority of the proposed model over state-of-the-art methods. The proposed model achieves an accuracy of 92.5% on the IMDB dataset, outperforming the best existing method by 2.5%. The results also show that the proposed model is robust to noise and can handle out-of-vocabulary words. The proposed model is a significant contribution to the field of natural language processing and has the potential to be applied to various real-world applications.'
  • 'Title: A Novel Hybrid Approach for Deep Learning-based Optimization of Evolutionary Algorithms\n\nAbstract: This paper proposes a novel hybrid approach that combines the strengths of deep learning and evolutionary algorithms to optimize complex optimization problems. We introduce a new neural network architecture that learns to adapt the parameters of evolutionary algorithms in real-time, leading to improved convergence rates and better solution quality. Our approach is evaluated on a range of benchmark problems and compared to state-of-the-art methods. The results show that our hybrid approach outperforms existing methods in terms of convergence speed and solution quality. We also provide a comprehensive analysis of the proposed approach and discuss its potential applications in various fields.\n\nKeywords: Evolutionary algorithms, Deep learning, Optimization, Hybrid approach, Neural networks.'
Statistics Theory
  • 'Title: A New Perspective on the Generalization Error of Support Vector Machines\nAbstract: We provide a new bound on the generalization error of support vector machines (SVMs) in terms of the Rademacher complexity of the reproducing kernel Hilbert space (RKHS) of the kernel. Our bound is tighter than existing bounds and has a simpler form. We also provide a new algorithm for learning the kernel, which is based on the idea of minimizing the empirical risk with respect to the RKHS norm. We demonstrate the effectiveness of our approach on several benchmark datasets.'
  • 'Title: A Bayesian Approach to Hypothesis Testing for High-Dimensional Data\n\nAbstract: Hypothesis testing is a fundamental problem in statistics, and its applications are widespread in various fields. However, the traditional methods of hypothesis testing often fail to perform well in high-dimensional data settings. In this paper, we propose a novel Bayesian approach to hypothesis testing for high-dimensional data. Our method combines the strengths of Bayesian inference and dimensionality reduction techniques to provide a robust and efficient solution to the hypothesis testing problem. We demonstrate the effectiveness of our approach through extensive simulations and real-world experiments on high-dimensional data sets. The results show that our method outperforms existing methods in terms of accuracy and computational efficiency. Furthermore, we provide a theoretical analysis of our approach, which provides insights into its performance and limitations. Our method has the potential to be applied to a wide range of applications, including image analysis, genomics, and finance. The code and data used in this paper are available online for reproducibility purposes.'
  • 'Title: Bayesian Network Learning with Gaussian Process Priors for Uncertainty Quantification in High-Dimensional Systems\n\nAbstract: Bayesian networks are a powerful tool for modeling complex systems with uncertainty. However, in high-dimensional systems, the computational cost of learning Bayesian networks can be prohibitively expensive. In this paper, we propose a novel approach to Bayesian network learning using Gaussian process priors. Our approach, which we call Bayesian network learning with Gaussian process priors (BN-GP), leverages the flexibility of Gaussian processes to model the uncertainty in the network structure. We demonstrate the effectiveness of BN-GP on several high-dimensional systems, including a synthetic dataset and a real-world dataset from the field of systems biology. Our results show that BN-GP can learn accurate Bayesian networks with significantly reduced computational cost compared to traditional methods. Furthermore, we provide a theoretical analysis of the convergence properties of BN-GP, which shows that it can learn consistent estimates of the network structure even in the presence of high-dimensional data. Our approach has the potential to enable the widespread adoption of Bayesian networks in high-dimensional systems, where traditional methods are often infeasible.\n\nKeywords: Bayesian networks, Gaussian process priors, uncertainty quantification, high-dimensional systems, systems biology.'
Artificial Intelligence
  • 'Title: Learning Hierarchical Representations for Robust Visual Perception in Autonomous Systems\nAbstract: We propose a novel deep learning approach for visual perception in autonomous systems, which leverages hierarchical representations to improve robustness and accuracy. Our method combines a convolutional neural network (CNN) with a recurrent neural network (RNN) to learn a hierarchical representation of visual data. We evaluate our approach on several benchmark datasets and demonstrate significant improvements in performance compared to state-of-the-art methods. Our results show that the proposed approach can learn robust and accurate representations of visual data, even in the presence of significant occlusions and variations in lighting conditions. We also provide a detailed analysis of the learned representations and demonstrate their applicability to various tasks in autonomous systems. This work makes significant contributions to the field of computer vision and robotics, and has the potential to enable more robust and accurate visual perception in autonomous systems.'
  • 'Title: A Deep Learning Approach for Text Classification: A Comparative Study\n\nAbstract: Text classification is a fundamental task in natural language processing (NLP) that has numerous applications in various domains. In this paper, we propose a deep learning approach for text classification using convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We compare the performance of our proposed approach with state-of-the-art methods on several benchmark datasets. Our results show that our approach outperforms the existing methods in terms of accuracy and F1-score. We also analyze the effect of different hyperparameters on the performance of our approach and provide insights into the importance of feature extraction in text classification. This study contributes to the development of efficient and accurate text classification models using deep learning techniques.\n\nKeywords: text classification, deep learning, convolutional neural networks, recurrent neural networks, natural language processing.'
  • 'Title: Investigating the Impact of Attention Mechanisms on Deep Learning Models for Sentiment Analysis.\n\nAbstract: This paper explores the effects of incorporating attention mechanisms into deep learning models for sentiment analysis. We propose a novel architecture that combines the strengths of recurrent neural networks (RNNs) and attention mechanisms to improve the performance of sentiment analysis tasks. Our experimental results demonstrate that the proposed model outperforms state-of-the-art models in terms of accuracy and F1-score. Furthermore, we conduct an ablation study to investigate the impact of different attention mechanisms on the performance of the model. Our findings suggest that the proposed attention mechanism is more effective than other attention mechanisms in improving the performance of sentiment analysis tasks.\n\nKeywords: deep learning, attention mechanisms, sentiment analysis, natural language processing, neural networks.\n\nSource: Google Scholar.'
Computer Vision
  • 'A Novel Approach to Object Detection using Convolutional Neural Networks\n\nAbstract: Object detection is a fundamental task in computer vision, and its applications are vast in various fields. In this paper, we propose a novel approach to object detection using convolutional neural networks (CNNs). Our method, called Object Detection using CNNs (ODCNN), is based on a combination of region proposal networks (RPNs) and CNNs. We train the ODCNN model on the PASCAL VOC 2007 dataset and evaluate its performance on the PASCAL VOC 2012 dataset. The results show that our approach outperforms the state-of-the-art methods in terms of accuracy and speed. We also provide a detailed analysis of the ODCNN model and its components. The code for the ODCNN model is available at https://github.com/odcnn/odcnn.\n\nKeywords: Object detection, Convolutional neural networks, Region proposal networks, PASCAL VOC 2007, PASCAL VOC 2012.'
  • 'A Novel Object Recognition Framework for Autonomous Robots using Deep Learning and Computer Vision Techniques\n\nAbstract: This paper proposes a novel object recognition framework for autonomous robots that leverages the power of deep learning and computer vision techniques. The proposed framework consists of two stages: a detection stage and a recognition stage. In the detection stage, a convolutional neural network (CNN) is used to detect objects in the scene, while in the recognition stage, a recurrent neural network (RNN) is employed to recognize the detected objects. The proposed framework is evaluated on a dataset of images collected from a robotic platform, and the results show that it outperforms state-of-the-art methods in terms of accuracy and speed. The proposed framework has the potential to be used in various applications, including robotics, autonomous vehicles, and surveillance systems.\n\nKeywords: Object recognition, autonomous robots, deep learning, computer vision, convolutional neural networks, recurrent neural networks.\n\n'
  • 'A novel approach to image classification using convolutional neural networks (CNNs) is proposed in this paper. The proposed method, dubbed "Deep Image Classifier", leverages the power of CNNs to learn hierarchical features from images. Experimental results on several benchmark datasets, including CIFAR-10 and ImageNet, demonstrate the efficacy of the proposed method in achieving state-of-the-art performance. The code for the proposed method is made available on GitHub, allowing for easy reproduction and extension of the results. The contributions of this paper can be summarized as follows: (1) a novel CNN architecture is proposed, which consists of multiple convolutional and pooling layers, followed by fully connected layers; (2) a novel training strategy is proposed, which involves data augmentation and batch normalization; (3) the proposed method is evaluated on several benchmark datasets, and the results are compared with state-of-the-art methods. The results of this paper demonstrate the potential of CNNs in image classification tasks, and provide a new benchmark for future research in this area.'

Click here to inspect more.

Synthetic Data for fancyzhx/ag_news

Label Examples
Sports
  • 'Hamburg hampered by Lauth knock Hamburg SV striker Benjamin Lauth will be sidelined for up to four weeks because of complications to a fractured foot and perhaps longer if surgery is required, coach Klaus Toppmoeller said on Wednesday.'
  • 'Keane Pleads Not Guilty to Assault Charges (AP) AP - Manchester United captain Roy Keane pleaded not guilty to all three charges Thursday over an alleged confrontation with a 16-year-old boy.'
  • 'NBA Game Summary - San Antonio at Chicago Chicago, IL (Sports Network) - Tony Parker scored 17 points and had five assists to lead a balanced San Antonio attack that handed the Spurs a 91-75 victory over the Chicago Bulls at the United Center.'
Business
  • 'Forex: Dollar Falls After Fed Rate Hike NEW YORK (Reuters) - The dollar extended its losses on Tuesday after the Federal Reserve raised interest rates as expected but signaled that both inflation and inflation expectations were easing.'
  • 'Ameritrade Posts November Client Trades Ameritrade Holding Corp., a provider of brokerage services for individual investors, said Friday that daily average client trades in November reached 183,000, with 29,000 new accounts opened during the month.'
  • 'Firefox browser sees surge in use A sudden, measurable decline in market share in any product over the course of a few months says something, even if that product is one whose producer still holds about 90 of the market in question.'
World
  • 'Leaders Attend UAE President #39;s Funeral The United Arab Emirates appointed Sheik Khalifa bin Zayed Al Nahyan as its president Wednesday, hours after burying his father in a funeral that attracted thousands of mourners and nine heads of state to this desert nation on the Arabian Peninsula.'
  • 'Report: Tobacco Industry Hid Smoking Dangers NEW YORK (Reuters Health) - The tobacco industry for many years claimed that it was unaware of biological evidence that smoking is harmful to health, but that was untrue according to a medical journal report.'
  • 'Telenor urges fair regulatory system in Thailand (FT.com) FT.com - Telenor, the Norwegian telecommunications company, on Thursday called for "a level-playing field" in Thailand's mobile industry, urging a newly-established Thai telecoms regulator swiftly to create a fair new interconnection regime.'
Sci/Tech
  • 'Microsoft Takes Lead in Software For Handhelds Microsoft has unseated the Palm system with worldwide sales of more than 1.3 million units over the third quarter of the year, compared with slightly more than 850,000 for the Palm, according to a new report. <FONT face="verdana,MS Sans Serif,arial,helvetica" size="-2" color="#666666"><B>-The Washington Post</B></FONT>'
  • 'Telstra launches international Wi-fi roaming Telstra has launched Wi-fi roaming with five international wireless broadband operators giving Telstra customers travelling abroad access to WiFi hotspots in the UK (BT Group), USA (T-Mobile USA), Japan (NTT DoCoMo), Singapore (StarHub) and Malaysia (Maxis '
  • 'Passwords Fail To Defend Enterprises (TechWeb) TechWeb - Passwords, the dominant form of securing enterprise assets, are a failure, a research firm says.'

Click here to inspect more.

Synthetic Data for Language other than English

kenhktsui/chinese_sentiment_syn

5. Benchmarking

We tested on multiple datasets:

  • stanfordnlp/imdb
  • zeroshot/twitter-financial-news-sentiment
  • ccdv/arxiv-classification
  • lmsys/toxic-chat
  • fancyzhx/ag_news

The objective is to see if synthetic data is performing as well as real data (annotation). Full training dataset indicates the upper limit of performance as more data is available. Model performance of synthetic data is at par with/ close to that of real data, which is not bad because the testing data is usually by design more similar to training (real) data than synthetic data. We also note that synthetic data is also advantageous when class is highly imbalanced like the toxic chat problem.
Our benchmark implies the synthetic data generated is close to the distribution of test data, showing the effectiveness of this synthetic data generation approach, without using any real data.
All models finetune on sentence-transformers/paraphrase-mpnet-base-v2 (109M). The performance can be boosted by using a larger base model and generating more data.

dataset metric synthetic data generation annotation full training dataset full training reference
stanfordnlp/imdb accuracy 0.878 0.908 0.928 lvwerra/distilbert-imdb
zeroshot/twitter-financial-news-sentiment f1 (weighted) 0.631 0.676 0.866 nickmuchi/finbert-tone-finetuned-fintwitter-classification
ccdv/arxiv-classification acurracy 0.618 0.566 0.805 paper
lmsys/toxic-chat, toxicchat0124 f1 (binary) 0.362 0.00 0.822 lmsys/toxicchat-t5-large-v1.0
fancyzhx/ag_news accuracy 0.768 0.765 0.938 fabriceyhc/bert-base-uncased-ag_news

6. Usage

To generate synthetic data for classification, you just need to instantiate the class SyntheticDataGeneratorForSequenceClassification, and then specify the what your classifier does, and the definition of labels in the generate method. The resulting dataset is a huggingface's datasets.Dataset.

from anyclassifier.schema import Label
from anyclassifier.synthetic_data_generation import SyntheticDataGeneratorForSequenceClassification

tree_constructor = SyntheticDataGeneratorForSequenceClassification()
dataset = tree_constructor.generate(
    "Classify a text's sentiment.",
    [
        Label(id=0, desc='negative sentiment'),
        Label(id=1, desc='positive sentiment')
    ]
)
dataset.push_to_hub('user_id/any_data')

The meta column of each row records how the synethetic is generated.

{ 
  "source_type": "IMDB Movie Reviews",
  "subtopic": "Surprising Twist",
  "topic": "Engaging Plot"
}

If you want to have fine-grained control to construct a larger dataset, you can specify the keyword arguements. You can either specify n_record_to_generate or all of the params n_source_type, n_topic, n_subtopic, sample_per_subtopic. If the first is used, our implementation aims at generate such no of record, but it might not be exact because of unpredictability of LLM. If the later is used, the result no of record = no of label * n_source_type * n_topic * n_subtopic * sample_per_subtopic. If there are many labels in a classification problem, it might take a while to generate synthetic data.

7. Conclusion

We believe AnyClassifier can revolutionize your ML workflow. It will be interesting to see 1) if it will result in explosion of new datasets and better models because of the synthetic data generation - it will be beneficial to AI communities; and 2) what is the limit if we combine it with agentic flow?

There are many open researches and implementation in the future:

  • research on synthetic data algorithm resulting in higher performance
  • agentic workflow of model evaluation, error analysis and model improvement
  • multilingual support

If you are interested, please give a star, try it out, and contribute.

Credits

Without these open source models and libraries, this project would not have been possible.

Reference

  1. Chan, X., Wang, X., Yu, D., Mi, H., Yu, D., 2024. Scaling synthetic data creation with 1,000,000,000 personas. URL: https://arxiv.org/abs/2406.20094, arXiv:2406.20094.

Citation

If you are interested in this work, please cite:

@software{Tsui_AnyClassifier_2024,
author = {Tsui, Ken},
month = {8},
title = {{AnyClassifier}},
url = {https://github.com/kenhktsui/anyclassifier},
year = {2024}
}