sander-wood commited on
Commit
1ba57f4
1 Parent(s): dc1f408

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -5
README.md CHANGED
@@ -91,12 +91,32 @@ The `config.py` file contains critical settings for training and inference, allo
91
  Generative modelling with bGPT is a flexible and powerful approach to learning and generating new data across various formats. bGPT segments byte sequences into patches, predicts next patch features with a patch-level decoder, and reconstructs bytes within patches using these features with a byte-level decoder. Here's how to get started:
92
 
93
  1. **Prepare Your Data**: bGPT can handle any computer file type, including text, images, audio, executables, and encrypted or proprietary formats, without needing specific adjustments for each modality. This capability allows for straightforward and versatile training on a wide array of data. The only thing you need to do here is simply to split your data for training and evaluation.
94
-
95
- 3. **Adjust Configuration Settings**: Modify the `config.py` file to tailor the training process to your needs. At a minimum, you should update the `TRAIN_FOLDERS` and `EVAL_FOLDERS` to point to your actual data directories. Also, specify where to save the trained model weights and logs by setting `WEIGHTS_PATH` and `LOGS_PATH`. You may adjust other parameters based on your specific requirements. For instance, with the default `PATCH_SIZE=16` and `PATCH_LENGTH=512`, bGPT can model byte sequences up to 8KB. If your training files are larger, and you have sufficient computational resources, consider increasing these parameters to accommodate the larger file sizes.
96
 
97
- 4. **Leverage Pre-trained Weights (Optional)**: If you wish to fine-tune a pre-trained bGPT model, set `PRE_WEIGHTS_PATH` to the location of the pre-trained weights and ensure `LOAD_FROM_PRE_CHECKPOINT=True`. To train a model from scratch, simply set `LOAD_FROM_PRE_CHECKPOINT=False`.
98
 
99
- 5. **Start Training**: Run `train-gen.py` to begin the training process. The script will use the configurations set in `config.py` and apply the training data to learn generative models capable of producing new, unseen outputs in the format of your training data.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  ### Classification
102
 
@@ -104,7 +124,7 @@ Classification with bGPT leverages the model's ability to understand and differe
104
 
105
  1. **Prepare Labelled Data**: Ensure your dataset consists of labelled data, which can be a mix of different formats. The model distinguishes between data types using the naming convention `label.ext`, where the label is derived from the filename, specifically `filename.split('_')[0]`. This means that the label for classification should be clearly reflected in the file name, such as "Business_1.txt". It is crucial to organize your files accordingly to facilitate accurate classification.
106
 
107
- 2. **Generative Modelling Before Classification (Strongly Recommended)**: Before embarking on classification tasks, it is highly recommended to perform generative modelling on the same dataset. Starting with weights trained through generative modelling provides a solid foundation for further fine-tuning in classification tasks. To do this, set `PRE_WEIGHTS_PATH` to your generative model weights and ensure `LOAD_FROM_PRE_CHECKPOINT=True`. Directly training a classification model from scratch without this pre-training step has been observed to result in significantly poorer performance. When fine-tuning for classification, ensure that `WEIGHTS_PATH` and `LOGS_PATH` are set to different locations to prevent overwriting previous models. Note that the classification model will inherit the bGPT's patch-level decoder and discard the byte-level decoder, so it's essential to keep the model parameters unchanged during this phase.
108
 
109
  3. **Start Training for Classification**: Run `train-cls.py` to begin the classification training process. The script will utilize the previously set configurations and apply them to your labelled dataset. The model will learn to classify the input data into the defined categories based on the labels extracted from the filenames.
110
 
 
91
  Generative modelling with bGPT is a flexible and powerful approach to learning and generating new data across various formats. bGPT segments byte sequences into patches, predicts next patch features with a patch-level decoder, and reconstructs bytes within patches using these features with a byte-level decoder. Here's how to get started:
92
 
93
  1. **Prepare Your Data**: bGPT can handle any computer file type, including text, images, audio, executables, and encrypted or proprietary formats, without needing specific adjustments for each modality. This capability allows for straightforward and versatile training on a wide array of data. The only thing you need to do here is simply to split your data for training and evaluation.
 
 
94
 
95
+ 2. **Adjust Configuration Settings**: Modify the `config.py` file to tailor the training process to your needs. At a minimum, you should update the `TRAIN_FOLDERS` and `EVAL_FOLDERS` to point to your actual data directories. Also, specify where to save the trained model weights and logs by setting `WEIGHTS_PATH` and `LOGS_PATH`. You may adjust other parameters based on your specific requirements. For instance, with the default `PATCH_SIZE=16` and `PATCH_LENGTH=512`, bGPT can model byte sequences up to 8KB. If your training files are larger, and you have sufficient computational resources, consider increasing these parameters to accommodate the larger file sizes. By default, the `CONVERSION_MODE` is set to `None`. If you wish to engage in data conversion tasks, please refer to the below section for instructions on setting up the conversion mode.
96
 
97
+ 3. **Leverage Pre-trained Weights (Optional)**: If you wish to fine-tune a pre-trained bGPT model, set `PRE_WEIGHTS_PATH` to the location of the pre-trained weights and ensure `LOAD_FROM_PRETRAINED=True`. To train a model from scratch, simply set `LOAD_FROM_PRETRAINED=False`.
98
+
99
+ 4. **Start Training**: Run `train-gen.py` to begin the training process. The script will use the configurations set in `config.py` and apply the training data to learn generative models capable of producing new, unseen outputs in the format of your training data.
100
+
101
+ ### Data Conversion
102
+
103
+ The conversion mode in bGPT adds a specialized functionality for transforming data from one format to another, leveraging the model's understanding of byte sequences across different file types. This mode supports both unidirectional and bidirectional conversions, enabling a wide range of data transformation tasks. Here's how to utilize the conversion mode effectively:
104
+
105
+ 1. **Define Conversion Mode**: In your `config.py` file, you'll define the `CONVERSION_MODE` setting, which governs how files are transformed. This setting offers two options: unidirectional and bidirectional conversion, denoted by `"->"` and `"&"` respectively.
106
+
107
+ - Unidirectional Conversion: Denoted by `"->"`, this mode signifies a one-way transformation from one format to another. For instance, if you want to convert text files to HTML, you'd set `CONVERSION_MODE = "txt->html"`. This means the model will learn to convert text files specifically into HTML format, but not vice versa.
108
+
109
+ - Bidirectional Conversion: Denoted by `"&"`, this mode implies a two-way transformation between formats. For example, setting `CONVERSION_MODE = "wav&mp3"` instructs the model to learn bidirectional conversion between WAV and MP3 audio formats. In this mode, the model learns to convert files from WAV to MP3 and vice versa, allowing flexibility in both directions of conversion.
110
+
111
+ 2. **Prepare Your Data**: Ensure your data pairs are stored within the same directory path in both `TRAIN_FOLDERS` and `EVAL_FOLDERS`. Each pair should share identical paths, including filenames, differing only in their file extensions. For instance, if converting between WAV and MP3 formats, ensure files like "path/audio.wav" and "path/audio.mp3" are paired accordingly. This strict alignment guarantees the script correctly associates files for conversion based on the specified mode.
112
+
113
+ 3. **Adjust Training Parameters**: Although the conversion mode operates under the same training principles as generative modelling, you might want to adjust certain parameters in `config.py` to optimize the conversion process. This could include tuning the `PATCH_SIZE` and `PATCH_LENGTH` settings to better accommodate the file sizes commonly encountered in your conversion tasks.
114
+
115
+ 4. **Leverage Pre-trained Weights (Optional)**: Same as regular generative modelling, if you wish to fine-tune a pre-trained bGPT model, set `PRE_WEIGHTS_PATH` to the location of the pre-trained weights and ensure `LOAD_FROM_PRETRAINED=True`. To train a model from scratch, simply set `LOAD_FROM_PRETRAINED=False`.
116
+
117
+ 5. **Start Training for Conversion**: When training bGPT in conversion mode, the model learns to map byte sequences from the source format to the target format (or vice versa in bidirectional mode). Execute `train-gen.py` to start the training process, ensuring that the `CONVERSION_MODE` is correctly set in your configuration file.
118
+
119
+ By leveraging the conversion mode, bGPT enables simulating and reverse engineering the behaviors of algorithms or hardware through paired inputs and outputs, opening up new possibilities for data processing and content generation tasks.
120
 
121
  ### Classification
122
 
 
124
 
125
  1. **Prepare Labelled Data**: Ensure your dataset consists of labelled data, which can be a mix of different formats. The model distinguishes between data types using the naming convention `label.ext`, where the label is derived from the filename, specifically `filename.split('_')[0]`. This means that the label for classification should be clearly reflected in the file name, such as "Business_1.txt". It is crucial to organize your files accordingly to facilitate accurate classification.
126
 
127
+ 2. **Generative Modelling Before Classification (Strongly Recommended)**: Before embarking on classification tasks, it is highly recommended to perform generative modelling on the same dataset. Starting with weights trained through generative modelling provides a solid foundation for further fine-tuning in classification tasks. To do this, set `PRE_WEIGHTS_PATH` to your generative model weights and ensure `LOAD_FROM_PRETRAINED=True`. Directly training a classification model from scratch without this pre-training step has been observed to result in significantly poorer performance. When fine-tuning for classification, ensure that `WEIGHTS_PATH` and `LOGS_PATH` are set to different locations to prevent overwriting previous models. Note that the classification model will inherit the bGPT's patch-level decoder and discard the byte-level decoder, so it's essential to keep the model parameters unchanged during this phase.
128
 
129
  3. **Start Training for Classification**: Run `train-cls.py` to begin the classification training process. The script will utilize the previously set configurations and apply them to your labelled dataset. The model will learn to classify the input data into the defined categories based on the labels extracted from the filenames.
130