Edit model card

Model Description

Machine learning models like tensorflow-compress which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.
This model was trained with the dynamic sapient technology, it was SentencePiece unigram with the dataset go_emotion, and it can compress the bits much better than RLE.

  • Developed by: Ziv Arin
  • Model type: Sentence similarity lossless compression
  • License: CC0-1.0

Demo

Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000
Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space-saving efficiency)

The notebook:

import sentencepiece as spm

bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')

def encode_id(bit_text):
    encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
    encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
    assert any([id_ <= 255 for id_ in encoded_ids])
    string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
    return string_ids

def decode_id(hex_string):
    u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
    encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
    return encoded_tokens

# Encode text
new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
encoded_ids = encode_id(new_sentence)
decoded_tokens = decode_id(encoded_ids)

print("length:", len(encoded_tokens))
print("encoded_tokens:", encoded_tokens)
print("encoded_ids:", encoded_ids)
print("same?:", encoded_tokens == decoded_tokens)

count = Counter(encoded_tokens)
print("count:", count)

Output:

length: 13
encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
encoded_ids: 1ab2ed09d7a9617206894e0608
same?: True
count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})

Bias, Risks, and Limitations

It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.
The model doesn't compress well strings with fewer zeros.

Environmental Impact

  • Hardware Type: I5-9300H
  • Hours used: 3 hours
Downloads last month
0
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train baiango/384_bit_comp