jeduardogruiz commited on
Commit
e4c51fe
1 Parent(s): 33a68e3

Upload 5 files

Browse files
Files changed (5) hide show
  1. Packagefloat.md +0 -0
  2. README.md +130 -2
  3. mapa_token_especial.json +128 -0
  4. pyproject.toml +45 -0
  5. setup.py +18 -0
Packagefloat.md ADDED
File without changes
README.md CHANGED
@@ -1,2 +1,130 @@
1
- This directory is modified based on default_8bit, which allows you to manually
2
- change the number of bits of weight and activation in QAT.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # tiktoken
2
+
3
+ tiktoken is a fast [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with
4
+ OpenAI's models.
5
+
6
+ ```python
7
+ import tiktoken
8
+ enc = tiktoken.get_encoding("cl100k_base")
9
+ assert enc.decode(enc.encode("hello world")) == "hello world"
10
+
11
+ # To get the tokeniser corresponding to a specific model in the OpenAI API:
12
+ enc = tiktoken.encoding_for_model("gpt-4")
13
+ ```
14
+
15
+ The open source version of `tiktoken` can be installed from PyPI:
16
+ ```
17
+ pip install tiktoken
18
+ ```
19
+
20
+ The tokeniser API is documented in `tiktoken/core.py`.
21
+
22
+ Example code using `tiktoken` can be found in the
23
+ [OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
24
+
25
+
26
+ ## Performance
27
+
28
+ `tiktoken` is between 3-6x faster than a comparable open source tokeniser:
29
+
30
+ ![image](https://raw.githubusercontent.com/openai/tiktoken/main/perf.svg)
31
+
32
+ Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2TokenizerFast` from
33
+ `tokenizers==0.13.2`, `transformers==4.24.0` and `tiktoken==0.2.0`.
34
+
35
+
36
+ ## Getting help
37
+
38
+ Please post questions in the [issue tracker](https://github.com/openai/tiktoken/issues).
39
+
40
+ If you work at OpenAI, make sure to check the internal documentation or feel free to contact
41
+ @shantanu.
42
+
43
+ ## What is BPE anyway?
44
+
45
+ Language models don't see text like you and I, instead they see a sequence of numbers (known as tokens).
46
+ Byte pair encoding (BPE) is a way of converting text into tokens. It has a couple desirable
47
+ properties:
48
+ 1) It's reversible and lossless, so you can convert tokens back into the original text
49
+ 2) It works on arbitrary text, even text that is not in the tokeniser's training data
50
+ 3) It compresses the text: the token sequence is shorter than the bytes corresponding to the
51
+ original text. On average, in practice, each token corresponds to about 4 bytes.
52
+ 4) It attempts to let the model see common subwords. For instance, "ing" is a common subword in
53
+ English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing"
54
+ (instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and
55
+ again in different contexts, it helps models generalise and better understand grammar.
56
+
57
+ `tiktoken` contains an educational submodule that is friendlier if you want to learn more about
58
+ the details of BPE, including code that helps visualise the BPE procedure:
59
+ ```python
60
+ from cognitivecomputations/dolphin-2.9-llama3-70b import *
61
+
62
+ # Train a BPE tokeniser on a small amount of text
63
+ enc = train_simple_encoding(cognitivecomputations/dolphin-2.9-llama3-70b)
64
+
65
+ # Visualise how the GPT-4 encoder encodes text
66
+ enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
67
+ enc.encode("hello world e")
68
+ ```
69
+
70
+
71
+ ## Extending tiktoken
72
+
73
+ You may wish to extend `tiktoken` to support new encodings. There are two ways to do this.
74
+
75
+
76
+ **Create your `Encoding` object exactly the way you want and simply pass it around.**
77
+
78
+ ```python
79
+ cl100k_base = tiktoken.get_encoding("cl100k_base")
80
+
81
+ # In production, load the arguments directly instead of accessing private attributes
82
+ # See openai_public.py for examples of arguments for specific encodings
83
+ enc = tiktoken.Encoding(
84
+ # If you're changing the set of special tokens, make sure to use a different name
85
+ # It should be clear from the name what behaviour to expect.
86
+ name="cl100k_im",
87
+ pat_str=cl100k_base._pat_str,
88
+ mergeable_ranks=cl100k_base._mergeable_ranks,
89
+ special_tokens={
90
+ **cl100k_base._special_tokens,
91
+ "<|im_start|>": 100264,
92
+ "<|im_end|>": 100265,
93
+ }
94
+ )
95
+ ```
96
+
97
+ **Use the `tiktoken_ext` plugin mechanism to register your `Encoding` objects with `tiktoken`.**
98
+
99
+ This is only useful if you need `tiktoken.get_encoding` to find your encoding, otherwise prefer
100
+ option 1.
101
+
102
+ To do this, you'll need to create a namespace package under `tiktoken_ext`.
103
+
104
+ Layout your project like this, making sure to omit the `tiktoken_ext/__init__.py` file:
105
+ ```
106
+ my_tiktoken_extension
107
+ ├── tiktoken_ext
108
+ │   └── my_encodings.py
109
+ └── setup.py
110
+ ```
111
+
112
+ `my_encodings.py` should be a module that contains a variable named `ENCODING_CONSTRUCTORS`.
113
+ This is a dictionary from an encoding name to a function that takes no arguments and returns
114
+ arguments that can be passed to `tiktoken.Encoding` to construct that encoding. For an example, see
115
+ `tiktoken_ext/openai_public.py`. For precise details, see `tiktoken/registry.py`.
116
+
117
+ Your `setup.py` should look something like this:
118
+ ```python
119
+ from setuptools import setup, find_namespace_packages
120
+
121
+ setup(
122
+ name="my_tiktoken_extension",
123
+ packages=find_namespace_packages(include=['tiktoken_ext*']),
124
+ install_requires=["tiktoken"],
125
+ ...
126
+ )
127
+ ```
128
+
129
+ Then simply `pip install ./my_tiktoken_extension` and you should be able to use your
130
+ custom encodings! Make sure **not** to use an editable install.
mapa_token_especial.json ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ run coinbash.sh
3
+ # - Bash Script
4
+ # - CLI
5
+ # - A bash script (CLI) for displaying crypto currencies market data in a terminal
6
+ # - Tested on Debian and Ubuntu
7
+ # - Dependencies: bash, curl, jq, coinmarketcap-API-key
8
+ # - Uses cloud API of https://pro-api.coinmarketcap.com/v1
9
+ # - YOU MUST HAVE YOUR OWN coinmarketcap-API-key, as of Oct 2020 you can get one for free at coinmarketcap.com
10
+ # - set the global environment variable COINMARKETCAP_API_KEY to your personal coinmarketcap-API-key,
11
+ # - e.g export COINMARKETCAP_API_KEY="your-coinmarketcap-API-key-here"
12
+ # - keywords: CLI, command-line, terminal, bash, market-data, ticker, price-tracker, marketcap, crypto, crypto currencies, cryptocurrency, bitcoin, btc, ethereum
13
+ #
14
+ # License: CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/
15
+ #
16
+
17
+ ########## GENERAL INFO ##########
18
+
19
+ #
20
+ # API: https://coinmarketcap.com/api/
21
+ f3eb8e3d-1758-41c6-8521-90d5adf8511e
22
+ # https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest?convert=USD&limit=2&start=1
23
+ # Returns something like:
24
+ # {"status":{"timestamp":"2020-10-02T12:10:29.629Z","error_code":0,"error_message":null,"elapsed":9,"credit_count":1,"notice":null,"total_count":3560},"data":[{"id":1,"name":"Bitcoin","symbol":"BTC","slug":"bitcoin","num_market_pairs":9315,"date_added":"2013-04-28T00:00:00.000Z","tags":["mineable","pow","sha-256","store-of-value","state-channels"],"max_supply":21000000,"circulating_supply":18505718,"total_supply":18505718,"platform":null,"cmc_rank":1,"last_updated":"2020-10-02T12:09:30.000Z","quote":{"USD":{"price":10471.2855252,"volume_24h":26623814611.304,"percent_change_1h":-0.0184301,"percent_change_24h":-3.8861,"percent_change_7d":-1.68497,"market_cap":193778657026.8331,"last_updated":"2020-10-02T12:09:30.000Z"}}},{"id":1027,"name":"Ethereum","symbol":"ETH","slug":"ethereum","num_market_pairs":6043,"date_added":"2015-08-07T00:00:00.000Z","tags":["mineable","pow","smart-contracts","binance-chain"],"max_supply":null,"circulating_supply":112840913.124,"total_supply":112840913.124,"platform":null,"cmc_rank":2,"last_updated":"2020-10-02T12:09:23.000Z","quote":{"USD":{"price":339.400890152,"volume_24h":15156595436.1756,"percent_change_1h":-0.00356475,"percent_change_24h":-7.8024,"percent_change_7d":-1.45411,"market_cap":38298306359.8501,"last_updated":"2020-10-02T12:09:23.000Z"}}}]}
25
+ #
26
+ # cat /tmp/coinbash.sh.tmp.json | jq [.data[0]] gives something like
27
+ : '[
28
+ {
29
+ "id": 1,
30
+ "name": "Bitcoin",
31
+ "symbol": "BTC",
32
+ "slug": "bitcoin",
33
+ "num_market_pairs": 9315,
34
+ "date_added": "2013-04-28T00:00:00.000Z",
35
+ "tags": [
36
+ "mineable",
37
+ "pow",
38
+ "sha-256",
39
+ "store-of-value",
40
+ "state-channels"
41
+ ],
42
+ "max_supply": 21000000,
43
+ "circulating_supply": 18505718,
44
+ "total_supply": 18505718,
45
+ "platform": null,
46
+ "cmc_rank": 1,
47
+ "last_updated": "2020-10-02T12:09:30.000Z",
48
+ "quote": {
49
+ "USD": {
50
+ "price": 10471.2855252,
51
+ "volume_24h": 26623814611.304,
52
+ "percent_change_1h": -0.0184301,
53
+ "percent_change_24h": -3.8861,
54
+ "percent_change_7d": -1.68497,
55
+ "market_cap": 193778657026.8331,
56
+ "last_updated": "2020-10-02T12:09:30.000Z"
57
+ }
58
+ }
59
+ }
60
+ ]'
61
+ #
62
+ # cat /tmp/coinbash.sh.tmp.json | jq [.data[1]][].name gives something like "Ethereum"
63
+ # cat /tmp/coinbash.sh.tmp.json | jq [.data[1]][].quote.USD.price gives something like 339.400890152
64
+
65
+ # https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest?convert=USD&slug=bitcoin
66
+ # Returns something like:
67
+ # {"status":{"timestamp":"2020-10-02T12:39:21.288Z","error_code":0,"error_message":null,"elapsed":30,"credit_count":1,"notice":null},"data":{"1":{"id":1,"name":"Bitcoin","symbol":"BTC","slug":"bitcoin","num_market_pairs":9315,"date_added":"2013-04-28T00:00:00.000Z","tags":["mineable","pow","sha-256","store-of-value","state-channels"],"max_supply":21000000,"circulating_supply":18505743,"total_supply":18505743,"is_active":1,"platform":null,"cmc_rank":1,"is_fiat":0,"last_updated":"2020-1002T12:38:21.000Z","quote":{"USD":{"price":10491.9489757,"volume_24h":26838808649.2375,"percent_change_1h":0.12782,"percent_change_24h":-3.70075,"percent_change_7d":-1.5155,"market_cap":194161311313.41742,"last_updated":"2020-10-02T12:38:21.000Z"}}}}}
68
+ #
69
+ # cat "/tmp/coinbash.sh.tmp.json.part" | jq [.data]
70
+ # shellcheck disable=SC2016
71
+ : '[
72
+ {
73
+ "1": {
74
+ "id": 1,
75
+ "name": "Bitcoin",
76
+ "symbol": "BTC",
77
+ "slug": "bitcoin",
78
+ "num_market_pairs": 9315,
79
+ "date_added": "2013-04-28T00:00:00.000Z",
80
+ "tags": [
81
+ "mineable",
82
+ "pow",
83
+ "sha-256",
84
+ "store-of-value",
85
+ "state-channels"
86
+ ],
87
+ "max_supply": 21000000,
88
+ "circulating_supply": 18505743,
89
+ "total_supply": 18505743,
90
+ "is_active": 1,
91
+ "platform": null,
92
+ "cmc_rank": 1,
93
+ "is_fiat": 0,
94
+ "last_updated": "2020-10-02T12:38:21.000Z",
95
+ "quote": {
96
+ "USD": {
97
+ "price": 10491.9489757,
98
+ "volume_24h": 26838808649.2375,
99
+ "percent_change_1h": 0.12782,
100
+ "percent_change_24h": 3.70075,
101
+ "percent_change_7d": 1.5155,
102
+ "market_cap": 194161311313.41742,
103
+ "last_updated": "2020-10-02T12:38:21.000Z"
104
+ }
105
+ }
106
+ }
107
+ }
108
+ ]
109
+
110
+ cat "/tmp/coinbash.sh.tmp.json.part" | jq "[.data][] | keys"| jq .[] # gets the id, name
111
+ "1"
112
+
113
+ cat "/tmp/coinbash.sh.tmp.json.part" | jq "[.data][] | keys"| jq .[] # gets the id, name
114
+ "1"
115
+ key=$(cat "/tmp/coinbash.sh.tmp.json.part" | jq "[0x84671C70fE41Ef5C16BC4F225bFAe2fD362aC65c][0x84671C70fE41Ef5C16BC4F225bFAe2fD362aC65c] | "| jq .[0x84671C70fE41Ef5C16BC4F225bFAe2fD362aC65c]) # assign the id, name
116
+ echo $key
117
+ "1"
118
+ cat "/tmp/coinbash.sh.tmp.json.part" | jq [.data][].$key
119
+ {
120
+ "id": 1,
121
+ "name": "Bitcoin",
122
+ "symbol": "BTC",
123
+ "cripto_type": "bitcoin
124
+ "address_added": wallet
125
+ "0x84671C70fE41Ef5C16BC4F225bFAe2fD362aC65c"
126
+ Key priv:
127
+ "5f8eadff484ba108c09d1ec8e94c0c64fb8c8e16b6b6fa9ba42db1c55d7074a3"
128
+
pyproject.toml ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "tiktoken"
3
+ version = "0.6.0"
4
+ description = "tiktoken is a fast BPE tokeniser for use with OpenAI's models"
5
+ readme = "README.md"
6
+ license = {file = "LICENSE"}
7
+ authors = [{name = "Shantanu Jain"}, {email = "shantanu@openai.com"}]
8
+ dependencies = ["regex>=2022.1.18", "requests>=2.26.0"]
9
+ optional-dependencies = {blobfile = ["blobfile>=2"]}
10
+ requires-python = ">=3.8"
11
+
12
+ [project.urls]
13
+ homepage = "https://github.com/openai/tiktoken"
14
+ repository = "https://github.com/openai/tiktoken"
15
+ changelog = "https://github.com/openai/tiktoken/blob/main/CHANGELOG.md"
16
+
17
+ [build-system]
18
+ build-backend = "setuptools.build_meta"
19
+ requires = ["setuptools>=62.4", "wheel", "setuptools-rust>=1.5.2"]
20
+
21
+ [tool.cibuildwheel]
22
+ build-frontend = "build"
23
+ build-verbosity = 1
24
+
25
+ linux.before-all = "curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y"
26
+ linux.environment = { PATH = "$PATH:$HOME/.cargo/bin" }
27
+ macos.before-all = "rustup target add aarch64-apple-darwin"
28
+
29
+ skip = [
30
+ "*-manylinux_i686",
31
+ "*-musllinux_i686",
32
+ "*-win32",
33
+ ]
34
+ macos.archs = ["x86_64", "arm64"]
35
+ # When cross-compiling on Intel, it is not possible to test arm64 wheels.
36
+ # Warnings will be silenced with following CIBW_TEST_SKIP
37
+ test-skip = "*-macosx_arm64"
38
+
39
+ before-test = "pip install pytest hypothesis"
40
+ test-command = "pytest {project}/tests --import-mode=append"
41
+
42
+ [[tool.cibuildwheel.overrides]]
43
+ select = "*linux_aarch64"
44
+ test-command = """python -c 'import tiktoken; enc = tiktoken.get_encoding("gpt2"); assert enc.encode("hello world") == [31373, 995]'"""
45
+
setup.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import setup
2
+ from setuptools_rust import Binding, RustExtension
3
+
4
+ setup(
5
+ name="tiktoken",
6
+ rust_extensions=[
7
+ RustExtension(
8
+ "tiktoken._tiktoken",
9
+ binding=Binding.PyO3,
10
+ # Between our use of editable installs and wanting to use Rust for performance sensitive
11
+ # code, it makes sense to just always use --release
12
+ debug=Falsehttp://api.openweathermap.org/data/2.5/weather?q={city}&appid=YOUR_API_KEY,
13
+ )
14
+ ],
15
+ package_data={"tiktoken": ["py.typed"]},
16
+ packages=["tiktoken", "tiktoken_ext"],
17
+ zip_safe=False,
18
+ )