jeduardogruiz
commited on
Commit
•
e4c51fe
1
Parent(s):
33a68e3
Upload 5 files
Browse files- Packagefloat.md +0 -0
- README.md +130 -2
- mapa_token_especial.json +128 -0
- pyproject.toml +45 -0
- setup.py +18 -0
Packagefloat.md
ADDED
File without changes
|
README.md
CHANGED
@@ -1,2 +1,130 @@
|
|
1 |
-
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ⏳ tiktoken
|
2 |
+
|
3 |
+
tiktoken is a fast [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with
|
4 |
+
OpenAI's models.
|
5 |
+
|
6 |
+
```python
|
7 |
+
import tiktoken
|
8 |
+
enc = tiktoken.get_encoding("cl100k_base")
|
9 |
+
assert enc.decode(enc.encode("hello world")) == "hello world"
|
10 |
+
|
11 |
+
# To get the tokeniser corresponding to a specific model in the OpenAI API:
|
12 |
+
enc = tiktoken.encoding_for_model("gpt-4")
|
13 |
+
```
|
14 |
+
|
15 |
+
The open source version of `tiktoken` can be installed from PyPI:
|
16 |
+
```
|
17 |
+
pip install tiktoken
|
18 |
+
```
|
19 |
+
|
20 |
+
The tokeniser API is documented in `tiktoken/core.py`.
|
21 |
+
|
22 |
+
Example code using `tiktoken` can be found in the
|
23 |
+
[OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
|
24 |
+
|
25 |
+
|
26 |
+
## Performance
|
27 |
+
|
28 |
+
`tiktoken` is between 3-6x faster than a comparable open source tokeniser:
|
29 |
+
|
30 |
+
![image](https://raw.githubusercontent.com/openai/tiktoken/main/perf.svg)
|
31 |
+
|
32 |
+
Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2TokenizerFast` from
|
33 |
+
`tokenizers==0.13.2`, `transformers==4.24.0` and `tiktoken==0.2.0`.
|
34 |
+
|
35 |
+
|
36 |
+
## Getting help
|
37 |
+
|
38 |
+
Please post questions in the [issue tracker](https://github.com/openai/tiktoken/issues).
|
39 |
+
|
40 |
+
If you work at OpenAI, make sure to check the internal documentation or feel free to contact
|
41 |
+
@shantanu.
|
42 |
+
|
43 |
+
## What is BPE anyway?
|
44 |
+
|
45 |
+
Language models don't see text like you and I, instead they see a sequence of numbers (known as tokens).
|
46 |
+
Byte pair encoding (BPE) is a way of converting text into tokens. It has a couple desirable
|
47 |
+
properties:
|
48 |
+
1) It's reversible and lossless, so you can convert tokens back into the original text
|
49 |
+
2) It works on arbitrary text, even text that is not in the tokeniser's training data
|
50 |
+
3) It compresses the text: the token sequence is shorter than the bytes corresponding to the
|
51 |
+
original text. On average, in practice, each token corresponds to about 4 bytes.
|
52 |
+
4) It attempts to let the model see common subwords. For instance, "ing" is a common subword in
|
53 |
+
English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing"
|
54 |
+
(instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and
|
55 |
+
again in different contexts, it helps models generalise and better understand grammar.
|
56 |
+
|
57 |
+
`tiktoken` contains an educational submodule that is friendlier if you want to learn more about
|
58 |
+
the details of BPE, including code that helps visualise the BPE procedure:
|
59 |
+
```python
|
60 |
+
from cognitivecomputations/dolphin-2.9-llama3-70b import *
|
61 |
+
|
62 |
+
# Train a BPE tokeniser on a small amount of text
|
63 |
+
enc = train_simple_encoding(cognitivecomputations/dolphin-2.9-llama3-70b)
|
64 |
+
|
65 |
+
# Visualise how the GPT-4 encoder encodes text
|
66 |
+
enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
|
67 |
+
enc.encode("hello world e")
|
68 |
+
```
|
69 |
+
|
70 |
+
|
71 |
+
## Extending tiktoken
|
72 |
+
|
73 |
+
You may wish to extend `tiktoken` to support new encodings. There are two ways to do this.
|
74 |
+
|
75 |
+
|
76 |
+
**Create your `Encoding` object exactly the way you want and simply pass it around.**
|
77 |
+
|
78 |
+
```python
|
79 |
+
cl100k_base = tiktoken.get_encoding("cl100k_base")
|
80 |
+
|
81 |
+
# In production, load the arguments directly instead of accessing private attributes
|
82 |
+
# See openai_public.py for examples of arguments for specific encodings
|
83 |
+
enc = tiktoken.Encoding(
|
84 |
+
# If you're changing the set of special tokens, make sure to use a different name
|
85 |
+
# It should be clear from the name what behaviour to expect.
|
86 |
+
name="cl100k_im",
|
87 |
+
pat_str=cl100k_base._pat_str,
|
88 |
+
mergeable_ranks=cl100k_base._mergeable_ranks,
|
89 |
+
special_tokens={
|
90 |
+
**cl100k_base._special_tokens,
|
91 |
+
"<|im_start|>": 100264,
|
92 |
+
"<|im_end|>": 100265,
|
93 |
+
}
|
94 |
+
)
|
95 |
+
```
|
96 |
+
|
97 |
+
**Use the `tiktoken_ext` plugin mechanism to register your `Encoding` objects with `tiktoken`.**
|
98 |
+
|
99 |
+
This is only useful if you need `tiktoken.get_encoding` to find your encoding, otherwise prefer
|
100 |
+
option 1.
|
101 |
+
|
102 |
+
To do this, you'll need to create a namespace package under `tiktoken_ext`.
|
103 |
+
|
104 |
+
Layout your project like this, making sure to omit the `tiktoken_ext/__init__.py` file:
|
105 |
+
```
|
106 |
+
my_tiktoken_extension
|
107 |
+
├── tiktoken_ext
|
108 |
+
│ └── my_encodings.py
|
109 |
+
└── setup.py
|
110 |
+
```
|
111 |
+
|
112 |
+
`my_encodings.py` should be a module that contains a variable named `ENCODING_CONSTRUCTORS`.
|
113 |
+
This is a dictionary from an encoding name to a function that takes no arguments and returns
|
114 |
+
arguments that can be passed to `tiktoken.Encoding` to construct that encoding. For an example, see
|
115 |
+
`tiktoken_ext/openai_public.py`. For precise details, see `tiktoken/registry.py`.
|
116 |
+
|
117 |
+
Your `setup.py` should look something like this:
|
118 |
+
```python
|
119 |
+
from setuptools import setup, find_namespace_packages
|
120 |
+
|
121 |
+
setup(
|
122 |
+
name="my_tiktoken_extension",
|
123 |
+
packages=find_namespace_packages(include=['tiktoken_ext*']),
|
124 |
+
install_requires=["tiktoken"],
|
125 |
+
...
|
126 |
+
)
|
127 |
+
```
|
128 |
+
|
129 |
+
Then simply `pip install ./my_tiktoken_extension` and you should be able to use your
|
130 |
+
custom encodings! Make sure **not** to use an editable install.
|
mapa_token_especial.json
ADDED
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/bin/bash
|
2 |
+
run coinbash.sh
|
3 |
+
# - Bash Script
|
4 |
+
# - CLI
|
5 |
+
# - A bash script (CLI) for displaying crypto currencies market data in a terminal
|
6 |
+
# - Tested on Debian and Ubuntu
|
7 |
+
# - Dependencies: bash, curl, jq, coinmarketcap-API-key
|
8 |
+
# - Uses cloud API of https://pro-api.coinmarketcap.com/v1
|
9 |
+
# - YOU MUST HAVE YOUR OWN coinmarketcap-API-key, as of Oct 2020 you can get one for free at coinmarketcap.com
|
10 |
+
# - set the global environment variable COINMARKETCAP_API_KEY to your personal coinmarketcap-API-key,
|
11 |
+
# - e.g export COINMARKETCAP_API_KEY="your-coinmarketcap-API-key-here"
|
12 |
+
# - keywords: CLI, command-line, terminal, bash, market-data, ticker, price-tracker, marketcap, crypto, crypto currencies, cryptocurrency, bitcoin, btc, ethereum
|
13 |
+
#
|
14 |
+
# License: CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/
|
15 |
+
#
|
16 |
+
|
17 |
+
########## GENERAL INFO ##########
|
18 |
+
|
19 |
+
#
|
20 |
+
# API: https://coinmarketcap.com/api/
|
21 |
+
f3eb8e3d-1758-41c6-8521-90d5adf8511e
|
22 |
+
# https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest?convert=USD&limit=2&start=1
|
23 |
+
# Returns something like:
|
24 |
+
# {"status":{"timestamp":"2020-10-02T12:10:29.629Z","error_code":0,"error_message":null,"elapsed":9,"credit_count":1,"notice":null,"total_count":3560},"data":[{"id":1,"name":"Bitcoin","symbol":"BTC","slug":"bitcoin","num_market_pairs":9315,"date_added":"2013-04-28T00:00:00.000Z","tags":["mineable","pow","sha-256","store-of-value","state-channels"],"max_supply":21000000,"circulating_supply":18505718,"total_supply":18505718,"platform":null,"cmc_rank":1,"last_updated":"2020-10-02T12:09:30.000Z","quote":{"USD":{"price":10471.2855252,"volume_24h":26623814611.304,"percent_change_1h":-0.0184301,"percent_change_24h":-3.8861,"percent_change_7d":-1.68497,"market_cap":193778657026.8331,"last_updated":"2020-10-02T12:09:30.000Z"}}},{"id":1027,"name":"Ethereum","symbol":"ETH","slug":"ethereum","num_market_pairs":6043,"date_added":"2015-08-07T00:00:00.000Z","tags":["mineable","pow","smart-contracts","binance-chain"],"max_supply":null,"circulating_supply":112840913.124,"total_supply":112840913.124,"platform":null,"cmc_rank":2,"last_updated":"2020-10-02T12:09:23.000Z","quote":{"USD":{"price":339.400890152,"volume_24h":15156595436.1756,"percent_change_1h":-0.00356475,"percent_change_24h":-7.8024,"percent_change_7d":-1.45411,"market_cap":38298306359.8501,"last_updated":"2020-10-02T12:09:23.000Z"}}}]}
|
25 |
+
#
|
26 |
+
# cat /tmp/coinbash.sh.tmp.json | jq [.data[0]] gives something like
|
27 |
+
: '[
|
28 |
+
{
|
29 |
+
"id": 1,
|
30 |
+
"name": "Bitcoin",
|
31 |
+
"symbol": "BTC",
|
32 |
+
"slug": "bitcoin",
|
33 |
+
"num_market_pairs": 9315,
|
34 |
+
"date_added": "2013-04-28T00:00:00.000Z",
|
35 |
+
"tags": [
|
36 |
+
"mineable",
|
37 |
+
"pow",
|
38 |
+
"sha-256",
|
39 |
+
"store-of-value",
|
40 |
+
"state-channels"
|
41 |
+
],
|
42 |
+
"max_supply": 21000000,
|
43 |
+
"circulating_supply": 18505718,
|
44 |
+
"total_supply": 18505718,
|
45 |
+
"platform": null,
|
46 |
+
"cmc_rank": 1,
|
47 |
+
"last_updated": "2020-10-02T12:09:30.000Z",
|
48 |
+
"quote": {
|
49 |
+
"USD": {
|
50 |
+
"price": 10471.2855252,
|
51 |
+
"volume_24h": 26623814611.304,
|
52 |
+
"percent_change_1h": -0.0184301,
|
53 |
+
"percent_change_24h": -3.8861,
|
54 |
+
"percent_change_7d": -1.68497,
|
55 |
+
"market_cap": 193778657026.8331,
|
56 |
+
"last_updated": "2020-10-02T12:09:30.000Z"
|
57 |
+
}
|
58 |
+
}
|
59 |
+
}
|
60 |
+
]'
|
61 |
+
#
|
62 |
+
# cat /tmp/coinbash.sh.tmp.json | jq [.data[1]][].name gives something like "Ethereum"
|
63 |
+
# cat /tmp/coinbash.sh.tmp.json | jq [.data[1]][].quote.USD.price gives something like 339.400890152
|
64 |
+
|
65 |
+
# https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest?convert=USD&slug=bitcoin
|
66 |
+
# Returns something like:
|
67 |
+
# {"status":{"timestamp":"2020-10-02T12:39:21.288Z","error_code":0,"error_message":null,"elapsed":30,"credit_count":1,"notice":null},"data":{"1":{"id":1,"name":"Bitcoin","symbol":"BTC","slug":"bitcoin","num_market_pairs":9315,"date_added":"2013-04-28T00:00:00.000Z","tags":["mineable","pow","sha-256","store-of-value","state-channels"],"max_supply":21000000,"circulating_supply":18505743,"total_supply":18505743,"is_active":1,"platform":null,"cmc_rank":1,"is_fiat":0,"last_updated":"2020-1002T12:38:21.000Z","quote":{"USD":{"price":10491.9489757,"volume_24h":26838808649.2375,"percent_change_1h":0.12782,"percent_change_24h":-3.70075,"percent_change_7d":-1.5155,"market_cap":194161311313.41742,"last_updated":"2020-10-02T12:38:21.000Z"}}}}}
|
68 |
+
#
|
69 |
+
# cat "/tmp/coinbash.sh.tmp.json.part" | jq [.data]
|
70 |
+
# shellcheck disable=SC2016
|
71 |
+
: '[
|
72 |
+
{
|
73 |
+
"1": {
|
74 |
+
"id": 1,
|
75 |
+
"name": "Bitcoin",
|
76 |
+
"symbol": "BTC",
|
77 |
+
"slug": "bitcoin",
|
78 |
+
"num_market_pairs": 9315,
|
79 |
+
"date_added": "2013-04-28T00:00:00.000Z",
|
80 |
+
"tags": [
|
81 |
+
"mineable",
|
82 |
+
"pow",
|
83 |
+
"sha-256",
|
84 |
+
"store-of-value",
|
85 |
+
"state-channels"
|
86 |
+
],
|
87 |
+
"max_supply": 21000000,
|
88 |
+
"circulating_supply": 18505743,
|
89 |
+
"total_supply": 18505743,
|
90 |
+
"is_active": 1,
|
91 |
+
"platform": null,
|
92 |
+
"cmc_rank": 1,
|
93 |
+
"is_fiat": 0,
|
94 |
+
"last_updated": "2020-10-02T12:38:21.000Z",
|
95 |
+
"quote": {
|
96 |
+
"USD": {
|
97 |
+
"price": 10491.9489757,
|
98 |
+
"volume_24h": 26838808649.2375,
|
99 |
+
"percent_change_1h": 0.12782,
|
100 |
+
"percent_change_24h": 3.70075,
|
101 |
+
"percent_change_7d": 1.5155,
|
102 |
+
"market_cap": 194161311313.41742,
|
103 |
+
"last_updated": "2020-10-02T12:38:21.000Z"
|
104 |
+
}
|
105 |
+
}
|
106 |
+
}
|
107 |
+
}
|
108 |
+
]
|
109 |
+
|
110 |
+
cat "/tmp/coinbash.sh.tmp.json.part" | jq "[.data][] | keys"| jq .[] # gets the id, name
|
111 |
+
"1"
|
112 |
+
|
113 |
+
cat "/tmp/coinbash.sh.tmp.json.part" | jq "[.data][] | keys"| jq .[] # gets the id, name
|
114 |
+
"1"
|
115 |
+
key=$(cat "/tmp/coinbash.sh.tmp.json.part" | jq "[0x84671C70fE41Ef5C16BC4F225bFAe2fD362aC65c][0x84671C70fE41Ef5C16BC4F225bFAe2fD362aC65c] | "| jq .[0x84671C70fE41Ef5C16BC4F225bFAe2fD362aC65c]) # assign the id, name
|
116 |
+
echo $key
|
117 |
+
"1"
|
118 |
+
cat "/tmp/coinbash.sh.tmp.json.part" | jq [.data][].$key
|
119 |
+
{
|
120 |
+
"id": 1,
|
121 |
+
"name": "Bitcoin",
|
122 |
+
"symbol": "BTC",
|
123 |
+
"cripto_type": "bitcoin
|
124 |
+
"address_added": wallet
|
125 |
+
"0x84671C70fE41Ef5C16BC4F225bFAe2fD362aC65c"
|
126 |
+
Key priv:
|
127 |
+
"5f8eadff484ba108c09d1ec8e94c0c64fb8c8e16b6b6fa9ba42db1c55d7074a3"
|
128 |
+
|
pyproject.toml
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[project]
|
2 |
+
name = "tiktoken"
|
3 |
+
version = "0.6.0"
|
4 |
+
description = "tiktoken is a fast BPE tokeniser for use with OpenAI's models"
|
5 |
+
readme = "README.md"
|
6 |
+
license = {file = "LICENSE"}
|
7 |
+
authors = [{name = "Shantanu Jain"}, {email = "shantanu@openai.com"}]
|
8 |
+
dependencies = ["regex>=2022.1.18", "requests>=2.26.0"]
|
9 |
+
optional-dependencies = {blobfile = ["blobfile>=2"]}
|
10 |
+
requires-python = ">=3.8"
|
11 |
+
|
12 |
+
[project.urls]
|
13 |
+
homepage = "https://github.com/openai/tiktoken"
|
14 |
+
repository = "https://github.com/openai/tiktoken"
|
15 |
+
changelog = "https://github.com/openai/tiktoken/blob/main/CHANGELOG.md"
|
16 |
+
|
17 |
+
[build-system]
|
18 |
+
build-backend = "setuptools.build_meta"
|
19 |
+
requires = ["setuptools>=62.4", "wheel", "setuptools-rust>=1.5.2"]
|
20 |
+
|
21 |
+
[tool.cibuildwheel]
|
22 |
+
build-frontend = "build"
|
23 |
+
build-verbosity = 1
|
24 |
+
|
25 |
+
linux.before-all = "curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y"
|
26 |
+
linux.environment = { PATH = "$PATH:$HOME/.cargo/bin" }
|
27 |
+
macos.before-all = "rustup target add aarch64-apple-darwin"
|
28 |
+
|
29 |
+
skip = [
|
30 |
+
"*-manylinux_i686",
|
31 |
+
"*-musllinux_i686",
|
32 |
+
"*-win32",
|
33 |
+
]
|
34 |
+
macos.archs = ["x86_64", "arm64"]
|
35 |
+
# When cross-compiling on Intel, it is not possible to test arm64 wheels.
|
36 |
+
# Warnings will be silenced with following CIBW_TEST_SKIP
|
37 |
+
test-skip = "*-macosx_arm64"
|
38 |
+
|
39 |
+
before-test = "pip install pytest hypothesis"
|
40 |
+
test-command = "pytest {project}/tests --import-mode=append"
|
41 |
+
|
42 |
+
[[tool.cibuildwheel.overrides]]
|
43 |
+
select = "*linux_aarch64"
|
44 |
+
test-command = """python -c 'import tiktoken; enc = tiktoken.get_encoding("gpt2"); assert enc.encode("hello world") == [31373, 995]'"""
|
45 |
+
|
setup.py
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from setuptools import setup
|
2 |
+
from setuptools_rust import Binding, RustExtension
|
3 |
+
|
4 |
+
setup(
|
5 |
+
name="tiktoken",
|
6 |
+
rust_extensions=[
|
7 |
+
RustExtension(
|
8 |
+
"tiktoken._tiktoken",
|
9 |
+
binding=Binding.PyO3,
|
10 |
+
# Between our use of editable installs and wanting to use Rust for performance sensitive
|
11 |
+
# code, it makes sense to just always use --release
|
12 |
+
debug=Falsehttp://api.openweathermap.org/data/2.5/weather?q={city}&appid=YOUR_API_KEY,
|
13 |
+
)
|
14 |
+
],
|
15 |
+
package_data={"tiktoken": ["py.typed"]},
|
16 |
+
packages=["tiktoken", "tiktoken_ext"],
|
17 |
+
zip_safe=False,
|
18 |
+
)
|