Files changed (4) hide show
  1. .gitattributes +35 -0
  2. LICENSE +0 -125
  3. NOTICE +0 -1
  4. README.md +28 -61
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
LICENSE DELETED
@@ -1,125 +0,0 @@
1
- LLAMA 2 COMMUNITY LICENSE AGREEMENT
2
- Llama 2 Version Release Date: July 18, 2023
3
-
4
- "Agreement" means the terms and conditions for use, reproduction, distribution and
5
- modification of the Llama Materials set forth herein.
6
-
7
- "Documentation" means the specifications, manuals and documentation
8
- accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-
9
- libraries/llama-downloads/.
10
-
11
- "Licensee" or "you" means you, or your employer or any other person or entity (if
12
- you are entering into this Agreement on such person or entity's behalf), of the age
13
- required under applicable laws, rules or regulations to provide legal consent and that
14
- has legal authority to bind your employer or such other person or entity if you are
15
- entering in this Agreement on their behalf.
16
-
17
- "Llama 2" means the foundational large language models and software and
18
- algorithms, including machine-learning model code, trained model weights,
19
- inference-enabling code, training-enabling code, fine-tuning enabling code and other
20
- elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-
21
- libraries/llama-downloads/.
22
-
23
- "Llama Materials" means, collectively, Meta's proprietary Llama 2 and
24
- Documentation (and any portion thereof) made available under this Agreement.
25
-
26
- "Meta" or "we" means Meta Platforms Ireland Limited (if you are located in or, if you
27
- are an entity, your principal place of business is in the EEA or Switzerland) and Meta
28
- Platforms, Inc. (if you are located outside of the EEA or Switzerland).
29
-
30
- By clicking "I Accept" below or by using or distributing any portion or element of the
31
- Llama Materials, you agree to be bound by this Agreement.
32
-
33
- 1. License Rights and Redistribution.
34
-
35
- a. Grant of Rights. You are granted a non-exclusive, worldwide, non-
36
- transferable and royalty-free limited license under Meta's intellectual property or
37
- other rights owned by Meta embodied in the Llama Materials to use, reproduce,
38
- distribute, copy, create derivative works of, and make modifications to the Llama
39
- Materials.
40
-
41
- b. Redistribution and Use.
42
-
43
- i. If you distribute or make the Llama Materials, or any derivative works
44
- thereof, available to a third party, you shall provide a copy of this Agreement to such
45
- third party.
46
- ii. If you receive Llama Materials, or any derivative works thereof, from
47
- a Licensee as part of an integrated end user product, then Section 2 of this
48
- Agreement will not apply to you.
49
-
50
- iii. You must retain in all copies of the Llama Materials that you
51
- distribute the following attribution notice within a "Notice" text file distributed as a
52
- part of such copies: "Llama 2 is licensed under the LLAMA 2 Community License,
53
- Copyright (c) Meta Platforms, Inc. All Rights Reserved."
54
-
55
- iv. Your use of the Llama Materials must comply with applicable laws
56
- and regulations (including trade compliance laws and regulations) and adhere to the
57
- Acceptable Use Policy for the Llama Materials (available at
58
- https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into
59
- this Agreement.
60
-
61
- v. You will not use the Llama Materials or any output or results of the
62
- Llama Materials to improve any other large language model (excluding Llama 2 or
63
- derivative works thereof).
64
-
65
- 2. Additional Commercial Terms. If, on the Llama 2 version release date, the
66
- monthly active users of the products or services made available by or for Licensee,
67
- or Licensee's affiliates, is greater than 700 million monthly active users in the
68
- preceding calendar month, you must request a license from Meta, which Meta may
69
- grant to you in its sole discretion, and you are not authorized to exercise any of the
70
- rights under this Agreement unless or until Meta otherwise expressly grants you
71
- such rights.
72
-
73
- 3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE
74
- LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE
75
- PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
76
- EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY
77
- WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR
78
- FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE
79
- FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING
80
- THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR
81
- USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
82
-
83
- 4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE
84
- LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT,
85
- NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS
86
- AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL,
87
- CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN
88
- IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF
89
- ANY OF THE FOREGOING.
90
-
91
- 5. Intellectual Property.
92
-
93
- a. No trademark licenses are granted under this Agreement, and in
94
- connection with the Llama Materials, neither Meta nor Licensee may use any name
95
- or mark owned by or associated with the other or any of its affiliates, except as
96
- required for reasonable and customary use in describing and redistributing the
97
- Llama Materials.
98
-
99
- b. Subject to Meta's ownership of Llama Materials and derivatives made by or
100
- for Meta, with respect to any derivative works and modifications of the Llama
101
- Materials that are made by you, as between you and Meta, you are and will be the
102
- owner of such derivative works and modifications.
103
-
104
- c. If you institute litigation or other proceedings against Meta or any entity
105
- (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama
106
- Materials or Llama 2 outputs or results, or any portion of any of the foregoing,
107
- constitutes infringement of intellectual property or other rights owned or licensable
108
- by you, then any licenses granted to you under this Agreement shall terminate as of
109
- the date such litigation or claim is filed or instituted. You will indemnify and hold
110
- harmless Meta from and against any claim by any third party arising out of or related
111
- to your use or distribution of the Llama Materials.
112
-
113
- 6. Term and Termination. The term of this Agreement will commence upon your
114
- acceptance of this Agreement or access to the Llama Materials and will continue in
115
- full force and effect until terminated in accordance with the terms and conditions
116
- herein. Meta may terminate this Agreement if you are in breach of any term or
117
- condition of this Agreement. Upon termination of this Agreement, you shall delete
118
- and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the
119
- termination of this Agreement.
120
-
121
- 7. Governing Law and Jurisdiction. This Agreement will be governed and
122
- construed under the laws of the State of California without regard to choice of law
123
- principles, and the UN Convention on Contracts for the International Sale of Goods
124
- does not apply to this Agreement. The courts of California shall have exclusive
125
- jurisdiction of any dispute arising out of this Agreement.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTICE DELETED
@@ -1 +0,0 @@
1
- Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved
 
 
README.md CHANGED
@@ -1,65 +1,38 @@
1
  ---
2
  license: apache-2.0
3
- datasets:
4
- - JetBrains/KExercises
5
- base_model: meta-llama/CodeLlama-7b-hf
6
- results:
7
- - task:
8
- type: text-generation
9
- dataset:
10
- name: MultiPL-HumanEval (Kotlin)
11
- type: openai_humaneval
12
- metrics:
13
- - name: pass@1
14
- type: pass@1
15
- value: 42.24
16
- tags:
17
- - code
18
  ---
19
 
20
  # Kexer models
21
 
22
- Kexer models are a collection of open-source generative text models fine-tuned on the [Kotlin Exercices](https://huggingface.co/datasets/JetBrains/KExercises) dataset.
23
- This is a repository for the fine-tuned **CodeLlama-7b** model in the *Hugging Face Transformers* format.
24
 
25
- # How to use
26
 
27
  ```python
28
- from transformers import AutoModelForCausalLM, AutoTokenizer
29
-
30
- # Load pre-trained model and tokenizer
31
- model_name = 'JetBrains/CodeLlama-7B-Kexer'
32
- tokenizer = AutoTokenizer.from_pretrained(model_name)
33
- model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda')
34
-
35
- # Create and encode input
36
- input_text = """\
37
- This function takes an integer n and returns factorial of a number:
38
- fun factorial(n: Int): Int {\
39
- """
40
- input_ids = tokenizer.encode(
41
- input_text, return_tensors='pt'
42
- ).to('cuda')
43
-
44
- # Generate
45
- output = model.generate(
46
- input_ids, max_length=60, num_return_sequences=1,
47
- early_stopping=True, pad_token_id=tokenizer.eos_token_id,
48
- )
49
-
50
- # Decode output
51
- generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
52
- print(generated_text)
53
- ```
54
 
55
- As with the base model, we can use FIM. To do this, the following format must be used:
56
- ```
57
- '<PRE> ' + prefix + ' <SUF> ' + suffix + ' <MID>'
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
59
 
60
  # Training setup
61
 
62
- The model was trained on one A100 GPU with the following hyperparameters:
63
 
64
  | **Hyperparameter** | **Value** |
65
  |:---------------------------:|:----------------------------------------:|
@@ -67,25 +40,19 @@ The model was trained on one A100 GPU with the following hyperparameters:
67
  | `max_lr` | 1e-4 |
68
  | `scheduler` | linear |
69
  | `total_batch_size` | 256 (~130K tokens per step) |
70
- | `num_epochs` | 4 |
71
 
72
- More details about fine-tuning can be found in the technical report (coming soon!).
73
 
74
  # Fine-tuning data
75
 
76
- For tuning this model, we used 15K exmaples from the synthetically generated [Kotlin Exercices](https://huggingface.co/datasets/JetBrains/KExercises) dataset. Every example follows the HumanEval format. In total, the dataset contains about 3.5M tokens.
77
 
78
  # Evaluation
79
 
80
- For evaluation, we used the [Kotlin HumanEval](https://huggingface.co/datasets/JetBrains/Kotlin_HumanEval) dataset, which contains all 161 tasks from HumanEval translated into Kotlin by human experts. You can find more details about the pre-processing necessary to obtain our results, including the code for running, on the [datasets's page](https://huggingface.co/datasets/JetBrains/Kotlin_HumanEval).
81
-
82
- Here are the results of our evaluation:
83
-
84
- | **Model name** | **Kotlin HumanEval Pass Rate** |
85
- |:---------------------------:|:----------------------------------------:|
86
- | `CodeLlama-7B` | 26.89 |
87
- | `CodeLlama-7B-Kexer` | **42.24** |
88
 
89
- # Ethical considerations and limitations
90
 
91
- CodeLlama-7B-Kexer is a new technology that carries risks with use. The testing conducted to date has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, CodeLlama-7B-Kexer's potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. The model was fine-tuned on a specific data format (Kotlin tasks), and deviation from this format can also lead to inaccurate or undesirable responses to user queries. Therefore, before deploying any applications of CodeLlama-7B-Kexer, developers should perform safety testing and tuning tailored to their specific applications of the model.
 
 
 
 
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # Kexer models
6
 
7
+ Kexer models is a collection of fine-tuned open-source generative text models fine-tuned on Kotlin Exercices dataset.
8
+ This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
9
 
10
+ # Model use
11
 
12
  ```python
13
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
+ # Load pre-trained model and tokenizer
16
+ model_name = 'JetBrains/CodeLlama-7B-Kexer' # Replace with the desired model name
17
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
18
+ model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
19
+
20
+ # Encode input text
21
+ input_text = """This function takes an integer n and returns factorial of a number:
22
+ fun factorial(n: Int): Int {"""
23
+ input_ids = tokenizer.encode(input_text, return_tensors='pt').to('cuda')
24
+
25
+ # Generate text
26
+ output = model.generate(input_ids, max_length=150, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)
27
+
28
+ # Decode and print the generated text
29
+ generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
30
+ print(generated_text)
31
  ```
32
 
33
  # Training setup
34
 
35
+ The model was trained on one A100 GPU with following hyperparameters:
36
 
37
  | **Hyperparameter** | **Value** |
38
  |:---------------------------:|:----------------------------------------:|
 
40
  | `max_lr` | 1e-4 |
41
  | `scheduler` | linear |
42
  | `total_batch_size` | 256 (~130K tokens per step) |
 
43
 
 
44
 
45
  # Fine-tuning data
46
 
47
+ For this model we used 15K exmaples of Kotlin Exercices dataset {TODO: link!}. For more information about the dataset follow th link.
48
 
49
  # Evaluation
50
 
51
+ To evaluate we used Kotlin Humaneval (more infromation here)
 
 
 
 
 
 
 
52
 
53
+ Fine-tuned model:
54
 
55
+ | **Model name** | **Kotlin HumanEval Pass Rate** | **Kotlin Completion** |
56
+ |:---------------------------:|:----------------------------------------:|:----------------------------------------:|
57
+ | `base model` | 26.89 | 0.388 |
58
+ | `fine-tuned model` | 42.24 | 0.344 |