Financbase commited on
Commit
445a672
·
verified ·
1 Parent(s): 2d7f3bb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +221 -0
README.md ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ task_categories:
4
+ - text-generation
5
+ - text-classification
6
+ - summarization
7
+ language:
8
+ - en
9
+ tags:
10
+ - finance
11
+ - financial-qa
12
+ - sentiment-analysis
13
+ - summarization
14
+ - instruction-tuning
15
+ - sec-filings
16
+ - 10-k
17
+ size_categories:
18
+ - 1K<n<10K
19
+ configs:
20
+ - config_name: default
21
+ data_files:
22
+ - split: train
23
+ path: financial_qa.jsonl
24
+ ---
25
+
26
+ # Financbase Financial QA Dataset
27
+
28
+ ## Dataset Description
29
+
30
+ The Financbase Financial QA Dataset is a curated collection of financial question-answering examples designed for training large language models on financial domain tasks. This dataset supports multiple financial AI tasks including question answering, sentiment analysis, and document summarization.
31
+
32
+ ### Dataset Summary
33
+
34
+ - **Total Examples**: 1,000+ financial Q&A pairs
35
+ - **Format**: JSONL (JSON Lines)
36
+ - **Language**: English
37
+ - **Domain**: Financial services, SEC filings, investment analysis
38
+ - **Tasks**: Question answering, sentiment classification, summarization
39
+
40
+ ### Dataset Structure
41
+
42
+ Each example follows the instruction-tuning format with three fields:
43
+
44
+ ```json
45
+ {
46
+ "instruction": "Answer the question clearly for a retail investor.",
47
+ "input": "What is EBITDA?",
48
+ "output": "EBITDA stands for Earnings Before Interest, Taxes, Depreciation, and Amortization. It's a measure of a company's operating performance that excludes non-operating expenses..."
49
+ }
50
+ ```
51
+
52
+ ### Supported Tasks
53
+
54
+ 1. **Financial Question Answering**
55
+ - Basic financial concepts (EBITDA, P/E ratio, etc.)
56
+ - Investment terminology
57
+ - Market analysis questions
58
+
59
+ 2. **Sentiment Analysis**
60
+ - Financial news sentiment classification
61
+ - Earnings report sentiment
62
+ - Market outlook analysis
63
+
64
+ 3. **Document Summarization**
65
+ - SEC filing summaries
66
+ - Earnings call summaries
67
+ - Financial report abstracts
68
+
69
+ ## Usage
70
+
71
+ ### Loading the Dataset
72
+
73
+ ```python
74
+ from datasets import load_dataset
75
+
76
+ # Load the dataset
77
+ dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")
78
+
79
+ # Access examples
80
+ for example in dataset:
81
+ print(f"Instruction: {example['instruction']}")
82
+ print(f"Input: {example['input']}")
83
+ print(f"Output: {example['output']}")
84
+ ```
85
+
86
+ ### Training with Transformers
87
+
88
+ ```python
89
+ from transformers import AutoTokenizer, AutoModelForCausalLM
90
+ from datasets import load_dataset
91
+
92
+ # Load dataset
93
+ dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")
94
+
95
+ # Format for training
96
+ def format_example(example):
97
+ return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
98
+
99
+ # Apply formatting
100
+ formatted_dataset = dataset.map(lambda x: {"text": format_example(x)})
101
+ ```
102
+
103
+ ### Using with PEFT/LoRA
104
+
105
+ ```python
106
+ from peft import LoraConfig, get_peft_model
107
+ from transformers import AutoModelForCausalLM
108
+
109
+ # Load base model
110
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
111
+
112
+ # Configure LoRA
113
+ lora_config = LoraConfig(
114
+ r=16,
115
+ lora_alpha=32,
116
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
117
+ lora_dropout=0.05,
118
+ bias="none",
119
+ task_type="CAUSAL_LM"
120
+ )
121
+
122
+ # Apply LoRA
123
+ model = get_peft_model(model, lora_config)
124
+ ```
125
+
126
+ ## Data Fields
127
+
128
+ | Field | Type | Description |
129
+ |-------|------|-------------|
130
+ | `instruction` | string | The task instruction or prompt |
131
+ | `input` | string | The input context or question |
132
+ | `output` | string | The expected response or answer |
133
+
134
+ ## Data Splits
135
+
136
+ - **train**: 1,000+ examples for training
137
+ - **validation**: 100+ examples for validation (future release)
138
+ - **test**: 100+ examples for testing (future release)
139
+
140
+ ## Data Collection
141
+
142
+ ### Sources
143
+
144
+ - SEC 10-K filings (processed and chunked)
145
+ - Financial news articles
146
+ - Investment research reports
147
+ - Financial education materials
148
+ - Curated financial Q&A pairs
149
+
150
+ ### Preprocessing
151
+
152
+ 1. **Document Chunking**: Long documents split into ≤1800 token chunks
153
+ 2. **Section Preservation**: Maintains document structure and headings
154
+ 3. **Quality Filtering**: Removes low-quality or irrelevant examples
155
+ 4. **Format Standardization**: Ensures consistent instruction/input/output format
156
+
157
+ ## Compliance and Safety
158
+
159
+ ### Financial Compliance
160
+
161
+ - **No Investment Advice**: Dataset does not contain personalized investment recommendations
162
+ - **Educational Purpose**: Designed for educational and research use
163
+ - **Source Attribution**: All examples traceable to original sources
164
+ - **Regulatory Compliance**: Follows financial data handling best practices
165
+
166
+ ### Content Filtering
167
+
168
+ - Removed personally identifiable information (PII)
169
+ - Filtered out actionable trading directives
170
+ - Excluded copyrighted material
171
+ - Sanitized sensitive financial data
172
+
173
+ ## Evaluation
174
+
175
+ ### Metrics
176
+
177
+ - **Perplexity**: Model confidence on financial text
178
+ - **BLEU Score**: Response quality for summarization tasks
179
+ - **Accuracy**: Classification accuracy for sentiment analysis
180
+ - **ROUGE Score**: Summarization quality metrics
181
+
182
+ ### Benchmark Tasks
183
+
184
+ 1. **Financial QA**: Answer financial questions accurately
185
+ 2. **Sentiment Analysis**: Classify financial sentiment (positive/negative/neutral)
186
+ 3. **Summarization**: Summarize financial documents concisely
187
+
188
+ ## Limitations
189
+
190
+ - **Language**: English only
191
+ - **Domain**: Primarily US financial markets
192
+ - **Temporal**: Data from 2020-2024 (may become outdated)
193
+ - **Bias**: Reflects training data biases and limitations
194
+
195
+ ## Citation
196
+
197
+ ```bibtex
198
+ @dataset{financbase_financial_qa_2024,
199
+ title={Financbase Financial QA Dataset},
200
+ author={Financbase Team},
201
+ year={2024},
202
+ url={https://huggingface.co/datasets/Financbase/financbase-10k-jsonl},
203
+ license={MIT}
204
+ }
205
+ ```
206
+
207
+ ## License
208
+
209
+ This dataset is released under the MIT License. See LICENSE file for details.
210
+
211
+ ## Contact
212
+
213
+ - **Organization**: Financbase
214
+ - **Repository**: https://huggingface.co/datasets/Financbase/financbase-10k-jsonl
215
+ - **Issues**: Report issues via HuggingFace Hub
216
+
217
+ ## Changelog
218
+
219
+ - **v0.1** (2024-12-19): Initial release with 1,000+ financial Q&A examples
220
+ - **v0.2** (Planned): Add validation and test splits
221
+ - **v0.3** (Planned): Expand to 10,000+ examples with more diverse sources