metadata
base_model: BAAI/bge-small-en-v1.5
datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:664
- loss:DenoisingAutoEncoderLoss
widget:
- source_sentence: of fresh for in for that,, stream_id
sentences:
- >-
Number of functional/operational toilets for boys with disabilities or
CWSN(Children with special needs)
- >-
Indicates grant for sports and physical education expenditure (in Rs)
spent by the school during the financial year 2022-2023 under Samagra
Shiksha, corresponding to the udise_sch_code.
- >-
Number of fresh enrollments for transgenders in class 11 for that
school. corresponding to udise_sch_code, caste_id, stream_id.
- source_sentence: Unique each associated . This in and.
sentences:
- >-
classes in which language 3 i.e ('lang3' column) is taught as a subject.
Its a comma seperated value.
- >-
Unique identifier code each school, associated with school_name in
sch_master table. This can be joined with udise_sch_code in sch_profile
and sch_facility tables.
- 'Number of assessments happened for primary section/school '
- source_sentence: urinals
sentences:
- >-
Unique identifier code for the schools providing vocational courses
under nsqf and where sectors are available, associated with school name
in sch_master table. This can be joined with udise_sch_code in
sch_profile and sch_facility tables.
- >-
Indicates whether there is a reading corner/space/room in school. Can
only be ['Yes','No']
- 'Number of functional/operational urinals for boys '
- source_sentence: >-
total of in-service training by of that from district and training) the
tch_code_state
sentences:
- >-
Indicates total days of in-service training received by the teacher of
that school from district institute of education and training(diet),
corresponding to the udise_sch_code, tch_name, tch_code_state.
- >-
Unique identifier code for each school. This column is crucial for
aggregating or analyzing data at the school level, such as school-wise
attendance, performance metrics, or demographic information.
- >-
Indicates whether it is a special school, specifically for disabled
students. Is school CWSN ( Children with Special Needs ). This can only
be one of 2 values:['Yes','No']
- source_sentence: >-
The teacher_id column . This essential related teacher absenteeism or will
column
sentences:
- >-
Indicates Urban local body ID as per LGD - Local Government Directory
where the school is present, related to 'lgd_urban_local_body_name'
- 'Number of pucca classrooms in good condition in school '
- >-
The teacher_id column is a unique identifier used to represent
individual teachers. This column is essential for retrieving
teacher-specific information.Queries related to teacher attendance,
absenteeism, or any teacher-level analysis will likely require this
column.
SentenceTransformer based on BAAI/bge-small-en-v1.5
This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-small-en-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 384 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("ravch/fine_tuned_bge_small_en_v1.5_another_data_formate")
# Run inference
sentences = [
'The teacher_id column . This essential related teacher absenteeism or will column',
'The teacher_id column is a unique identifier used to represent individual teachers. This column is essential for retrieving teacher-specific information.Queries related to teacher attendance, absenteeism, or any teacher-level analysis will likely require this column. ',
"Indicates Urban local body ID as per LGD - Local Government Directory where the school is present, related to 'lgd_urban_local_body_name' ",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
Unnamed Dataset
- Size: 664 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 3 tokens
- mean: 15.88 tokens
- max: 127 tokens
- min: 7 tokens
- mean: 36.37 tokens
- max: 311 tokens
- Samples:
sentence_0 sentence_1 Number of Girls Defense
Number of Girls Student provided Self Defense training
whether is While filtering, must 0 (int active.
Indicate whether school is active or inactive. While filtering only consider active schools, but When asked for total schools must consider active and inactive schools. 0(int) indicates active schools.
classes in which language i.e 'lang2 as a subject a comma seperated
classes in which language 2 i.e ('lang2' column) is taught as a subject. Its a comma seperated value.
- Loss:
DenoisingAutoEncoderLoss
Training Hyperparameters
Non-Default Hyperparameters
num_train_epochs
: 50multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: noprediction_loss_only
: Trueper_device_train_batch_size
: 8per_device_eval_batch_size
: 8per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 50max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | Training Loss |
---|---|---|
6.0241 | 500 | 2.0771 |
12.0482 | 1000 | 0.4663 |
18.0723 | 1500 | 0.2979 |
24.0964 | 2000 | 0.2476 |
30.1205 | 2500 | 0.2341 |
36.1446 | 3000 | 0.2321 |
42.1687 | 3500 | 0.2116 |
48.1928 | 4000 | 0.2012 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.0.1
- Transformers: 4.42.4
- PyTorch: 2.3.1+cu121
- Accelerate: 0.32.1
- Datasets: 2.21.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
DenoisingAutoEncoderLoss
@inproceedings{wang-2021-TSDAE,
title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
pages = "671--688",
url = "https://arxiv.org/abs/2104.06979",
}