--- language: - en library_name: fasttext pipeline_tag: text-classification tags: - text - semantic-similarity - earnings-call-transcripts - word2vec - fasttext widget: - text: "transformation" example_title: "transformation" - text: "sustainability" example_title: "sustainability" - text: "turnaround" example_title: "turnaround" - text: "disruption" example_title: "disruption" --- # EarningsCall2Vec **EarningsCall2Vec** is a [`fastText`](https://fasttext.cc/) word embedding model that was trained via [`Gensim`](https://radimrehurek.com/gensim/). It maps each token in the vocabulary to a dense, 300-dimensional vector space, designed for performing **semantic search**. More details about the training procedure can be found [below](#model-training). ## Background Context on the project. ## Usage The model is intented to be used for semantic search: It encodes the search-query in a dense vector space and finds semantic neighbours, i.e., token which frequently occur within similar contexts in the underlying training data. The query should consist of a single word. When provided a bi-, tri-, or even fourgram, the quality of the model output depends on the presence of the query token in the model's vocabulary. Multiple words are to be concated by an underscore (e.g., "machine_learning" or "artifical_intelligence"). ## Usage (API) ```python import json import requests API_TOKEN = headers = {"Authorization": f"Bearer {API_TOKEN}"} API_URL = "https://api-inference.huggingface.co/models/simonschoe/call2vec" def query(payload): data = json.dumps(payload) response = requests.request("POST", API_URL, headers=headers, data=data) return json.loads(response.content.decode("utf-8")) query({"inputs": ")) # train model logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) class MyCallback(CallbackAny2Vec): def __init__(self): self.epoch = 0 def on_epoch_end(self, model): self.epoch += 1 if (self.epoch % 10) == 0: # save in gensim format model.save() def on_train_end(self, model): # save in binary format for upload to huggingface save_facebook_model(.bin) model.train( corpus_iterable=LineSentence(), total_words=model.corpus_total_words, total_examples=model.corpus_count, epochs=, callbacks=[MyCallback()], ) ``` **Model statistics:** - Vocabulary size: 64,891 - Min. token frequency: 10 - Embedding dimensions: 300 - Number of epochs: 50