Spaces:

ACRLab
/

FraleyLabAttachmentBot

Sleeping

FraleyLabAttachmentBot / ObtainDataEmbedding.py

This will probably be the way we do it

4247f5a over 1 year ago

1.49 kB

	# imports
	import pandas as pd
	import tiktoken
	from openai.embeddings_utils import get_embedding

	# embedding model parameters
	embedding_model = "text-embedding-ada-002"
	embedding_encoding = "cl100k_base" # this the encoding for text-embedding-ada-002
	max_tokens = 8000 # the maximum for text-embedding-ada-002 is 8191

	# load & inspect dataset
	input_datapath = "data/chat_transcripts.csv"
	df = pd.read_csv(input_datapath, index_col=0)
	df = df[["ChatTranscript", "Attachment", "Avoidance"]]
	df = df.dropna()
	df.head(2)

	# Filter out chat transcripts that are too long to embed, estimate for the maximum number of words would be around 1638 words (8191 tokens / 5).
	encoding = tiktoken.get_encoding(embedding_encoding)

	df["n_tokens"] = df.ChatTranscript.apply(lambda x: len(encoding.encode(x)))
	df = df[df.n_tokens <= max_tokens]
	len(df)

	# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

	# This may take a few minutes
	df["embedding"] = df.ChatTranscript.apply(lambda x: get_embedding(x, engine=embedding_model))
	df.to_csv("data/chat_transcripts_with_embeddings.csv")


	# Please replace "data/chat_transcripts.csv" with the path to your actual data file. Also, replace 'ChatTranscript', 'Attachment', 'Avoidance' with the actual column names of your chat transcripts and attachment scores in your data file.

	# Also, remember to set the API key for OpenAI in your environment before running the get_embedding function.