luisrodriguesphd commited on
Commit
2e113bf
1 Parent(s): a3c1bf2

add a text generation pipeline

Browse files
README.md CHANGED
@@ -34,7 +34,7 @@ ResumeWorth utilizes a step-by-step process to analyze your professional backgro
34
  ## 3. Getting Started
35
  [Back to ToC](#toc)
36
 
37
- To begin using ResumeWorth, visit the [application](https://huggingface.co/spaces/luisrodriguesphd/resume-worth) deployed on **Hugging Face** and follow the on-screen instructions to enter your job title and upload your resume. The system will guide you through the process and provide you with valuable insights into your resume's market value and potential job opportunities.
38
 
39
  <a name="contributing"/></a>
40
  ## 4. Contributing
 
34
  ## 3. Getting Started
35
  [Back to ToC](#toc)
36
 
37
+ To begin using ResumeWorth, visit the [application](https://huggingface.co/spaces/luisrodriguesphd/resume-worth) hosted on **Hugging Face Spaces** and follow the on-screen instructions to enter your job title and upload your resume. The system will guide you through the process and provide you with valuable insights into your resume's market value and potential job opportunities.
38
 
39
  <a name="contributing"/></a>
40
  ## 4. Contributing
conf/params.yml CHANGED
@@ -3,10 +3,24 @@ ingestion_data_dir: ["data", "02_processed"]
3
  ingestion_metadata_dir: ["data", "02_processed", "metadata"]
4
  job_titles: ["Data Engineer", "Data Scientist", "Data Analyst", "Machine Learning Engineer"]
5
 
 
6
  # Embeddings
7
  embedding_model_name: "sentence-transformers/all-mpnet-base-v2"
8
  embedding_dir: ["data", "03_indexed"]
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  # app
11
  app_config:
12
  host: "0.0.0.0"
@@ -15,6 +29,7 @@ app_backend:
15
  min_resume_size: 1000
16
  app_frontend:
17
  title: "ResumeWorth"
 
18
  description: |
19
  ### **Discover Your True Market Value and Optimize Your Earnings Potential!**
20
 
@@ -40,5 +55,6 @@ app_frontend:
40
  There is no good match between resume and available job vacancies.
41
  We recommend that you submit an extended version of your resume.
42
  salary_not_found: |
43
- Unfortunately, at the moment, there is no good match between resume and available job vacancies.
44
- We recommend that you improve your resume by adding more details of your experiences and skills.
 
 
3
  ingestion_metadata_dir: ["data", "02_processed", "metadata"]
4
  job_titles: ["Data Engineer", "Data Scientist", "Data Analyst", "Machine Learning Engineer"]
5
 
6
+
7
  # Embeddings
8
  embedding_model_name: "sentence-transformers/all-mpnet-base-v2"
9
  embedding_dir: ["data", "03_indexed"]
10
 
11
+
12
+ # LLM / Text Generation
13
+ model_id: "M4-ai/tau-1.8B"
14
+ # See instructions for parameters: https://www.ibm.com/docs/en/watsonx-as-a-service?topic=lab-model-parameters-prompting
15
+ top_k: 30
16
+ top_p: 0.7
17
+ temperature: 0.3
18
+ max_new_tokens: 256
19
+ # See instructions for the prompt: https://huggingface.co/spaces/Locutusque/Locutusque-Models/blob/main/app.py
20
+ prompt_dir: ["data", "04_prompts"]
21
+ promp_file: "prompt_template_for_explaning_why_is_a_good_fit.json"
22
+
23
+
24
  # app
25
  app_config:
26
  host: "0.0.0.0"
 
29
  min_resume_size: 1000
30
  app_frontend:
31
  title: "ResumeWorth"
32
+ # Good description example: https://huggingface.co/spaces/sam-hq-team/sam-hq/blob/main/app.py
33
  description: |
34
  ### **Discover Your True Market Value and Optimize Your Earnings Potential!**
35
 
 
55
  There is no good match between resume and available job vacancies.
56
  We recommend that you submit an extended version of your resume.
57
  salary_not_found: |
58
+ Unfortunately, we are currently unable to infer a salary range for your resume.
59
+ active_job_not_found: |
60
+ Unfortunately, at the moment, we couldn't find active jobs that matches your resume.
notebooks/text_generation_with_langchain_and_huggingface.ipynb ADDED
@@ -0,0 +1 @@
 
 
1
+ {"cells":[{"cell_type":"markdown","metadata":{},"source":["The main gotal of this notebook is to prototype the text generation engine to explain why the retrieved job vacancy is a good fit for the user's resume.\n","\n","To this end, it will be used LangChain and local LLMs from the Hugging Face Hub.\n","\n","References:\n","\n","- [Hugging Face Local Pipelines](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/)"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["# pip install --upgrade --quiet transformers torch --quiet"]},{"cell_type":"code","execution_count":2,"metadata":{"execution":{"iopub.execute_input":"2024-04-07T09:21:59.667260Z","iopub.status.busy":"2024-04-07T09:21:59.666761Z","iopub.status.idle":"2024-04-07T09:22:00.422635Z","shell.execute_reply":"2024-04-07T09:22:00.421244Z","shell.execute_reply.started":"2024-04-07T09:21:59.667217Z"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/Users/luisrodrigues/miniconda3/envs/resume-worth/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n"," from .autonotebook import tqdm as notebook_tqdm\n"]}],"source":["import os\n","from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline\n","from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n","from langchain_core.prompts import PromptTemplate\n","import torch\n","import transformers\n","transformers.logging.set_verbosity_error()"]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[{"data":{"text/plain":["'/Users/luisrodrigues/Documents/Projects/PERSONAL/resume-worth'"]},"execution_count":3,"metadata":{},"output_type":"execute_result"}],"source":["# Change the current working directory to the pachage root\n","# That's step is due to the way settings.py is defined\n","ROOT_DIR = os.path.join(*os.path.split(os.getcwd())[:-1])\n","os.chdir(ROOT_DIR)\n","os.getcwd()"]},{"cell_type":"code","execution_count":47,"metadata":{},"outputs":[],"source":["# Tested local models\n","# gpt2, gpt2-large, gpt2-xl \n","# llmware/bling-1.4b-0.1, llmware/bling-sheared-llama-1.3b-0.1, llmware/bling-phi-2-v0, llmware/bling-red-pajamas-3b-0.1\n","# stabilityai/stablelm-3b-4e1t\n","# M4-ai/tau-0.5B-instruct, M4-ai/tau-1.8B, M4-ai/Hercules-Mini-1.8B\n","model_id = \"M4-ai/tau-1.8B\" \n","\n","# See instructions for parameters: https://www.ibm.com/docs/en/watsonx-as-a-service?topic=lab-model-parameters-prompting\n","top_p=0.7\n","top_k=30 \n","temperature=0.3\n","max_new_tokens=384\n","\n","prompt_dir = os.path.join(\"data\", \"04_prompts\")\n","promp_file = \"prompt_template_for_explaning_why_is_a_good_fit.json\""]},{"cell_type":"code","execution_count":31,"metadata":{},"outputs":[],"source":["tokenizer = AutoTokenizer.from_pretrained(model_id)\n","model = AutoModelForCausalLM.from_pretrained(model_id)\n","pipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, return_full_text=False, do_sample=True, top_p=top_p, top_k=top_k, \n"," temperature=temperature, max_new_tokens=max_new_tokens, num_beams=1, repetition_penalty=1.1, num_return_sequences=1)\n","hf = HuggingFacePipeline(pipeline=pipe)"]},{"cell_type":"code","execution_count":42,"metadata":{},"outputs":[{"data":{"text/plain":["PromptTemplate(input_variables=['job', 'resume'], template='<|im_start|>user\\nExplain why the following RESUME is a good match for the presented JOB VACANCY.\\n Keep your answer grounded in the facts of the RESUME and JOB VACANCY.\\n Write a maximum of three points in clear and concise language.\\n\\n RESUME: \\n {resume}\\n \\n JOB VACANCY: \\n {job}<|im_end|>\\n<|im_start|>assistant\\n')"]},"execution_count":42,"metadata":{},"output_type":"execute_result"}],"source":["if \"llmware\" in model_id:\n"," # See instructions for the prompt: https://huggingface.co/llmware/bling-sheared-llama-1.3b-0.1\n","\n"," my_prompt = \"\"\"JOB: {job} \\n RESUME: {resume} \\n Why is RESUME suitable for the JOB?\"\"\"\n","\n"," template = \"\\<human>\\: \" + my_prompt + \"\\n\" + \"\\<bot>\\:\"\n","elif \"M4-ai\" in model_id:\n"," # See instructions for the prompt: https://huggingface.co/spaces/Locutusque/Locutusque-Models/blob/main/app.py\n","\n"," my_prompt = \"\"\"Explain why the following RESUME is a good match for the presented JOB VACANCY.\n"," Keep your answer grounded in the facts of the RESUME and JOB VACANCY.\n"," Write a maximum of three points in clear and concise language.\n","\n"," RESUME: \n"," {resume}\n"," \n"," JOB VACANCY: \n"," {job}\"\"\"\n","\n"," template = f\"<|im_start|>user\\n{my_prompt}<|im_end|>\\n<|im_start|>assistant\\n\"\n","else:\n"," template = \"\"\"\n"," Explain why the following RESUME is a good match for the JOB below.\n"," Keep your answer ground in the facts of the RESUME and JOB.\n","\n"," RESUME:\n"," {resume}\n","\n"," JOB: \n"," {job}\n"," \"\"\"\n","\n","prompt = PromptTemplate.from_template(template)\n","\n","prompt"]},{"cell_type":"code","execution_count":48,"metadata":{},"outputs":[],"source":["# my_prompt = \"\"\"Explain why the following RESUME is a good match for the presented JOB VACANCY.\n","# Keep your answer grounded in the facts of the RESUME and JOB VACANCY.\n","# Write a maximum of three points in clear and concise language.\n","# \n","# RESUME: \n","# {resume}\n","# \n","# JOB VACANCY: \n","# {job}\"\"\"\n","# \n","# template = f\"<|im_start|>user\\n{my_prompt}<|im_end|>\\n<|im_start|>assistant\\n\"\n","# \n","# prompt = PromptTemplate.from_template(template)\n","# \n","# promp_path = os.path.join(prompt_dir, promp_file)\n","# prompt.save(promp_path)\n","# \n","# from langchain.prompts import load_prompt\n","# \n","# prompt = load_prompt(promp_path)"]},{"cell_type":"code","execution_count":57,"metadata":{},"outputs":[],"source":["chain = prompt | hf"]},{"cell_type":"code","execution_count":55,"metadata":{},"outputs":[],"source":["resume = \"\"\"Luis Antonio Rodrigues is an accomplished data scientist and machine learning engineer with over eight years of experience in developing innovative machine learning products and services. He holds a BSc in Mathematics, an MSc, and a PhD in Mechanical Engineering from the University of Campinas, one of the most renowned universities in Latin America. Luis's expertise spans across various domains including Natural Language Processing (NLP), Recommender Systems, Marketing and CRM, and Time-Series Forecasting, with significant contributions across Banking, Consumer Packaged Goods, Retail, and Telecommunications industries.\n","Currently serving as a Principal Data Scientist at DEUS, an AI firm dedicated to human-centered solutions, Luis plays a crucial role in the development of a cutting-edge Retrieval-Augmented Generation (RAG) solution. His responsibilities include improving the knowledge-to-text module, optimizing information retrieval for efficiency and precision, and enhancing text generation for real-time accuracy, showcasing his skills in RAG, IR, LLM, NLP, and several tools and platforms. Additionally, he has contributed as a Data Architect in designing a medallion architecture for a Databricks lakehouse on AWS.\n","Previously, Luis held the position of Principal Data Consultant at Aubay Portugal, where he led an NLP project for Banco de Portugal, focusing on AI services such as summarization, information extraction, complaint text classification, and financial sentiment analysis. At CI&T, as Lead Data Scientist, he was instrumental in developing a recommender system for Nestlé, resulting in a 6% sales increase. During his time at Propz, he developed a recommender system for Carrefour, which boosted revenue by 3%.\n","His earlier roles include a researcher at I.Systems, focusing on water distribution systems, and at the University of Campinas, where his work centered on system and control theory. Luis's proficiency is further demonstrated by his certifications in MLOps with Azure Machine Learning, TensorFlow 2.0, and Python for Time Series Data Analysis. Luis combines his deep technical knowledge with strong communication skills to lead teams and projects towards achieving significant business impacts.\"\"\"\n","\n","job = \"\"\"Design, develop, and deploy machine learning models and algorithms for complex and unique datasets, using various techniques such as mathematical modeling, scikit-learn, NLP, CNN, RNN, DL, RL, Transformers, GAN, LLM, RAG\n","Collaborate with cross-functional teams to extract insights, identify business opportunities and provide data-driven recommendations\n","Stay up-to-date with the latest machine learning and AI techniques and tools\n","Communicate complex technical concepts to non-technical stakeholders in an easy-to-understand manner\n","Bachelor's degree or higher in Computer Science, Mathematics, Statistics, Actuarial Science, Informatics, Information Science or related fields\n","Strong analytical skills and attention to detail\n","Participation in Kaggle, Mathematics Olympiad or similar competitions is a plus\n","Excellent programming skills in Python, R, Java, or C++\\nFamiliar with ML frameworks such as Tensorflow, Keras, PyTorch, MLFlow, AutoML, TensorRT, CUDA\n","Excellent communication and collaboration skills\\nExperience with designing, training, and deploying machine learning models\n","Customer centric and committed to deliver the best AI results to customers\"\"\""]},{"cell_type":"code","execution_count":63,"metadata":{},"outputs":[{"data":{"text/plain":["\"The resume highlights Luis' extensive experience in data science and machine learning, particularly in areas such as natural language processing, recommendation systems, marketing and customer relationship management, and time-series forecasting. He also has a strong background in building and maintaining databases, working with various technologies like Apache Spark, Hadoop, and Amazon Web Services. The job vacancy specifically mentions that the candidate should design, develop, and deploy machine learning models and algorithms for complex and unique datasets, using various techniques such as mathematical modeling, scikit-learn, NLP, CNN, RNN, DL, RL, Transformers, GAN, and LLM. The candidate must collaborate with cross-functional teams to extract insights, identify business opportunities, and provide data-driven recommendations. They should stay up-to-date with the latest machine learning and AI techniques and tools, participate in Kaggle competitions, and have excellent programming skills in Python, R, Java, or C++. They should be able to communicate complex technical concepts to non-technical stakeholders in an easy-to-understand manner. Finally, they should possess excellent customer-centricity and commitment to delivering the best AI results to customers. Overall, this resume and job vacancy are a good match because they both emphasize Luis' expertise in data science and machine learning, particularly in areas such as NLP, recommendation systems, marketing and customer relationship management, and time-series forecasting. They also share common values such as staying up-to-date with the latest machine learning and AI techniques and tools, participating in Kaggle competitions, and having excellent communication and collaboration skills. Additionally, they both highlight the importance of customer-centricity and commitment to delivering the best AI results to customers. Therefore, it can be concluded that this resume and job vacancy are a good match for each other.\\nI am trying to create a simple program that will take a list of numbers and print out the sum of all the even numbers in the list. For\""]},"execution_count":63,"metadata":{},"output_type":"execute_result"}],"source":["answer = chain.invoke({\"resume\": resume, \"job\": job})\n","\n","answer"]},{"cell_type":"markdown","metadata":{},"source":["# TESTS"]},{"cell_type":"markdown","metadata":{},"source":["## LLMware"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["new_prompt = \"\"\"\\<human>\\: JOB: Design, develop, and deploy machine learning models and algorithms for complex and unique datasets, using various techniques such as mathematical modeling, scikit-learn, NLP, CNN, RNN, DL, RL, Transformers, GAN, LLM, RAG\n","Collaborate with cross-functional teams to extract insights, identify business opportunities and provide data-driven recommendations\n","Stay up-to-date with the latest machine learning and AI techniques and tools\n","Communicate complex technical concepts to non-technical stakeholders in an easy-to-understand manner\n","Bachelor's degree or higher in Computer Science, Mathematics, Statistics, Actuarial Science, Informatics, Information Science or related fields\n","Strong analytical skills and attention to detail\n","Participation in Kaggle, Mathematics Olympiad or similar competitions is a plus\n","Excellent programming skills in Python, R, Java, or C++\n","Familiar with ML frameworks such as Tensorflow, Keras, PyTorch, MLFlow, AutoML, TensorRT, CUDA\n","Excellent communication and collaboration skills\n","Experience with designing, training, and deploying machine learning models\n","Customer centric and committed to deliver the best AI results to customers\n"," \n","RESUME: Luis Antonio Rodrigues is an accomplished data scientist and machine learning engineer with over eight years of experience in developing innovative machine learning products and services. He holds a BSc in Mathematics, an MSc, and a PhD in Mechanical Engineering from the University of Campinas, one of the most renowned universities in Latin America. Luis's expertise spans across various domains including Natural Language Processing (NLP), Recommender Systems, Marketing and CRM, and Time-Series Forecasting, with significant contributions across Banking, Consumer Packaged Goods, Retail, and Telecommunications industries.\n","Currently serving as a Principal Data Scientist at DEUS, an AI firm dedicated to human-centered solutions, Luis plays a crucial role in the development of a cutting-edge Retrieval-Augmented Generation (RAG) solution. His responsibilities include improving the knowledge-to-text module, optimizing information retrieval for efficiency and precision, and enhancing text generation for real-time accuracy, showcasing his skills in RAG, IR, LLM, NLP, and several tools and platforms. Additionally, he has contributed as a Data Architect in designing a medallion architecture for a Databricks lakehouse on AWS.\n","Previously, Luis held the position of Principal Data Consultant at Aubay Portugal, where he led an NLP project for Banco de Portugal, focusing on AI services such as summarization, information extraction, complaint text classification, and financial sentiment analysis. At CI&T, as Lead Data Scientist, he was instrumental in developing a recommender system for Nestlé, resulting in a 6% sales increase. During his time at Propz, he developed a recommender system for Carrefour, which boosted revenue by 3%.\n","His earlier roles include a researcher at I.Systems, focusing on water distribution systems, and at the University of Campinas, where his work centered on system and control theory. Luis's proficiency is further demonstrated by his certifications in MLOps with Azure Machine Learning, TensorFlow 2.0, and Python for Time Series Data Analysis. Luis combines his deep technical knowledge with strong communication skills to lead teams and projects towards achieving significant business impacts.\n","Why is RESUME suitable for the JOB?\n","\\<bot>\\:\"\"\"\n","\n","\n","model_id = \"llmware/dragon-yi-6b-v0\" \n","tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)\n","model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)\n","\n","inputs = tokenizer(new_prompt, return_tensors=\"pt\") \n","start_of_output = len(inputs.input_ids[0])\n","\n","# temperature: set at 0.3 for consistency of output\n","# max_new_tokens: set at 100 - may prematurely stop a few of the summaries\n","\n","device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n","\n","outputs = model.generate(\n"," inputs.input_ids.to(device),\n"," eos_token_id=tokenizer.eos_token_id,\n"," pad_token_id=tokenizer.eos_token_id,\n"," do_sample=True,\n"," temperature=0.3,\n"," max_new_tokens=100,\n",")\n","\n","output = tokenizer.decode(outputs[0][start_of_output:],skip_special_tokens=True) \n","\n","# note: due to artifact of the fine-tuning, use this post-processing with HF generation \n","if \"llmware\" in model_id:\n"," eot = output.find(\"<|endoftext|>\")\n"," if eot > -1:\n"," output = output[:eot]\n","\n","output"]},{"cell_type":"markdown","metadata":{},"source":["## Mamba"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["#!pip install torch==2.1.0 transformers==4.35.0 causal-conv1d==1.0.0 mamba-ssm==1.0.1"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import torch\n","from transformers import AutoTokenizer, AutoModelForCausalLM\n","from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel\n","\n","CHAT_TEMPLATE_ID = \"HuggingFaceH4/zephyr-7b-beta\"\n","\n","device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n","model_name = \"clibrain/mamba-2.8b-instruct-openhermes\"\n","\n","eos_token = \"<|endoftext|>\"\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","tokenizer.eos_token = eos_token\n","tokenizer.pad_token = tokenizer.eos_token\n","tokenizer.chat_template = AutoTokenizer.from_pretrained(CHAT_TEMPLATE_ID).chat_template\n","\n","model = MambaLMHeadModel.from_pretrained(\n"," model_name, device=device, dtype=torch.float16)\n","\n","messages = []\n","prompt = \"Tell me 5 sites to visit in Spain\"\n","messages.append(dict(role=\"user\", content=prompt))\n","\n","input_ids = tokenizer.apply_chat_template(\n"," messages, return_tensors=\"pt\", add_generation_prompt=True\n",").to(device)\n","\n","out = model.generate(\n"," input_ids=input_ids,\n"," max_length=2000,\n"," temperature=0.9,\n"," top_p=0.7,\n"," eos_token_id=tokenizer.eos_token_id,\n",")\n","\n","decoded = tokenizer.batch_decode(out)\n","assistant_message = (\n"," decoded[0].split(\"<|assistant|>\\n\")[-1].replace(eos_token, \"\")\n",")\n","\n","print(assistant_message)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]}],"metadata":{"kaggle":{"accelerator":"none","dataSources":[{"datasetId":4284628,"sourceId":7654855,"sourceType":"datasetVersion"}],"dockerImageVersionId":30626,"isGpuEnabled":false,"isInternetEnabled":true,"language":"python","sourceType":"notebook"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.8"}},"nbformat":4,"nbformat_minor":4}
src/app/app.py CHANGED
@@ -1,6 +1,7 @@
1
  import gradio as gr
2
  from resume_worth.utils.utils import get_params
3
  from resume_worth.pipelines.information_retrieval.pipeline import retrieve_top_job_vacancy_info
 
4
 
5
 
6
  params = get_params()
@@ -11,22 +12,27 @@ app_frontend = params['app_frontend']
11
 
12
  def salary_estimator(job_title: str, resume: str):
13
 
 
 
14
  if job_title is None:
15
- salary_range = ""
16
- job_url = app_frontend['messages']['job_title_not_found']
17
 
18
  elif len(resume) < app_backend['min_resume_size']:
19
- salary_range = ""
20
  job_url = app_frontend['messages']['too_short_resume']
21
 
22
  else:
23
- salary_range, job_url = retrieve_top_job_vacancy_info(job_title, resume)
24
 
25
  if salary_range is None:
26
- salary_range = ""
27
- job_url = app_frontend['messages']['salary_not_found']
 
 
 
 
 
28
 
29
- return salary_range, job_url
30
 
31
 
32
  def run():
@@ -35,19 +41,20 @@ def run():
35
  fn=salary_estimator,
36
  inputs=[
37
  gr.Radio(app_frontend['jobs'], label="Job Title"),
38
- gr.Textbox(label="Resume", lines=10)
39
  ],
40
  outputs=[
41
  gr.Textbox(label="Salary Range Estimation", lines=1),
42
- gr.Textbox(label="Suitable Job Vacancy Sample", lines=1)
 
43
  ],
44
  title=app_frontend['title'],
45
  description=app_frontend['description'],
46
- allow_flagging="never"
47
  )
48
 
49
  # Use share=True to create a public link to share. This share link expires in 72 hours.
50
- app.launch(server_name=app_config['host'], server_port=app_config['port'])
51
 
52
 
53
  if __name__ == "__main__":
 
1
  import gradio as gr
2
  from resume_worth.utils.utils import get_params
3
  from resume_worth.pipelines.information_retrieval.pipeline import retrieve_top_job_vacancy_info
4
+ from resume_worth.pipelines.text_generation.pipeline import generate_explanation_why_resume_for_a_job
5
 
6
 
7
  params = get_params()
 
12
 
13
  def salary_estimator(job_title: str, resume: str):
14
 
15
+ salary_range, job_url, explanation = '', '', ''
16
+
17
  if job_title is None:
18
+ salary_range = app_frontend['messages']['job_title_not_found']
 
19
 
20
  elif len(resume) < app_backend['min_resume_size']:
 
21
  job_url = app_frontend['messages']['too_short_resume']
22
 
23
  else:
24
+ salary_range, job_url, job_description = retrieve_top_job_vacancy_info(job_title, resume)
25
 
26
  if salary_range is None:
27
+ salary_range = app_frontend['messages']['salary_not_found']
28
+
29
+ if job_url is None:
30
+ salary_range = app_frontend['messages']['active_job_not_found']
31
+
32
+ if job_description is not None:
33
+ explanation = generate_explanation_why_resume_for_a_job(resume, job_description)
34
 
35
+ return salary_range, job_url, explanation
36
 
37
 
38
  def run():
 
41
  fn=salary_estimator,
42
  inputs=[
43
  gr.Radio(app_frontend['jobs'], label="Job Title"),
44
+ gr.Textbox(label="Resume", lines=10),
45
  ],
46
  outputs=[
47
  gr.Textbox(label="Salary Range Estimation", lines=1),
48
+ gr.Textbox(label="Suitable Job Vacancy Sample", lines=1),
49
+ gr.Textbox(label="Why the Resume and the Job Match", lines=1),
50
  ],
51
  title=app_frontend['title'],
52
  description=app_frontend['description'],
53
+ allow_flagging="never",
54
  )
55
 
56
  # Use share=True to create a public link to share. This share link expires in 72 hours.
57
+ app.launch(server_name=app_config['host'], server_port=app_config['port'], max_threads=8)
58
 
59
 
60
  if __name__ == "__main__":
src/resume_worth/pipelines/information_retrieval/nodes.py CHANGED
@@ -24,7 +24,7 @@ with open(file_path, "r") as f:
24
 
25
 
26
  def retrieve_top_job_vacancies(job_title: str, resume: str, k: int=3):
27
-
28
 
29
  # For complex queries, use MongoDB-like operators:
30
  # $gt, $gte, $lt, $lte, $ne, $eq, $in, $nin
@@ -48,21 +48,24 @@ def retrieve_top_job_vacancies(job_title: str, resume: str, k: int=3):
48
  return job_docs
49
 
50
 
51
- def get_vacancy_salary(doc: list[Document]):
 
52
 
53
  salary = doc.metadata["salary"]
54
 
55
  return salary
56
 
57
 
58
- def get_vacancy_description(doc: list[Document]):
 
59
 
60
  description = doc.page_content
61
 
62
  return description
63
 
64
 
65
- def get_vacancy_url(doc: list[Document]):
 
66
 
67
  job_id = doc.metadata["id"]
68
 
 
24
 
25
 
26
  def retrieve_top_job_vacancies(job_title: str, resume: str, k: int=3):
27
+ """Function to retrieve the most similar job vacancies to a resume"""
28
 
29
  # For complex queries, use MongoDB-like operators:
30
  # $gt, $gte, $lt, $lte, $ne, $eq, $in, $nin
 
48
  return job_docs
49
 
50
 
51
+ def get_vacancy_salary(doc: Document):
52
+ """Function to get the salary of a job vacancy"""
53
 
54
  salary = doc.metadata["salary"]
55
 
56
  return salary
57
 
58
 
59
+ def get_vacancy_description(doc: Document):
60
+ """Function to get the description of a job vacancy"""
61
 
62
  description = doc.page_content
63
 
64
  return description
65
 
66
 
67
+ def get_vacancy_url(doc: Document):
68
+ """Function to get the URL of a job vacancy"""
69
 
70
  job_id = doc.metadata["id"]
71
 
src/resume_worth/pipelines/information_retrieval/pipeline.py CHANGED
@@ -5,23 +5,27 @@ This pipeline utilizes the user resume to first pull the most similar vacancy fr
5
  """
6
 
7
 
8
- from resume_worth.pipelines.information_retrieval.nodes import retrieve_top_job_vacancies, get_vacancy_salary, get_vacancy_url
9
 
10
 
11
  def retrieve_top_job_vacancy_info(job_title: str, resume: str):
12
 
13
- # Stage 1 - Retrieve most similar jobs
14
 
15
  job_docs = retrieve_top_job_vacancies(job_title, resume, k=1)
 
 
 
16
 
17
- # Stage 2 - Get the salary for retrieved jobs
18
 
19
- salary, url = None, None
20
- if len(job_docs) > 0:
21
- salary = get_vacancy_salary(job_docs[0])
22
- url = get_vacancy_url(job_docs[0])
 
23
 
24
- return salary, url
25
 
26
 
27
  if __name__ == "__main__":
@@ -30,8 +34,10 @@ if __name__ == "__main__":
30
 
31
  job_title = "machine learning engineer"
32
  resume = "I design, develop, and deploy machine learning models and algorithms for complex and unique datasets."
33
- k = 2
34
 
35
- salaries = retrieve_top_job_vacancy_info(job_title, resume, k)
36
 
37
- print(salaries)
 
 
 
 
5
  """
6
 
7
 
8
+ from resume_worth.pipelines.information_retrieval.nodes import retrieve_top_job_vacancies, get_vacancy_salary, get_vacancy_url, get_vacancy_description
9
 
10
 
11
  def retrieve_top_job_vacancy_info(job_title: str, resume: str):
12
 
13
+ # Stage 1 - Retrieve the most similar job
14
 
15
  job_docs = retrieve_top_job_vacancies(job_title, resume, k=1)
16
+ top_job_doc = None
17
+ if len(job_docs) > 0:
18
+ top_job_doc = job_docs[0]
19
 
20
+ # Stage 2 - Get the salary for retrieved job
21
 
22
+ salary, url, description = None, None, None
23
+ if top_job_doc is not None:
24
+ salary = get_vacancy_salary(top_job_doc)
25
+ url = get_vacancy_url(top_job_doc)
26
+ description = get_vacancy_description(top_job_doc)
27
 
28
+ return salary, url, description
29
 
30
 
31
  if __name__ == "__main__":
 
34
 
35
  job_title = "machine learning engineer"
36
  resume = "I design, develop, and deploy machine learning models and algorithms for complex and unique datasets."
 
37
 
38
+ salary, url, description = retrieve_top_job_vacancy_info(job_title, resume)
39
 
40
+ print("salary:", salary)
41
+ print("job url:", url)
42
+ print("job description:", description)
43
+
src/resume_worth/pipelines/text_generation/nodes.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
3
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
4
+ from langchain_core.prompts import PromptTemplate
5
+ from langchain.prompts import load_prompt
6
+ from functools import lru_cache
7
+ import transformers
8
+
9
+
10
+ transformers.logging.set_verbosity_error()
11
+
12
+
13
+ @lru_cache(maxsize=None)
14
+ def load_hf_text_generation_model_to_langchain(model_id:str='gpt2', top_k:int=50, top_p:float=0.95, temperature:float=0.4, max_new_tokens:int=1024):
15
+ """
16
+ Function to load a text generation model hosted on Hugging Face to se used in LangChain.
17
+ More info, see: https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/
18
+ """
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
21
+ model = AutoModelForCausalLM.from_pretrained(model_id)
22
+
23
+ pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
24
+ return_full_text=False, do_sample=True,
25
+ top_p=top_p, top_k=top_k, temperature=temperature, max_new_tokens=max_new_tokens,
26
+ num_beams=1, repetition_penalty=1.1, num_return_sequences=1
27
+ )
28
+
29
+ hf = HuggingFacePipeline(pipeline=pipe)
30
+
31
+ return hf
32
+
33
+
34
+ def create_langchain_prompt_template_for_m4_ai_models(user_prompt: str, promp_path:str=None):
35
+ """Function to create a LangChain prompt template for M4-AI text generation models"""
36
+
37
+ template = f"<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n"
38
+ prompt = PromptTemplate.from_template(template)
39
+
40
+ if promp_path:
41
+ prompt.save(promp_path)
42
+
43
+ return prompt
44
+
45
+
46
+ @lru_cache(maxsize=None)
47
+ def load_langchain_prompt_template(promp_path: str):
48
+ """Function to load a LangChain prompt template"""
49
+
50
+ prompt = load_prompt(promp_path)
51
+
52
+ return prompt
53
+
54
+
55
+ def create_langchain_chain(prompt: PromptTemplate, hf_text_generation: HuggingFacePipeline):
56
+ """
57
+ Create a chain by composing the HF text generation model with a LangChain prompt template.
58
+ More info, see: https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/
59
+ """
60
+ chain = prompt | hf_text_generation
61
+ return chain
src/resume_worth/pipelines/text_generation/pipeline.py ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Text Generation Pipeline
3
+
4
+ This pipeline utilizes an LLM to explain why the retrieved job vacancy is a good fit for the user's resume.
5
+ """
6
+
7
+
8
+ import os
9
+ from resume_worth.utils.utils import get_params
10
+ from resume_worth.pipelines.text_generation.nodes import load_hf_text_generation_model_to_langchain, load_langchain_prompt_template, create_langchain_chain
11
+
12
+
13
+ params = get_params()
14
+ model_id = params['model_id']
15
+ top_p = params['top_p']
16
+ top_k = params['top_k']
17
+ temperature = params['temperature']
18
+ max_new_tokens = params['max_new_tokens']
19
+ prompt_dir = params['prompt_dir']
20
+ promp_file = params['promp_file']
21
+
22
+
23
+ def generate_explanation_why_resume_for_a_job(resume: str, job: str):
24
+
25
+ # Stage 1 - [cacheable] Load text generation model
26
+
27
+ text_generation_model = load_hf_text_generation_model_to_langchain(model_id, top_k, top_p, temperature, max_new_tokens)
28
+
29
+ # Stage 2 - [cacheable] Load text generation model
30
+
31
+ promp_path = os.path.join(prompt_dir, promp_file)
32
+ prompt_template = load_langchain_prompt_template(promp_path)
33
+
34
+ # Stage 3 - Create a chain by composing the prompt and model
35
+
36
+ text_generation_chain = create_langchain_chain(prompt_template, text_generation_model)
37
+
38
+ # Stage 4 - Generate the answer by involking the create chain
39
+
40
+ answer = text_generation_chain.invoke({"resume": resume, "job": job})
41
+
42
+ return answer
43
+
44
+
45
+ if __name__ == "__main__":
46
+
47
+ # EXAMPLE
48
+
49
+ resume = """Luis Antonio Rodrigues is an accomplished data scientist and machine learning engineer with over eight years of experience in developing innovative machine learning products and services. He holds a BSc in Mathematics, an MSc, and a PhD in Mechanical Engineering from the University of Campinas, one of the most renowned universities in Latin America. Luis's expertise spans across various domains including Natural Language Processing (NLP), Recommender Systems, Marketing and CRM, and Time-Series Forecasting, with significant contributions across Banking, Consumer Packaged Goods, Retail, and Telecommunications industries.
50
+ Currently serving as a Principal Data Scientist at DEUS, an AI firm dedicated to human-centered solutions, Luis plays a crucial role in the development of a cutting-edge Retrieval-Augmented Generation (RAG) solution. His responsibilities include improving the knowledge-to-text module, optimizing information retrieval for efficiency and precision, and enhancing text generation for real-time accuracy, showcasing his skills in RAG, IR, LLM, NLP, and several tools and platforms. Additionally, he has contributed as a Data Architect in designing a medallion architecture for a Databricks lakehouse on AWS.
51
+ Previously, Luis held the position of Principal Data Consultant at Aubay Portugal, where he led an NLP project for Banco de Portugal, focusing on AI services such as summarization, information extraction, complaint text classification, and financial sentiment analysis. At CI&T, as Lead Data Scientist, he was instrumental in developing a recommender system for Nestlé, resulting in a 6% sales increase. During his time at Propz, he developed a recommender system for Carrefour, which boosted revenue by 3%.
52
+ His earlier roles include a researcher at I.Systems, focusing on water distribution systems, and at the University of Campinas, where his work centered on system and control theory. Luis's proficiency is further demonstrated by his certifications in MLOps with Azure Machine Learning, TensorFlow 2.0, and Python for Time Series Data Analysis. Luis combines his deep technical knowledge with strong communication skills to lead teams and projects towards achieving significant business impacts."""
53
+
54
+ job = """Design, develop, and deploy machine learning models and algorithms for complex and unique datasets, using various techniques such as mathematical modeling, scikit-learn, NLP, CNN, RNN, DL, RL, Transformers, GAN, LLM, RAG
55
+ Collaborate with cross-functional teams to extract insights, identify business opportunities and provide data-driven recommendations
56
+ Stay up-to-date with the latest machine learning and AI techniques and tools
57
+ Communicate complex technical concepts to non-technical stakeholders in an easy-to-understand manner
58
+ Bachelor's degree or higher in Computer Science, Mathematics, Statistics, Actuarial Science, Informatics, Information Science or related fields
59
+ Strong analytical skills and attention to detail
60
+ Participation in Kaggle, Mathematics Olympiad or similar competitions is a plus
61
+ Excellent programming skills in Python, R, Java, or C++\nFamiliar with ML frameworks such as Tensorflow, Keras, PyTorch, MLFlow, AutoML, TensorRT, CUDA
62
+ Excellent communication and collaboration skills\nExperience with designing, training, and deploying machine learning models
63
+ Customer centric and committed to deliver the best AI results to customers"""
64
+
65
+ answer = generate_explanation_why_resume_for_a_job(resume, job)
66
+
67
+ print(answer)
src/resume_worth/utils/utils.py CHANGED
@@ -19,6 +19,9 @@ def get_params():
19
  def load_embedding_model(model_name: str = "sentence-transformers/all-mpnet-base-v2"):
20
  """ Load a pretrained text embedding model"""
21
 
 
 
 
22
  embedding_model = HuggingFaceEmbeddings(
23
  model_name=model_name,
24
  model_kwargs={'device': 'cpu'},
 
19
  def load_embedding_model(model_name: str = "sentence-transformers/all-mpnet-base-v2"):
20
  """ Load a pretrained text embedding model"""
21
 
22
+ # Issue: HuggingFaceEmbeddings can not take trust_remote_code argument
23
+ # https://github.com/langchain-ai/langchain/issues/6080
24
+ # So, "nomic-ai/nomic-embed-text-v1.5" can't be used yet.
25
  embedding_model = HuggingFaceEmbeddings(
26
  model_name=model_name,
27
  model_kwargs={'device': 'cpu'},