Spaces:

luisrodriguesphd
/

resume-worth

Sleeping

App Files Files Community

luisrodriguesphd commited on Apr 14

Commit

2e113bf

•

1 Parent(s): a3c1bf2

add a text generation pipeline

Browse files

Files changed (9) hide show

README.md +1 -1
conf/params.yml +18 -2
notebooks/text_generation_with_langchain_and_huggingface.ipynb +1 -0
src/app/app.py +18 -11
src/resume_worth/pipelines/information_retrieval/nodes.py +7 -4
src/resume_worth/pipelines/information_retrieval/pipeline.py +17 -11
src/resume_worth/pipelines/text_generation/nodes.py +61 -0
src/resume_worth/pipelines/text_generation/pipeline.py +67 -0
src/resume_worth/utils/utils.py +3 -0

README.md CHANGED Viewed

@@ -34,7 +34,7 @@ ResumeWorth utilizes a step-by-step process to analyze your professional backgro
 ## 3. Getting Started
 [Back to ToC](#toc)
-To begin using ResumeWorth, visit the [application](https://huggingface.co/spaces/luisrodriguesphd/resume-worth) deployed on **Hugging Face** and follow the on-screen instructions to enter your job title and upload your resume. The system will guide you through the process and provide you with valuable insights into your resume's market value and potential job opportunities.
 <a name="contributing"/></a>
 ## 4. Contributing

 ## 3. Getting Started
 [Back to ToC](#toc)
+To begin using ResumeWorth, visit the [application](https://huggingface.co/spaces/luisrodriguesphd/resume-worth) hosted on **Hugging Face Spaces** and follow the on-screen instructions to enter your job title and upload your resume. The system will guide you through the process and provide you with valuable insights into your resume's market value and potential job opportunities.
 <a name="contributing"/></a>
 ## 4. Contributing

conf/params.yml CHANGED Viewed

@@ -3,10 +3,24 @@ ingestion_data_dir: ["data", "02_processed"]
 ingestion_metadata_dir: ["data", "02_processed", "metadata"]
 job_titles: ["Data Engineer", "Data Scientist", "Data Analyst", "Machine Learning Engineer"]
 # Embeddings
 embedding_model_name: "sentence-transformers/all-mpnet-base-v2"
 embedding_dir: ["data", "03_indexed"]
 # app
 app_config:
     host: "0.0.0.0"
@@ -15,6 +29,7 @@ app_backend:
     min_resume_size: 1000
 app_frontend:
     title: "ResumeWorth"
     description: |
         ### **Discover Your True Market Value and Optimize Your Earnings Potential!**
@@ -40,5 +55,6 @@ app_frontend:
             There is no good match between resume and available job vacancies.
             We recommend that you submit an extended version of your resume.
         salary_not_found: |
-            Unfortunately, at the moment, there is no good match between resume and available job vacancies.
-            We recommend that you improve your resume by adding more details of your experiences and skills.

 ingestion_metadata_dir: ["data", "02_processed", "metadata"]
 job_titles: ["Data Engineer", "Data Scientist", "Data Analyst", "Machine Learning Engineer"]
 # Embeddings
 embedding_model_name: "sentence-transformers/all-mpnet-base-v2"
 embedding_dir: ["data", "03_indexed"]
+# LLM / Text Generation
+model_id: "M4-ai/tau-1.8B"
+# See instructions for parameters: https://www.ibm.com/docs/en/watsonx-as-a-service?topic=lab-model-parameters-prompting
+top_k: 30
+top_p: 0.7
+temperature: 0.3
+max_new_tokens: 256
+# See instructions for the prompt: https://huggingface.co/spaces/Locutusque/Locutusque-Models/blob/main/app.py
+prompt_dir: ["data", "04_prompts"]
+promp_file: "prompt_template_for_explaning_why_is_a_good_fit.json"
 # app
 app_config:
     host: "0.0.0.0"
     min_resume_size: 1000
 app_frontend:
     title: "ResumeWorth"
+    # Good description example: https://huggingface.co/spaces/sam-hq-team/sam-hq/blob/main/app.py
     description: |
         ### **Discover Your True Market Value and Optimize Your Earnings Potential!**
             There is no good match between resume and available job vacancies.
             We recommend that you submit an extended version of your resume.
         salary_not_found: |
+            Unfortunately, we are currently unable to infer a salary range for your resume.
+        active_job_not_found: |
+            Unfortunately, at the moment, we couldn't find active jobs that matches your resume.

notebooks/text_generation_with_langchain_and_huggingface.ipynb ADDED Viewed

	@@ -0,0 +1 @@

+ {"cells":[{"cell_type":"markdown","metadata":{},"source":["The main gotal of this notebook is to prototype the text generation engine to explain why the retrieved job vacancy is a good fit for the user's resume.\n","\n","To this end, it will be used LangChain and local LLMs from the Hugging Face Hub.\n","\n","References:\n","\n","- [Hugging Face Local Pipelines](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/)"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["# pip install --upgrade --quiet transformers torch --quiet"]},{"cell_type":"code","execution_count":2,"metadata":{"execution":{"iopub.execute_input":"2024-04-07T09:21:59.667260Z","iopub.status.busy":"2024-04-07T09:21:59.666761Z","iopub.status.idle":"2024-04-07T09:22:00.422635Z","shell.execute_reply":"2024-04-07T09:22:00.421244Z","shell.execute_reply.started":"2024-04-07T09:21:59.667217Z"},"tags":[],"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["/Users/luisrodrigues/miniconda3/envs/resume-worth/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n"," from .autonotebook import tqdm as notebook_tqdm\n"]}],"source":["import os\n","from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline\n","from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n","from langchain_core.prompts import PromptTemplate\n","import torch\n","import transformers\n","transformers.logging.set_verbosity_error()"]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[{"data":{"text/plain":["'/Users/luisrodrigues/Documents/Projects/PERSONAL/resume-worth'"]},"execution_count":3,"metadata":{},"output_type":"execute_result"}],"source":["# Change the current working directory to the pachage root\n","# That's step is due to the way settings.py is defined\n","ROOT_DIR = os.path.join(*os.path.split(os.getcwd())[:-1])\n","os.chdir(ROOT_DIR)\n","os.getcwd()"]},{"cell_type":"code","execution_count":47,"metadata":{},"outputs":[],"source":["# Tested local models\n","# gpt2, gpt2-large, gpt2-xl \n","# llmware/bling-1.4b-0.1, llmware/bling-sheared-llama-1.3b-0.1, llmware/bling-phi-2-v0, llmware/bling-red-pajamas-3b-0.1\n","# stabilityai/stablelm-3b-4e1t\n","# M4-ai/tau-0.5B-instruct, M4-ai/tau-1.8B, M4-ai/Hercules-Mini-1.8B\n","model_id = \"M4-ai/tau-1.8B\" \n","\n","# See instructions for parameters: https://www.ibm.com/docs/en/watsonx-as-a-service?topic=lab-model-parameters-prompting\n","top_p=0.7\n","top_k=30 \n","temperature=0.3\n","max_new_tokens=384\n","\n","prompt_dir = os.path.join(\"data\", \"04_prompts\")\n","promp_file = \"prompt_template_for_explaning_why_is_a_good_fit.json\""]},{"cell_type":"code","execution_count":31,"metadata":{},"outputs":[],"source":["tokenizer = AutoTokenizer.from_pretrained(model_id)\n","model = AutoModelForCausalLM.from_pretrained(model_id)\n","pipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, return_full_text=False, do_sample=True, top_p=top_p, top_k=top_k, \n"," temperature=temperature, max_new_tokens=max_new_tokens, num_beams=1, repetition_penalty=1.1, num_return_sequences=1)\n","hf = HuggingFacePipeline(pipeline=pipe)"]},{"cell_type":"code","execution_count":42,"metadata":{},"outputs":[{"data":{"text/plain":["PromptTemplate(input_variables=['job', 'resume'], template='<|im_start|>user\\nExplain why the following RESUME is a good match for the presented JOB VACANCY.\\n Keep your answer grounded in the facts of the RESUME and JOB VACANCY.\\n Write a maximum of three points in clear and concise language.\\n\\n RESUME: \\n {resume}\\n \\n JOB VACANCY: \\n {job}<|im_end|>\\n<|im_start|>assistant\\n')"]},"execution_count":42,"metadata":{},"output_type":"execute_result"}],"source":["if \"llmware\" in model_id:\n"," # See instructions for the prompt: https://huggingface.co/llmware/bling-sheared-llama-1.3b-0.1\n","\n"," my_prompt = \"\"\"JOB: {job} \\n RESUME: {resume} \\n Why is RESUME suitable for the JOB?\"\"\"\n","\n"," template = \"\\<human>\\: \" + my_prompt + \"\\n\" + \"\\<bot>\\:\"\n","elif \"M4-ai\" in model_id:\n"," # See instructions for the prompt: https://huggingface.co/spaces/Locutusque/Locutusque-Models/blob/main/app.py\n","\n"," my_prompt = \"\"\"Explain why the following RESUME is a good match for the presented JOB VACANCY.\n"," Keep your answer grounded in the facts of the RESUME and JOB VACANCY.\n"," Write a maximum of three points in clear and concise language.\n","\n"," RESUME: \n"," {resume}\n"," \n"," JOB VACANCY: \n"," {job}\"\"\"\n","\n"," template = f\"<|im_start|>user\\n{my_prompt}<|im_end|>\\n<|im_start|>assistant\\n\"\n","else:\n"," template = \"\"\"\n"," Explain why the following RESUME is a good match for the JOB below.\n"," Keep your answer ground in the facts of the RESUME and JOB.\n","\n"," RESUME:\n"," {resume}\n","\n"," JOB: \n"," {job}\n"," \"\"\"\n","\n","prompt = PromptTemplate.from_template(template)\n","\n","prompt"]},{"cell_type":"code","execution_count":48,"metadata":{},"outputs":[],"source":["# my_prompt = \"\"\"Explain why the following RESUME is a good match for the presented JOB VACANCY.\n","# Keep your answer grounded in the facts of the RESUME and JOB VACANCY.\n","# Write a maximum of three points in clear and concise language.\n","# \n","# RESUME: \n","# {resume}\n","# \n","# JOB VACANCY: \n","# {job}\"\"\"\n","# \n","# template = f\"<|im_start|>user\\n{my_prompt}<|im_end|>\\n<|im_start|>assistant\\n\"\n","# \n","# prompt = PromptTemplate.from_template(template)\n","# \n","# promp_path = os.path.join(prompt_dir, promp_file)\n","# prompt.save(promp_path)\n","# \n","# from langchain.prompts import load_prompt\n","# \n","# prompt = load_prompt(promp_path)"]},{"cell_type":"code","execution_count":57,"metadata":{},"outputs":[],"source":["chain = prompt | hf"]},{"cell_type":"code","execution_count":55,"metadata":{},"outputs":[],"source":["resume = \"\"\"Luis Antonio Rodrigues is an accomplished data scientist and machine learning engineer with over eight years of experience in developing innovative machine learning products and services. He holds a BSc in Mathematics, an MSc, and a PhD in Mechanical Engineering from the University of Campinas, one of the most renowned universities in Latin America. Luis's expertise spans across various domains including Natural Language Processing (NLP), Recommender Systems, Marketing and CRM, and Time-Series Forecasting, with significant contributions across Banking, Consumer Packaged Goods, Retail, and Telecommunications industries.\n","Currently serving as a Principal Data Scientist at DEUS, an AI firm dedicated to human-centered solutions, Luis plays a crucial role in the development of a cutting-edge Retrieval-Augmented Generation (RAG) solution. His responsibilities include improving the knowledge-to-text module, optimizing information retrieval for efficiency and precision, and enhancing text generation for real-time accuracy, showcasing his skills in RAG, IR, LLM, NLP, and several tools and platforms. Additionally, he has contributed as a Data Architect in designing a medallion architecture for a Databricks lakehouse on AWS.\n","Previously, Luis held the position of Principal Data Consultant at Aubay Portugal, where he led an NLP project for Banco de Portugal, focusing on AI services such as summarization, information extraction, complaint text classification, and financial sentiment analysis. At CI&T, as Lead Data Scientist, he was instrumental in developing a recommender system for Nestlé, resulting in a 6% sales increase. During his time at Propz, he developed a recommender system for Carrefour, which boosted revenue by 3%.\n","His earlier roles include a researcher at I.Systems, focusing on water distribution systems, and at the University of Campinas, where his work centered on system and control theory. Luis's proficiency is further demonstrated by his certifications in MLOps with Azure Machine Learning, TensorFlow 2.0, and Python for Time Series Data Analysis. Luis combines his deep technical knowledge with strong communication skills to lead teams and projects towards achieving significant business impacts.\"\"\"\n","\n","job = \"\"\"Design, develop, and deploy machine learning models and algorithms for complex and unique datasets, using various techniques such as mathematical modeling, scikit-learn, NLP, CNN, RNN, DL, RL, Transformers, GAN, LLM, RAG\n","Collaborate with cross-functional teams to extract insights, identify business opportunities and provide data-driven recommendations\n","Stay up-to-date with the latest machine learning and AI techniques and tools\n","Communicate complex technical concepts to non-technical stakeholders in an easy-to-understand manner\n","Bachelor's degree or higher in Computer Science, Mathematics, Statistics, Actuarial Science, Informatics, Information Science or related fields\n","Strong analytical skills and attention to detail\n","Participation in Kaggle, Mathematics Olympiad or similar competitions is a plus\n","Excellent programming skills in Python, R, Java, or C++\\nFamiliar with ML frameworks such as Tensorflow, Keras, PyTorch, MLFlow, AutoML, TensorRT, CUDA\n","Excellent communication and collaboration skills\\nExperience with designing, training, and deploying machine learning models\n","Customer centric and committed to deliver the best AI results to customers\"\"\""]},{"cell_type":"code","execution_count":63,"metadata":{},"outputs":[{"data":{"text/plain":["\"The resume highlights Luis' extensive experience in data science and machine learning, particularly in areas such as natural language processing, recommendation systems, marketing and customer relationship management, and time-series forecasting. He also has a strong background in building and maintaining databases, working with various technologies like Apache Spark, Hadoop, and Amazon Web Services. The job vacancy specifically mentions that the candidate should design, develop, and deploy machine learning models and algorithms for complex and unique datasets, using various techniques such as mathematical modeling, scikit-learn, NLP, CNN, RNN, DL, RL, Transformers, GAN, and LLM. The candidate must collaborate with cross-functional teams to extract insights, identify business opportunities, and provide data-driven recommendations. They should stay up-to-date with the latest machine learning and AI techniques and tools, participate in Kaggle competitions, and have excellent programming skills in Python, R, Java, or C++. They should be able to communicate complex technical concepts to non-technical stakeholders in an easy-to-understand manner. Finally, they should possess excellent customer-centricity and commitment to delivering the best AI results to customers. Overall, this resume and job vacancy are a good match because they both emphasize Luis' expertise in data science and machine learning, particularly in areas such as NLP, recommendation systems, marketing and customer relationship management, and time-series forecasting. They also share common values such as staying up-to-date with the latest machine learning and AI techniques and tools, participating in Kaggle competitions, and having excellent communication and collaboration skills. Additionally, they both highlight the importance of customer-centricity and commitment to delivering the best AI results to customers. Therefore, it can be concluded that this resume and job vacancy are a good match for each other.\\nI am trying to create a simple program that will take a list of numbers and print out the sum of all the even numbers in the list. For\""]},"execution_count":63,"metadata":{},"output_type":"execute_result"}],"source":["answer = chain.invoke({\"resume\": resume, \"job\": job})\n","\n","answer"]},{"cell_type":"markdown","metadata":{},"source":["# TESTS"]},{"cell_type":"markdown","metadata":{},"source":["## LLMware"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["new_prompt = \"\"\"\\<human>\\: JOB: Design, develop, and deploy machine learning models and algorithms for complex and unique datasets, using various techniques such as mathematical modeling, scikit-learn, NLP, CNN, RNN, DL, RL, Transformers, GAN, LLM, RAG\n","Collaborate with cross-functional teams to extract insights, identify business opportunities and provide data-driven recommendations\n","Stay up-to-date with the latest machine learning and AI techniques and tools\n","Communicate complex technical concepts to non-technical stakeholders in an easy-to-understand manner\n","Bachelor's degree or higher in Computer Science, Mathematics, Statistics, Actuarial Science, Informatics, Information Science or related fields\n","Strong analytical skills and attention to detail\n","Participation in Kaggle, Mathematics Olympiad or similar competitions is a plus\n","Excellent programming skills in Python, R, Java, or C++\n","Familiar with ML frameworks such as Tensorflow, Keras, PyTorch, MLFlow, AutoML, TensorRT, CUDA\n","Excellent communication and collaboration skills\n","Experience with designing, training, and deploying machine learning models\n","Customer centric and committed to deliver the best AI results to customers\n"," \n","RESUME: Luis Antonio Rodrigues is an accomplished data scientist and machine learning engineer with over eight years of experience in developing innovative machine learning products and services. He holds a BSc in Mathematics, an MSc, and a PhD in Mechanical Engineering from the University of Campinas, one of the most renowned universities in Latin America. Luis's expertise spans across various domains including Natural Language Processing (NLP), Recommender Systems, Marketing and CRM, and Time-Series Forecasting, with significant contributions across Banking, Consumer Packaged Goods, Retail, and Telecommunications industries.\n","Currently serving as a Principal Data Scientist at DEUS, an AI firm dedicated to human-centered solutions, Luis plays a crucial role in the development of a cutting-edge Retrieval-Augmented Generation (RAG) solution. His responsibilities include improving the knowledge-to-text module, optimizing information retrieval for efficiency and precision, and enhancing text generation for real-time accuracy, showcasing his skills in RAG, IR, LLM, NLP, and several tools and platforms. Additionally, he has contributed as a Data Architect in designing a medallion architecture for a Databricks lakehouse on AWS.\n","Previously, Luis held the position of Principal Data Consultant at Aubay Portugal, where he led an NLP project for Banco de Portugal, focusing on AI services such as summarization, information extraction, complaint text classification, and financial sentiment analysis. At CI&T, as Lead Data Scientist, he was instrumental in developing a recommender system for Nestlé, resulting in a 6% sales increase. During his time at Propz, he developed a recommender system for Carrefour, which boosted revenue by 3%.\n","His earlier roles include a researcher at I.Systems, focusing on water distribution systems, and at the University of Campinas, where his work centered on system and control theory. Luis's proficiency is further demonstrated by his certifications in MLOps with Azure Machine Learning, TensorFlow 2.0, and Python for Time Series Data Analysis. Luis combines his deep technical knowledge with strong communication skills to lead teams and projects towards achieving significant business impacts.\n","Why is RESUME suitable for the JOB?\n","\\<bot>\\:\"\"\"\n","\n","\n","model_id = \"llmware/dragon-yi-6b-v0\" \n","tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)\n","model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)\n","\n","inputs = tokenizer(new_prompt, return_tensors=\"pt\") \n","start_of_output = len(inputs.input_ids[0])\n","\n","# temperature: set at 0.3 for consistency of output\n","# max_new_tokens: set at 100 - may prematurely stop a few of the summaries\n","\n","device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n","\n","outputs = model.generate(\n"," inputs.input_ids.to(device),\n"," eos_token_id=tokenizer.eos_token_id,\n"," pad_token_id=tokenizer.eos_token_id,\n"," do_sample=True,\n"," temperature=0.3,\n"," max_new_tokens=100,\n",")\n","\n","output = tokenizer.decode(outputs[0][start_of_output:],skip_special_tokens=True) \n","\n","# note: due to artifact of the fine-tuning, use this post-processing with HF generation \n","if \"llmware\" in model_id:\n"," eot = output.find(\"<|endoftext|>\")\n"," if eot > -1:\n"," output = output[:eot]\n","\n","output"]},{"cell_type":"markdown","metadata":{},"source":["## Mamba"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["#!pip install torch==2.1.0 transformers==4.35.0 causal-conv1d==1.0.0 mamba-ssm==1.0.1"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import torch\n","from transformers import AutoTokenizer, AutoModelForCausalLM\n","from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel\n","\n","CHAT_TEMPLATE_ID = \"HuggingFaceH4/zephyr-7b-beta\"\n","\n","device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n","model_name = \"clibrain/mamba-2.8b-instruct-openhermes\"\n","\n","eos_token = \"<|endoftext|>\"\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","tokenizer.eos_token = eos_token\n","tokenizer.pad_token = tokenizer.eos_token\n","tokenizer.chat_template = AutoTokenizer.from_pretrained(CHAT_TEMPLATE_ID).chat_template\n","\n","model = MambaLMHeadModel.from_pretrained(\n"," model_name, device=device, dtype=torch.float16)\n","\n","messages = []\n","prompt = \"Tell me 5 sites to visit in Spain\"\n","messages.append(dict(role=\"user\", content=prompt))\n","\n","input_ids = tokenizer.apply_chat_template(\n"," messages, return_tensors=\"pt\", add_generation_prompt=True\n",").to(device)\n","\n","out = model.generate(\n"," input_ids=input_ids,\n"," max_length=2000,\n"," temperature=0.9,\n"," top_p=0.7,\n"," eos_token_id=tokenizer.eos_token_id,\n",")\n","\n","decoded = tokenizer.batch_decode(out)\n","assistant_message = (\n"," decoded[0].split(\"<|assistant|>\\n\")[-1].replace(eos_token, \"\")\n",")\n","\n","print(assistant_message)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]}],"metadata":{"kaggle":{"accelerator":"none","dataSources":[{"datasetId":4284628,"sourceId":7654855,"sourceType":"datasetVersion"}],"dockerImageVersionId":30626,"isGpuEnabled":false,"isInternetEnabled":true,"language":"python","sourceType":"notebook"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.8"}},"nbformat":4,"nbformat_minor":4}

src/app/app.py CHANGED Viewed

@@ -1,6 +1,7 @@
 import gradio as gr
 from resume_worth.utils.utils import get_params
 from resume_worth.pipelines.information_retrieval.pipeline import retrieve_top_job_vacancy_info
 params = get_params()
@@ -11,22 +12,27 @@ app_frontend = params['app_frontend']
 def salary_estimator(job_title: str, resume: str):
     if job_title is None:
-        salary_range = ""
-        job_url = app_frontend['messages']['job_title_not_found']
     elif len(resume) < app_backend['min_resume_size']:
-        salary_range = ""
         job_url = app_frontend['messages']['too_short_resume']
     else:
-        salary_range, job_url = retrieve_top_job_vacancy_info(job_title, resume)
         if salary_range is None:
-            salary_range = ""
-            job_url = app_frontend['messages']['salary_not_found']
-    return salary_range, job_url
 def run():
@@ -35,19 +41,20 @@ def run():
         fn=salary_estimator,
         inputs=[
             gr.Radio(app_frontend['jobs'], label="Job Title"),
-            gr.Textbox(label="Resume", lines=10)
         ],
         outputs=[
             gr.Textbox(label="Salary Range Estimation", lines=1),
-            gr.Textbox(label="Suitable Job Vacancy Sample", lines=1)
         ],
         title=app_frontend['title'],
         description=app_frontend['description'],
-        allow_flagging="never"
     )
     # Use share=True to create a public link to share. This share link expires in 72 hours.
-    app.launch(server_name=app_config['host'], server_port=app_config['port'])
 if __name__ == "__main__":

 import gradio as gr
 from resume_worth.utils.utils import get_params
 from resume_worth.pipelines.information_retrieval.pipeline import retrieve_top_job_vacancy_info
+from resume_worth.pipelines.text_generation.pipeline import generate_explanation_why_resume_for_a_job
 params = get_params()
 def salary_estimator(job_title: str, resume: str):
+    salary_range, job_url, explanation = '', '', ''
     if job_title is None:
+        salary_range = app_frontend['messages']['job_title_not_found']
     elif len(resume) < app_backend['min_resume_size']:
         job_url = app_frontend['messages']['too_short_resume']
     else:
+        salary_range, job_url, job_description = retrieve_top_job_vacancy_info(job_title, resume)
         if salary_range is None:
+            salary_range = app_frontend['messages']['salary_not_found']
+        if job_url is None:
+            salary_range = app_frontend['messages']['active_job_not_found']
+        if job_description is not None:
+            explanation = generate_explanation_why_resume_for_a_job(resume, job_description)
+    return salary_range, job_url, explanation
 def run():
         fn=salary_estimator,
         inputs=[
             gr.Radio(app_frontend['jobs'], label="Job Title"),
+            gr.Textbox(label="Resume", lines=10),
         ],
         outputs=[
             gr.Textbox(label="Salary Range Estimation", lines=1),
+            gr.Textbox(label="Suitable Job Vacancy Sample", lines=1),
+            gr.Textbox(label="Why the Resume and the Job Match", lines=1),
         ],
         title=app_frontend['title'],
         description=app_frontend['description'],
+        allow_flagging="never",
     )
     # Use share=True to create a public link to share. This share link expires in 72 hours.
+    app.launch(server_name=app_config['host'], server_port=app_config['port'], max_threads=8)
 if __name__ == "__main__":

src/resume_worth/pipelines/information_retrieval/nodes.py CHANGED Viewed

@@ -24,7 +24,7 @@ with open(file_path, "r") as f:
 def retrieve_top_job_vacancies(job_title: str, resume: str, k: int=3):
     # For complex queries, use MongoDB-like operators:
     #   $gt, $gte, $lt, $lte, $ne, $eq, $in, $nin
@@ -48,21 +48,24 @@ def retrieve_top_job_vacancies(job_title: str, resume: str, k: int=3):
     return job_docs
-def get_vacancy_salary(doc: list[Document]):
     salary = doc.metadata["salary"]
     return salary
-def get_vacancy_description(doc: list[Document]):
     description = doc.page_content
     return description
-def get_vacancy_url(doc: list[Document]):
     job_id = doc.metadata["id"]

 def retrieve_top_job_vacancies(job_title: str, resume: str, k: int=3):
+    """Function to retrieve the most similar job vacancies to a resume"""
     # For complex queries, use MongoDB-like operators:
     #   $gt, $gte, $lt, $lte, $ne, $eq, $in, $nin
     return job_docs
+def get_vacancy_salary(doc: Document):
+    """Function to get the salary of a job vacancy"""
     salary = doc.metadata["salary"]
     return salary
+def get_vacancy_description(doc: Document):
+    """Function to get the description of a job vacancy"""
     description = doc.page_content
     return description
+def get_vacancy_url(doc: Document):
+    """Function to get the URL of a job vacancy"""
     job_id = doc.metadata["id"]

src/resume_worth/pipelines/information_retrieval/pipeline.py CHANGED Viewed

@@ -5,23 +5,27 @@ This pipeline utilizes the user resume to first pull the most similar vacancy fr
 """
-from resume_worth.pipelines.information_retrieval.nodes import retrieve_top_job_vacancies, get_vacancy_salary, get_vacancy_url
 def retrieve_top_job_vacancy_info(job_title: str, resume: str):
-    # Stage 1 - Retrieve most similar jobs
     job_docs = retrieve_top_job_vacancies(job_title, resume, k=1)
-    # Stage 2 - Get the salary for retrieved jobs
-    salary, url = None, None
-    if len(job_docs) > 0:
-        salary = get_vacancy_salary(job_docs[0])
-        url = get_vacancy_url(job_docs[0])
-    return salary, url
 if __name__ == "__main__":
@@ -30,8 +34,10 @@ if __name__ == "__main__":
     job_title = "machine learning engineer"
     resume = "I design, develop, and deploy machine learning models and algorithms for complex and unique datasets."
-    k = 2
-    salaries = retrieve_top_job_vacancy_info(job_title, resume, k)
-    print(salaries)

 """
+from resume_worth.pipelines.information_retrieval.nodes import retrieve_top_job_vacancies, get_vacancy_salary, get_vacancy_url, get_vacancy_description
 def retrieve_top_job_vacancy_info(job_title: str, resume: str):
+    # Stage 1 - Retrieve the most similar job
     job_docs = retrieve_top_job_vacancies(job_title, resume, k=1)
+    top_job_doc = None
+    if len(job_docs) > 0:
+        top_job_doc = job_docs[0]
+    # Stage 2 - Get the salary for retrieved job
+    salary, url, description = None, None, None
+    if top_job_doc is not None:
+        salary = get_vacancy_salary(top_job_doc)
+        url = get_vacancy_url(top_job_doc)
+        description = get_vacancy_description(top_job_doc)
+    return salary, url, description
 if __name__ == "__main__":
     job_title = "machine learning engineer"
     resume = "I design, develop, and deploy machine learning models and algorithms for complex and unique datasets."
+    salary, url, description = retrieve_top_job_vacancy_info(job_title, resume)
+    print("salary:", salary)
+    print("job url:", url)
+    print("job description:", description)

src/resume_worth/pipelines/text_generation/nodes.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import os
+from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
+from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+from langchain_core.prompts import PromptTemplate
+from langchain.prompts import load_prompt
+from functools import lru_cache
+import transformers
+transformers.logging.set_verbosity_error()
+@lru_cache(maxsize=None)
+def load_hf_text_generation_model_to_langchain(model_id:str='gpt2', top_k:int=50, top_p:float=0.95, temperature:float=0.4, max_new_tokens:int=1024):
+    """
+    Function to load a text generation model hosted on Hugging Face to se used in LangChain.
+    More info, see: https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/
+    """
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    model = AutoModelForCausalLM.from_pretrained(model_id)
+    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
+            return_full_text=False, do_sample=True,
+            top_p=top_p, top_k=top_k, temperature=temperature, max_new_tokens=max_new_tokens,
+            num_beams=1, repetition_penalty=1.1, num_return_sequences=1
+        )
+    hf = HuggingFacePipeline(pipeline=pipe)
+    return hf
+def create_langchain_prompt_template_for_m4_ai_models(user_prompt: str, promp_path:str=None):
+    """Function to create a LangChain prompt template for M4-AI text generation models"""
+    template = f"<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n"
+    prompt = PromptTemplate.from_template(template)
+    if promp_path:
+        prompt.save(promp_path)
+    return prompt
+@lru_cache(maxsize=None)
+def load_langchain_prompt_template(promp_path: str):
+    """Function to load a LangChain prompt template"""
+    prompt = load_prompt(promp_path)
+    return prompt
+def create_langchain_chain(prompt: PromptTemplate, hf_text_generation: HuggingFacePipeline):
+    """
+    Create a chain by composing the HF text generation model with a LangChain prompt template.
+    More info, see: https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/
+    """
+    chain = prompt | hf_text_generation
+    return chain

src/resume_worth/pipelines/text_generation/pipeline.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""
+Text Generation Pipeline
+This pipeline utilizes an LLM to explain why the retrieved job vacancy is a good fit for the user's resume.
+"""
+import os
+from resume_worth.utils.utils import get_params
+from resume_worth.pipelines.text_generation.nodes import load_hf_text_generation_model_to_langchain, load_langchain_prompt_template, create_langchain_chain
+params = get_params()
+model_id = params['model_id']
+top_p = params['top_p']
+top_k = params['top_k']
+temperature = params['temperature']
+max_new_tokens = params['max_new_tokens']
+prompt_dir = params['prompt_dir']
+promp_file = params['promp_file']
+def generate_explanation_why_resume_for_a_job(resume: str, job: str):
+    # Stage 1 - [cacheable] Load text generation model
+    text_generation_model = load_hf_text_generation_model_to_langchain(model_id, top_k, top_p, temperature, max_new_tokens)
+    # Stage 2 - [cacheable] Load text generation model
+    promp_path = os.path.join(prompt_dir, promp_file)
+    prompt_template = load_langchain_prompt_template(promp_path)
+    # Stage 3 - Create a chain by composing the prompt and model
+    text_generation_chain = create_langchain_chain(prompt_template, text_generation_model)
+    # Stage 4 - Generate the answer by involking the create chain
+    answer = text_generation_chain.invoke({"resume": resume, "job": job})
+    return answer
+if __name__ == "__main__":
+    # EXAMPLE
+    resume =  """Luis Antonio Rodrigues is an accomplished data scientist and machine learning engineer with over eight years of experience in developing innovative machine learning products and services. He holds a BSc in Mathematics, an MSc, and a PhD in Mechanical Engineering from the University of Campinas, one of the most renowned universities in Latin America. Luis's expertise spans across various domains including Natural Language Processing (NLP), Recommender Systems, Marketing and CRM, and Time-Series Forecasting, with significant contributions across Banking, Consumer Packaged Goods, Retail, and Telecommunications industries.
+    Currently serving as a Principal Data Scientist at DEUS, an AI firm dedicated to human-centered solutions, Luis plays a crucial role in the development of a cutting-edge Retrieval-Augmented Generation (RAG) solution. His responsibilities include improving the knowledge-to-text module, optimizing information retrieval for efficiency and precision, and enhancing text generation for real-time accuracy,  showcasing his skills in RAG, IR, LLM, NLP, and several tools and platforms. Additionally, he has contributed as a Data Architect in designing a medallion architecture for a Databricks lakehouse on AWS.
+    Previously, Luis held the position of Principal Data Consultant at Aubay Portugal, where he led an NLP project for Banco de Portugal, focusing on AI services such as summarization, information extraction, complaint text classification, and financial sentiment analysis. At CI&T, as Lead Data Scientist, he was instrumental in developing a recommender system for Nestlé, resulting in a 6% sales increase. During his time at Propz, he developed a recommender system for Carrefour, which boosted revenue by 3%.
+    His earlier roles include a researcher at I.Systems, focusing on water distribution systems, and at the University of Campinas, where his work centered on system and control theory. Luis's proficiency is further demonstrated by his certifications in MLOps with Azure Machine Learning, TensorFlow 2.0, and Python for Time Series Data Analysis. Luis combines his deep technical knowledge with strong communication skills to lead teams and projects towards achieving significant business impacts."""
+    job = """Design, develop, and deploy machine learning models and algorithms for complex and unique datasets, using various techniques such as mathematical modeling, scikit-learn, NLP, CNN, RNN, DL, RL, Transformers, GAN, LLM, RAG
+    Collaborate with cross-functional teams to extract insights, identify business opportunities and provide data-driven recommendations
+    Stay up-to-date with the latest machine learning and AI techniques and tools
+    Communicate complex technical concepts to non-technical stakeholders in an easy-to-understand manner
+    Bachelor's degree or higher in Computer Science, Mathematics, Statistics, Actuarial Science, Informatics, Information Science or related fields
+    Strong analytical skills and attention to detail
+    Participation in Kaggle, Mathematics Olympiad or similar competitions is a plus
+    Excellent programming skills in Python, R, Java, or C++\nFamiliar with ML frameworks such as Tensorflow, Keras, PyTorch, MLFlow, AutoML, TensorRT, CUDA
+    Excellent communication and collaboration skills\nExperience with designing, training, and deploying machine learning models
+    Customer centric and committed to deliver the best AI results to customers"""
+    answer = generate_explanation_why_resume_for_a_job(resume, job)
+    print(answer)

src/resume_worth/utils/utils.py CHANGED Viewed

@@ -19,6 +19,9 @@ def get_params():
 def load_embedding_model(model_name: str = "sentence-transformers/all-mpnet-base-v2"):
     """ Load a pretrained text embedding model"""
     embedding_model = HuggingFaceEmbeddings(
         model_name=model_name,
         model_kwargs={'device': 'cpu'},

 def load_embedding_model(model_name: str = "sentence-transformers/all-mpnet-base-v2"):
     """ Load a pretrained text embedding model"""
+    # Issue: HuggingFaceEmbeddings can not take trust_remote_code argument
+    # https://github.com/langchain-ai/langchain/issues/6080
+    # So, "nomic-ai/nomic-embed-text-v1.5" can't be used yet.
     embedding_model = HuggingFaceEmbeddings(
         model_name=model_name,
         model_kwargs={'device': 'cpu'},