{"metadata":{"accelerator":"GPU","colab":{"gpuType":"T4","provenance":[]},"gpuClass":"standard","kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"nvidiaTeslaT4","dataSources":[{"sourceId":7563056,"sourceType":"datasetVersion","datasetId":4403785},{"sourceId":7571253,"sourceType":"datasetVersion","datasetId":4407676},{"sourceId":7678915,"sourceType":"datasetVersion","datasetId":4479814},{"sourceId":7713636,"sourceType":"datasetVersion","datasetId":4504654},{"sourceId":7964016,"sourceType":"datasetVersion","datasetId":4685329},{"sourceId":8521457,"sourceType":"datasetVersion","datasetId":5088083}],"dockerImageVersionId":30699,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":true}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"
The main goal of this project is to utilize Large Language Models (LLMs) to extract specific information from PDF documents and organize it into a structured JSON format.
\nTo achieve this objective, we are assessing various LLMs such as Mistral and Llama 2 to identify the most suitable model. Following this, we will fine-tune the selected model.
\nOur initial step involves converting the PDFs to text while preserving their structural integrity, allowing the LLMs to comprehend and extract information accurately. Subsequently, we will process each page individually, providing it as context to the model and requesting the extraction of information in JSON format.
\nIn order to effectively compare these models, we will introduce appropriate metrics and implement them.
\nTo benchmark LLms, we require a dataset comprising context and correct responses. To create this dataset, we have extracted 59 pages from our PDF and provided them to Gemini. We have requested Gemini to provide responses in JSON format. Subsequently, we have verified the results obtained from Gemini.
\n","metadata":{}},{"cell_type":"code","source":"with open(\"/kaggle/input/mc-eurpen/mc_EuropeInterchangeManual_Customer (2).txt\", 'r', encoding='utf-8') as file:\n content = file.read()\n\ntoken = 'Interchange and Service Fees Manual: Europe Region • 12 September 2023'\npaginated_doc = content.split(token)","metadata":{"execution":{"iopub.status.busy":"2024-05-30T08:04:35.416756Z","iopub.execute_input":"2024-05-30T08:04:35.417434Z","iopub.status.idle":"2024-05-30T08:04:35.584760Z","shell.execute_reply.started":"2024-05-30T08:04:35.417401Z","shell.execute_reply":"2024-05-30T08:04:35.583784Z"},"trusted":true},"execution_count":19,"outputs":[]},{"cell_type":"code","source":"import json\nimport pandas as pd","metadata":{"execution":{"iopub.status.busy":"2024-05-30T08:04:35.585998Z","iopub.execute_input":"2024-05-30T08:04:35.586375Z","iopub.status.idle":"2024-05-30T08:04:35.590869Z","shell.execute_reply.started":"2024-05-30T08:04:35.586344Z","shell.execute_reply":"2024-05-30T08:04:35.589933Z"},"trusted":true},"execution_count":20,"outputs":[]},{"cell_type":"code","source":"data = []\ni=40\nwith open(\"/kaggle/input/clean-data-for-fine-t/clean_data.jsonl\",encoding='utf-8') as file:\n for line in file:\n response = json.loads(line)\n context = paginated_doc[i]\n data.append({\"context\":context,\"response\":response})\n i+=1","metadata":{"execution":{"iopub.status.busy":"2024-05-30T08:04:35.591995Z","iopub.execute_input":"2024-05-30T08:04:35.592297Z","iopub.status.idle":"2024-05-30T08:04:35.610790Z","shell.execute_reply.started":"2024-05-30T08:04:35.592273Z","shell.execute_reply":"2024-05-30T08:04:35.610076Z"},"trusted":true},"execution_count":21,"outputs":[]},{"cell_type":"code","source":"df = pd.DataFrame(data)\ndf.shape","metadata":{"execution":{"iopub.status.busy":"2024-05-30T08:04:35.611678Z","iopub.execute_input":"2024-05-30T08:04:35.611900Z","iopub.status.idle":"2024-05-30T08:04:35.619946Z","shell.execute_reply.started":"2024-05-30T08:04:35.611880Z","shell.execute_reply":"2024-05-30T08:04:35.619125Z"},"trusted":true},"execution_count":22,"outputs":[{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"(59, 2)"},"metadata":{}}]},{"cell_type":"code","source":"pd.set_option('display.max_colwidth', 600) # Display up to 100 characters per column","metadata":{"execution":{"iopub.status.busy":"2024-05-30T08:04:35.621105Z","iopub.execute_input":"2024-05-30T08:04:35.621551Z","iopub.status.idle":"2024-05-30T08:04:35.627728Z","shell.execute_reply.started":"2024-05-30T08:04:35.621521Z","shell.execute_reply":"2024-05-30T08:04:35.626985Z"},"trusted":true},"execution_count":23,"outputs":[]},{"cell_type":"code","source":"df.iloc[7:12]","metadata":{"execution":{"iopub.status.busy":"2024-05-30T08:04:35.628810Z","iopub.execute_input":"2024-05-30T08:04:35.629144Z","iopub.status.idle":"2024-05-30T08:04:35.658940Z","shell.execute_reply.started":"2024-05-30T08:04:35.629115Z","shell.execute_reply":"2024-05-30T08:04:35.658122Z"},"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":" context \\\n7 48 \\n \\n \\n \\n ... \n8 49 \\n \\n \\n \\n ... \n9 50 \\n \\n \\n \\n ... \n10 51 \\n \\n \\n \\n ... \n11 52 \\n \\n \\n \\n ... \n\n response \n7 {'message': 'Context lacks a Payment product,FeeTier and Rate'} \n8 {'message': 'Context lacks a Payment product,FeeTier and Rate'} \n9 {'GeographicContext': 'Intercountry', 'SubGeographicContext': 'Intra-EEA', 'Channel': 'Mastercard POS', 'RateType': 'fallback interchange fee rates', 'Notes': ['6 For transactions less than or equal to EUR 25 only. Transactions greater than EUR 25 are processed under normal Mastercard acceptance criteria'], 'Rates': [{'PaymentProduct': 'Mastercard Corporate', 'Details': [{'FeeTier': 'Chip', 'IRD': ['50', '53'], 'Rate': '0.30%'}, {'FeeTier': 'Enhanced Electronic', 'IRD': ['84', '88'], 'Rate': '0.30%'}, {'FeeTier': 'Masterpass Wallet', 'IRD': ['PW'], 'Rate': '0.30%'}, {'FeeTier': 'Base', 'IR... \n10 {'GeographicContext': 'Intercountry', 'SubGeographicContext': 'Intra-EEA', 'Channel': 'Mastercard POS', 'RateType': 'fallback interchange fee rates', 'Notes': ['7 If the acquirer meets the requirements and provides the required additional data, Mastercard calculates the fee amount by deducting the incentive amount (= incentive rate applied to the transaction amount)from the applicable interchange fee amount (= interchange rate applied to the transaction amount). Refer to Optional Addendum Message Requirements.'], 'Rates': [{'PaymentProduct': 'Mastercard Corporate\nMastercard Electronic Corp... \n11 {'GeographicContext': 'Intercountry', 'SubGeographicContext': 'Intra-EEA', 'Channel': 'Mastercard POS', 'RateType': 'fallback interchange fee rates', 'Notes': ['8 Applicable to transactions greater than EUR 3,000. The enriched data incentive is not applicable.', '9 Applicable to transactions greater than EUR 10,000. The enriched data incentive is not applicable. '], 'Rates': [{'PaymentProduct': 'Mastercard Purchasing', 'Details': [{'FeeTier': 'Chip\\n(Incentive rate)7', 'IRD': ['50', '53'], 'Rate': '1.25%\\n(-0.50 EUR)'}, {'FeeTier': 'Enhanced Electronic\\n(Incentive rate)7', 'IRD': ['84', '8... ","text/html":"\n | context | \nresponse | \n
---|---|---|
7 | \n48 \\n \\n \\n \\n ... | \n{'message': 'Context lacks a Payment product,FeeTier and Rate'} | \n
8 | \n49 \\n \\n \\n \\n ... | \n{'message': 'Context lacks a Payment product,FeeTier and Rate'} | \n
9 | \n50 \\n \\n \\n \\n ... | \n{'GeographicContext': 'Intercountry', 'SubGeographicContext': 'Intra-EEA', 'Channel': 'Mastercard POS', 'RateType': 'fallback interchange fee rates', 'Notes': ['6 For transactions less than or equal to EUR 25 only. Transactions greater than EUR 25 are processed under normal Mastercard acceptance criteria'], 'Rates': [{'PaymentProduct': 'Mastercard Corporate', 'Details': [{'FeeTier': 'Chip', 'IRD': ['50', '53'], 'Rate': '0.30%'}, {'FeeTier': 'Enhanced Electronic', 'IRD': ['84', '88'], 'Rate': '0.30%'}, {'FeeTier': 'Masterpass Wallet', 'IRD': ['PW'], 'Rate': '0.30%'}, {'FeeTier': 'Base', 'IR... | \n
10 | \n51 \\n \\n \\n \\n ... | \n{'GeographicContext': 'Intercountry', 'SubGeographicContext': 'Intra-EEA', 'Channel': 'Mastercard POS', 'RateType': 'fallback interchange fee rates', 'Notes': ['7 If the acquirer meets the requirements and provides the required additional data, Mastercard calculates the fee amount by deducting the incentive amount (= incentive rate applied to the transaction amount)from the applicable interchange fee amount (= interchange rate applied to the transaction amount). Refer to Optional Addendum Message Requirements.'], 'Rates': [{'PaymentProduct': 'Mastercard Corporate\nMastercard Electronic Corp... | \n
11 | \n52 \\n \\n \\n \\n ... | \n{'GeographicContext': 'Intercountry', 'SubGeographicContext': 'Intra-EEA', 'Channel': 'Mastercard POS', 'RateType': 'fallback interchange fee rates', 'Notes': ['8 Applicable to transactions greater than EUR 3,000. The enriched data incentive is not applicable.', '9 Applicable to transactions greater than EUR 10,000. The enriched data incentive is not applicable. '], 'Rates': [{'PaymentProduct': 'Mastercard Purchasing', 'Details': [{'FeeTier': 'Chip\\n(Incentive rate)7', 'IRD': ['50', '53'], 'Rate': '1.25%\\n(-0.50 EUR)'}, {'FeeTier': 'Enhanced Electronic\\n(Incentive rate)7', 'IRD': ['84', '8... | \n