{"metadata":{"accelerator":"GPU","colab":{"gpuType":"T4","provenance":[]},"gpuClass":"standard","kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"nvidiaTeslaT4","dataSources":[{"sourceId":7571253,"sourceType":"datasetVersion","datasetId":4407676},{"sourceId":7601235,"sourceType":"datasetVersion","datasetId":4425013},{"sourceId":7678915,"sourceType":"datasetVersion","datasetId":4479814},{"sourceId":7713636,"sourceType":"datasetVersion","datasetId":4504654},{"sourceId":4298,"sourceType":"modelInstanceVersion","isSourceIdPinned":true,"modelInstanceId":3093},{"sourceId":5110,"sourceType":"modelInstanceVersion","isSourceIdPinned":true,"modelInstanceId":3898}],"dockerImageVersionId":30665,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":true}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"
The main goal of this project is to utilize Large Language Models (LLMs) to extract specific information from PDF documents and organize it into a structured JSON format.
\nTo achieve this objective, we are assessing various LLMs such as Mistral and Llama 2 to identify the most suitable model. Following this, we will fine-tune the selected model.
\nOur initial step involves converting the PDFs to text while preserving their structural integrity, allowing the LLMs to comprehend and extract information accurately. Subsequently, we will process each page individually, providing it as context to the model and requesting the extraction of information in JSON format.
\nIn order to effectively compare these models, we will introduce appropriate metrics and implement them.
\nTo benchmark LLms, we require a dataset comprising context and correct responses. To create this dataset, we have extracted 59 pages from our PDF and provided them to Gemini. We have requested Gemini to provide responses in JSON format. Subsequently, we have verified the results obtained from Gemini.
\n","metadata":{}},{"cell_type":"code","source":"with open(\"/kaggle/input/mc-eurpen/mc_EuropeInterchangeManual_Customer (2).txt\", 'r', encoding='utf-8') as file:\n content = file.read()\n\ntoken = 'Interchange and Service Fees Manual: Europe Region • 12 September 2023'\npaginated_doc = content.split(token)","metadata":{"execution":{"iopub.status.busy":"2024-03-19T09:21:08.831092Z","iopub.execute_input":"2024-03-19T09:21:08.831453Z","iopub.status.idle":"2024-03-19T09:21:08.989767Z","shell.execute_reply.started":"2024-03-19T09:21:08.831427Z","shell.execute_reply":"2024-03-19T09:21:08.988943Z"},"trusted":true},"execution_count":24,"outputs":[]},{"cell_type":"code","source":"data = []\ni=40\nwith open(\"/kaggle/input/clean-data-for-fine-t/clean_data.jsonl\",encoding='utf-8') as file:\n for line in file:\n response = json.loads(line)\n context = paginated_doc[i]\n data.append({\"context\":context,\"response\":response})\n i+=1","metadata":{"execution":{"iopub.status.busy":"2024-03-19T09:21:13.290707Z","iopub.execute_input":"2024-03-19T09:21:13.291059Z","iopub.status.idle":"2024-03-19T09:21:13.304858Z","shell.execute_reply.started":"2024-03-19T09:21:13.291029Z","shell.execute_reply":"2024-03-19T09:21:13.304154Z"},"trusted":true},"execution_count":25,"outputs":[]},{"cell_type":"code","source":"df = pd.DataFrame(data)\ndf.shape","metadata":{"execution":{"iopub.status.busy":"2024-03-19T09:21:18.615523Z","iopub.execute_input":"2024-03-19T09:21:18.616279Z","iopub.status.idle":"2024-03-19T09:21:18.625824Z","shell.execute_reply.started":"2024-03-19T09:21:18.616247Z","shell.execute_reply":"2024-03-19T09:21:18.624673Z"},"trusted":true},"execution_count":26,"outputs":[{"execution_count":26,"output_type":"execute_result","data":{"text/plain":"(59, 2)"},"metadata":{}}]},{"cell_type":"code","source":"print(df.loc[0,'context'])","metadata":{"execution":{"iopub.status.busy":"2024-03-19T09:21:20.226824Z","iopub.execute_input":"2024-03-19T09:21:20.227198Z","iopub.status.idle":"2024-03-19T09:21:20.236255Z","shell.execute_reply.started":"2024-03-19T09:21:20.227167Z","shell.execute_reply":"2024-03-19T09:21:20.235149Z"},"trusted":true},"execution_count":27,"outputs":[{"name":"stdout","text":" 41 \n \n \n \n \n \n Global program rates \n \n \n \n \n \n \n IRD and program name Product code Rate (USD) \n \n BB MBS: Mastercard B2B Product 1 2.00% + USD 0.00 \n \n Commercial Business-to-Business MBA: Mastercard B2B Product 2 1.80% + USD 0.00 \n \n MBG: Mastercard B2B Product 3 1.60% + USD 0.00 \n MBH: Mastercard B2B Product 4 1.40% + USD 0.00 \n \n MBI: Mastercard B2B Product 5 1.20% + USD 0.00 \n \n MBJ: Mastercard B2B Product 6 1.00% + USD 0.00 \n \n MTA: Mastercard B2B Product 7 2.00% + USD 0.00 \n \n MTB: Mastercard B2B Product 8 1.90% + USD 0.00 \n \n MTC: Mastercard B2B Product 9 1.80% + USD 0.00 \n \n MTD: Mastercard B2B Product 10 1.70% + USD 0.00 \n \n MTE: Mastercard B2B Product 11 1.60% + USD 0.00 \n \n MTF: Mastercard B2B Product 12 1.50% + USD 0.00 \n \n MTG: Mastercard B2B Product 13 1.40% + USD 0.00 \n MTH: Mastercard B2B Product 14 1.30% + USD 0.00 \n \n MTI: Mastercard B2B Product 15 1.20% + USD 0.00 \n \n MTJ: Mastercard B2B Product 16 1.10% + USD 0.00 \n \n MTK: Mastercard B2B Product 17 1.00% + USD 0.00 \n \n MTL: Mastercard B2B Product 18 Rate to be announced \n \n MTM: Mastercard B2B Product 19 Rate to be announced \n \n MTN: Mastercard B2B Product 20 Rate to be announced \n \n MTO: Mastercard B2B Product 21 Rate to be announced \n \n MTQ: Mastercard B2B Product 22 Rate to be announced \n \n MTR: Mastercard B2B Product 23 Rate to be announced \n MTS: Mastercard B2B Product 24 Rate to be announced \n \n MTT: Mastercard B2B Product 25 Rate to be announced \n \n MTU: Mastercard B2B Product 26 Rate to be announced \n \n MTV: Mastercard B2B Product 27 Rate to be announced \n \n \n \n NOTE: Product codes MTA, MTB, MTC, MTD, MTE, MTF, MTG, MTH, MTI, MTJ, MTK, MTL, MTM, MTN, MTO, \n MTG, MTR, MTS, MTT, MTU, and MTV are effective globally except for the Canada region and Brazil. These \n product codes will be effective in the Canada region and Brazil in Release 23.Q4. \n \n \n \n \n ©1999–2023 Mastercard. Proprietary. All rights reserved. \n \n \n","output_type":"stream"}]},{"cell_type":"code","source":"print(df.loc[0,'response'])","metadata":{"execution":{"iopub.status.busy":"2024-03-19T09:21:23.682480Z","iopub.execute_input":"2024-03-19T09:21:23.683253Z","iopub.status.idle":"2024-03-19T09:21:23.688298Z","shell.execute_reply.started":"2024-03-19T09:21:23.683219Z","shell.execute_reply":"2024-03-19T09:21:23.687362Z"},"trusted":true},"execution_count":28,"outputs":[{"name":"stdout","text":"{'message': 'Context lacks a Payment product,FeeTier and Rate'}\n","output_type":"stream"}]},{"cell_type":"code","source":"print(df.loc[15,'context'])","metadata":{"execution":{"iopub.status.busy":"2024-03-19T09:21:25.726831Z","iopub.execute_input":"2024-03-19T09:21:25.727737Z","iopub.status.idle":"2024-03-19T09:21:25.732887Z","shell.execute_reply.started":"2024-03-19T09:21:25.727700Z","shell.execute_reply":"2024-03-19T09:21:25.731811Z"},"trusted":true},"execution_count":29,"outputs":[{"name":"stdout","text":" 56 \n \n \n \n \n \n Intercountry fallback fee rates \n Intra-EEA Mastercard MoneySend funding transaction fallback service fee rates \n \n \n \n Payment product Fee tier IRD Fee rate \n \n Mastercard N/A Q6, Q7 1.65% \n \n BusinessCard/Mastercard \n Professional Card/ \n Mastercard Executive \n BusinessCard/Mastercard \n Corporate Executive Card \n \n Mastercard Electronic \n BusinessCard \n \n Debit Mastercard for \n Business \n \n \n Mastercard Purchasing N/A Q6, Q7 1.65% \n \n Mastercard Fleetcard N/A Q6, Q7 1.65% \n \n Mastercard Prepaid N/A Q6, Q7 1.65% \n Commercial \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ©1999–2023 Mastercard. Proprietary. All rights reserved. \n \n \n","output_type":"stream"}]},{"cell_type":"code","source":"print(df.loc[15,'response'])","metadata":{"execution":{"iopub.status.busy":"2024-03-19T09:21:29.720424Z","iopub.execute_input":"2024-03-19T09:21:29.721042Z","iopub.status.idle":"2024-03-19T09:21:29.725997Z","shell.execute_reply.started":"2024-03-19T09:21:29.721011Z","shell.execute_reply":"2024-03-19T09:21:29.725117Z"},"trusted":true},"execution_count":30,"outputs":[{"name":"stdout","text":"{'GeographicContext': 'Intercountry', 'SubGeographicContext': 'Intra-EEA', 'Channel': 'Mastercard MoneySend funding transaction', 'RateType': 'fallback service fee rates', 'Notes': [], 'Rates': [{'PaymentProduct': 'Mastercard BusinessCard/Mastercard Professional Card/Mastercard Executive BusinessCard/Mastercard Corporate Executive Card\\nMastercard Electronic BusinessCard\\nDebit Mastercard for Business', 'Details': [{'FeeTier': 'N/A', 'IRD': ['Q6', 'Q7'], 'Rate': '1.65%'}]}, {'PaymentProduct': 'Mastercard Purchasing', 'Details': [{'FeeTier': 'N/A', 'IRD': ['Q6', 'Q7'], 'Rate': '1.65%'}]}, {'PaymentProduct': 'Mastercard Fleetcard', 'Details': [{'FeeTier': 'N/A', 'IRD': ['Q6', 'Q7'], 'Rate': '1.65%'}]}, {'PaymentProduct': 'Mastercard Prepaid\\nCommercial', 'Details': [{'FeeTier': 'N/A', 'IRD': ['Q6', 'Q7'], 'Rate': '1.65%'}]}]}\n","output_type":"stream"}]},{"cell_type":"markdown","source":"\n#