AN AI-BASED FRAMEWORK AND DATA-DRIVEN METHODOLOGY FOR POST-PCR HIGH-RESOLUTION MELTING ANALYSIS. A MAJOR PROJECT REPORT SUBMITTED TO THE DEPARTMENT OF COMPUTER APPLICATIONS, BHARATHIAR UNIVERSITY IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF, MASTER OF SCIENCE IN DATA ANALYTICS Submitted By, RAJAGOPAL S (REG.NO: ******) Under the Guidance of, Prof. Dr. V. BHUVANESWARI, M.C.A., M.Phil., Ph.D., Professor, Department of Computer Applications. DEPARTMENT OF COMPUTER APPLICATIONS SCHOOL OF COMPUTER SCIENCE AND ENGINEERING BHARATHIAR UNIVERSITY COIMBATORE-641046 TAMIL NADU MAY – 2023 CERTIFICATE CERTIFICATE This is to certify that the project titled “AN AI-BASED FRAMEWORK AND DATA-DRIVEN METHODOLOGY FOR POST-PCR HIGH-RESOLUTION MELTING ANALYSIS.” submitted to Bharathiar University in partial fulfilment of the requirement for the award of the degree of the Master of Science in Data Analytics is a record of the original work done by RAJAGOPAL S under my supervision and guidance and this project work has not formed the basis for the award of any Degree/Diploma/Associate ship/Fellowship or similar title to any candidate of any University. Place: Coimbatore Date: Project Guide Head of the Department Submitted for the University Viva-voice Examination held on _______________ Internal Examiner External Examiner Microbiological Laboratory Research and Services India Private Limited (An ISO 13485:2016 Certified Company) 0422-2425312/8098701010 E-mail: sales@microserv.in Webpage: www.microserv.in 2nd May 2023, Coimbatore To Whom it may concern This is to certify that Mr. Rajagopal S has been our internship participant from 23rd January 2023 to 2nd May 2023. During this period, he has worked on developing "An Al-based framework and data-driven methodology for post-PCR High-Resolution Melting Analysis". This project aimed to develop software capable of interpreting molecular assays for diagnostic purposes, representing the first approach of its kind. Due to the interdisciplinary nature of the project, Mr. Rajagopal. S as a Data science specialist had to collaborate with clinicians and molecular biologists from different fields. As required by the project, Mr. Rajagopal S also demonstrated a strong interest in learning rtPCR and related analysis protocols, which provided the foundation for the development of the software for automated interpretation of molecular assays. In summary, we would like to thank Mr. Rajagopal. S for his contributions during his internship at Microbiological Laboratory Research and Services India Private Limited. We wish him the best for his future endeavours. Regards, Dr. Rohit Radhakrishnan Ph.D. Director (Research and operations) DECLARATION DECLARATION I hereby declare that this project work title “AN AI-BASED FRAMEWORK AND DATA-DRIVEN METHODOLOGY FOR POST-PCR HIGH-RESOLUTION MELTING ANALYSIS” submitted to Department of Computer Applications, Bharathiar University is a record of original work done by RAJAGOPAL S under the supervision and guidance of Dr. V. BHUVANESHWARI., MCA., M.Phil., Ph.D., Professor Department of Computer Applications, Bharathiar University and that this project work has not formed the basis for the award of any Degree/ Diploma/ Associateship/ Fellowship or similar title to any candidate of any University. Place: Coimbatore Signature of the candidate Date: COUNTERSIGNED BY ACKNOWLEDGEMENT ACKNOWLEDGEMENT Union is Strength. It gives me a great pleasure to acknowledgement with gratitude to personalities, without their help the completion of this project work would not been possible. I express my respectful thanks to Prof. Dr. T. DEVI., M.C.A., M.Phil., Ph.D., (UK), Professor and Head, Department of Computer Applications, Bharathiar University, Coimbatore, for permitting me to carry out my project work. I really deem it a special privilege to convey my prodigious and everlasting thanks to my guide Prof. Dr. V. BHUVANESWARI, M.C.A., M.Phil., Ph.D., Professor, Department of Computer Applications, Bharathiar University, for her valuable guidance and suggestions to this project work. I extended my sincere thanks to Dr. ROHIT RADHAKRISHNAN, Ph.D., Director (Research and Operations), Microbiological Laboratory Research and Services (I) PVT LTD, Coimbatore, for providing opportunity to work on their R&D project. Finally, I express my thanks to my dear parents and my dear friends for their support and encouragement for the successful completion of this project. I am highly obliged to those who have helped me directly and indirectly in making this project a successful one. ABSTRACT ABSTRACT During the onset of the COVID-19 epidemic, there was a significant investment globally by clinical laboratories and hospitals in improving their molecular-based diagnostics infrastructure. In the current post-epidemic scenario, the upgraded infrastructure is underused as there is a significant decrease in COVID-19 testing. This is primarily due to the unavailability of commercial molecular assays to diagnose different infectious and non-infectious diseases. Another gap in implementing the molecular assays in mainstream diagnosis is the unavailability of automated analysis and database management of real-time PCR results for both probe-based and High-Resolution Melt Analysis (HRMA) based assay. Microbiological Laboratory, Coimbatore has patented an HRMA-based identification of molecular targets which is critical for the treatment of fatal diseases such as septicemia. The unavailability of customized predictive analysis and reporting for this HRM increases the dependency on error-prone manual interpretation of such complex data. This thesis is a foundation for developing a first-of-its-kind framework for automated analysis of HRM data which can be customized for different molecular targets. We have used advanced computational techniques like Machine Learning, Signal Processing and Deep Learning predictive analysis of HRM data of clinical samples tested. In this thesis, the team has designed and discussed the fundamental principle for processing, analysing and interpreting the HRM data for a representative set of molecular targets, which can aid the technicians and clinicians in reporting. We have also developed a database for structured storage and retrieval of HRM data which could help in linking this analysis software with the existing Laboratory Information Management Software used for reporting clinical re TABLE OF CONTENTS CHAPTER NO DESCRIPTION PAGE NO AKNOWLEDGEMENT ABSTRACT LIST OF FIGURES LIST OF TABLES 1. INTRODUCTION 1.1 ORGANIZATION PROFILE 1.2 PROBLEM DEFINITION 1.3 BACKGROUND AND NEED 1.4 PURPOSE OF THE STUDY 1.5 DEFINITIONS 1.6 ROLE OF DATA IN REAL-TIME PCR 1.7 PREDICTIVE ANALYSIS IN DIAGNOSIS  1.8 OVERALL RESEARCH AIM AND OBJECTIVES 21 21 23 24 24 34 35 37 2. EXISTING SYSTEM 2.1 DATA ANALYSIS SOFTWARE 38 3. LITERATURE REVIEW 50 4. PROPOSED METHODOLOGIES 4.1 KEY ASPECTS 4.2 PROPOSED METHODOLOGIES 52 53 5. APPROACH ON IMAGES OF DNA MELT SIGNALS 5.1 IMAGE PROCESSING RESULT AND DISCUSSION 54 58 6. APPROACH ON COORDINATES OF RAW FLUORESCENCE SIGNAL RESULT AND DISCUSSION CONCLUSION 73 73 7. APPROACH ON COORDINATES OF DNA MELTING SIGNAL 7.1 MELT CONVERSION 7.2 SPLINE AND SAVGOL FILTER 7.3 BSPLINE 7.4 BASELINE SUBTRACTION 7.5 BACKGROUND CORRECTION 7.6 SIGNAL PROCESSING ON DNA MELTING SIGNAL 7.7 THRESHOLDING LOGIC RESULT AND DISCUSSION CONCLUSION 74 76 77 80 81 83 89 90 90 8 COMBINATION OF APPROACH ON IMAGES AND THE COORDINATES OF DNA MELTING SIGNAL 8.1 CONVOLUTION NEURAL NETWORK 8.2 GENERATING IMAGE DATASET 8.3 MODEL ARCHITECTURE RESULT AND DISCUSSION 91 93 95 99 9 SYSTEM DESIGN AND DEVELOPMENTS 9.1 COMPONENTS 9.2 EXTRACTOR 9.3 PYHRM 9.4 MELTCURVE INTERPRETER 9.5 ER DIAGRAM 100 102 105 110 116 10. TESTING AND RESULTS 10.1 TEST DATA 117 CONCLUSION 120 REFERENCES 121 LIST OF FIGURES FIGURE NO TITLE PAGE NO 1 Capabilities of Laboratory Information Management System 18 2 Amplification of DNA segments 25 3 Melting profile of a PCR product 26 4 Negative derivative plot of the Melting curve 28 5 PCR Amplification Cure 29 6 Theoretical plot of PCR, three phases 30 7 High-Resolution Melt Curve 31 8 Reference DNA melting signal – Meningitis Panel 32 9 A and B: Rotor-Gene Q-Rex Interface 39 10 Rotor-Gene Q Rex Interface 40 11 Data analysis performed in ScreenClust HRM Software 42 12 Interface of BIO-RAD CFX Manager 43 13 Melt Peak Spreadsheet – CFX Manager 44 14 Interface of melt curve graph in Bio Molecular System micPCR software 45 15 Interface of Melt curve graph in ThermoFisher QuantStudio 46 16 Interface of Roche’s LightCycler 47 17 uANALYZE Interface 48 18 uMELT Interface 49 19 DNA Melt data report generated by thermal cycler machine 54 20 Colour mask to track yellow colour in the input image 55 21 Melt signal images scanned and cropped from PCR reports 56 22 Image masking performed on melt signal images 56 23 Cropping the images to retain only melt signals 57 24 Raw fluorescence signal plotted using matplotlib 60 25 Properties of raw fluorescence signal 61 26 First 10 linear points of the fluorescence signal 63 27 Fitting a straight line along the fluorescence signal’s linear phase 64 28 Fluorescence signals after performing normalization 65 29 Erroneous intersection spotted in the signals 66 30 Raw fluorescence signal – Temperature range (30ºC to 90ºC) 67 31 Normalized fluorescence signals 67 32 Extrapolating imaginary lines on both the ends 68 33 Mapping the take-off & touch-down points with Normalized fluorescence signal in DNA Melt signals. 70 34 DNA melting signals with double peaks and its corresponding raw fluorescence 72 35 Comparison for manually converted melting signal(A) to machine converted melting signal(B). 75 36 Before (A) and After (B) applying smoothening filter. 78 37 Machine converted melting signal (A) and manually converted melting signal (B) by applying smoothening filter. 79 38 Features can be extracted from a melt curve using signal processing 83 39 Detecting peaks in the DNA melting signal 84 40 Calculating the peak prominence of the DNA melting signal 85 41 Calculating the peak width of the DNA melting signal 86 42 Detecting all the features from a melting signal. 86 43 Negative(noise) signal with peaks detected 88 44 Measuring the prominences 89 45 Concept of Convolution Neural Network for classifying DNA melting signal 92 46 Training images of DNA melting signals of various class ‘Single’, ’Double’ and ‘Noise’ 93 47 Model architecture with layers 95 48 Model performance with Accuracy and Loss 96 49 Confusion matrix for the results of CNN model. 97 50 Classification of CNN model between genuine and non-genuine peaks. 98 51 AI-framework 101 52 User interface of Extractor 102 53 Framework of extractor 103 54 Type of data to extract 104 55 PyHRM installation 105 56 File stack of PyHRM 105 57 Input data format for PyHRM 107 58 Import PyHRM 107 59 Output of plot() 108 60 Features of HRM data using feature_detection() 108 61 Reports of features_detection 109 62 File stack of meltcurve interpreter 110 63 MCI Interface 111 64 MCI Home page 112 65 MCI file upload 112 66 MCI Melt curve visualisation 113 67 MCI amplification curvet visualisation 113 68 MCI feature detection panel 114 69 MCI Statistical measures 114 70 MCI Report Generation 115 71 MCI Final Report 115 72 ER diagram of MCI database component 116 73 Melt curve test data 117 74 Features of melt curve test data 117 75 Precision and recall for classification model 118 76 Confusion matrix 118 77 Accuracy and metrics 119 LIST OF TABLES FIGURE NO TITLE PAGE NO 1 DNA melting temperature standards for Meningitis Panel of the positive sample [fig. 8] 31 2 Sample DNA melting temperature standards for sepsis panel. 33 3 Sample data – coordinates of raw fluorescence 60 4 Model performance metrics 97 INTRODUCTION CHAPTER 1 INTRODUCTION Laboratory Information Management Systems (LIMS) transformed conventional laboratory operations into digitally-enabled infrastructure operations to attain high productivity and efficiency. LIMS aid a clinical laboratory to create an ecosystem for automating workflows, integrate instruments, manage samples, data management, real-time collaboration, perform data analytics, check quality control and patient reporting in a secured, user-friendly and polarized environment (Fig. 1). Thus, the software is crucial not only in a clinical lab but also in wide laboratories ranging from academic research, chemical labs, and manufacturing to agricultural testing, forensics, etc., [1, 2]. Figure 1: Capabilities of Laboratory Information Management System Clinical laboratories are healthcare institutions that offer a variety of techniques to aid physicians with patient diagnostics, care, and management run by laboratory scientists [3]. With the rapid advancements in hardware and software technology, all the specialized clinical laboratories are modernized with state-of-the-art laboratory machines and instruments for testing and diagnosing with high quality and accurate results, therefrom quick analysis and reports made in respective software. Shirts et al. [4], stated the importance of analytics in clinical laboratories as ‘Clinical laboratory analytics is the systematic evaluation and communication of clinical laboratory testing data to improve healthcare operations and patient outcomes.’ [4, p. 9]. In looking at the system (Fig. 1), where the analytics part is the most demanding and includes complicated methods, such as analyzing and interpreting the test results and checking with quality control data to monitor the instrument's performance and accuracy. Along with these tasks, there are some limitations of data acquisition and integration phases in the analytics pipeline to integrate or acquire several other instruments with the respective data formats of their respective manufacturers’ software/plugins in a general LIMS, taking into account large-sized labs. However, there are a few manufacturers/vendors that address this shortage by providing a solution for configuring instruments in the workflow [1, 2, 5], but this software might be expensive for mid to small-sized laboratories, or custom-built software can be a solution. According to Baron [4], clinical laboratory analytics should focus on improving decision support (i.e., the use of tools or systems to provide clinicians with relevant information, recommendations, and guidelines at the point of care when ordering and interpreting laboratory tests) during test ordering and result interpretation. This approach requires developing a strong decision support infrastructure that embodies both rule-based and machine learning-based algorithms into the clinical workflow. In addition to that the importance of using “offline” clinical laboratory analytics to analyze and enhance test utilization (e.g., identifying variations in test ordering patterns between clinicians that cannot be explained by clinical factors) was also discussed [4, p. 11]. Most of the third-party vendors provide generic LIMS, in which the analytics pipeline involves transferring data from the LIMS to the different environments/platforms for analyzing the data, thus making a quite complex situation for the technicians and clinicians to perform analysis and interpretation in intermediate-level and national reference labs with faster deliverance and accurate results, where they own different kinds of instruments [3], [4, p. 11]. Besides, mid to large-sized intermediate-level labs have laborious and time-consuming processes for doing analytics without LIMS. In concern to the lab budget, the specific or generic LIMS developed by various PCR manufacturers or software vendors may charge additional fees for updates or require ongoing maintenance for continued support and updates. Molecular diagnostics is a high complexity clinical laboratory [3], which amplifies the genetic level (DNA or RNA) of cells or pathogens to detect mutations, gene expression, or infectious agents at the molecular level using PCR [6, Sec. 1.1]. Quantitative polymerase chain reaction (i.e., real-time PCR or rtPCR) data analysis is a highly significant process that includes many primarily different techniques such as experiment setup, data processing, normalization, amplification analysis, efficiency and performance of PCR reaction, visualization of results [7, 8], this technique allows DNA amplification in real-time accumulation of fluorescence in reaction. High-Resolution Melt Analysis (HRMA) is an advanced technique of conventional Melt Curve Analysis (MCA) with a rtPCR instrument or its specialized instrument to identify the melting temperatures (Tm) of DNA, though melt curve analysis also gives reliable results however, HRMA gives more accurate results compared to MCA [7]. These techniques are done by the various commercialized PCR manufacturers’ data analysis software/plugins. Though stipulated statistical analyses and mathematical algorithms which can be done easily by clinicians with this software then, after the analyses part, information and insights gained finally lead to interpreting the results require certain expertise in the field. Wrong interpretations and analysis of PCR test data and post-PCR data, (i.e., DNA melting curve, melt peaks, and HRMA curve) lead to serious consequences both for the patients affected and for the laboratory itself. However, the interpretation of PCR test result data is manually done by the clinicians/microbiologists with the domain knowledge (e.g., DNA melting, amplification of DNA, high-resolution melt analysis, characteristics of molecular pathogens, and thermodynamics of PCR) and other major parameters (such as., components added in the PCR compound, primers, and so on) by visual inspections and gain detailed insights from the analysis software. This visual interpretation highly requires intense laborious and time-consuming processes for novice clinicians/researchers and for complex case data, hence it impacts the right time for results to be delivered. Several clinical laboratories own different PCR manufacturers’ instruments and their respective software/plugins for their unique features and accurate results to obtain and interpret the data is challenging for mid to large-sized intermediate-level labs, which run numerous PCR experiments on a day-to-day basis. In addition to that, various democratized software/plugins/web-based applications are available, and each of them has its advantages and limitations for doing analytics after the experiment is done but, this might or might not give accurate results, according to different types of analysis and techniques used in the software. As defined by Shirts et al., this project aims to improve microbiologists’/clinicians’ decision support and assist them by leveraging Artificial Intelligence and Machine Learning with the advent of HRMA data during the conventional analytical process to improve the performance of the interpretation of results. Additionally, to address the absence in the analytics phase by developing an automated application to aid laboratorians/clinicians. 1.1 ORGANIZATION PROFILE The Microbiological Laboratory Research and Services India Private Limited (Microserv) is an ISO 13485:2016 certified medical manufacturing facility and a research institute. Microserv develops and produces diagnostic kits such as ready-to-use microbiological culture media for clinical and industrial use, molecular assay reagents and MLRS-STaTAST. MLRS-STaTAST is a novel patented antibiotic sensitivity testing technology jointly developed with Anna University. Microserv also provides training in molecular diagnosis jointly with Bharathiar University. The Microbiological Laboratory, Coimbatore is a leading NABL-accredited clinical laboratory in India. It is the first clinical laboratory which has several molecular assays under the NABL scope since 2007. Microbiological Laboratory has developed and patented HRMA based molecular assay for infectious diseases which is currently being used for patient diagnosis. Microbiological Laboratory operates over 50 branches throughout India that are connected to a central server system, enabling the consistent delivery of high-quality reports across all locations. 1.2 PROBLEM DEFINITION The High-Resolution Melting Analysis (HRMA) involves monitoring the disassociation characteristics of double-stranded DNA during denaturation heating. Mutations and sequence variations in the DNA cause changes in the melting temperature and curve shape, allowing for sensitive and rapid detection of genetic variations without the need for expensive probes or post-PCR processing. HRMA is aided by commercially available thermal-cycler (PCR) machines and respective analysis plugins/software for generating the melting curve graphs based on the raw fluorescence data. Visual interpretation The interpretation of HRM data is crucial and it requires a clear understanding of the melting temperatures of every DNA target. HRMA software provide visualization (graphs) of melting signals against temperature, which comprises both perfect and imperfect (noisy) signals. The result interpretation usually involves visual observations and analysis by technicians. Experts who perform interpretation must focus on removing such noisy signals from their analysis by following some thresholding metrics. As the number of samples scales up, the interpretation also requires scaling, and doing it manually is challenging and time-consuming. Typically, experts would have much experience, and the perception they have in interpretation is huge and impeccable. In practice, experts alone cannot perform interpretation at all times, and several other beginners and junior technicians are also often required to perform interpretation, considering the productivity. Versatile software Various PCR instrument manufacturers have their proprietary data analysis software and algorithms, which calculate and process the HRMA data with stipulated steps. Hence, different software can produce different melting curves for the same sample in HRMA analysis. This can occur due to differences in the algorithms used for data analysis and curve fitting, variations in the baseline correction and normalization methods, and other factors related to data processing and interpretation. To ensure accurate and reliable results, it is important to use standardized melting curve plotting protocol and software validated for the specific HRMA application. Predictive analysis Most of the analysis software of HRMA uses the fluorescence response vs temperature data for the samples tested to plot the melting curve. The software uses statistical techniques and mathematical algorithms to analyze the raw fluorescence signals to melt signals. This software has various features such as identifying mutations of the pathogens, genotyping and detecting SNPs. Predictive analysis techniques are unavailable in this software for studying the unique features (e.g., Melting temperature (Tm), melting peak height, curve shape, curve width, inflection point, and area under the curve) of melt signals which can be used for creating digital signatures unique for each target. 1.3 BACKGROUND AND NEED The analysis, interpretation, and reporting of HRMA in the context of enhancing the decision support of the clinicians with the interoperable customized predictive analysis and reporting will decrease the dependency on laborious visual interpretation of such complex data. With the existing software available, such have their limitations of producing melt curve with the standard statistical and mathematical methods, and to the extent there are few commercial software which utilizes some of the machine learning algorithms such as principal component analysis and k-means for dimensionality reduction and clustering the data for other applications such as genotyping and mutation scanning. However, this software is outdated with support only for Windows 7 platform. Currently, HRMA data analysis and interpretation require technical expertise, which can result in variability and errors in the results. An AI-based framework can standardize the interpretation and analysis of HRMA data, leading to more accurate and reliable results. This framework can provide automated data management, making the process more efficient and less time-consuming. With the integration of machine learning algorithms, the framework can learn from past data and adapt to new data sets, improving its accuracy and efficiency over time. The AI-based framework can also reduce human error and increase the speed of analysis, making it possible to analyze more clinical samples in a shorter time frame. The predictive analysis of HRMA data allows for rapid identification of the pathogen causing the disease. The implementation of an AI-based framework for HRMA data management, interpretation, and reporting can also facilitate the sharing of data between laboratories and clinics, improving collaboration and accelerating the development of new diagnostic tools. The framework can also be used to track the evolution of genetic variations in pathogens, enabling early detection of emerging pathogens and their drug resistance patterns. Overall, an AI-based framework for HRMA data management, interpretation, and reporting has the potential to revolutionize the clinical diagnosis of infectious diseases, making it more accurate, reliable, and efficient. It can provide a standardized approach to HRMA data analysis, allowing for better comparison of results across different laboratories, and can ultimately lead to the development of more effective diagnostic tools and treatment strategies. 1.4 PURPOSE OF THE STUDY The purpose of the study aims to develop and implement an AI framework to analyze and interpret High-Resolution DNA melt data with faster deliverance of accurate results and reports to aid clinicians/laboratorians in an intermediate-level laboratory. Interpretation of HRMA data is handled by clinicians with keen visual observations and inspections with high domain expertise, which is a time-consuming and laborious process, and the lack of data acquisition and extraction pipeline makes a bit tangled situation at intermediate-level and national reference laboratories. These factors influence the speed and accuracy of deliverance and providing responsive reports to physicians and patients. Harnessing state-of-the-art AI and ML techniques and algorithms to analyze, interpret, and report without human intervention in a web-based platform. To encounter the interpretation of HRMA data in concern with rtPCR data analysis techniques, the team explored the HRMA data of various pathogens by performing pre-processing, and feature engineering, with appropriate statistical analyses to gain insights. Implemented some of the existing methodologies beginning from the bare method of examining the images of DNA melt signal, to the approach of unravelling the co-ordinates of raw-fluorescence signal, DNA melt signal, and lastly the combination of images and co-ordinates of DNA melt signals. Upon the researched methodologies, ultimately the team devised a feasible and practical solution of developing a Python-based library with custom-trained Deep Learning models for the interpretation, and finally measured and validated the results with the expert clinicians. Along with these research methods, the team developed automated software for data extraction from the PCR data analysis plugin/software. The goal of the study is to implement the web-based AI framework for analysis, interpretation, and reporting to assist technicians and clinicians. Another goal of the study is to develop automated software for data extraction. 1.5 DEFINITIONS This section provides a major list of concepts in detail such as PCR reaction, the processes behind the reaction, DNA melting behaviour, amplification analysis, and so on. 1.5.1 POLYMERASE CHAIN REACTION The Polymerase Chain Reaction (PCR) is a molecular technique used for DNA quantification, biomarker identification, genotyping, and mutation detection. The technique is based on the amplification of a specific segment of DNA into several copies, using a DNA polymerase enzyme (Fig. 2). PCR involves using short synthetic DNA fragments called primers designed based on sequences specific to each target. The segment of the DNA complementing the sequences of the primer will be amplified for multiple PCR cycles until it reaches the limit of detection [9, 10]. Figure 2: Amplification of DNA segments Source: National Human Genome Research Institute Traditional PCR demand that the product be examined following the completion of the reaction, this procedure is frequently referred to as “end point” analysis [7]. Due to the improvements in hardware and software, real-time PCR (also known as quantitative PCR or qPCR or rtPCR) occurred as a new variation that constantly accumulates fluorescent signals from several polymerase reactions and permits to detect the DNA amplification at the right moment [11]. Real-time PCR uses commercially available fluorescence-detecting thermocyclers to amplify specific nucleic-acid sequences tagged with different types of fluorescent dyes (probes and SYBRTM green dye) and measure their concentration simultaneously. Target sequences are amplified and quantified simultaneously in the same PCR machine. Hence, the PCR amplification of the target sequence can be monitored in real-time thus eliminating quantification steps such as agarose gel electrophoresis [11]. PCR was widely used during the recent COVID-19 outbreak to manage the epidemic across the world and remains the gold-standard method for COVID-19 diagnosis. The COVID-19 diagnosis is the latest application that has popularised this qPCR technique in recent times, and the application of PCR in the diagnosis of pathogens (targets) such as bacteria, viruses, fungi, and other non-culture biomarkers has been available for several years [12, 13]. 1.5.2 DNA MELTING The dissolution of the double-stranded DNA (dsDNA) helix into single coils is referred to as DNA melting. It can be accomplished by simply heating double-stranded DNA. The temperature at which the DNA strands dissociate into single coils depends on the number of hydrogen bonds holding the complementary strands. The most commonly used method to determine the melting temperature of a PCR product is to subject the product to a temperature gradient in the presence of intercalating dye. The intercalating dyes are chemicals that only emit light when bound to double-stranded DNA. In a typical melting experiment, a PCR product is mixed with an intercalating dye, and fluorescence emitted by this mix is monitored as the sample is slowly heated (subjected to a temperature gradient). The outcome of the analysis is a curve displaying fluorescence changes emitted by the sample over the range of temperatures that the sample was subjected to, commonly referred to as a melting profile (Fig. 3). Figure 3: Melting profile of a PCR product Source: MethylDetect At the beginning of the experiment, the temperature is low and all PCR product in the sample is double-stranded. Thus, the fluorescence level is high in the sample (Fig. 3-A). Observing high levels of fluorescence, as the temperature increases up to the point, where all hydrogen bonds within the PCR fragment are broken and the amount of double-stranded PCR product drastically decreases. Consequently, a sharp decrease in the detected fluorescence level (Fig. 3-B). At a high temperature, there is no double-stranded PCR product in the sample and the fluorescence levels are close to 0 (Fig. 3-C). The temperature at which the sharp drop in the fluorescence depends on the number of hydrogen bonds in the analyzed PCR product and hence is specific to the analyzed fragment. 1.5.3 MELT CURVE ANALYSIS The Melt Curve is derived from the raw fluorescence data, by getting the first negative derivative (-dF⁄dT) of Fluorescence intensity and Temperature (fig. 4). In Melt curve analysis, the data comes as a result of HRM (in the case of specialized instrument used) being analyzed further, to determine the melting characteristics of several DNA in a more precise way. In this stage, the derivative of fluorescence intensity captured in real-time will be plotted against the temperature so, the temperature at which the dsDNA began to denature into ssDNA, the point at the melt peaks which resemble the melting temperature of the DNA. A threshold is manually set by the clinicians through keen visual inspections and observations with the help of commercial PCR manufacturers’ analysis software. Figure 4: Negative derivative plot of the Melting curve Source: QIAGEN’s Q-Rex Software In looking at Fig. 4, the characteristics of the Melt curves are observed and studied by concerning various features like: • Peaks • Shape of the curve • Height • Range • Area under the curve 1.5.4 AMPLIFICATION CURVE ANALYSIS The Amplification Curves are also known as ‘growth curves’ that display the graph of Cycle number vs Fluorescence (fig. 5), these data from real-time PCR are used to detect the presence of the PCR product (i.e., target DNA) and a threshold (Ct) line is to be set in between the exponential and linear phase, to identify which PCR product is amplified earlier in the PCR reaction cycles [7], [20, p. 212]. The expression levels of genes can be measured by either absolute or relative quantification. In absolute quantification, a calibration curve is used to relate the PCR signal to the input copy number, while relative quantification measures the relative change in mRNA expression levels. The accuracy of an absolute real-time rtPCR assay depends on the identical amplification efficiencies of both the native target and the calibration curve in the RT reaction and kinetic PCR. Relative quantification is a simpler method compared to absolute quantification since it does not require a calibration curve. It involves comparing the expression levels of a target gene to a reference gene and is sufficient for most investigations into changes n gene expression. The units used for relative quantification are unimportant and can be compared across multiple real-time RT-PCR experiments [14, Sec. 3.2.6]. Figure 5: PCR Amplification Cure Source: QIAGEN’s Q-Rex Software In [15, Fig. 1, A] depicts the three phases of PCR: • Exponential phase • Linear phase • Plateau phase Figure 6: Theoretical plot of PCR, three phases (adapted from Yuan et al. [15]) The PCR will eventually reach the plateau phase during later cycles and the amount of product will not change because some reagents become depleted. The exponential phase is the earliest segment of the PCR, in which the product increases exponentially because the reagents are not limited. the linear phase is characterized by a linear increase in the product as PCR reagents become limited. the PCR will eventually reach the plateau phase during later cycles [15, pp. 1 – 2]. 1.5.5 HIGH-RESOLUTION DNA MELT ANALYSIS A novel DNA analysis technique called High-Resolution Melting (HRM/HRMA) is a significant method that was developed in 2002 through a collaborative effort between the University of Utah, USA, and Idaho Technology Inc., USA [16, p. 219]. for analyzing genetic variations such as SNPs (single nucleotide polymorphisms), mutations, and methylations in PCR amplicons. It is a homogeneous, close-tube, post-PCR technique that allows researchers to study the thermal denaturation of double-stranded DNA in greater detail than traditional melting curve analysis, resulting in higher information yield [16, 17], with the advanced hardware. By analyzing the disassociation (melting) behaviour of nucleic acid samples, HRMA can differentiate between samples based on their sequence, length, guanine-cytosine (GC) content, or strand complementarity and it can even detect single base changes like SNPs [18]. It is a powerful tool that enables the detection of unknown variations in PCR amplicons, making it a valuable alternative to sequencing and the range of applications including [19]: • Mutation discovery • Screening for loss of heterozygosity • DNA fingerprinting • SNP genotyping • DNA methylation analysis Figure 7: High-Resolution Melt Curve Source: QIAGEN’s Q-Rex Software 1.5.6 OVERVIEW OF DNA MELT SIGNAL INTERPRETATION DNA melt signals are the important output of PCR experiments, as they provide information about the characteristics of the amplified DNA. As the temperature increases, the double-stranded DNA begins to denature into single-strand, and the DNA-binding dye will dissociate from the DNA, causing a decrease in fluorescence. The temperature at which half of the double-stranded DNA is denatured is called the melting temperature (Tm). Figure 8: Reference DNA melting signal – Meningitis Panel Each pathogen will be having different DNA melting temperatures and their respective signals will also possess different shapes and sizes. Usually, DNA melting signals are in bell-shaped curve with peaks, which denotes the melting point or the melting temperature of the DNA. Interpretation will be made through visual inspection, performed on such signal’s shape, peaks, and size. Fig. 8 shows the DNA melting signals of Haemophilus influenzae (HI), Streptococcus pneumoniae (SP), and Neisseria meningitidis (NM). No. Colour Pathogen Name Temperature of melt-Observed Tm Temperature of melt-Expected Tm Cycle of Threshold Cycle of Threshold Threshold 1 Haemophilus influenzae 78.35 ºC 77±1 ºC 27.25 26 0.4 2 Streptococcus pneumoniae 78.40 ºC 78±1 ºC 24.88 25 0.4 3 Neisseria meningitidis 79.27 ºC /81.13 ºC 79/81±1 ºC 19.52 2 0.4 Table 1: DNA melting temperature standards for Meningitis Panel of the positive sample [fig. 8] A. Melting Temperature Pathogen Target Tm Acinetobacter baumanii 81ºC Bacteriode fragillis 80 ºC CONS-coagulase-negative Staphylococci. 80 ºC Enterobacter spp. 84 ºC Enterococccus feacalis 82 ºC Enterococccus spp. 84 ºC Group A Streptococcus-GAS 82 ºC Serratia macessens 85 ºC Staphylococcus spp. (Gram-positive) 77 ºC Streptococcus agalactiae-GBSC 77 ºC Streptococcus pneumoniae (Gram-positive) 78 ºC Table 2: Sample DNA melting temperature standards for Sepsis Panel There is a set of pre-defined DNA melting temperature standards recorded by clinicians and microbiologists for identifying and distinguishing melt signals during the interpretation process. Peaks found from such melt signals (Fig. 8) for a proposed pathogen, that seemed to be satisfying (±1) the standard, will be treated as ‘Positive’, and if not, will be treated as ‘Negative’ and vice-versa. B. Thresholding Thresholding, on the other hand, plays a crucial role in this process, which is being set manually during the analysis, for eliminating noisy signals and unwanted peaks. It is a simple numerical figure on the y-axis (i.e., derivative of fluorescence over temperature), where only those peaks will be considered on or above such numerical figure (Fig. 8 and Table 1). Apart from Melting temperature and threshold, there are several other parameters that must also be put into consideration for interpreting signals, and they are, • Temperature point at which the signal start rising. • Temperature point at which the signal falls/saturated. • Prominence of the signal. • Area under the curve. Such attributes of DNA melt signals will be considered as features, and relevant feature engineering techniques must be employed to bring the best out of them. The techniques and methodologies are briefly elaborated in the upcoming sections. 1.6 ROLE OF DATA IN REAL-TIME PCR The post-PCR methods such as Melt Curve Analysis, Amplification Curve Analysis and HRMA are undergone using rtPCR machines, followed by the PCR experiment using thermal-cycler machines. In common, most post-PCR data are collectively known as DNA melting curves, amplification curves, and HRMA [1.5.3 – 1.5.5]. • Amplification curves: PCR generates amplification curves that show the increase in fluorescence signal over time, indicating the amount of amplified DNA. This data can be used to determine the starting amount of target DNA and to assess the efficiency and sensitivity of the PCR reaction. • Melting curves: PCR melting curves show the dissociation of double-stranded DNA into single strands as the temperature is increased. These curves can be used to determine the melting temperature (Tm) of the amplified DNA, which can help to identify specific DNA sequences and to detect mutations. • HRMA data: HRMA provides information on the melting behaviour of PCR products, which can be used to identify and distinguish different PCR amplicons based on their melting temperature (Tm). HRMA data can be used to detect mutations, SNPs, and other sequence variations in DNA samples. It can also be used to evaluate PCR performance, including specificity and sensitivity, and to optimize PCR conditions. From the discussion of Vaerman et al., the interpretation of rtPCR results data including various numerical data (other than melting curves, amplification curves and HRMA) that grant the assessment of various analytical parameters such as linearity, accuracy, precision, specificity, and so on [20] to determine the rtPCR instrument’s efficiency, specificity and other results. From the post-PCR data, the HRMA data (i.e., melting curves) is promising to do analyses and interpretations about the specificity and identity of the amplified product (i.e., target DNA) by employing the pioneering ML techniques. Overall, HRMA data plays an important role in real-time PCR by providing valuable information on PCR products and helping to improve the accuracy and reliability of PCR-based assays. 1.7 PREDICTIVE ANALYSIS IN DIAGNOSIS  Diagnostic is itself an analysis of “What happened?”, “Why happened?” and “Where happened?”. There are a lot of advancements are been introduced day by day in healthcare and some of them already exist. In such a way, predictive analysis of diagnostic data is not a new approach. There are a lot of provisions and support were already been introduced to aid many researchers and organizations in boosting their routine work.   Some examples are,   • Predicting heart disease using electronic data, medical data, and patient information. • Predicting various health complexities using medical images like X-rays, CT scans, and MRIs. • Predicting Cancer with data on tumours (benign or malignant). In the field of molecular diagnosis, predicting targets may involve considering several factors. Melt curve analysis is one of the steps in diagnosing with PCR and it gives information more on the melting nature of a dsDNA. Clinicians/Microbiologists will study the melt curves and observe the distinct variations in the graphs thus determine the presence of a specific target DNA. It is important to note that only melting analysis would not suffice to confirm any presence of target DNA (pathogen) in a patient sample, and additional analysis may require such as  • Sequencing  • Phylogenetic analysis  • Multiplex PCR  • Specific primer/probe design  • Culture and isolation  • Serology  As a result, this project will cover predictive analysis on PCR diagnosis data, specifically HRM data, that gives Melt curves and peaks. With relevant feature extraction and feature engineering, finally, predictions will be made on classifying pathogen classes using various predictive analysis techniques.    1.7.1 MACHINE LEARNING IN DIAGNOSIS  Machine Learning is a great choice of predictive modelling, and it is a robust technology, which has been globally applied in various applications ranging from classifying spam mail to predicting diseases. Due to its high compatibility and sound algorithmic resource, many real-world problems can be solved using Machine Learning and diagnosis is not an exception for applying it.  Machine Learning over Statistical Modelling  Both tend to use similar predictive approaches like regression, classification, and clustering, but they differ in many ways. Statistical modelling is a subset of mathematical modelling where it hugely involves assumptions, relationships between random and non-random variables, and estimating population with sample data. Choosing statistical modelling as a predictive analysis technique will require more understanding of the variables involved in the data and the respective relationships between them. Once these perceptions are satisfied, a sensible assumption has to be made, to explain the relationships between variables and the resulting prediction. This is why statistical modelling is highly preferred when proper interpretation and explanations are demanded.  On the other hand, Machine Learning is also another predictive analysis technique, which is a branch of computer science and artificial intelligence, that mainly works on the principle of pattern analysis and often introduces challenges in interpreting their learning pattern. Compared to statistical modeling ML models can work with large data sets, and it rejects the chances of making assumptions on the given data, as it learns from the pattern through a weight-based approach. As a result, predictions by these models are powerful and more accurate.  In the context of performing predictive analysis with HRM data, typically it’s biological data which is complicated due to its complexity and high variability in nature because every experiment is done and influenced under the clinical environment. So, the Machine Learning approach is preferable over statistical modeling, owing to its pattern learning process which gives highly accurate results while statistical modeling is best if the characteristics of the data should not vary for formulating the hypothesis. Concurrently, several statistical techniques were also used to find insights during the development of this project. Besides both techniques have their advantages and limitations [21, 22]. 1.8 OVERALL RESEARCH AIM AND OBJECTIVES The overall aim of the project is to create an AI-based framework for analyzing and interpreting the HRM data without involving any human assistance. The scope of the project starts from the necessary data extraction/acquisition to the end report presentation. The objectives can be enumerated as: • To setup data acquisition pipelines for extracting and acquiring all the data, necessary for the analysis and interpretation. • To develop data pre-processing modules for cleaning and transforming acquired data. • To conduct research on the domain and the given problem to formulate the best and most suitable solution. • To conduct relative research on previously undergone research and solutions made for problems in the same regard. • To formulate an effective approach to applying Machine Learning algorithms to the problem. • To evaluate and validate results with domain expertise and look for further improvisation. • To setup modules and components for effective data storing and accessing. • To develop supporting software components to aid the workflow. • To augment all the components into a single apex system. CHAPTER 2 EXISTING SYSTEM The High-Resolution Melting Analysis are being done with the help of specific software and existing hardware components. In practice, without these components, HRM cannot be done. However, doing manually is more complex and prone to error. Since the term “High-Resolution” itself depicts the technology of capturing fluorescence in “High-Resolution” which is thereby demanding sophisticated and engineered technical components. There are already many commercially available instruments and plugin tools used for running PCR tests, and most of them are engineered with cutting-edge technology. 2.1. DATA ANALYSIS SOFTWARE 2.1.1 QIAGEN’s ROTOR-GENE Rotor-Gene Q series are commercial thermal cycle instruments used in many laboratories for running PCR tests and analyzing the results using their respective versions of plugin software [24]. Rotor-Gene offers several individual components for analysis like, • Melt Analysis • HRM Analysis • ScreenClust Analysis The components come with various names and versions, and among them, QIAGEN’s ‘Rotor-Gene Q-Rex’ is a default software tool that comes with every Rotor-Gene instrument for analyzing the run files of completed PCR tests. Q-Rex offers both Melt and HRM analysis in a single package. 2.1.1 (a) Rotor-Gene Q-Rex Software Melt Analysis The "Melt Curve Analysis" function of the Rotor-Gene Q software can be used for checking the specificity of a reaction, genotyping, and measuring protein stability with differential scanning fluorimetry. The function analyses the first derivative (dF/dT) of the raw melting data and identifies peaks in the selected temperature and fluorescence range. The data can be used for genotyping by defining "Peak Bins" based on peak characteristics of known genotypes [24]. A B Fig 9, A and B: Rotor-Gene Q-Rex Interface Source: QIAGEN’s Q-Rex Software HRM Analysis High-resolution melting (HRM) analysis using Rotor-Gene Q software involves selecting a data set and defining normalization regions to compensate for variations between samples. Genotype names are assigned, and control samples are selected. Results are displayed in the “HRM Results” table with automatic identification results for each genotype. Confidence levels are assigned to each sample, and a threshold value for the "Confidence Percentage" can be defined. The “HRM Normalized Graph” plot displays different curves relative to a selected genotype in a “Difference Graph” to emphasize differences between samples. Once the process is complete, tables and graphs can be exported [24]. Fig 10: Rotor-Gene Q Rex Interface Source: QIAGEN’s Q-Rex Software 2.1.1 (b) Rotor-Gene ScreenClust HRM Software Rotor-Gene ScreenClust HRM Software is a powerful tool for the analysis of high-resolution melting (HRM) data from the Rotor-Gene Q or Rotor-Gene 6000 cyclers. By grouping samples into clusters based on their dissociation (melting) curve characteristics, Rotor-Gene ScreenClust HRM Software enables applications such as genotyping and mutation scanning. The number of clusters can either be defined by the user, if they have known controls for each genotype (supervised mode) or the software can aid the user in determining the number of clusters in a sample set (unsupervised mode). Rotor-Gene ScreenClust HRM Software provides: • Innovative mathematical approach to HRM analysis. • Highly accurate identification of genotypes in supervised mode. • Automatic detection of new mutations in unsupervised mode. • Robust statistics for classifying and interpreting HRM data. • Minimal effort and standardized processes for data interpretation. HRM analysis on a Rotor-Gene cycler produces raw data that can be further analyzed using Rotor-Gene ScreenClust HRM Software. Rotor-Gene ScreenClust HRM Software analyses HRM data in 4 steps: 1. Normalization 2. Generation of a residual plot 3. Principal component analysis 4. Clustering HRM curves can have different starting points, therefore the scale of each melt is different [25, Fig. 1-A). Rotor-Gene ScreenClust HRM Software only compares samples that are on the same scale, which is achieved by normalization. Raw data are normalized by applying curve scaling to a line of best fit so that the highest fluorescence value is equal to 100 and the lowest is equal to zero [25, Fig. 1-B]. Next, the curves are differentiated and a composite median curve is constructed using the median fluorescence of all samples. The melt traces for each sample are subtracted from this composite median curve to draw a residual plot [25, Fig. 1-C). Consecutively, the individual sample characteristics are extracted by principal component analysis from the residual plot. The principal component analysis is a well-established method of data analysis. However, Rotor-Gene ScreenClust HRM Software is the first software application to apply principal component analysis to HRM data. The principal component analysis highlights similarities and differences in the data and is used to create a cluster plot [25, Fig.1-D]. Figure 11: Data analysis performed in ScreenClust HRM Software (adapted from ScreenClust Software user manual [25, Fig. 1]) Rotor-Gene ScreenClust HRM Software performs clustering (grouping) of data according to allele in either supervised or unsupervised mode. Supervised mode is often used for SNP genotyping, where the genotypes are known. In supervised mode, the user assigns one or more control samples for each cluster and the software classifies (auto calls) all unknown samples to clusters according to their characteristics. The unsupervised mode is used when there is no or only partial prior knowledge of the genotypes present in the samples. In unsupervised mode, the software calculates the optimal number of clusters by itself. This feature is an excellent tool for the discovery of new polymorphisms. In addition to the easy-to-interpret cluster plot, Rotor-Gene ScreenClust HRM Software provides statistical probabilities and typicalities in a results table to allow easy comparison of results from different experiments. 2.1.2 BIO-RAD’S CFX SERIES 2.1.2 (a) CFX Manager The software plots the relative fluorescence unit (RFU) data collected during a melt curve as a function of temperature. To analyze melt peak data, the software assigns a beginning and ending temperature to each peak by moving the threshold bar. The floor of the peak area is specified by the position of the melt threshold bar. A valid peak must have a minimum height relative to the distance between the threshold bar and the height of the highest peak. • Melt Curve: Viewing the real-time data for each fluorophore as RFUs per temperature for each well. • Melt Peak: Viewing the negative regression of the RFU data per temperature for each well. • Well selector: Wells to show or hide the data. • Peak spreadsheet: Viewing as a spreadsheet of the data collected in the selected well. Figure 12: Interface of BIO-RAD CFX Manager The Melt Curve Data shows the data from the Melt Curve in multiple spreadsheets, including all the melt peaks for each trace. Select one of these four options to show the melt curve data in different spreadsheets: • Melt Peaks: Listing all the data, including all the melt peaks, for each trace • Plate: Listing a view of the data and contents of each well in the plate • RFU: Listing the RFU quantities at each temperature for each well • -d(RFU)/dT: Listing the negative rate of change in RFU as the temperature (T) changes. This is the first regression plot for each well in the plate Figure 13: Melt Peak Spreadsheet – CFX Manager 2.1.3 BIO MOLECULAR SYSTEMS – MIC 2.1.3 (a) micPCR Software The micPCR software offers a Melt Analysis option that enables the determination of the peak dissociation temperature (Tm) of a sample from the melt data. This feature is useful in detecting non-specific amplicons like primer dimers, thereby serving as a measure of analytical specificity for an assay. Melt Analysis can also be applied for genotyping using chemistries such as dual hybridization probes. The software displays a graph of the first derivative curve plotted as 𝑑𝐹/𝑑𝑇 (y-axis) against temperature (°C, x-axis) for the first target selected in the Assays list. Users can set the melt curve threshold to any value and adjust other melting parameters available for genotyping. Figure 14: Interface of melt curve graph in Bio Molecular System micPCR software 2.1.4 THERMO FISHER – QUANTSTUDIO 2.1.4 (a) QuantStudio Design & Analysis Software The Melt Curve experiment is used in Thermo Fisher PCR reactions with SYBR Green dye to determine the melting temperature (Tm) of the amplification products. Tm is the temperature at which 50% of the DNA is double-stranded and 50% is dissociated into single-stranded DNA. Melt Curve analysis is included in the default run method for any experiment type that uses SYBR Green reagents. Multiple peaks in a melt curve indicate additional amplification products, often due to non-specific amplification or primer-dimer formation. Figure 15: Interface of Melt curve graph in ThermoFisher QuantStudio software 2.1.5 ROCHE – LIGHTCYCLER SERIES 2.1.5 (a) LightCycler Software The LightCycler uses fluorescence measurements to perform melting temperature analysis, which determines the melting temperature (Tm) of each sample. The analysis produces a Melting Curves chart that shows the downward curve in fluorescence as samples melt and a Melting Peaks chart that plots the first negative derivative of sample fluorescent curves to display the melting temperature of each sample as a peak. This allows for easier comparison between samples. Figure 16: Interface of Roche’s LightCycler 2.2 DEMOCRATIZED SOFTWARE 2.2.1 DNA-UTAH 2.2.1 (a) uANALYZE uAnalyze is a web-based tool that analyses high-resolution melting data of PCR products. It uses recursive nearest-neighbour thermodynamic calculations to predict a melting curve. The tool accepts unprocessed melting data from LightScanner-96, LS32, or HR-1 data files or via a generic format for other instruments. A fluorescence discriminator identifies low-intensity samples, and the background is removed either as an exponential or by linear baseline extrapolation. The precision and accuracy of experimental melting curves are quantified, and a temperature overlay is provided to focus on the curve shape. Figure 17: uANALYZE Interface 2.1.4 b) uMELT uMelt is a web-based tool used for predicting the DNA melting curves and denaturation profiles of PCR products. The user inputs an amplicon sequence and defines thermodynamic and experimental parameters including nearest neighbour stacking energies, loop entropy effects, and cation concentrations. Using an accelerated partition function algorithm, uMelt calculates and visualizes the mean helicity and dissociation probability at each sequence position within a temperature range. The predicted curves and profiles display stability and loss of helicity with increasing temperature. Results from fluorescent high-resolution melting experiments match the predicted melting domains and their relative temperatures, but the absolute melting temperatures may vary. uMelt provides a convenient platform for the simulation and design of high-resolution melting assays. Figure 18: uMELT Interface CHAPTER 3 LITERATURE REVIEW Untergasser et al. 2021, made the analyses of amplification and melting curves to provide valuable information on the quality of individual reactions in quantitative PCR (qPCR) experiments and result in more reliable and reproducible quantitative results. The new web-based LinRegPCR web application provides visualization and analysis of a single qPCR run, displaying the analysis results on the amplification curve and melting curve analysis in tables and graphs. It also provides a stand-alone back-end RDML (Real-time PCR Data Markup Language) Python library and several companion applications for data visualization, analysis, and interactive access. The use of the RDML data standard enables machine-independent storage and exchange of qPCR data, and the RDML tools assist with importing the data from the files exported by the qPCR instrument. Moniri et al. (2020) demonstrates that the large volume of raw data obtained from real-time PCR instruments can be exploited to perform data-driven multiplexing in a single channel using machine learning methods. This approach, referred to as Amplification Curve Analysis (ACA), was used to multiplex 3 carbapenem-resistant genes in the presence of single targets, resulting in an accuracy of 99.1% (N = 16188). To support the analysis, a formula was derived to estimate co-amplification occurrence in PCR based on multi-variate Poisson statistics. Combining this method with probe-based assays will increase multiplexing capabilities. Wisittipanit et al. 2020 used a modified high-resolution DNA melting curve analysis (m-HRMa) to classify Salmonella spp. into clusters and a machine learning (dynamic time warping) algorithm (DTW) to create a phylogeny tree of Salmonella strains (n = 40) collected from homes, farms, and slaughterhouses in northern Thailand. DTW and ms-HRMa clustering analyses were able to generate molecular signatures of the Salmonella isolates, resulting in 25 ms-HRM and 28 DTW clusters compared to 14 clusters from a standard HRM analysis. The new Salmonella sub-typing protocol identified five S. Weltevraden subtypes with S.Weltevreden subtype DTW4-M1 being predominant. This suggests that transmission of salmonellosis in northern Thailand is likely to be farm-to-farm through contaminated chicken stool. Athamanolap et al. 2014 made an automated HRM curve classification based on machine learning methods and learned tolerance for reaction condition deviations enables reliable, scalable, and automated HRM genotyping analysis with broad potential clinical and epidemiological applications. Roediger et al. 2013. implemented the MBmca package with R, for DNA Melting Curve Analysis on microbead surfaces. Particularly, for the use of the second derivative melting peaks as an additional parameter to characterize the melting behaviour of DNA duplexes. Dwight et al. 2011 created a web-based tool called “uMeltSM” for predicting DNA melting curves and denaturation proles of PCR products. It uses an accelerated partition function algorithm to calculate and visualize the mean helicity and dissociation probability at each sequence position at different temperatures. Results from fluorescent melting experiments match the number of predicted domains and their relative temperatures, but current libraries do not account for the rapid melting rates and helix-stabilizing dyes used in experiments. Smith et al. 2009 defines Methylation of DNA as a common mechanism for silencing genes and is increasingly being implicated in many diseases. They describe and validate a rapid, in-tube method to quantitate DNA methylation using the melt data obtained following the amplification of bisulphite-modified DNA in a real-time thermocycler. The parameters derived provide an objective description and quantitation of the methylation in a specimen and can be used for statistical comparisons of methylation between specimens. CHAPTER 4 PROPOSED METHODOLOGIES Developing an AI-based framework for interpreting and reporting PCR tests, require necessary research and primary attention towards understanding key result analyzing techniques such as melt curve analysis and cycle threshold analysis. Therefore, conducting proper research is crucial for formulating any robust solution for the problem given. 4.1 KEY ASPECTS The research and methodologies in this project have three main key aspects, namely Data Acquisition, Pre-processing, and Feature Engineering. Approaches were made, concerning all the above aspects as stated. • How data acquisition can be performed? • What type of data is required? • How pre-processing should be done? • Is pre-processing necessary? • How feature engineering can be done effectively? The research and methodology of this project predominantly focuses on the above stated key aspects, and satisfying all of them, will constitute a larger contribution. Most of the work in this project follows, how data is being collected and how it is leveraged, such that expected outcomes are brought in. Following that, several methodologies was found and applied on various scenarios, which gave deterministic and non-deterministic results. Since the project aims in introducing Machine Learning based approach for interpreting the results, proper attention has to be made for data acquisition and feature engineering. 4.2 PROPOSED METHODOLOGIES The project comprises of various methodologies, proposed and implemented in various sections, where each of them provide certain observable results, that are notable and few of them demand for future implementation. • Approach on images of DNA melting signal graphs. • Approach on co-ordinates of raw fluorescence signal. • Approach on co-ordinates of DNA melting signal graphs. • Combination of approach on images and the co-ordinates of DNA melting signal. The above methodologies are proposed and implemented in this project as part of feature engineering process. Moreover, not all the methodologies stated above are truly applicable and successful, as some of them are preliminary experiments and trails. Each methodology relied on data and respective data acquisition methods are also covered in the following sections. CHAPTER 5 APPROACH ON IMAGES OF DNA MELT SIGNALS As discussed earlier, melt curves are interpreted by visual inspection of their respective graphs that is the result of the rtPCR assay. Experts then examine every feature in the graph and draft their report/interpretation accordingly. In the onset of this project the type of data that is more suitable for developing the AI framework was unknown. Hence, initially the available images of the melt curve graph for the different patient sample were used as the source data for developing the algorithm. Figure 19: DNA Melt data report generated by thermal cycler machine Reports are the only initial source of data, where information on melt data are available. These reports are created by commercially available thermal cycler machines, after performing the interpretation by clinicians/microbiologists. Since the data is available in the form of images, image processing techniques are performed as an initial step, to explore the available image source. 5.1 IMAGE PROCESSING Image Processing is a standard technique, widely used on various applications like computer vision, video processing, remote sensing etc., to extract and restore valuable information in the image by the use of algorithms. Image processing has various applications like image sharpening, image restoration, image reconstruction, image reformatting, image generation and so on, for enhancing the existing image resource. In this context, as the data available from the physical paper reports, it is likely to have poor quality and there are more chances of having noises. To suppress such constraints, image processing is a promising solution. There are many libraries and packages, that are readily available for image-processing tasks, and among those, OpenCV is a familiar image-processing library, predominantly used on many image-based tasks and projects. Likewise, OpenCV has been used in this methodology with Python as a code language tool. Pre-processing with OpenCV, provides many image processing methods, to manipulate images in a desired manner. Such a way to separate or distinguish the melt signals from the given plot, image masking can be performed. Image masking is a simple technique, where only the desired pixels are taken for consideration, ignoring the background or unwanted pixels by creating appropriate masks. Figure 20: Colour mask to track yellow colour in the input image Figure 21: Melt signal images scanned and cropped from PCR reports After collecting several report files, images of melt signals are scanned and segmented individually. The signal plots in the reports are partially clear and some of the signals are pale in colour. Most of the images, have white background that possess the RGB value of (255,255,255). Masking has to be done in such a way to remove those white coloured pixels from the images, so that only the coloured pixels are remained for consideration. Figure 22: Image masking performed on melt signal images Performing image masking on the DNA melt images, has removed all the white background pixels, keeping only those pixels of melt signals. However, few sections in the picture such as x-axis and y-axis legends, labels, threshold line etc., (fig. 21)) were not removed. Hence, there is a requirement of additional techniques to be applied for removing such unwanted information. Figure 23: Cropping the images to retain only melt signals Cropping the images helped to retain the target area of the input images. As a result, all the unwanted sections like labels, x-axis and y-axis are removed. However, the melt signals in the images are still noisy and due to the poor pixel quality of the images and the inaccuracy of the edges. Additionally, images in the reports were compressed, which is entirely not as good as image formats like SVG and PNG. On the other hand, information on the numerical values seems to be missing (say Tm). Unlike problems like object detection, which has the aim of identifying the desired object at any position, at any size, at any colour irrespective of numerical information (say the exact size of the object in inches or centimetres), image processing is unable to assess such attributes (numerical). RESULT AND DISCUSSION The experiment on the images of melt signals has introduced image processing methods, with the aim of extracting important information from the signals. The experiment focuses on creating mask for such melt signals, which as a result, background is separated. Despite this happens, the process has not seemed to be satisfying several requirements, such as numerical attributes of the signal, sound pixel quality etc., The image-based approach, as a result, could capture and partially enhance the melt signals from the image, but could not assess any other import features, required for analysis. Though great-quality images are acquired, extracting/assessing the numerical attributes is challenging and that may introduce methods like plot digitizing. Explorations with respect to the above constraints have to be done, and a proper substitute or applicable alternate method needs to be introduced. Instead of acquiring physical reports, exploration shall be made on the software components of the thermal cycler machines, so that, any new data sources or any alternative in this regard can be found. CONCLUSION Approach on images of DNA melt signals, has provided critical results, that demand further implantation and exploration. The prevailing approach lacks areas like feature extraction and data acquisition, where the source of data is not intact and with that source, formulating a robust foundation for the problem is difficult. CHAPTER 6 APPROACH ON CO-ORDINATES OF RAW FLUORESCENCE SIGNAL In the previous approach, physical paper reports were the initial source of data, where image processing techniques are applied and respective results are observed at the end. It was found that, there were no reliable data source for extracting information on melt signals, and applying image-based techniques are only relevant with several limitations. The approach is not feasible, because of the following reasons, • Unreliable Data • Time consuming (Involves much pre-processing) • Hard to differentiate signals within themselves • Involves plot digitizing • Could not cover numerical features Concerning all these drawbacks, further exploration has been performed in this successive approach to address the above stated constraints. The science behind DNA melting signal, is a light reaction, called ‘fluorescence’. Fluorescence is a process that belongs to the ubiquitous luminescence family in which sensitive molecules produce light from electronically excited states created by a physical (for e.g., light absorption), mechanical (friction), or chemical mechanism. The DNA melt signals are actually this fluorescence emission, emitted by fluorescent dyes during the PCR run. Such emissions are captured and recorded by thermal cycle machines in real time. The captured fluorescence signal is then plotted against the temperature, which as a result, signals are produced. Like all other signals (say sin waves, spike waves etc.,) melting signals also have co-ordinates (x, y co-ordinates), that can be plotted in a cartesian plane. Exploring the thermal cycler machines and its corresponding software plugin, it is discovered that, such co-ordinates can be extracted and exported into various file formats. As a result, any melting signals can be plotted manually with any data visualization tool, without depending on the software plugin of thermal cycler machines. Text X Y Text X Y Text X Y Text X Y 1: 0400437684 70 56.25 5: 0400437685 70 77.52 9: 1000675295 70 48.9 13: 1000675298 70 78.69 1: 0400437684 70.2 55.43 5: 0400437685 70.2 76.74 9: 1000675295 70.2 48.2 13: 1000675298 70.2 77.87 1: 0400437684 70.4 54.61 5: 0400437685 70.4 75.95 9: 1000675295 70.4 47.5 13: 1000675298 70.4 77.06 1: 0400437684 70.6 53.84 5: 0400437685 70.6 75.30 9: 1000675295 70.6 46.9 13: 1000675298 70.6 76.38 1: 0400437684 70.8 53.07 5: 0400437685 70.8 74.69 9: 1000675295 70.8 46.1 13: 1000675298 70.8 75.71 1: 0400437684 71 52.31 5: 0400437685 71 74.07 9: 1000675295 71 45.4 13: 1000675298 71 75.14 1: 0400437684 71.2 51.47 5: 0400437685 71.2 73.45 9: 1000675295 71.2 44.6 13: 1000675298 71.2 74.48 1: 0400437684 71.6 49.75 5: 0400437685 71.6 72.19 9: 1000675295 71.6 43.0 13: 1000675298 71.6 73.15 1: 0400437684 72.2 46.57 5: 0400437685 72.2 69.88 9: 1000675295 72.2 40.0 13: 1000675298 72.2 71.03 Table 3: Sample data - co-ordinates of raw fluorescence The Table 3 shows the tabular representation of fluorescence intensity, captured in real time. The data specifies the ‘X’ and ‘Y’ co-ordinates of raw fluorescence signals, where ‘X’ denotes the temperature and ‘Y’ denotes the fluorescence intensity. The ‘Text’ column represents the name/reference id of the sample, which is being loaded in the wells. It is also clearly observable that, the ‘X’ column is common for all the ‘Y’ (fluorescence intensity) columns, which is the temperature value, that range from 70ºC to 90 ºC. Figure 24: Raw fluorescence signal plotted using matplotlib The shape of raw fluorescence signal is different from the DNA melting signal. In fact, the DNA melting signals are produced after applying mathematical function on the raw fluorescence signals. And the mathematical function is the first negative derivative of raw fluorescence intensity. - (change in y)/(change in x)= - Δy/Δx= -dy/dx (1) There raw fluorescence signal has several properties and also provide certain information on the melting characteristic of a DNA sample. According to the method described by Smith et al. the team stretched out the mathematical method to find the start and end of melting temperature from a fluorescence signal. The raw fluorescence signal has an inverse proportion with the temperature, that, as temperature increases, fluorescence decrease. Such decreasing nature of fluorescence has some properties and it can be distinguished in three different phases. Figure 25: Properties of raw fluorescence signal In the fig. 25, a standard fluorescence signal has been divided into three different phases, where each illustrate appropriate properties of raw fluorescence signal. According to the study, the first property of a fluorescence signal is that, it follows linearity in the beginning, that is, every raw fluorescence signal follows a linearity in its initial stage, where the decrease against the increasing temperature will be linear for specific time interval. Usually, fluorescence-based PCR test uses fluorescent dyes, that bound to the double strand DNA. These fluorescent dyes follow a mechanism that, they do not fluoresce when they bound to double strand DNA. When the DNA sample is heated with a gradual increase in temperature, the double strand DNA will be separated into a single strand DNA. As a result, the fluorescent dye bound to the double strand DNA will fluoresce. Such state, is often called as melting state, where half of the DNA will be separated into single strand DNA. Until this process happens (melting), there will be linear decrease in fluorescence intensity, which cannot be avoided, as heating happens in a continuous increasing manner. Hence, there will be some amount of fluorescence release is observed, which is termed as Phase 1. The Phase 2 describes the melting stage, where the fluorescence release is aggressive, as the DNA sample is being converted into single strand DNA. The magnitude of fluorescence will be high, that cause the sudden massive decrease in the fluorescence intensity. The Phase 3 is the final stage, where the fluorescent dye gets saturated and no longer fluoresce will happen as the dye gets exhausted. With the above information, raw fluorescence signals can be processed in more significant manner along with the methods described by Smith et al., to find the start and end of melting temperature from a fluorescence signal. There are several mathematical algorithms are followed and applied in this approach, to extract the earlier stated features like, temperature at which the melting started and the temperature at which the melting ends or saturated. Distinct mathematical and statistical algorithms like Linear regression and line fitting methods are also introduced and applied in this methodology. The algorithm to apply line fitting method in the raw fluorescence signal is followed as, ALGORITHM 1: ALGORITHM TO APPLY LINE FITTING METHOD IN THE RAW FLUORESCENCE SIGNAL USING LINEAR REGRESSION Input: Data frame and the index of the required column Output: Straight line extrapolated and fitted on the features of the data 1 Importing necessary libraries: sklearn, numpy, matplotlib, pandas 2 Initialization of objects: create an empty list object ‘List1’. 3 FUNCTION predict (data frame, index of column) 4 X ← create an array of shape (10,1), taking the first 10 elements of temperature column from the data frame. 5 Y ← create an array of shape (10,1), taking the first 10 elements of given column index from the data frame. 6 Fit a linear equation on the arrays X, Y using sklearn.linear_model.LinearRegression() 7 Prediction ← extrapolate the temperature column in the data frame using the above fitted linear equation. 8 RETURN Prediction 9 END FUNCTION 10 FOR j = 1 TO len (data frame.columns) DO 11 List1 ← Call the predict function for each column and append the results to the list 12 END FOR 13 FOR k = 1 TO len (data frame.columns) DO 14 Plot the predictions against the actual values for each column using matplotlib.pyplot 15 Create a new plot for each column 16 Plot the actual values for the given column 17 Plot the predicted values for the given column 18 Show the plot 19 END FOR The above algorithm illustrates how to extrapolate a straight line to the temperature values in the data set using Linear Regression. Following that, the algorithm starts by capturing the initial linear phase of the signal by choosing the first n (10 (a random choice, must be a least figure)) numbers. Figure 26: First 10 linear points of the fluorescence signal Figure 27: Fitting a straight line along the fluorescence signal’s linear phase The fitted straight line is an assumption that, the fluorescence intensity is decreased linearly throughout the temperature increase and it is denoted as “F(T)max”. This F(T)max is the function of temperature, and the same has been described in the above algorithm as ‘FUNCTION predict’. The fluorescence signals are denoted as F(T)obs, which is also treated as a function of temperature, but this denotation is not appropriate, as the fluorescence signals are not computed by a mathematical function, and it is wholly depends on various parameters, combining that, it is observed in real-time. For the discussion, it is denoted as such, and it is simply the observed fluorescence values. To compute the melting phase of the raw fluorescence signal, the following formula can be applied, F(T)melt = F(T)max – F(T)obs (2) Again F(T)melt is considered as a function of temperature, but here it is expressed as the difference between the assumed and observed values of fluorescence signals. Fitting an extrapolation line helps to differentiate fluorescence release caused by DNA melting, and the release that is caused by heating. On clearly observing the values of fluorescence intensity, it is obvious that, the values are ranging from 0 to 100, which can also be furtherly processed with pre-processing techniques like normalization. Using appropriate mathematical functions, normalization can be achieved. ALGORITHM 2: ALGORITHM TO NORMALIZE THE RAW FLUORESCENCE SIGNAL Input: Data frame and the List1 from the Algorithm 1 (extrapolated values) Output: Normalized values of fluorescence signals 1 Importing necessary libraries: numpy, pandas 2 Initialization of objects: create an empty list object ‘List2’, ‘List3’. 3 FOR i = 1 TO len (data frame.columns) DO 4 FOR x, y IN zip (List1[i-1], data frame.iloc[:,i]) DO 5 List2 ← perform element wise division (y/x) and append to the list 6 END FOR 7 List3 ← append all the lists after computing the element wise division for all the columns. 8 END FOR The algorithm takes data frame (fluorescence signals co-ordinates) and the extrapolated values computed in Algorithm 1 as inputs, to perform normalization. The mathematical notion implemented in the algorithm is the ratio of observed values to the extrapolated values. Normalization = F(T)obs/(F(T)max) (3) Figure 28: Fluorescence signals after performing normalization The fig. 28, shows the normalized raw fluorescence signals, but the results are not as expected. On clearly observing the plots, there are some noises, being introduced in the end, i.e., between 85ºC to 90ºC temperature range. This is an error, and it is not a properly normalized fluorescence signal. In fact, the algorithm, on a specific stage has introduced an outlier value, which has to be backtracked to fix such erroneous. On backtracking the error, it was observed that, the extrapolated values, being fitted along the linear phase of fluorescence signal, is intersecting the observed signal on a specific point (fig. 29). Figure 29: Erroneous intersection spotted in the signals The extrapolated values fitted along the linear phase, should not intersect the observed values. This is because, normalization makes the fluorescence intensity as percentage or the ratio of observed values to the extrapolated values. On performing the element wise division, as described in Algorithm 2, observed and extrapolated value on the intersected point will be same, and the resultant ratio will be a whole number or a fraction that is near to a whole number. As a result, the intersection produces outlier and will be reflected in the resultant signal (Figure 3.12). Moreover, on further analysis, it is also observed that, the linear phase in the observed fluorescence signal is narrow for fig 29 A, B, C, D as compared to fig 29 E and extrapolating a linear line on such phase will definitely intersect the fluorescence signal. This may be due to the reason that, the signals are actually missing the linear phase, as the has the temperature is staring from 70ºC. Capturing the linear phase is crucial for this approach, and lacking on such attribute, may lead to erroneous results. These issues can be fixed by acquiring data from earlier temperature values (say 50 ºC), so that, linear phase of the fluorescence signals can be captured. Figure 30: Raw fluorescence signal – Temperature range (30ºC to 90ºC) Once the signals are captured with linear phase, Algorithm 2 can be applied hence, such that normalization can be performed properly. Figure 31: Normalized fluorescence signals In (fig 31), windowing method has been applied, such that, only the required the range of temperature values and its corresponding fluorescence intensities will be taken. Following that, only the temperature range 70ºC to 90ºC has been fixed as a window size. Raw fluorescence signals are taken into consideration, mainly to extract information on the take-off temperature value and the touch-down temperature value. This information is phenomenal, as it provides additional information like take-off and touch-off values, which can be considered as a valuable characteristic or a feature of the DNA melt signal. As far as, all the methods applied in this approach are stated and derived by E. Smith et al.,[1], as a part of fluorescence signal processing (FSP) to extract desired information for the analysis. The methods were described by E. Smith et al., were performed on different context, with different samples, and following the same, has introduced several practical exceptions. The next section of this approach will be an extended approach of E. Smith et al., whom described the imaginary/extrapolated line fitting on the fluorescence signals. To capture such take-off and touch down temperature values, normalized fluorescence signals are used in the further experiments where the same line fitting method will be performed in a different manner. This time the existing Algorithm 1 can be used with some additional steps. In Algorithm 1 extrapolation has been done on the raw fluorescence signals, especially on the linear phase, by taking first n least observations. As a change, the same steps can be performed in a reciprocal way, to extrapolate the same straight line for the least n elements from the end of the signal (saturated part). This process as a result, will provide two lines extrapolated on both the ends (top and down) of the normalized fluorescence signals (fig 31). Figure 32: Extrapolating imaginary lines on both the ends ALGORITHM 3: ALGORITHM TO RECORD TAKE-OFF AND TOUCH-DOWN TEMPERATURE VALUES IN NORMALIZED RAW FLUORESCENCE SIGNAL Input: Normalized fluorescence values computed in Algorithm 2 Output: Take-off and Touch-down temperature values. 1 Importing necessary libraries: numpy, pandas 2 FUNCTION predict (normalized fluorescence data, index of column) 3 X1 ← create an array of shape (10,1), taking the first 10 elements of temperature values from the normalized fluorescence data 4 X2 ← create an array of shape (10,1), taking the last 10 elements of temperature values from the normalized fluorescence data 5 Y1 ← create an array of shape (10,1), taking the first 10 elements of given column index from the normalized fluorescence data. 6 Y2 ← create an array of shape (10,1), taking the last 10 elements of given column index from the normalized fluorescence data. 7 Fit a linear equation on the arrays X1, Y1 and X2, Y2 using sklearn.linear_model.LinearRegression() 8 Prediction1 ← extrapolate the temperature column in the data frame using the fitted linear equation for X1 and Y1 arrays. 9 Prediction2 ← extrapolate the temperature column in the data frame using the fitted linear equation for X2 and Y2 arrays. 10 RETURN Prediction1, Prediction2 11 END FUNCTION 13 FOR i = 1 TO len (Normalized fluorescence data frame.columns) DO 14 Extrapolated_values ← predict(Normalized fluorescence dataframe, i) 15 FOR x, y IN zip (Extrapolated_values [0][i-1], data frame.iloc[:,i]) DO 16 Take0ff ← compare the extrapolated values and observed fluorescence values to detect the point of deviation. 17 END FOR 18 FOR x, y IN zip (Extrapolated_values [1][i-1], data frame.iloc[:,i]) DO 19 TouchDown ← compare the extrapolated values and observed fluorescence values to detect the point of convergence. 20 END FOR 21 END FOR The above algorithm describes a function to compute the extrapolated values for both the ends of a normalized fluorescence signal, and iterates through each column (signals) of normalized fluorescence data frame, such that, it could compare the actual signal and the extrapolated values, to detect the point of deviation and the point of convergence (fig 33). The resultant touch-down and take-off values are considered as the starting temperature point and the ending temperature point, and such points shall be cross validated with the corresponding DNA melting signal. Figure 33: Mapping the take-off & touch-down points with Normalized fluorescence signal in DNA Melt signals. The results of the algorithm looks relvant, as the take-off and touch-down points are cross validated with the corresponding DNA melting signals. These points can be taken as features for analysis, and can be applied on successive samples, especially on the raw fluorescence signals. By the way, this algorithm works as a part of feature engineering process. For calculating the melting temperature, same raw fluorescence signal co-ordinates can be processed further, to produce DNA melting signal. As stated earlier, DNA melting signals are actually produced after applying a differential equation on the co-ordinates of raw fluorescence signals. The differential equation (1), captures the rate of change of fluorescence intensity for every change in temperature values. The negative derivative has been taken, concerning the order of observations (as the temperature values are in ascending order). As a result, the melting phase in the raw fluorescence will be reflected as a peak (fig 33 B), which is more appropriate for interpretation. The exceptions of the algorithm is quite observable that, where the algorithm finds it difficult to capture further more additional informations. When it come to those signlas, which has double peaks and double attributes (depends on primers) the algorithm could not pick the additional information on successive peak, which is prominent for several type of pathogen. Not all the pathogen would be expected with a single peaked DNA melting signals, where different pathogens markers will be captured with different set of primers designed dedicatedly. Such variations in primers could result in differernt DNA melting signals, and it is important that, capturing all the information is prominent. Despite the algorithm works fine, there are also several scenarios where the algorithm fails to provide expected results. As the HRM data is prone to high variability, applying the same algorithm on different samples, does not work, and it follows certain exceptions. Figure 34: DNA melting signals with double peaks and its corresponding raw fluorescence. With the algorithm using raw fluorescenc curve, capturing the melting phase can be achieved in an overall sense, but could not provide any further detailing on additional peaks, which is being introduced along with the major peak. This will lead to, lack of information extraction, as the additional peaks also provides additional relavant information, which is a valuable feature, that has to be considered during interpretation. Since the algorithm could not do this, further implentation on capturing those additional information has to be performed. RESULT AND DISCUSSION Data of Raw fluorescence signal has been acquired in this methodology as a primary data, and processed respectively using several line fitting techniques like linear regression and extrapolation. As a result, linear phase of fluorescence intensity has been captured and compared with actual observed signal, to differentiate melting phase, from the raw fluorescence signal. Furtherly data pre-processing techniques like normalization has also introduced to normalize the fluorescence intensity in a range of 0 to 1, and performed further line fitting methods on both the end of the resultant normalized curve to detect take off and touch down points. CONCLUSION The experiment on raw fluorescence signals, has provided some significant results on single peaked DNA melting signal, but failed to work with double peaked signal, as it could not capture any additional information on additional peaks, which is being introuced along the previous peak. It is necessary to employ some alternate processing technique to capture all relavnt information of peaks as much as possible. Efficient and gold standard pre-processing techniques like signal processing methods, has to be trialed, and corresponding results must be captured and evaluated, which is being covered in the upcoming chapters. CHAPTER 7 APPROACH ON CO-ORDINATES OF DNA MELTING SIGNAL In the previous chapter, processing raw fluorescence signals has several short comes, where it could only extract partial information from DNA melting signals. In this chapter, alternative methods to processing raw fluorescence signals will be introduced, and its corresponding results will be observed and evaluated duly. As stated earlier, DNA melting signals can be produced after applying a differential function over a raw fluorescence signal, such that peaks will be formed, denoting the melting phase. Such a way, raw fluorescence signals will be converted into melting signals and further processing steps will be covered in this chapter, to extract features from melting signals. 7.1 MELT CONVERSION Converting a raw fluorescence signal into a melting signal involves a gradient function, which takes the first negative derivative of fluorescence over the temperature values. The gradient function simply captures the rate of change of fluorescence intensity over each change happens in a temperature value. This as a result will produce signals with peaks, which is so called Melt curve or Melting signal. ALGORITHM 4: ALGORITHM TO CONVERT RAW FLUORESCENCE SIGNAL TO MELTING SIGNAL Input: Raw fluorescence signal values Output: Melting signal values 1 Importing necessary libraries: numpy, pandas 2 Initialization of objects: create an empty dataframe object to store melting signal values. 3 FOR i = 1 TO len (data frame.columns) DO 4 gradient ← Calculate the rate of change for fluorescence intensity and temperature values using np.gradient() and multiply with np.negative 5 Melting Signal ← append each signal columns of raw fluorescence to its corresponding columns of Melting signal data frame. 6 END FOR Figure 35: Comparison for manually converted melting signal(A) to machine converted melting signal(B). Converting raw fluorescence signals into melting signals, provide similar results, as shown in the (fig 35). Though there are still some differences are observed, with respect to the smoothness of the signal. The machine converted melting signals (fig 35 B) are smooth and did not provide any noise at either the ends, whereas, manually converted signals (fig 35 A) introduce noises, which makes a slighter difference from the original machine converted signals. The machine may apply some noise reducing filters, that reduce the excess noise in the melting signals. Harnessing raw fluorescence signals furtherly, throughout the approach, paves way for common and generalized solution, since raw fluorescence data can be extracted from any thermal cycler machines of any manufactures. Following the same will provide a universal solution, and hence can be applied anywhere, with any machines. Commercially available thermal cycler machines, uses several signal smoothening algorithms like Savitzky-Golay algorithm, moving average algorithm, and ensemble average etc., to smooth the DNA melting signals. Despite signal smoothening, the machine also performs data interpolation and some of the familiar algorithms are spline, radial basis function, nearest interpolation etc. Interpolation and smoothing are two related concepts in signal processing that are used to enhance or modify signals. Interpolation refers to the process of estimating values of a signal at points where it has not been sampled, based on the values of the signal at nearby sampled points. This is typically done using mathematical algorithms, such as linear interpolation or spline interpolation. The purpose of interpolation is to increase the sampling rate of a signal, or to fill in missing or corrupted data points. Smoothing, on the other hand, refers to the process of reducing the noise or irregularities in a signal by averaging adjacent data points or using a filtering algorithm. Smoothing can be used to remove high-frequency noise or artifacts in a signal, or to remove short-term fluctuations that are not relevant to the underlying trend of the signal. Interpolation and smoothing are related in that they both involve modifying a signal to improve its quality or make it more useful for analysis or processing. In some cases, smoothing may be applied before interpolation to remove noise or irregularities in the signal, while in other cases interpolation may be applied before smoothing to fill in missing data points. Both techniques are widely used in signal processing and are essential for many applications, such as image processing, audio processing, and data analysis. Interpolating method depends on the type of data being processed and the specific requirements of the application. Some common interpolating methods used in signal processing include: 1. Linear interpolation: This method involves connecting two adjacent data points with a straight line and estimating the value of the signal at a new point based on the position of that point on the line. 2. Cubic spline interpolation: This method involves fitting a cubic polynomial curve to the data points in a local region, which provides a smoother estimate of the signal at the new point. 3. Fourier interpolation: This method involves using the Fourier transform to estimate the signal at a new point based on its frequency content and the values of the signal at neighbouring points. 4. Akima interpolation: This method uses a modified cubic spline method that is specifically designed to handle noisy or irregular data. 5. B-spline interpolation: This method involves fitting a spline curve to the data points using a set of basis functions known as B-splines. 7.2 SPLINE AND SAVGOL FILTER Spline and Savitzky-Golay (Savitzky-Golay) filter are two popular methods used for smoothing data. Spline is a mathematical technique that is used to interpolate or smooth data points by fitting a piecewise polynomial curve through the data points. The curve is designed to have a smooth appearance by minimizing the curvature of the function. The resulting curve can be used to interpolate between data points or to smooth out noise or irregularities in the data. On the other hand, the Savitzky-Golay filter is a technique used for digital signal processing. The filter is designed to smooth out noisy data by fitting a series of polynomial functions to the data points. The polynomial functions are then used to calculate the smoothed values of the data points. The filter is designed to minimize the impact of the noise on the resulting smoothed values by fitting a polynomial curve of a certain order to the data. The key difference between the two methods is that spline is designed to interpolate between data points and smooth out noise, whereas the Savitzky-Golay filter is designed to smooth out noise while preserving the shape of the data. In other words, spline can introduce new data points that were not present in the original data, whereas SavGol filter only uses the existing data points. In general, if the goal is to preserve the shape of the data, SavGol filter may be a better choice. On the other hand, if the goal is to interpolate between data points and smooth out noise, spline may be a better choice. 7.3 B SPLINE B-spline representation of a 1-D curve is a mathematical technique that approximates a smooth curve using a set of control points and a set of piecewise polynomial functions known as basis functions. The degree of the basis functions determines the degree of the B-spline curve, and the knot vector determines the locations of the knots that connect the basis functions. To evaluate a point on the B-spline curve, the basis functions are computed at that point and weighted sums of the control points are computed using those basis functions. B-splines have the advantage of being able to approximate complex shapes using a small number of control points, and they can be used to interpolate or approximate a given set of data points. B-splines are widely used in computer graphics, computer-aided design (CAD), and other fields that require precise and efficient curve and surface modeling. They offer several advantages over other curve representations, including their flexibility, efficiency, and ability to approximate complex shapes with a relatively small number of control points. To smoothen the converted melting signal from the raw fluorescence signal, spline functions has been used. ALGORITHM 5: ALGORITHM TO SMOOTH THE CONVERTED MELTING SIGNAL FROM RAW FLUORESCENCE SIGNAL Input: Converted Melting signal values Output: Smoothened Melting signal values 1 Importing necessary libraries: numpy, pandas, scipy.interpolate 2 Initialization of objects: create an empty dataframe object to store smoothened melting signal values. Interpolated temperature ← Interpolate the temperature values using np.linspace with a constant multiplying to the actual number of the temperatures values. 3 FOR i = 1 TO len (data frame.columns) DO 4 Interpolated Signals ← Interpolate the signal values with the corresponding interpolated temperature values using scipy.interpolate.splrep() with a ‘s’ value. 5 Smoothened Melting Signal ← append each signal columns of raw fluorescence to its corresponding columns of Melting signal data frame. 6 END FOR Figure 36: Before (A) and After (B) applying smoothening filter. After applying smoothening filter in the signal, it looks fine and there are no noises observed as compared to the previous one (A). The spline function smoothens the signal with the use of ‘S’ parameter, which is a smoothing condition. The amount of smoothness is determined by satisfying the conditions: ∑▒〖(w*(y-g))〗**2≤s where g(x) is the smoothed interpolation of (x,y). I can be controlled to make a trade-off between closeness and smoothness of fit. Larger s means more smoothing while smaller values of s indicate less smoothing. Recommended values of s depend on the weights, w. If the weights represent the inverse of the standard-deviation of y, then a good s value should be found in the range (m-√((2*m) ), m +√((2*m) )) where m is the number of datapoints in x, y, and w. Default: s=m-√((2*m) ) if weights are supplied. s = 0.0 (interpolating) if no weights are supplied. Figure 37: Machine converted melting signal (A) and manually converted melting signal (B) by applying smoothening filter. Selecting an appropriate value for ‘S’ looks prominent, and the impact of such parameter is variable as data changes. Using a constant ‘S’ value to the spline function will not give a generalized solution, and smoothness applied to the signal must be proportional and dynamic. This also becomes a hindrance for feature extraction, as over smoothness cause, over smoothening the important curves and details, which as a result brings lack of performance. Finding a suitable ‘S’ parameter for each set of data is challenging, and making a dynamic ‘S’ value requires more attention. Despite proper hyper parameter has chosen, performing a manual conversion involves certain risk of losing the authenticity of data, since the machine-made DNA signals will undergo certain stages, which follows some technical process, that could only handle efficiently by dedicated PCR machines. A PCR machine converts raw fluorescence to melt data by involving the following steps: 1. Raw fluorescence data acquisition: The instrument measures the fluorescence signal during the PCR reaction and generates a raw fluorescence data file. 2. Background correction: The machine subtracts the baseline fluorescence level from the raw fluorescence data to correct for any background signal. 3. Normalization: The fluorescence signal is normalized to a reference dye signal to account for any variations in fluorescence intensity caused by differences in sample concentration or dye binding. 4. Melting curve generation: The normalized fluorescence data is plotted against the temperature of the reaction to generate a melting curve. The software uses an algorithm to smooth the curve and calculate the first derivative of the curve. 5. Baseline subtraction: The instrument calculates the baseline of the melting curve by fitting a straight line to the lowest fluorescence signal values. 6. Tm determination: The machine identifies the temperature at which 50% of the double-stranded DNA has dissociated by calculating the maximum of the first derivative of the melting curve. 7. Melt curve analysis: The machine can also perform a derivative analysis on the melting curve to identify the number of melting domains or different DNA sequences present in the sample. This information can be used to identify the presence of mutations or genetic variations in the sample. 8. Data output: The machine finally, outputs the melt curve data in various formats such as a graph, a table of Tm values, and a report containing detailed analysis results. 7.4 BASELINE SUBTRACTION Baseline subtraction is a critical step in converting raw fluorescence data to melt data. The purpose of baseline subtraction is to correct for any background signal and to ensure that the fluorescence signal represents the DNA melting signal. Performing baseline subtraction can be achieved using the following method: 1. Baseline region selection: The PCR machine selects a region of the melting curve that represents the baseline fluorescence level. This region is usually chosen as the region before the start of the DNA melting transition, where the fluorescence signal is relatively stable. 2. Straight line fitting: Then it fits a straight line to the baseline region of the melting curve using linear regression analysis. 3. Baseline subtraction: The machine subtracts the values of the fitted straight line from the entire melting curve to obtain the baseline-subtracted curve. By fitting a straight line to the baseline region of the melting curve and subtracting it from the entire curve, the baseline-subtracted curve is obtained. This ensures that the fluorescence signal represents only the DNA melting signal and not any background signal. The baseline subtraction method eliminates any background signal that may interfere with the accurate detection of DNA melting. The result is a more accurate and reliable measurement of the melting temperature and any genetic variations in the sample. 7.5 BACKGROUND CORRECTION 1. Baseline fluorescence measurement: The instrument measures the fluorescence signal from a region of the reaction that does not contain any DNA, such as the negative control or a blank sample. 2. Baseline fluorescence subtraction: Then it subtracts the baseline fluorescence level from the raw fluorescence data for each sample to correct for any background signal. 3. Correction factor calculation: Calculates a correction factor for each sample by dividing the fluorescence signal of the sample by the fluorescence signal of the reference dye. This correction factor accounts for any variations in fluorescence intensity caused by differences in sample concentration or dye binding. • Reference dye measurement: The instrument measures the fluorescence signal of a reference dye that is added to each reaction. The reference dye has a constant fluorescence intensity and serves as a standard for the fluorescence measurement. • Raw fluorescence correction: The machine divides the raw fluorescence signal of each sample by the raw fluorescence signal of the reference dye to correct for any variations in fluorescence intensity caused by differences in sample concentration or dye binding. 4. Normalization: Then it normalizes the fluorescence data for each sample by dividing the corrected fluorescence signal by the average of the correction factors for all samples. However, there are some limitations to performing these steps manually. For example, the real-time PCR instrument used for fluorescence measurement would need to be capable of precise temperature control and fluorescence detection, and the analysis would require specialized software for data processing and analysis. Additionally, manual analysis may be more time-consuming and prone to errors than using dedicated software. While it is possible to perform some of the steps involved in converting raw fluorescence data to melt data manually, it would require specialized equipment and expertise and may not be as efficient or accurate as using PCR machines. Concerning the study and technical aspects behind converting raw fluorescence signals into melt signals, the process of manual conversion can be put in hold, and can be kept as a future research work of this project where those technical steps mentioned above must be implemented further with domain experts. To sum up everything, acquiring the machine converted DNA melting signal data is feasible as far as concerning the scope of the project and the relevant feature extraction process can be applied on such data, effectively. d 7.6 SIGNAL PROCESSING ON DNA MELTING SIGNAL To process the DNA melting signal, several techniques from signal processing can be used. For example, noise reduction techniques, such as filtering and averaging, can be used to remove random fluctuations in the signal and improve its accuracy. Signal processing techniques can also be used to extract features from the melting curve, such as the melting temperature, the shape of the curve, and the width of the transition region. These features can then be used to compare different DNA samples or to detect mutations and structural variations in the DNA molecule. Another important aspect of DNA melting signal processing is the use of mathematical models to describe the melting process. Signal processing techniques play a critical role in analysing and interpreting DNA melting signals, and can provide valuable insights into the properties of the DNA molecule and its interactions with other molecules. Figure 38: Features can be extracted from a melt curve using signal processing 7.6.1 PEAK FINDING Signal peak finding is a common problem in signal processing, where the goal is to identify and extract significant features, such as peaks or valleys, from a given signal. A peak is a local maximum or high point in the signal, while a valley is a local minimum or low point in the signal. Peak finding algorithms are used in a variety of fields, including image processing, speech recognition, and data analysis. In many cases, peak finding is a critical step in the data analysis pipeline as it can reveal important information about the underlying process generating the signal. There are many different methods for peak finding, ranging from simple threshold-based approaches to more sophisticated techniques like wavelet transform and machine learning-based approaches. These methods differ in terms of their accuracy, computational complexity, and robustness to noise and other types of signal artifacts. One commonly used approach for peak finding is the so-called "local maxima" method, where the signal is scanned for local maxima using a sliding window. Another popular approach is the "derivative-based" method, where the derivative of the signal is computed and peaks are identified as the points where the derivative changes sign. The peak finding algorithm function takes a 1-D array and finds all local maxima by simple comparison of neighbouring values. Figure 39: Detecting peaks in the DNA melting signal 7.6.2 PEAK PROMINENCE Signal peak prominence is a measure of the relative height of a peak in a signal compared to the surrounding peaks. It is a useful feature for signal processing and analysis because it can provide information about the importance or significance of a peak in the overall signal. The prominence of a peak is defined as the vertical distance between the peak and the lowest point on the curve that connects all neighbouring peaks that are higher. In other words, it measures the minimum height required to descend from the peak to a higher neighbouring peak or the baseline of the signal. Signal peak prominence is commonly used in peak detection algorithms to filter out noise and identify only the most significant peaks in a signal. Peaks with high prominence are typically more important and relevant to the underlying process generating the signal than those with low prominence. Prominence can also be used to compare and quantify the differences between peaks in different signals or datasets. For example, it can be used to compare the amplitude of peaks in EEG signals recorded from different brain regions or to compare the intensity of peaks in spectra obtained from different chemical compounds. Figure 40: Calculating the peak prominence of the DNA melting signal 7.6.3 PEAK WIDTH Signal peak width is a measure of the extent of a peak in a signal along the x-axis (usually time or frequency). It is an important feature in signal processing and analysis because it can provide information about the duration or spread of a particular phenomenon represented by the peak. Peak width is commonly used in peak detection algorithms to distinguish between closely spaced or overlapping peaks. In such cases, the width of a peak can help to differentiate between distinct peaks and noise or artifacts in the signal. Peak width can be quantified in a number of ways, depending on the nature of the signal and the application. One common measure is the full width at half maximum (FWHM), which is the width of the peak at half its maximum amplitude. Another common measure is the peak width at the baseline, which is the width of the peak at a certain fraction of its maximum amplitude. Figure 41: Calculating the peak width of the DNA melting signal Using the width of the peak, the take off and touch down points can be calculated using the start and end points of the width. Moreover, the width of the peak is being calculated using the relative height of the peak which has been set to 75%. Figure 42: Detecting all the features from a melting signal. 7.6.4 AREA UNDER THE CURVE (AUC) The area under the curve of a signal is a fundamental concept in signal processing and analysis. It represents the total energy or power of the signal over a given time period. This concept is widely used in many fields, including physics, engineering, biology, and finance. The calculation of the area under the curve can provide important information about the signal, such as its mean, variance, and distribution. It is also a useful tool for detecting anomalies and patterns in the signal, which can be used for various applications such as signal classification, signal denoising, and signal compression. There are various methods and techniques used for its computation, including numerical integration, trapezoidal rule, Simpson's rule, and Monte Carlo integration. Simpson’s Rule Simpson's rule is a numerical integration technique used to approximate the area under a curve. It is based on the idea of approximating the curve with a parabolic function and then computing the area of the resulting parabola. ∫▒〖[a,b] f(x)dx ≈(b-a)/〗 6*[f(a)+4f(a+b)/2+f(b)] where f(x) is the function to be integrated, [a,b] is the interval over which the integration is to be performed, and (a+b)/2 is the midpoint of the interval. The formula can be understood as follows: the area under the curve is approximated by dividing the interval [a,b] into subintervals of equal width, and then approximating the curve over each subinterval with a parabolic function. The area under each parabola is then computed and added together to obtain an approximation of the total area under the curve. Simpson's rule is known to be more accurate than other numerical integration techniques, such as the trapezoidal rule, for functions that are smooth and have a relatively simple shape. However, it may not be as effective for functions that have sharp changes or irregularities in their shape. As far as, the peak detection and feature extraction using signal processing methods, provided significant results, and it detects all possible features of a DNA melting signal including peak prominence, width, and area under the curve. However, it is also notable that, the peak detection process is performed without any thresholding, i.e., whatever the peaks introduced in a signal, are being captured, which in turn provides all the peaks, disregarding the desired peak necessitate for analysis. This is the scenario, where thresholding plays a crucial role in removing unwanted signals. Figure 43: Negative(noise) signal with peaks detected The negative signal as shown in the fig 43 should not be taken into consideration, as there is no any DNA melting is observed, but performing signal processing methods, has captured the minute noises as peaks, which is inappropriate. Proper thresholding techniques must be applied on such signals, such that noisy and unwanted peaks can be removed earlier, before bringing them into analysis. Thresholding has been performed in the prevailing context, with a visual inspection, which involve manual configurations on the prominences of DNA melting signals. A decent prominence level will be fixed, so that, peaks above such points are only considered, where remaining will be left as noisy/ unwanted signals. Since it is an AI based framed, which should perform tasks on its own, without any human intervenes, proper logic must be applied to set self-thresholding levels, so that, only necessary peaks will be taken for analysis. 7.7 THRESHOLDING LOGIC As stated earlier, logic for thresholding must be very appropriate and should be in a way to detect only genuine signals. In most of the melting signals, peaks with sound prominences will be chosen for analysis. In fact, even small peaks with smaller prominences also taken into consideration at some times, concerning all other parameters like melting temperature etc., Figure 44: Measuring the prominences Threshold is being set (red coloured line) for the peaks shown in the Figure 3.23, which rejects the second peak over the first peak, that has higher prominence as compared to the second one. On the other hand, while measuring the prominences of the peaks, it is observed that, the peaks introduced later are with the ~20% prominence of the first peak. Hence, the logic has to applied in a manner to detect the highest peak in the signal, and to compare the prominences of its other neighbouring peaks, such that, the peaks which has more than ~20% prominence of the highest peak (first peak) will be considered. RESULT AND DISCUSSION In this approach, raw fluorescence signals were initially converted into a melting signal using a gradient function and applied interpolation algorithm like Spline, to reduce the noise introduced in the signals. As a result, signals are smoothened, but does not provide authenticity, since the parameter of the spline function is highly variable. Therefore, a machine processed DNA melting signals are adopted and processed further with signal processing methods like peak finding, peak prominences, peak width and area under the curve, to extract necessary information of such signals. Later, Thresholding logic has been approached to keep only genuine and desired peak, but on applying such logic, does not provide significant result, as the ~20% value is not significant, and the value varies data to data. CONCLUSION Apart from feature extraction, developing a sensible logic on thresholding is mandatory. Applying a mathematical equation for thresholding provides significant result for certain data, but has rate of variability, and the result which comes out of such equation becomes insignificant at some scenarios. Proper, alternate solution must be employed, so that, unwanted peaks and noisy signals are removed from the analysis. CHAPTER 8 COMBINATION OF APPROACH ON IMAGES AND THE CO-ORDINATES OF DNA MELTING SIGNAL In the previous section, thresholding becomes a challenging process, where applying logic, based on peak prominences is not significant, and highly variable. To tackle such hindrance, an image-based approach is being performed in this chapter, to introduce an image-based thresholding, as like how conventional thresholding is being done through visual inspection. This can be achieved through the traditional computer vision algorithm like Convolution Neural Network, that can be used to classify genuine and non-genuine peaks, by training them on multiple images of DNA melting signals, generated randomly from a pool of samples. 8.1 CONVOLUTION NEURAL NETWORK A Convolutional Neural Network (CNN) is a type of neural network that is commonly used in image and video recognition tasks. It is a deep learning algorithm that learns to automatically extract features from input images, through a process called convolution. The key feature of a CNN is its ability to learn and identify spatial patterns in an image. This is done through the use of convolutional layers, which apply filters to an input image to extract specific features. These features are then passed on to other layers in the network, such as pooling layers and fully connected layers, to further refine the information and classify the input image. CNNs have achieved state-of-the-art performance in various computer vision tasks such as object detection, face recognition, and image segmentation. They have also been used in other domains, such as natural language processing and speech recognition. Classifying genuine and non-genuine DNA melting signals is an important task in DNA analysis and sequencing. Convolutional Neural Networks (CNNs) have been used to address this problem by learning to automatically extract features from the melting signals and classify them as genuine or non-genuine. During training, the CNN is fed with a large dataset of labelled DNA melting signals, consisting of both genuine and non-genuine examples. The network learns to differentiate between these two classes by adjusting the weights of its filters through a process called backpropagation. Once trained, the CNN can be used to classify new DNA melting signals as genuine or non-genuine. By the way using CNNs to classify genuine and non-genuine DNA melting signals will be a promising approach. Figure 45: Concept of Convolution Neural Network for classifying DNA melting signal To train a CNN model, to classify DNA melting signals of “Single Peaked”, “Double Peaked” and “Noise”, necessary training images has to be generated and must be labelled accordingly. ConvNets, or Convolutional Neural Networks, have shown promising results in DNA analysis, particularly in the classification of DNA melting signals. They are able to automatically extract relevant features from the data and learn to classify the signals based on these features. In the context of DNA signal thresholding, ConvNets can be used to generate an image of the signal based on provided coordinates. The image is then processed through the trained neural network, which has learned to identify and extract specific features from the signal. The network's feature maps, which are created during training, enable it to provide a probability distribution indicating the likelihood that the signal belongs to a particular probability density function (PDF). For example, the network might output a probability distribution indicating that the signal is more likely to belong to the PDF of "Single Peaked" than to the PDF of "Double Peaked" or "Noise". This information can be used to threshold the signal and separate genuine signals from noise or other artifacts. 8.2 GENERATING IMAGE DATA SET Images of DNA melting signal, will be generated, from the acquired co-ordinates data from PCR machines. The image generation includes all type of signals like ‘Single peaked’, ‘Double peaked’ and ‘Noisy’ signals. Figure 46: Training images of DNA melting signals of various class ‘Single’, ’Double’ and ‘Noise’ 8.3 MODEL ARCHTECTURE Figure 47: Model architecture with layers There are totally 569 images, which has 206 “Noisy Signals” labelled as 0, 197 “Single peaked signals” labelled as 1, and finally 166 “Double peaked signals” labelled as 2. The architecture of the model has 2 convolution layers with 2 pooling layers and 2 Dense layers, which as a whole, has 16,467 total trainable parameters. The input shape of the image is reshaped into (30,30) with 3 channels. The first convolution layer has a kernel size of (5,5), with pooling size of (2,2) and the second convolution layer has a kernel size of (3,3) with a pooling size of (2,2). The activation function used in these layers is ReLU. There are 2 Dense layers in the architecture, which has 12832 trainable parameters, with ReLU and Softmax as activation functions. The model has been trained with 20 epochs with a batch size of 8. Figure 48: Model performance with Accuracy and Loss The Model has a validation accuracy of 83.81% and the training accuracy of 86.67%, which looks like, the model doesn’t overfit to the data. Since it is a multiclass classification problem, looking on the accuracy is not sufficient. The true accuracy of the model will be assessed by looking on metrics like precision and recall. Furtherly, on combing both the metrics, f1 score can be taken into consideration, as it is harmonic mean of both precision and recall, will produce a significant and reliable result if the model truly performs good. 0 1 2 accuracy macro avg weighted avg precision 0.8 0.857143 1 0.871795 0.885714 0.885714 recall 0.923077 0.923077 0.769231 0.871795 0.871795 0.871795 f1-score 0.857143 0.888889 0.869565 0.871795 0.871866 0.871866 support 13 13 13 0.871795 39 39 Table 4: Model performance metrics The f1 score provides 85+% of accuracy, which denotes both the precision and recall are high. This result has been evaluated on testing the model on different test images, which the model hasn’t yet seen before. Figure 49: Confusion matrix for the results of CNN model. Once the model has been validated and tested, it can be applied in the process of extracting information of DNA melting signals with image- based thresholding along with the signal processing method. The model will provide a probabilistic result (0 or 1 or 2), based on which, number of peaks will be considered. If model says ‘1’, feature extraction will be done on, the first highest peak (peaks will be sorted in descending order of prominence), leaving the reaming peaks as out of concern. Figure 50: Classification of CNN model between genuine and non-genuine peaks. In the Figure 3.28 ‘Green’ coloured signals are identified as ‘genuine’ and ‘Red’ coloured signals are identified as ‘non-Genuine’. For each signal, number of peaks, that has to be considered will be given (0,1,2) based on that, feature extraction will be made. RESULT AND DISCUSSION The CNN based thresholding has provided a significant result, as the model was trained with images of genuine and non-genuine peaks, such that it could learn patterns of such signal, which as a result can identify and classify signals. Based on the result of the model, feature extraction process has been performed and relevant features of the classified signals are captured and stored in the feature store for model building. CHAPTER 9 SYSTEM DESIGN & DEVELOPMENTS 9.1 COMPONENTS For developing the AI-based framework for automated analysis, interpretation and data management for the HRM data, which is generalized and optimized , three major components were developed by the team are • EXTRACTOR (data extraction tool) • PyHRM (Feature engineering) • Meltcurve Interpreter (prediction) The team has developed three major components for developing an AI-based framework for automated analysis, interpretation, and data management of High-Resolution Melting (HRM) data. These components include the data extraction tool called "EXTRACTOR," the feature engineering tool called "PyHRM," and the prediction tool called "Meltcurve Interpreter." The EXTRACTOR tool is used to extract data from the raw HRM files, while the PyHRM tool is used for feature engineering, which involves extracting relevant features from the HRM data. Finally, the Meltcurve Interpreter tool uses predictive analytics and deep learning models to interpret the extracted features and predict the presence of the intended molecular target in a clinical sample tested. These three components work together to develop an AI-based framework that can automate the analysis and interpretation of HRM data, allowing for faster and more accurate diagnosis of infectious diseases. With this framework, clinicians can make more informed decisions and plan the course of treatment for their patients. Figure 51: AI-framework 9.2 EXTRACTOR Extractor is a lightweight simple GUI-based application (figure.) that extracts '.rex' files from the Qiagen's Rotor-Gene Q Software to the necessary '.xls' file. It's built for the users such as laboratory technicians and clinicians who handle and run PCR experiments especially in Qiagen's Rotor-Gene Q thermal cycler machine. Figure 52: User interface of Extractor After the successful experiment ran in Rotor-Gene Q cycler, it produces the raw data and the users which can be only opened and analyzed via Qiagen's Q-Rex Software. If a specific run file (raw data) has to be exported into desired formats such as text(.txt), HTML Table(.html), XML(.xml), excel(.xls) given by the Qiagen Rotor-Gene Q-Rex Software. Here we automated the user role by our EXTRACTOR Software, by which you simply put the raw data file directory and desired directory to which the excel files are stored in your system, which saves time and not to burned out from this repititive task. 9.2.1 FRAMEWORK The framework for the extractor application is below Figure 53: Framework of extractor ManualConversion Time consuming and often introduce frustation and inconsistency, while converting bulks of raw data by manual process. Automatic Conversion Runs in a constant time and it handle countless of raw data automatically without any human intervence. Features • Able to set user credentials. • Selection of which type of data to extract from the raw data, i.e., • CT (Amplification Curve) • MELT (derivative) & • HRM (Normalized fluorescence) • Supports desired output format as excel file. Types of data to extracts Select the type of data from the drop-down menu and enter the respected fields below and finally enter submit (figure). • CT • MELT • HRM System Requirements Q-Rex Software : Rotor-Gene Q Software v2.3.5 (Build 1) Platforms : Windows 10 / 11 The following are the essential requirements for this software to run: Figure 54: Type of data to extract 9.3 PyHRM We developed a python-based library called PyHRM for processing High-Resolution Melting (HRM) data, especially, DNA melting signals to extract features like 'Melting Temperatures', 'Take-off and Touch-down points of melting signal (Temperature at which peak start rising and temperature at which peak falls down)','Peak prominences', and 'Area Under the curve'. Additionally, the library offers interactive visualization for DNA melting signal and vision based filtering, to eliminate noisy signals from the data and provides only genuine peaks with all the above mentioned features. Installing with PyPi The PyHRM library can be installed with using pip command, python -m pip install PyHRM or pip3 install PyHRM Figure 55: PyHRM installation File stack of the library The file stack of the PyHRM library consists of various files including .py and .h5 files which are basement for feature detection of HRM data. Figure 56: File stack of PyHRM The Deep Learning model has been saved using the method in Keras module as CNN.h5 for meltcurve peak classification. The melt.py has major functions like data read, melt_conversion, feature_detection, report_generation for processing HRM data. Dependencies of PyHRM The following are the necessary dependencies for the PyHRM. • fpdf==1.7.2 • kaleido==0.2.1 • keras==2.12.0 • matplotlib==3.6.3 • numpy==1.23.5 • openpyxl==3.1.0 • packaging==23.1 • pandas==1.5.3 • Pillow • plotly==5.13.0 • Requests==2.28.2 • scipy==1.10.1 • tensorflow==2.12.0 • tqdm==4.64.1 • xlrd==2.0.1 Features The PyHRM library has the following features • Rapid preprocessing. • Feature Extraction o Tm (Melting Temperature (Max 2)) o Tstart (Starting temperature point) o Tend (Ending Temperature) o Prominence o Area Under the curve • Interactive Visulization. • Computer Vision based thresholding for eliminating noisy signals. • Report Generation. Input data format The PyHRM has only supports .xls and .xlsx formats for seamless analysis. Figure 57: Input data format for PyHRM Working of library The library has been imported in the any notebook IDEs, by with the following commands in figure. Next to create a class instance for the meltcurve interpreter module present in the PyRM Figure 58: Import PyHRM To process the HRM data with the desired input format using data_read() with the necessary parameters and to visualize the HRM data with using the plot() figure Figure 59: Output of plot() The major function in this library which detects the features of HRM signals using customized Deep Learning model using feature_detection() which returns the outputs in dataframe format. To generate and download the report of features using report() in pdf format figure. Figure 60: Features of HRM data using feature_detection() Figure 61: Reports of features_detection 9.4 MELTCURVE INTERPRETER The MeltcurveInterpreter is a web-based application for analysing and interpreting the final results from the consequences of extractor and PyHRM library. This module consists of various files and folders such as .py, .html, .css, .h5 for the final classification and interpretation of Meningitidis panel figure . Figure 62: File stack of meltcurve interpreter Here Meltcurve interpreter [MCI], there are two main Deep Learning models melt.h5, MEP_Model.h5 for the meltcurve peak classification and the molecular target classification respectively. The front-end of the MCI has the following files in the directory: Templates • Home.html • Homepage.html • Index.html • Melt.html • CT.html • FDP.html • Statistics.html • Report.html • Help.html o • Static • Stylesheets • Images The back-end of the MCI has the following files in the directory: LocalMeltcurveAnalysis • melt.h5 • MEP_Model.h5 • meltcurve_interpreter The meltcurve interpreter was deployed using a simple web development framework in python called flask. MCI is designed as user-friendly with multiple features such as file upload, visualization of interactive graphs, feature detection, EDA, final report. Figure 63: MCI Interface The home page of MCI describes the organization profiles and working modules as shown in figure 64 Figure 64: MCI Home page The file upload page consists of Melt and Ct files in excel format which is mandatory, and it gives a token which is used to retrieved the data at anytime. The uploaded files are storing and retrieval in the centralized database using PostgreSQL figure. Figure 65: MCI file upload Once the file has been uploaded ,it can be retrieved with the token and username for further analysis and for the visualisation of Melt and Amplification curves figure . Figure 66: MCI Melt curve visualisation Figure 67: MCI amplification curvet visualisation The classification of Melt peaks for thresholding and features of melt signals are detected by the feature detection panel its gave a peak features like temperature start, end, AUC, prominence, width and target of the samples tested figure. Figure 68: MCI feature detection panel The statistical measures of the HRM data to gain insights with stipulated statistical analysis figure Figure 69: MCI Statistical measures Finally, the report has been generated in pdf format with the melt curve graphs of classified peaks with features table as shown in fig.71 Figure 70: MCI Report Generation Figure 71: MCI Final Report 9.5 ER DIAGRAM The Entity-Relationship Diagram for the back-end database component of MCI as shown below fig. Figure 72: ER diagram of MCI database component CHAPTER 10 TESTING AND RESULTS 10.1 TEST DATA The following unknown patient sample data (fig.73 ) has been used to test the MCIs Deep Learning Model of pathogen classification (fig. ). Figure 73: Melt curve test data The Model has a test accuracy of 85% and the training accuracy of 86.67%, which looks like, the model doesn’t overfit to the data. Since it is a multiclass classification problem, looking on the accuracy is not sufficient. \ Figure 74: Features of melt curve test data The true accuracy (fig 75)of the model will be assessed by looking on metrics like precision and recall. Furtherly, on combing both the metrics, f1 score can be taken into consideration, as it is harmonic mean of both precision and recall, will produce a significant and reliable result if the model truly performs good. Figure 75: Accuracy and loss metrics for the MEP_Model The confusion matrix for the pathogen classification model (fig.76 ) which classifies the pathogens present in the Meningitidis panel. Here, the model classifies the SP, HI and NM pathogens with less false-positive rates. Figure 76: Confusion Matrix for MEP_model Figure 77: Accuracy and loss metrics for the MEP_Model CONCLUSION In the current scenario, the diagnosis of infectious diseases is rapidly moving towards molecular assays, with several major biotechnology companies developing ready-to-use molecular kits. However, the reporting of these molecular assays is largely dependent on visual interpretation and analysis by technicians, which has significantly affected the acceptability of these assays in commercial diagnostic setups such as clinical laboratories and hospitals. This project aims to lay the foundation for developing a framework for automated analysis of molecular assays, which is first-of-its-kind. We have successfully shown that by using predictive analytics and deep learning models on High-Resolution Melting data, several distinct features can be extracted that can be used to develop an algorithm to indicate the presence of the intended molecular target in a clinical sample tested. This project can be further developed into a full-fledged software that can aid clinicians in diagnosing several diseases and planning the course of treatment. This software has the potential to revolutionize the molecular diagnosis field and improve the digital compatibility of molecular assay interpretation with the existing laboratory information management system. REFERENCES Bibliography M. T. Dorak, Ed., Real-time PCR. New York: Taylor & Francis Group, 2007. Accessed: Mar. 15, 2023. pp. 1-83. [Online]. Available: https://www.gene-quantification.de/dorak-book-real-time-pcr-2006.pdf S. F. Dobrowolski and C. T. Wittwer, “High-Resolution Melt Profiling,” in Molecular Analysis and Genome Discovery, R. Rapley, S. Harbron, Eds., 2nd ed. West Sussex, UK: Wiley-Blackwell, 2012. Accessed: Mar. 29, 2023. pp. 81-113. [Online]. doi: 10.1002/9781119977438.ch5. Primary Literature [1] Thermo Fisher Scientific, Inc., Laboratory Information Management Systems [Online]. Available: https://www.thermofisher.com/in/en/home/digital-solutions/lab-informatics/lab-information-management-systems-lims.html. [Accessed Apr. 06, 2023]. [2] Thermo Fisher Scientific, Inc., “Lab Software Integration Tools,” Available: https://www.thermofisher.com/in/en/home/digital-solutions/lab-informatics/integration.html. [Accessed Apr. 06, 2023]. [3] M. L. Bayot, J. E. Lopes, and P. Naidoo, “Clinical Laboratory,” StatPearls - NCBI Bookshelf, Dec. 19, 2022. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK535358/. [Accessed Apr. 06, 2023]. [4] B. H. Shirts et al., “Clinical laboratory analytics: Challenges and promise for an emerging discipline,” Journal of Pathology Informatics, vol. 6, no. 1, p. 9, Feb. 2015, doi: 10.4103/2153-3539.151919. [5] Agaram, “LIMS for speciality Diagnostic labs,” Agaram Tech, Jan. 18, 2022. [Online]. Available: https://www.agaramtech.com/lims-for-specialty-diagnostic-labs/ [6] G. P. Patrinos, P. B. Danielson, and W. J. Ansorge, “Molecular Diagnostics,” in Elsevier eBooks, Elsevier BV, 2017, pp. 1–11. doi: 10.1016/b978-0-12-802971-8.00001-8. [7] R. G. L. Pryor and C. T. Wittwer, “Real-Time Polymerase Chain Reaction and Melting Curve Analysis,” in Humana Press eBooks, vol. 336, pp. 19–32, Jan. 2006, doi: 10.1385/1-59745-074-x:19. [8] D. V. Rebrikov and D. Y. Trofimov, “Real-time PCR: A review of approaches to data analysis,” Applied Biochemistry and Microbiology, vol. 42, no. 5, pp. 455–463, Sep. 2006, doi: 10.1134/s0003683806050024. [9] L. Garibyan and N. Avashia, “Polymerase Chain Reaction,” Journal of Investigative Dermatology, vol. 133, no. 3, pp. 1–4, Mar. 2013, doi: 10.1038/jid.2013.1. [10] C. Wang et al., “Veterinary PCR Diagnostics,” BENTHAM SCIENCE PUBLISHERS eBooks, Mar. 2012, doi: 10.2174/97816080534831120101. [11] G. L. Shipley, “Real-Time Quantitative PCR: Theory and Practice,” Encyclopedia of Molecular Cell Biology and Molecular Medicine, vol. 11, Sep. 2006, doi: 10.1002/3527600906.mcb.200500012. [12] A. Tahamtan and A. Ardebili, “Real-time RT-PCR in COVID-19 detection: issues affecting the results,” Expert Review of Molecular Diagnostics, vol. 20, no. 5, pp. 453–454, Apr. 2020, doi: 10.1080/14737159.2020.1757437. [13] I. M. Artika, Y. P. Dewi, I. M. Nainggolan, J. E. Siregar, and U. Antonjaya, “Real-Time Polymerase Chain Reaction: Current Techniques, Applications, and Role in COVID-19 Diagnosis,” Genes, vol. 13, no. 12, p. 2387, Dec. 2022, doi: 10.3390/genes13122387. [14] M. W. Pfaffl, “Quantification strategies in real-time PCR,” in A-Z of quantitative PCR, S. A. Bustin, Ed., California, USA: International University Line (IUL), 2004. Accessed: Apr. 15, 2023. pp. 87 – 112. [Online]. Available: https://www.gene-quantification.de/chapter-3-pfaffl.pdf [15] J. S. Yuan, A. M. Reed, F. Chen, and C. N. Stewart, “Statistical analysis of real-time PCR data,” BMC Bioinformatics, vol. 7, no. 1, Feb. 2006, doi: 10.1186/1471-2105-7-85. [16] J. L. Montgomery, L. N. Sanford, and C. T. Wittwer, “High-resolution DNA melting analysis in clinical research and diagnostics,” Expert Review of Molecular Diagnostics, vol. 10, no. 2, pp. 219–240, Mar. 2010, doi: 10.1586/erm.09.84. [17] G. H. Reed, J. Kent, and C. T. Wittwer, “High-resolution DNA melting analysis for simple and efficient molecular diagnostics,” Pharmacogenomics, vol. 8, no. 6, pp. 597–608, Jun. 2007, doi: 10.2217/14622416.8.6.597. [18] J. S. Farrar and C. T. Wittwer, “High-Resolution Melting Curve Analysis for Molecular Diagnostics,” Elsevier eBooks, pp. 79–102, Jan. 2017, doi: 10.1016/b978-0-12-802971-8.00006-7. [19] R. H. a. M. Vossen, E. Aten, A. Roos, and J. T. D. Dunnen, “High-Resolution Melting Analysis (HRMA)-More than just sequence variant screening,” in Human Mutation, vol. 30, no. 6, pp. 860–866, Jun. 2009, doi: 10.1002/humu.21019. [20] J. L. Vaerman, P. Saussoy, and I. Ingargiola, “Evaluation of real-time PCR data.,” Journal of Biological Regulators and Homeostatic Agents, vol. 18, no. 2, pp. 212–4, Apr. 2004. [21] L. M. Sullivan, J. Weinberg, and J. F. Keaney, “Common Statistical Pitfalls in Basic Science Research,” Journal of the American Heart Association, vol. 5, no. 10, Oct. 2016, doi: 10.1161/jaha.116.004142. [22] S. Prakash, “Statistical approaches to make sense of data in biology and medicine,” in Indian Journal of Medical Sciences, vol. 74, pp. 103–105, Aug. 2022, doi: 10.25259/ijms_197_2021. [23] M. W. Pfaffl, J. Vandesompele, and M. Kubista, “Data Analysis Software,” in Real-time PCR: Current Technology and Applications, J. Logan, J. M. J. Logan, K. J. Edwards, and N. A. Saunders, Eds., Caister Academic Press, 2009. pp. 65 – 83. [Online]. Available: https://www.gene-quantification.de/Pfaffl-Kubista-Vandesompele-real-time-PCR-chapter-5.pdf [24] QIAGEN GmbH, QIAGEN Strasse 1, D-40724 Hilden. Rotor-Gene Q User Manual, Version 2. (2012). Accessed: Apr. 10, 2023. [Online]. Available: https://www.qiagen.com/us/resources/resourcedetail?id=d29cab50-f102-4faa-b453-4a57463610fa&lang=en [25] QIAGEN GmbH, QIAGEN Strasse 1, D-40724 Hilden. Rotor-Gene ScreenClust HRM Software User Guide. Accessed: Apr. 10, 2023. [Online]. Available: https://www.qiagen.com/cn/resources/download.aspx?id=af33be05-14c6-4ac3-ace2-d85aa7ad0434&lang=en [26] Bio-Rad Laboratories, Inc., Hercules, California, USA. CFX96™ and CFX384™ Real-Time PCR Detection Systems Instruction Manual. Accessed: Apr. 10, 2023. [Online]. Available: https://www.bio-rad.com/sites/default/files/webroot/web/pdf/lsr/literature/10010424.pdf [27] BIO MOLECULAR SYSTEMS (BMS), Upper Coomera QLD 4209, Australia. MIC, User Manual, Version 1.2. Accessed: Apr. 10, 2023. [Online]. Available: https://biomolecularsystems.com/mic-qpcr/software/ [28] Thermo Fisher Scientific, Inc., Waltham, Massachusetts, USA. QuantStudio™ 5 Real-Time PCR Instrument, 2 2 2 2