mtyrrell commited on
Commit
59ac39c
·
1 Parent(s): fa42253
app.py CHANGED
@@ -11,7 +11,7 @@ def is_installed(package_name, version):
11
  return False
12
 
13
  # shifted from below - this must be the first streamlit call; otherwise: problems
14
- st.set_page_config(page_title = 'Vulnerability Analysis',
15
  initial_sidebar_state='expanded', layout="wide")
16
 
17
  @st.cache_resource # cache the function so it's not called every time app.py is triggered
@@ -74,15 +74,31 @@ with st.container():
74
  with st.expander("ℹ️ - About this app", expanded=False):
75
  st.write(
76
  """
77
- The Vulnerability Analysis App is an open-source\
78
- digital tool which aims to assist policy analysts and \
79
- other users in extracting and filtering references \
80
- to different groups in vulnerable situations from public documents. \
81
- We use Natural Language Processing (NLP), specifically deep \
82
- learning-based text representations to search context-sensitively \
83
- for mentions of the special needs of groups in vulnerable situations
84
- to cluster them thematically.
85
- For more understanding on Methodology [Click Here](https://vulnerability-analysis.streamlit.app/)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  """)
87
 
88
  st.write("""
 
11
  return False
12
 
13
  # shifted from below - this must be the first streamlit call; otherwise: problems
14
+ st.set_page_config(page_title = 'Vulnerability Analysis - EVALUATION PIPELINE',
15
  initial_sidebar_state='expanded', layout="wide")
16
 
17
  @st.cache_resource # cache the function so it's not called every time app.py is triggered
 
74
  with st.expander("ℹ️ - About this app", expanded=False):
75
  st.write(
76
  """
77
+ Pipeline for automated evaluation of vulnerability and target classifications using GPT-4o as judge. The pipeline is integrated into a hacked version of the app, so you just run the thing as normal (I haven't pushed it HF yet as there are some dependency issues). It does the classifications and summarizations- then sends the full dataframe of classified paragraphs (i.e. not filtered) to openai using crafted prompts. This happens twice - once for vulnerabilitiles and once for target - for each row in the dataframe. You then get the option to download an Excel file containing 3 sheets:
78
+ * Meta: document name (using doc code3 as per the master excel 'vul_africa_01')
79
+ * Summary: summarizations
80
+ * Results: shows each paragraph, classifications, and automated evals:
81
+ * VC_prob: % probability that the vulnerability classification is True (using logprobs output from GPT-4o)
82
+ * VC_keywords: fuzzy matching index from 0 to 1 reflecting the alignment with the label text (levenshtein distance). I included this as a secondary measure because GPT4o understandably struggles with some of the vulnerability classifications.
83
+ * VC_eval: Boolean based on VC_prob > 0.5 OR VC_keywords > 0
84
+ * TMA_prob: % probability that the target classification is True (using logprobs output from GPT-4o)
85
+ * TMA_eval: Boolean based on TMA_prob > 0.5
86
+ * VC_check: used for manually noting corrections
87
+ * TMA_check: " "
88
+
89
+ Evaluation with GPT4o-as-judge: to clarify, the automated pipeline is not 100% trustworthy, so I was just using the 'FALSE' tags as a starting point
90
+ The complete protocol is as follows:
91
+ 1. VC_eval == 'FALSE': manually check vulnerability labels that are suspect
92
+ 2. VC_eval == 'TRUE' AND VC_prob < 0.9: manually check all remaining vulnerability labels where GPT4o was not very certain (in some cases here I also use the VC_keywords to further filter down if there were alot of samples returned)
93
+ 3. TMA_eval == 'FALSE': manually check target labels that are suspect
94
+ 4. TMA_eval == 'TRUE' AND TMA_prob < 0.9: manually check all remaining target labels where GPT4o was not very certain.
95
+ 5. If incorrect classification: enter corrected value in 'VC_check' and 'TMA_check' columns.
96
+
97
+ Takeaways from evaluation:
98
+ * It appears the classifiers experience performance degradation in French-language source documents
99
+ * In particular, the vulnerability classifier had issues
100
+ * The target classifier returns alot of false negatives in all languages
101
+ * The GPT4o pipeline is a useful tool for the assessment, but only in terms of increasing accuracy over random sampling. It still takes time to review each document.
102
  """)
103
 
104
  st.write("""
appStore/__pycache__/rag.cpython-310.pyc CHANGED
Binary files a/appStore/__pycache__/rag.cpython-310.pyc and b/appStore/__pycache__/rag.cpython-310.pyc differ
 
appStore/__pycache__/target.cpython-310.pyc CHANGED
Binary files a/appStore/__pycache__/target.cpython-310.pyc and b/appStore/__pycache__/target.cpython-310.pyc differ
 
utils/__pycache__/target_classifier.cpython-310.pyc CHANGED
Binary files a/utils/__pycache__/target_classifier.cpython-310.pyc and b/utils/__pycache__/target_classifier.cpython-310.pyc differ