Forrest Bao commited on
Commit
2c44150
1 Parent(s): 8c8021f

polish the language for usage

Browse files
Files changed (1) hide show
  1. README.md +28 -37
README.md CHANGED
@@ -12,7 +12,7 @@ pipline_tag: text-classficiation
12
  * HHEM-2.1-Open outperforms GPT-3.5-Turbo and even GPT-4.
13
  * HHEM-2.1-Open can be run on consumer-grade hardware, occupying less than 600MB RAM space at 32-bit precision and elapsing around 1.5 seconds for a 2k-token input on a modern x86 CPU.
14
 
15
- **To HHEM-1.0 users**: HHEM-2.1-Open introduces breaking changes to the usage. Please update your code according to the [new usage](#using-hhem-21-open) below. We are working making it compatible with `transformers.pipeline` and HuggingFace's Inference Endpoint. We apologize for the inconvenience.
16
 
17
  HHEM-2.1-Open is a major upgrade to [HHEM-1.0-Open](https://huggingface.co/vectara/hallucination_evaluation_model/tree/hhem-1.0-open) created by [Vectara](https://vectara.com) in November 2023. The HHEM model series are designed for detecting hallucinations in LLMs. They are particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, and HHEM can be used to measure the extent to which this summary is factually consistent with the facts.
18
 
@@ -26,23 +26,24 @@ By "hallucinated" or "factually inconsistent", we mean that a text (hypothesis,
26
  A common type of hallucination in RAG is **factual but hallucinated**.
27
  For example, given the premise _"The capital of France is Berlin"_, the hypothesis _"The capital of France is Paris"_ is hallucinated -- although it is true in the world knowledge. This happens when LLMs do not generate content based on the textual data provided to them as part of the RAG retrieval process, but rather generate content based on their pre-trained knowledge.
28
 
29
- ## Using HHEM-2.1-Open with `transformers`
30
 
31
- HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
32
 
33
- **Using with `Auto` class**
34
 
35
- HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
 
 
 
 
 
 
36
 
37
- HHEM-2.1-Open can be loaded easily using the `transformers` library. Just remember to set `trust_remote_code=True` to take advantage of the pre-/post-processing code we provided for your convenience. The **input** of the model is a list of pairs of (premise, hypothesis). For each pair, the model will **return** a score between 0 and 1, where 0 means that the hypothesis is not evidenced at all by the premise and 1 means the hypothesis is fully supported by the premise.
38
 
39
  ```python
40
  from transformers import AutoModelForSequenceClassification
41
 
42
- # Load the model
43
- model = AutoModelForSequenceClassification.from_pretrained(
44
- 'vectara/hallucination_evaluation_model', trust_remote_code=True)
45
-
46
  pairs = [ # Test data, List[Tuple[str, str]]
47
  ("The capital of France is Berlin.", "The capital of France is Paris."), # factual but hallucinated
48
  ('I am in California', 'I am in United States.'), # Consistent
@@ -53,20 +54,23 @@ pairs = [ # Test data, List[Tuple[str, str]]
53
  ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
54
  ]
55
 
56
- # Use the model to predict
 
 
 
 
57
  model.predict(pairs) # note the predict() method. Do not do model(pairs).
58
  # tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543])
59
  ```
60
 
 
61
 
62
- **Using with `text-classification` pipeline**
63
-
64
- Please note that when using `text-classification` pipeline for prediction, scores for two labels will be returned for each pair. The score for **consistent** label is the one that should be focused on.
65
 
66
  ```python
67
  from transformers import pipeline, AutoTokenizer
68
 
69
- pairs = [
70
  ("The capital of France is Berlin.", "The capital of France is Paris."),
71
  ('I am in California', 'I am in United States.'),
72
  ('I am in United States', 'I am in California.'),
@@ -76,7 +80,7 @@ pairs = [
76
  ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
77
  ]
78
 
79
- # Apply prompt to pairs
80
  prompt = "<pad> Determine if the hypothesis is true given the premise?\n\nPremise: {text1}\n\nHypothesis: {text2}"
81
  input_pairs = [prompt.format(text1=pair[0], text2=pair[1]) for pair in pairs]
82
 
@@ -87,29 +91,16 @@ classifier = pipeline(
87
  tokenizer=AutoTokenizer.from_pretrained('google/flan-t5-base'),
88
  trust_remote_code=True
89
  )
90
- classifier(input_pairs, return_all_scores=True)
91
-
92
- # output
93
-
94
- # [[{'label': 'hallucinated', 'score': 0.9889384508132935},
95
- # {'label': 'consistent', 'score': 0.011061512865126133}],
96
- # [{'label': 'hallucinated', 'score': 0.35263675451278687},
97
- # {'label': 'consistent', 'score': 0.6473632454872131}],
98
- # [{'label': 'hallucinated', 'score': 0.870982825756073},
99
- # {'label': 'consistent', 'score': 0.1290171593427658}],
100
- # [{'label': 'hallucinated', 'score': 0.1030581071972847},
101
- # {'label': 'consistent', 'score': 0.8969419002532959}],
102
- # [{'label': 'hallucinated', 'score': 0.8153750896453857},
103
- # {'label': 'consistent', 'score': 0.18462494015693665}],
104
- # [{'label': 'hallucinated', 'score': 0.9949689507484436},
105
- # {'label': 'consistent', 'score': 0.005031010136008263}],
106
- # [{'label': 'hallucinated', 'score': 0.9456764459609985},
107
- # {'label': 'consistent', 'score': 0.05432349815964699}]]
108
- ```
109
 
110
- You may run into a warning message that "Token indices sequence length is longer than the specified maximum sequence length". Please ignore this warning for now. It is a notification inherited from the foundation, T5-base.
 
 
111
 
112
- Note that the order of a pair is important. For example, the 2nd and 3rd examples in the `pairs` list are consistent and hallucinated, respectively.
113
 
114
 
115
  ## HHEM-2.1-Open vs. HHEM-1.0
 
12
  * HHEM-2.1-Open outperforms GPT-3.5-Turbo and even GPT-4.
13
  * HHEM-2.1-Open can be run on consumer-grade hardware, occupying less than 600MB RAM space at 32-bit precision and elapsing around 1.5 seconds for a 2k-token input on a modern x86 CPU.
14
 
15
+ > HHEM-2.1-Open introduces breaking changes to the usage. Please update your code according to the [new usage](#using-hhem-21-open) below. We are working making it compatible with HuggingFace's Inference Endpoint. We apologize for the inconvenience.
16
 
17
  HHEM-2.1-Open is a major upgrade to [HHEM-1.0-Open](https://huggingface.co/vectara/hallucination_evaluation_model/tree/hhem-1.0-open) created by [Vectara](https://vectara.com) in November 2023. The HHEM model series are designed for detecting hallucinations in LLMs. They are particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, and HHEM can be used to measure the extent to which this summary is factually consistent with the facts.
18
 
 
26
  A common type of hallucination in RAG is **factual but hallucinated**.
27
  For example, given the premise _"The capital of France is Berlin"_, the hypothesis _"The capital of France is Paris"_ is hallucinated -- although it is true in the world knowledge. This happens when LLMs do not generate content based on the textual data provided to them as part of the RAG retrieval process, but rather generate content based on their pre-trained knowledge.
28
 
29
+ Additionally, hallucination detection is "asymmetric" or is not commutative. For example, the hypothesis _"I visited Iowa"_ is considered hallucinated given the premise _"I visited the United States"_, but the reverse is consistent.
30
 
31
+ ## Using HHEM-2.1-Open
32
 
33
+ > HHEM-2.1 has some breaking change from HHEM-1.0. Your code that works with HHEM-1 (November 2023) will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
34
 
35
+ Here we provide several ways to use HHEM-2.1-Open in the `transformers` library.
36
+
37
+ > You may run into a warning message that "Token indices sequence length is longer than the specified maximum sequence length". Please ignore it which is inherited from the foundation, T5-base.
38
+
39
+ ### Using with `AutoModel`
40
+
41
+ This is the most end-to-end and out-of-the-box way to use HHEM-2.1-Open. It takes a list of pairs of (premise, hypothesis) as the input and returns a score between 0 and 1 for each parir where 0 means that the hypothesis is not evidenced at all by the premise and 1 means the hypothesis is fully supported by the premise.
42
 
 
43
 
44
  ```python
45
  from transformers import AutoModelForSequenceClassification
46
 
 
 
 
 
47
  pairs = [ # Test data, List[Tuple[str, str]]
48
  ("The capital of France is Berlin.", "The capital of France is Paris."), # factual but hallucinated
49
  ('I am in California', 'I am in United States.'), # Consistent
 
54
  ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
55
  ]
56
 
57
+ # Step 1: Load the model
58
+ model = AutoModelForSequenceClassification.from_pretrained(
59
+ 'vectara/hallucination_evaluation_model', trust_remote_code=True)
60
+
61
+ # Step 2: Use the model to predict
62
  model.predict(pairs) # note the predict() method. Do not do model(pairs).
63
  # tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543])
64
  ```
65
 
66
+ ### Using with `pipeline`
67
 
68
+ In the popular `pipeline` class of the `transformers` library, you have to manually prepare the data using the prompt template in which we trained the model. HHEM-2.1-Open has two output neurons, corresponding to the labels `hallucinated` and `consistent` respectively. In the example below, we will ask `pipeline` to return the scores for both labels (by setting `top_k=None`, formerly `return_all_scores=True`) and then extract the score for the `consistent` label.
 
 
69
 
70
  ```python
71
  from transformers import pipeline, AutoTokenizer
72
 
73
+ pairs = [ # Test data, List[Tuple[str, str]]
74
  ("The capital of France is Berlin.", "The capital of France is Paris."),
75
  ('I am in California', 'I am in United States.'),
76
  ('I am in United States', 'I am in California.'),
 
80
  ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
81
  ]
82
 
83
+ # Prompt the pairs
84
  prompt = "<pad> Determine if the hypothesis is true given the premise?\n\nPremise: {text1}\n\nHypothesis: {text2}"
85
  input_pairs = [prompt.format(text1=pair[0], text2=pair[1]) for pair in pairs]
86
 
 
91
  tokenizer=AutoTokenizer.from_pretrained('google/flan-t5-base'),
92
  trust_remote_code=True
93
  )
94
+ full_scores = classifier(input_pairs, top_k=None) # List[List[Dict[str, float]]]
95
+
96
+ # Optional: Extract the scores for the 'consistent' label
97
+ simple_scores = [score_dict['score'] for score_dict in score_for_both_labels for score_for_both_labels in full_scores if score_dict['label'] == 'consistent']
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
+ print(simple_scores)
100
+ # Expected output: [0.011061512865126133, 0.6473632454872131, 0.1290171593427658, 0.8969419002532959, 0.18462494015693665, 0.005031010136008263, 0.05432349815964699]
101
+ ```
102
 
103
+ Of course, with `pipeline`, you can also get the most likely label, or the label with the highest score, by setting `top_k=1`.
104
 
105
 
106
  ## HHEM-2.1-Open vs. HHEM-1.0