Spaces:

m7n
/

on_rankings

Sleeping

App Files Files Community

m7n commited on Nov 19, 2023

Commit

e5f1915

•

1 Parent(s): 824740b

Update app.py

Browse files

Files changed (1) hide show

app.py +20 -13

app.py CHANGED Viewed

@@ -129,13 +129,13 @@ def simulate_applicant_judging(num_applicants=100, num_judges=10, ratings_per_ap
     return df
-df_results = simulate_applicant_judging(num_applicants=100, num_judges=10, ratings_per_applicant=5,alpha=0, loc=50, scale=15, judge_error=1, judges_attitude=0.3,
-                                        judgment_coarse_graining=10)
-df_results.head(30)
-df_results.sort_values(by='Rank of Evaluation').head(30)
@@ -187,10 +187,10 @@ def summarize_simulation_runs(num_runs, num_applicants, num_judges, ratings_per_
     return top_n_counts
 # Example usage
-num_runs = 100  # Number of simulation runs
-top_n_results = summarize_simulation_runs(num_runs=num_runs, num_applicants=100, num_judges=5, ratings_per_applicant=3,
-                                          top_n=5, alpha=0, loc=50, scale=15, judge_error=4, judges_attitude=0.3, judgment_coarse_graining=False)
-top_n_results
@@ -239,6 +239,8 @@ def plot_top_n_results(top_n_results, num_runs):
     else:
         ax.legend().set_visible(False)
     plt.tight_layout()
     return fig
     # plt.show()
@@ -368,6 +370,7 @@ def plot_skewed_normal_distribution(alpha, loc, scale, judge_error,judgement_var
     small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12  # Scale down for representation
     plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
     plt.text(np.mean(line_x), np.max(y_pop_dist)*.08 + np.max(small_dist_y_scaled) , 'Judgement Variability', ha='center', va='bottom', color='black',weight='bold',)
@@ -377,6 +380,7 @@ def plot_skewed_normal_distribution(alpha, loc, scale, judge_error,judgement_var
     small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12  # Scale down for representation
     plt.text(25, np.max(y_pop_dist)*.05 + np.max(small_dist_y_scaled), 'Most Harsh', ha='center', va='bottom', color='black')
     plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
         # Small Normal Distribution genereous judge
     small_dist_x = np.linspace(75 - 3*std_dev, 75 + 3*std_dev, 100)  # 3 standard deviations on each side
@@ -384,6 +388,9 @@ def plot_skewed_normal_distribution(alpha, loc, scale, judge_error,judgement_var
     small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12  # Scale down for representation
     plt.text(75, np.max(y_pop_dist)*.05 + np.max(small_dist_y_scaled) , 'Most Generous', ha='center', va='bottom', color='black')
     plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
     return fig
@@ -404,8 +411,7 @@ _by [Max Noichl](https://homepage.univie.ac.at/maximilian.noichl/)_
 One of the central experiences of being an academic is the experience of being ranked. We are ranked when we submit abstracts for conferences, when we publish papers in ranked journals, when we apply for graduate school or maybe already for a master's degree, and when we apply for faculty positions, at departments, which are, of course, ranked as well. But although rankings are catnip to academics (and presumably everybody else as well), most people probably share the intuition that there's something weird or iffy about rankings and may suspect that often they are not as informative as their prevalence suggests.
-The simulation, which you can run yourself below, should give some further insight into that perceived weirdness. It involves setting up a population of candidates (like conference abstracts or job applicants) according to a distribution which you specify. Candidates with objective qualities are then sampled from this distribution and evaluated by judges, who have a certain error margin in their assessments, reflecting real-world inaccuracies. The simulation averages the ratings each candidate receives. The results show that in many plausible scenarios the best candidates often don't make it into the top of the rankings, highlighting the rankings limited value for accurately identifying the most qualified individuals. Below, I give some more detailed explanation – but first, the simulation.
-"""
 explanation_md = """
@@ -413,7 +419,7 @@ The simulation works like this: We set up a population of candidates. These coul
 Applicant contributions are evaluated by a **Number of Judges**, who evenly distribute the task of rating until all applicants have received a rating. Importantly, like in the real world, judges are not infallible; they make errors in assessing applicants. This is adjustable via the **Judge Error** property, which represents the width of the 95% confidence interval of a distribution centered around the applicant's objective quality. For example, if an applicant has an objective quality score of 70 and the **Judge Error** is set to 4, it means that in 95% of cases, judges are expected to score the applicant between 68 and 72. In most cases of academic ratings, it seems implausible that people can reliably distinguish closely scored candidates (e.g., 75 vs. 77). Depending on your application, you can therefore experiment with different plausible error-ranges for your specific scenario.
-Judges' rating styles naturally vary. Some may be more critical, rating applicants harshly, while others may be more lenient. This variability is adjustable in the simulation through the **Judges' Attitude Range**, which defines the maximum and minimum levels of strictness or leniency. The impact of these settings is visually represented in the simulation by the black distributions within the population-distribution-graphic on the right.
 Once all evaluations are complete, each applicant's final score is determined by averaging all their received ratings. Additionally, the simulation allows for coarse-grained judgments, where judges have to select the most appropriate score on a predefined scale (e.g., 0 to 7, determined by the **Coarse Graining Factor**). In that case (a very common practice), we do lose quite some information about the quality of the candidates.
@@ -446,7 +452,7 @@ with gr.Blocks(theme=gr.themes.Monochrome()) as demo:
           top_n = gr.Slider(1, 40, step=1, value=5, label="Top N", info='How many candidates can be selected.')
         judge_error = gr.Slider(0, 20, step=1, value=7, label="Judge Error", info='How much error judges can plausibly commit in their ratings.')
-        judges_attitude = gr.Slider(0, 10, step=.1, value=1.7, label="Judges attitude-range", info='How harsh/generous individual judges can be. (Max. skewness of their error distributions)')
         judgment_coarse_graining_true_false = gr.Checkbox(value= True, label="Coarse grain judgements.", info='Whether judgements are made on a coarser scale.')
         judgment_coarse_graining = gr.Slider(0, 30, step=1, value=7, label="Coarse Graining Factor", info='Number of ratings on the judgement-scale.')
@@ -460,7 +466,8 @@ with gr.Blocks(theme=gr.themes.Monochrome()) as demo:
         population_plot = gr.Plot(label="Population",render=True)
         gr.Markdown("""**Applicants quality distribution & judgement errors** – Above you can see in red the distribution from which we draw the real qualities of our applicants.
                         You can alter its **Mean, Scale and Skewness** on the left side. You can also see how large the errors are,
-                        which our judges commit, and how harshly the most harsh and most generous judges judge.
                         You can alter these values by playing with the **Judge Error** and the **Judge's attitude range** on the left.""")
         # with gr.Group():
           # Your existing plot output

     return df
+# df_results = simulate_applicant_judging(num_applicants=100, num_judges=10, ratings_per_applicant=5,alpha=0, loc=50, scale=15, judge_error=1, judges_attitude=0.3,
+#                                         judgment_coarse_graining=10)
+# df_results.head(30)
+# df_results.sort_values(by='Rank of Evaluation').head(30)
     return top_n_counts
 # Example usage
+# num_runs = 100  # Number of simulation runs
+# top_n_results = summarize_simulation_runs(num_runs=num_runs, num_applicants=100, num_judges=5, ratings_per_applicant=3,
+#                                           top_n=5, alpha=0, loc=50, scale=15, judge_error=4, judges_attitude=0.3, judgment_coarse_graining=False)
+# top_n_results
     else:
         ax.legend().set_visible(False)
     plt.tight_layout()
+    plt.close()
     return fig
     # plt.show()
     small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12  # Scale down for representation
     plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
     plt.text(np.mean(line_x), np.max(y_pop_dist)*.08 + np.max(small_dist_y_scaled) , 'Judgement Variability', ha='center', va='bottom', color='black',weight='bold',)
+    plt.plot([50,50], [0, np.max(small_dist_y_scaled)], color='black', linewidth=1.4, linestyle='dotted')
     small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12  # Scale down for representation
     plt.text(25, np.max(y_pop_dist)*.05 + np.max(small_dist_y_scaled), 'Most Harsh', ha='center', va='bottom', color='black')
     plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
+    plt.plot([25,25], [0, np.max(small_dist_y_scaled)], color='black', linewidth=1.4, linestyle='dotted')
         # Small Normal Distribution genereous judge
     small_dist_x = np.linspace(75 - 3*std_dev, 75 + 3*std_dev, 100)  # 3 standard deviations on each side
     small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12  # Scale down for representation
     plt.text(75, np.max(y_pop_dist)*.05 + np.max(small_dist_y_scaled) , 'Most Generous', ha='center', va='bottom', color='black')
     plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
+    plt.plot([75,75], [0, np.max(small_dist_y_scaled)], color='black', linewidth=1.4, linestyle='dotted')
+    plt.close()
     return fig
 One of the central experiences of being an academic is the experience of being ranked. We are ranked when we submit abstracts for conferences, when we publish papers in ranked journals, when we apply for graduate school or maybe already for a master's degree, and when we apply for faculty positions, at departments, which are, of course, ranked as well. But although rankings are catnip to academics (and presumably everybody else as well), most people probably share the intuition that there's something weird or iffy about rankings and may suspect that often they are not as informative as their prevalence suggests.
+The simulation, which you can run yourself below, should give some further insight into that perceived weirdness. It involves setting up a population of candidates (like conference abstracts or job applicants) according to a distribution which you specify. Candidates with objective qualities are then sampled from this distribution and evaluated by judges, who have a certain error margin in their assessments, as well as a variability in their harshness or generosity reflecting real-world inaccuracies. The simulation averages the ratings each candidate receives. The results show that in many plausible scenarios the best candidates often don't make it into the top of the rankings, highlighting the rankings limited value for accurately identifying the most qualified individuals. Below, I give some more detailed explanation – but first, the simulation."""
 explanation_md = """
 Applicant contributions are evaluated by a **Number of Judges**, who evenly distribute the task of rating until all applicants have received a rating. Importantly, like in the real world, judges are not infallible; they make errors in assessing applicants. This is adjustable via the **Judge Error** property, which represents the width of the 95% confidence interval of a distribution centered around the applicant's objective quality. For example, if an applicant has an objective quality score of 70 and the **Judge Error** is set to 4, it means that in 95% of cases, judges are expected to score the applicant between 68 and 72. In most cases of academic ratings, it seems implausible that people can reliably distinguish closely scored candidates (e.g., 75 vs. 77). Depending on your application, you can therefore experiment with different plausible error-ranges for your specific scenario.
+Judges' rating styles naturally vary. Some may be more critical, rating applicants harshly, while others may be more lenient. This variability is adjustable in the simulation through the **Judges' Attitude Range**. Each judge gets assigned a random attitude value that is drawn from the  range between this number and its negative, which is then used to set the skew of the skewed normal distribution (centered around the true quality), from which their assessments are drawn. The impact of these settings is visually represented in the simulation by the black distributions within the population-distribution-graphic on the right.
 Once all evaluations are complete, each applicant's final score is determined by averaging all their received ratings. Additionally, the simulation allows for coarse-grained judgments, where judges have to select the most appropriate score on a predefined scale (e.g., 0 to 7, determined by the **Coarse Graining Factor**). In that case (a very common practice), we do lose quite some information about the quality of the candidates.
           top_n = gr.Slider(1, 40, step=1, value=5, label="Top N", info='How many candidates can be selected.')
         judge_error = gr.Slider(0, 20, step=1, value=7, label="Judge Error", info='How much error judges can plausibly commit in their ratings.')
+        judges_attitude = gr.Slider(0, 10, step=.1, value=1.7, label="Judges attitude-range", info='How harsh/generous individual judges can be at maximum. (Skewness of their error distributions)')
         judgment_coarse_graining_true_false = gr.Checkbox(value= True, label="Coarse grain judgements.", info='Whether judgements are made on a coarser scale.')
         judgment_coarse_graining = gr.Slider(0, 30, step=1, value=7, label="Coarse Graining Factor", info='Number of ratings on the judgement-scale.')
         population_plot = gr.Plot(label="Population",render=True)
         gr.Markdown("""**Applicants quality distribution & judgement errors** – Above you can see in red the distribution from which we draw the real qualities of our applicants.
                         You can alter its **Mean, Scale and Skewness** on the left side. You can also see how large the errors are,
+                        which our judges can potentially commit, and how harshly the most harsh (left) and how nice the most generous judges (right) can skew the assessments.
+                        The (example) candidates true scores are shown by a vertical dotted line.
                         You can alter these values by playing with the **Judge Error** and the **Judge's attitude range** on the left.""")
         # with gr.Group():
           # Your existing plot output