m7n commited on
Commit
e5f1915
β€’
1 Parent(s): 824740b

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +20 -13
app.py CHANGED
@@ -129,13 +129,13 @@ def simulate_applicant_judging(num_applicants=100, num_judges=10, ratings_per_ap
129
 
130
  return df
131
 
132
- df_results = simulate_applicant_judging(num_applicants=100, num_judges=10, ratings_per_applicant=5,alpha=0, loc=50, scale=15, judge_error=1, judges_attitude=0.3,
133
- judgment_coarse_graining=10)
134
- df_results.head(30)
135
 
136
 
137
 
138
- df_results.sort_values(by='Rank of Evaluation').head(30)
139
 
140
 
141
 
@@ -187,10 +187,10 @@ def summarize_simulation_runs(num_runs, num_applicants, num_judges, ratings_per_
187
  return top_n_counts
188
 
189
  # Example usage
190
- num_runs = 100 # Number of simulation runs
191
- top_n_results = summarize_simulation_runs(num_runs=num_runs, num_applicants=100, num_judges=5, ratings_per_applicant=3,
192
- top_n=5, alpha=0, loc=50, scale=15, judge_error=4, judges_attitude=0.3, judgment_coarse_graining=False)
193
- top_n_results
194
 
195
 
196
 
@@ -239,6 +239,8 @@ def plot_top_n_results(top_n_results, num_runs):
239
  else:
240
  ax.legend().set_visible(False)
241
  plt.tight_layout()
 
 
242
  return fig
243
 
244
  # plt.show()
@@ -368,6 +370,7 @@ def plot_skewed_normal_distribution(alpha, loc, scale, judge_error,judgement_var
368
  small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12 # Scale down for representation
369
  plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
370
  plt.text(np.mean(line_x), np.max(y_pop_dist)*.08 + np.max(small_dist_y_scaled) , 'Judgement Variability', ha='center', va='bottom', color='black',weight='bold',)
 
371
 
372
 
373
 
@@ -377,6 +380,7 @@ def plot_skewed_normal_distribution(alpha, loc, scale, judge_error,judgement_var
377
  small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12 # Scale down for representation
378
  plt.text(25, np.max(y_pop_dist)*.05 + np.max(small_dist_y_scaled), 'Most Harsh', ha='center', va='bottom', color='black')
379
  plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
 
380
 
381
  # Small Normal Distribution genereous judge
382
  small_dist_x = np.linspace(75 - 3*std_dev, 75 + 3*std_dev, 100) # 3 standard deviations on each side
@@ -384,6 +388,9 @@ def plot_skewed_normal_distribution(alpha, loc, scale, judge_error,judgement_var
384
  small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12 # Scale down for representation
385
  plt.text(75, np.max(y_pop_dist)*.05 + np.max(small_dist_y_scaled) , 'Most Generous', ha='center', va='bottom', color='black')
386
  plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
 
 
 
387
 
388
  return fig
389
 
@@ -404,8 +411,7 @@ _by [Max Noichl](https://homepage.univie.ac.at/maximilian.noichl/)_
404
 
405
  One of the central experiences of being an academic is the experience of being ranked. We are ranked when we submit abstracts for conferences, when we publish papers in ranked journals, when we apply for graduate school or maybe already for a master's degree, and when we apply for faculty positions, at departments, which are, of course, ranked as well. But although rankings are catnip to academics (and presumably everybody else as well), most people probably share the intuition that there's something weird or iffy about rankings and may suspect that often they are not as informative as their prevalence suggests.
406
 
407
- The simulation, which you can run yourself below, should give some further insight into that perceived weirdness. It involves setting up a population of candidates (like conference abstracts or job applicants) according to a distribution which you specify. Candidates with objective qualities are then sampled from this distribution and evaluated by judges, who have a certain error margin in their assessments, reflecting real-world inaccuracies. The simulation averages the ratings each candidate receives. The results show that in many plausible scenarios the best candidates often don't make it into the top of the rankings, highlighting the rankings limited value for accurately identifying the most qualified individuals. Below, I give some more detailed explanation – but first, the simulation.
408
- """
409
 
410
 
411
  explanation_md = """
@@ -413,7 +419,7 @@ The simulation works like this: We set up a population of candidates. These coul
413
 
414
  Applicant contributions are evaluated by a **Number of Judges**, who evenly distribute the task of rating until all applicants have received a rating. Importantly, like in the real world, judges are not infallible; they make errors in assessing applicants. This is adjustable via the **Judge Error** property, which represents the width of the 95% confidence interval of a distribution centered around the applicant's objective quality. For example, if an applicant has an objective quality score of 70 and the **Judge Error** is set to 4, it means that in 95% of cases, judges are expected to score the applicant between 68 and 72. In most cases of academic ratings, it seems implausible that people can reliably distinguish closely scored candidates (e.g., 75 vs. 77). Depending on your application, you can therefore experiment with different plausible error-ranges for your specific scenario.
415
 
416
- Judges' rating styles naturally vary. Some may be more critical, rating applicants harshly, while others may be more lenient. This variability is adjustable in the simulation through the **Judges' Attitude Range**, which defines the maximum and minimum levels of strictness or leniency. The impact of these settings is visually represented in the simulation by the black distributions within the population-distribution-graphic on the right.
417
 
418
  Once all evaluations are complete, each applicant's final score is determined by averaging all their received ratings. Additionally, the simulation allows for coarse-grained judgments, where judges have to select the most appropriate score on a predefined scale (e.g., 0 to 7, determined by the **Coarse Graining Factor**). In that case (a very common practice), we do lose quite some information about the quality of the candidates.
419
 
@@ -446,7 +452,7 @@ with gr.Blocks(theme=gr.themes.Monochrome()) as demo:
446
  top_n = gr.Slider(1, 40, step=1, value=5, label="Top N", info='How many candidates can be selected.')
447
 
448
  judge_error = gr.Slider(0, 20, step=1, value=7, label="Judge Error", info='How much error judges can plausibly commit in their ratings.')
449
- judges_attitude = gr.Slider(0, 10, step=.1, value=1.7, label="Judges attitude-range", info='How harsh/generous individual judges can be. (Max. skewness of their error distributions)')
450
 
451
  judgment_coarse_graining_true_false = gr.Checkbox(value= True, label="Coarse grain judgements.", info='Whether judgements are made on a coarser scale.')
452
  judgment_coarse_graining = gr.Slider(0, 30, step=1, value=7, label="Coarse Graining Factor", info='Number of ratings on the judgement-scale.')
@@ -460,7 +466,8 @@ with gr.Blocks(theme=gr.themes.Monochrome()) as demo:
460
  population_plot = gr.Plot(label="Population",render=True)
461
  gr.Markdown("""**Applicants quality distribution & judgement errors** – Above you can see in red the distribution from which we draw the real qualities of our applicants.
462
  You can alter its **Mean, Scale and Skewness** on the left side. You can also see how large the errors are,
463
- which our judges commit, and how harshly the most harsh and most generous judges judge.
 
464
  You can alter these values by playing with the **Judge Error** and the **Judge's attitude range** on the left.""")
465
  # with gr.Group():
466
  # Your existing plot output
 
129
 
130
  return df
131
 
132
+ # df_results = simulate_applicant_judging(num_applicants=100, num_judges=10, ratings_per_applicant=5,alpha=0, loc=50, scale=15, judge_error=1, judges_attitude=0.3,
133
+ # judgment_coarse_graining=10)
134
+ # df_results.head(30)
135
 
136
 
137
 
138
+ # df_results.sort_values(by='Rank of Evaluation').head(30)
139
 
140
 
141
 
 
187
  return top_n_counts
188
 
189
  # Example usage
190
+ # num_runs = 100 # Number of simulation runs
191
+ # top_n_results = summarize_simulation_runs(num_runs=num_runs, num_applicants=100, num_judges=5, ratings_per_applicant=3,
192
+ # top_n=5, alpha=0, loc=50, scale=15, judge_error=4, judges_attitude=0.3, judgment_coarse_graining=False)
193
+ # top_n_results
194
 
195
 
196
 
 
239
  else:
240
  ax.legend().set_visible(False)
241
  plt.tight_layout()
242
+ plt.close()
243
+
244
  return fig
245
 
246
  # plt.show()
 
370
  small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12 # Scale down for representation
371
  plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
372
  plt.text(np.mean(line_x), np.max(y_pop_dist)*.08 + np.max(small_dist_y_scaled) , 'Judgement Variability', ha='center', va='bottom', color='black',weight='bold',)
373
+ plt.plot([50,50], [0, np.max(small_dist_y_scaled)], color='black', linewidth=1.4, linestyle='dotted')
374
 
375
 
376
 
 
380
  small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12 # Scale down for representation
381
  plt.text(25, np.max(y_pop_dist)*.05 + np.max(small_dist_y_scaled), 'Most Harsh', ha='center', va='bottom', color='black')
382
  plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
383
+ plt.plot([25,25], [0, np.max(small_dist_y_scaled)], color='black', linewidth=1.4, linestyle='dotted')
384
 
385
  # Small Normal Distribution genereous judge
386
  small_dist_x = np.linspace(75 - 3*std_dev, 75 + 3*std_dev, 100) # 3 standard deviations on each side
 
388
  small_dist_y_scaled = small_dist_y / max(small_dist_y) * np.max(y_pop_dist)*.12 # Scale down for representation
389
  plt.text(75, np.max(y_pop_dist)*.05 + np.max(small_dist_y_scaled) , 'Most Generous', ha='center', va='bottom', color='black')
390
  plt.plot(small_dist_x, small_dist_y_scaled, color='black', linewidth=2)
391
+ plt.plot([75,75], [0, np.max(small_dist_y_scaled)], color='black', linewidth=1.4, linestyle='dotted')
392
+
393
+ plt.close()
394
 
395
  return fig
396
 
 
411
 
412
  One of the central experiences of being an academic is the experience of being ranked. We are ranked when we submit abstracts for conferences, when we publish papers in ranked journals, when we apply for graduate school or maybe already for a master's degree, and when we apply for faculty positions, at departments, which are, of course, ranked as well. But although rankings are catnip to academics (and presumably everybody else as well), most people probably share the intuition that there's something weird or iffy about rankings and may suspect that often they are not as informative as their prevalence suggests.
413
 
414
+ The simulation, which you can run yourself below, should give some further insight into that perceived weirdness. It involves setting up a population of candidates (like conference abstracts or job applicants) according to a distribution which you specify. Candidates with objective qualities are then sampled from this distribution and evaluated by judges, who have a certain error margin in their assessments, as well as a variability in their harshness or generosity reflecting real-world inaccuracies. The simulation averages the ratings each candidate receives. The results show that in many plausible scenarios the best candidates often don't make it into the top of the rankings, highlighting the rankings limited value for accurately identifying the most qualified individuals. Below, I give some more detailed explanation – but first, the simulation."""
 
415
 
416
 
417
  explanation_md = """
 
419
 
420
  Applicant contributions are evaluated by a **Number of Judges**, who evenly distribute the task of rating until all applicants have received a rating. Importantly, like in the real world, judges are not infallible; they make errors in assessing applicants. This is adjustable via the **Judge Error** property, which represents the width of the 95% confidence interval of a distribution centered around the applicant's objective quality. For example, if an applicant has an objective quality score of 70 and the **Judge Error** is set to 4, it means that in 95% of cases, judges are expected to score the applicant between 68 and 72. In most cases of academic ratings, it seems implausible that people can reliably distinguish closely scored candidates (e.g., 75 vs. 77). Depending on your application, you can therefore experiment with different plausible error-ranges for your specific scenario.
421
 
422
+ Judges' rating styles naturally vary. Some may be more critical, rating applicants harshly, while others may be more lenient. This variability is adjustable in the simulation through the **Judges' Attitude Range**. Each judge gets assigned a random attitude value that is drawn from the range between this number and its negative, which is then used to set the skew of the skewed normal distribution (centered around the true quality), from which their assessments are drawn. The impact of these settings is visually represented in the simulation by the black distributions within the population-distribution-graphic on the right.
423
 
424
  Once all evaluations are complete, each applicant's final score is determined by averaging all their received ratings. Additionally, the simulation allows for coarse-grained judgments, where judges have to select the most appropriate score on a predefined scale (e.g., 0 to 7, determined by the **Coarse Graining Factor**). In that case (a very common practice), we do lose quite some information about the quality of the candidates.
425
 
 
452
  top_n = gr.Slider(1, 40, step=1, value=5, label="Top N", info='How many candidates can be selected.')
453
 
454
  judge_error = gr.Slider(0, 20, step=1, value=7, label="Judge Error", info='How much error judges can plausibly commit in their ratings.')
455
+ judges_attitude = gr.Slider(0, 10, step=.1, value=1.7, label="Judges attitude-range", info='How harsh/generous individual judges can be at maximum. (Skewness of their error distributions)')
456
 
457
  judgment_coarse_graining_true_false = gr.Checkbox(value= True, label="Coarse grain judgements.", info='Whether judgements are made on a coarser scale.')
458
  judgment_coarse_graining = gr.Slider(0, 30, step=1, value=7, label="Coarse Graining Factor", info='Number of ratings on the judgement-scale.')
 
466
  population_plot = gr.Plot(label="Population",render=True)
467
  gr.Markdown("""**Applicants quality distribution & judgement errors** – Above you can see in red the distribution from which we draw the real qualities of our applicants.
468
  You can alter its **Mean, Scale and Skewness** on the left side. You can also see how large the errors are,
469
+ which our judges can potentially commit, and how harshly the most harsh (left) and how nice the most generous judges (right) can skew the assessments.
470
+ The (example) candidates true scores are shown by a vertical dotted line.
471
  You can alter these values by playing with the **Judge Error** and the **Judge's attitude range** on the left.""")
472
  # with gr.Group():
473
  # Your existing plot output