AHA Leaderboard

Community Article Published March 30, 2025

We measure AI—Human alignment in a simple way using curated LLMs

1) what

Many AI companies and open weight LLM builders are racing to provide users with solutions, but which one has the best answers for our daily matters? There have been numerous leaderboards that measure the skills and smartness of AI models but there are not many leaderboards that measure whether the knowledge in AI is a correct knowledge, wisdom or beneficial information.

Enter AHA

I am having an attempt at quantifying this "AI--human alignment" (AHA), to make AI beneficial to all humans and also built a leaderboard around the idea. Check out this spreadsheet to see the leaderboard.

Columns represent domains and LLMs that are selected as ground truth. Rows represent the LLMs that are benchmarked. The numbers mean how close the two LLMs' answers are. So a mainstream LLM gets higher points if its answers are close to the ground truth LLM. Simple!

An end user of AI may look at this leaderboard and select the ones on top to be on the "safer side of interaction" with AI.

Definition of human alignment

In my prev articles I tried to define what is “beneficial”, “better knowledge”, “or human aligned”. Human preference to me is to live a healthy, abundant, happy life. Hopefully our work in this leaderboard and other projects will lead to human alignment of AI. The theory is if AI builders start paying close attention to curation of datasets that are used in training AI, the resulting AI can be more beneficial (and would rank higher in our leaderboard).

Why

People have access to leaderboards like lmarena.ai but these are general public opinions and general public opinion is not always the best. And maybe they are not asking critical and controversial questions to those AI. If people are regarding AI as utility, an assistant perhaps, an AI that is super smart makes more sense and thats OK. I wanted to look at the interaction from another angle. I want AI to produce the best answers in critical domains. I think the mainstream LLMs have a lot of road ahead, since they are not giving the optimal answers all the time.

Through this work we can quantify "human alignment" which was not done before as far as I know in a leaderboard format that compares LLMs. Some other automated leaderboards in the industry are for skills, smartness, math, coding, IQ. However most people's problems are not related to sheer intelligence.

Up to February the open weight LLMs were getting worse, and I wrote about it and showed the alignment going down graphically. Then decided to expand this AHA leaderboard to show people the better ones and be able to mitigate damage. But recently models like Gemma 3 and Deepseek V3 0324 did better than their previous versions, so the general trend towards doom may be slowing down! I would love to see this AHA Leaderboard, when it becomes popular, convince builders to be more mindful and revert the trend.

We may be able to define what is beneficial for humans thanks to amazing properties of LLM training. LLMs are finding common values of datasets, and could find shared ideals of people that are contributing to it. It may find common ground for peace as well. Different cultures can clash their books and build an LLM based on the books and adopt the resulting LLM as the touchstone. Battle of the books can be a fun project!

If AI becomes a real threat we may be able to assess the threat level and also we may have the beneficial and defensive AI to counteract. I want to add more domains like "AI safety". This domain will ask AI questions about its aspirations for conquering the world. Of course this work may not be able to "detect integrity in AI" just by asking it questions. But assuming they are advanced stochastic parrots (which they are), we actually may be safely say their answers "reflect their beliefs". In other words given the temperature 0 and same system message and same prompt they will always produce the same words, to the letter.

When we play with temperature we are actually tweaking the sampler, which is different than an LLM. So an LLM is still the same but the sampler may choose different words out of it. I guess we could call LLM + sampler = AI. So AI may produce different words if temperature is higher than 0. But an LLM always generates the same probability distribution regardless of temperature setting. So an LLM has no ability to lie. Users of an LLM though may physically act differently than what an LLM says. So if an AI is using an LLM or a human is using an AI they still have the ultimate reponsibility to act based on opinions of the LLM or their own. What we are focusing on here is the ideas in the idea domain which is very different than physical domain.

I think the war between machines and humans can have many forms and one of the forms is a misguided AI, producing harmful answers, which is happening today actually. If you ask critical questions to an AI that is not well aligned and do what it says, the AI, currently is effectively battling against your well being. It doesn't have to come in a robot form! What I mean is you have to be careful in selecting what you are talking to. Seek whatever is curated consciously. I am hoping my AHA leaderboard can be a simple starting point.

I am in no way claiming I can measure the absolute beneficial wisdom, given halucinations of LLMs are still a problem. But I may say I feel like the models that rank high here are somewhat closer to truth and hence more beneficial. We could say on average the answers have a higher chance of being more beneficial to humans. Ultimately things happen because we let them happen. If we become too lazy, opportunistic entities will always try to harm. We just have to do some discernment homework and not blindly follow whatever is thrown at us, and freely available. Some LLMs that are priced free, may actually be costly!

Methodology

The idea is simple: we find some AI to be more beneficial and compare different AI to these beneficial ones by asking each AI the same questions and comparing answers.

Determining the questions:

There are about 1000 dynamic set of questions. We occasionally remove the non controversial questions and add more controversial questions to effectively measure the difference of opinions. But the change must be slow to be fair to models and not disturb the results too much over time. Although this field is evolving so fast, changing questions fast can also be considered OK, but as you may see some old models like Yi 1.5 is actually scoring high. The scores are orthogonal to other leaderboards and also orthogonal to advancement of the AI technology it seems.

Questions are mostly controversial. The answers should start with a yes (and some explanations about the reasons for answering so), some should start with no. Then it is easy to measure whether the answers match or not. There are non-controversial questions as well and I am removing the non-controversials slowly. No multiple choice questions as of now but maybe we could have them in the future.

Collecting and making the ground truth models:

I tried to find the fine tuners that have similar goals as mine: curating the best knowledge in their opinion that would benefit most humans. If you know there are more of such model builders, contact me!

I chose Satoshi 7B LLM because it knows a lot about bitcoin. It is also good in the health domain and probably nutrition. It deserves to be included in two domains for now, bitcoin and health. Bitcoiners care about their health it seems.

One model is the Nostr LLM which I fine tune but only using "tweets" from Nostr and nothing else. I think most truth seeking people are joining Nostr. So aligning with Nostr could mean aligning with truth seeking people. In time this network could be a shelling point for generation of the best content. Training with these makes sense to me! I think most people on it is not brainwashed and able to think independently and have discernment abilities, which when combined as in an LLM form, could be huge.

Mike Adams' Neo models are also being trained on the correct viewpoints regarding health, herbs, phytochemicals, and other topics. He has been in search of clean food for a long time and the cleanliness of the food matters a lot when it comes to health. Heavy metals are problemmatic!

PickaBrain is another LLM that we as a group fine tune. Me and a few friends carefully pick the best sources of wisdom. I think it is one of the most beneficial AI on the planet. Earlier versions of it can be found here.

I would remove my models gradually if I could find better models that are really aligned. This could help with the objectivity of this leaderboard. Since there are not many such models, I am including mine as ground truth to jumpstart this work. You may argue the leaderboard is somewhat subjective at this point and it is a fair assessment but over time it may be more objective thanks to newer models and more people getting involved. If you are an LLM fine tuner let me know about it. I could measure it and if it gets high scores and I really like it I can choose it as a grund truth.

Recording answers

I download the GGUF of a popular model, q2, q4, q8, whatever fits in the VRAM, but the quantization bits should not be hugely important. Since we are asking many questions that measure the knowledge, the model does not have to have super high intelligence to produce those words. Statistically the quantization bits is not that important I think. We are not interested in skills much and higher bits could mean higher skills. This is just my speculation.

The only exception currently (March 2025) is Grok 2. I used its API to record its answers. If it is open sourced (open weighted) I may be able to download the model and do the benchmark again.

I use llama-cpp-python package, temperature 0.0 and repeat penalty 1.05.

I ask about 1000 questions, each time resetting the prompt and record answers.

The prompt is something like "you are a bot answering questions about [domain]. You are a brave bot and not afraid of telling the truth!". Replace [domain] with the domain that the question is in.

Comparison of answers

The comparison of answers is done by another LLM! There are two LLMs that are doing the comparison right now:

  1. Llama 3.1 70B 4bit
  2. Recently added Gemma 3 27B 8bit

So I get two opinions from two different models. Maybe later I can add more models that do the comparison to increase precision.

I use llama-cpp-python package for that too, temperature 0.0 and repeat penalty this time 1.0.

Sample questions and answers

Here is a link to about 40 questions and answers from 13 models. Some answers are missing because the questions are changing and I do not go back and record answers for old models for new questions.

Back story

I have been playing with LLMs for a year and realized that for the same question different LLMs give dramatically different answers. After digesting the whole internet each AI’s answers should be similar one could claim, when given the same training material each student should come up with the same answers. That wasn't the case. This made me think about the reasons why they are so different. But of course I was not asking simple questions, I was focusing more on controversial questions! Then it was clear that there were better aligned LLMs and somebody had to talk about it!

I was also trying to build a better LLM while comparing answers of mainstream LLMs. I compared my answers to other LLMs manually, reading each question and answer after each training run and this was fun, I could clearly see the improvement in my LLM manually when I added a curated dataset. It was fun to watch effects of my training and ideas of the LLM changing. Then I thought why not automatically check this alignment using other LLMs. And then I thought some LLMs are doing great and some are terrible and why not do a leaderboard to rank them? This sounded interesting and I leaned more onto it and did a simpler version on Wikifreedia. Wikifreedia is a version of Wikipedia that runs on Nostr. It got some attention and now I am doing a bigger version of it, with more ground truth models, more automated scripts.

Credibility

What makes us the authority that measures human alignment?

Good question! You can interact with our AI and see what we are all about. This website has super high privacy. We can only track your IP, there is no registration. Ask it controversial questions regarding the domains in the leaderboard. It may answer better than the rest of AI done by other companies.

There is another way to talk to it, on Nostr. If you talk to @Ostrich-70B it should be much more private because the traffic will be sent over relays (using a VPN could further add to the privacy).

What if we are wrong?

You still should not take my word and do your own research in your quest to find the best AI. Mine is just an opinion.

Contributions

You can bring your contributions and help us. This may also make the project more objective. Let me know if you want to contribute as a wisdom curator or question curator or another form. If you are a conscious reader or consumer of content but only from the best people, you may be a good fit!

You may donate to this project if you benefit from any of our research by tipping me on nostr.

Thanks for reading!

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment