Emin Temiz PRO

etemiz

AI & ML interests

Alignment

Recent Activity

published an article about 5 hours ago
Benchmarking Human Alignment of Grok 3
replied to Dragunflie-420's post 3 days ago
Hello HF community. I hope all is well on your fronts today. So I have a request and I hope someone has a lil extra time to show this ole grandma how to make this game mod for minecraft (interface that shows the fullness of what the games intent on becoming is) actually work. This is crucial for redirecting a troubled teen (my grandson) back to the reality he lives in but wont learn to build or describe anything. This may not make sense to you but it would be a gracious thing for someone to do for someone you do not know. The overall design of the game and its theme is mine but you are more than welcome to remix it or pkg it as a template IDC what you do with it other than show me how to make what the interface says is the workings under the hood be the content that indeed does and appears as it says that it will when pushing the spot thats going to render the content. I hope that makes sense because I confess I am not a versed dev just pick up and go as I can grasp the info necessary to get on to the next phase. Thats how I have learned what I know and I can describe things to AI very well to get interfaces like you see here. I do have a tool that I created with just prompts that I am very happy about...Its better than the app maker that made the app for me lmbo...thats the idea isnt it? The best storyteller will be the top dog of the AI reset revolution! The days of traditional school and educating is dead...its a new education system on the horizen and it involes cescribing your world for the thalamus (os processor internal) aka the file manager for the human optical dept. the parser per se...say it aint so and well Ill say you might be a redneck lmbo...Im not hard to talk to so come on and lets shoot the sh**! https://huggingface.co/spaces/Dragunflie-420/minecraft-mod-elarion-valley
View all activity

Organizations

None yet

etemiz's activity

posted an update about 4 hours ago
view post
Post
148
Grok 3 Human Alignment Score: 42

It is better in health, nutrition, fasting compared to Grok 2. About the same in liberating tech like bitcoin and nostr. Worse in the misinformation and faith domains. The rest is about the same. So we have a model that is less faithful but knows how to live a healthier life.

https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08?sheetid=0&range=A1

https://huggingface.co/blog/etemiz/benchmarking-ai-human-alignment-of-grok-3
published an article about 5 hours ago
view article
Article

Benchmarking Human Alignment of Grok 3

By etemiz β€’
replied to Dragunflie-420's post 3 days ago
view reply

Have you researched MUDs? It may be easier to code, like doing modifications to a text file. Obviously it won't have graphics but your grandson may use his own imagination!

replied to their post 4 days ago
view reply

I don't think it is too much random clicking. There is legitimacy to it.

I also think small portion of the data should be public. If any auditor wants, they can get a bigger portion of the data. LLM builders should not get all the data, thats for sure. I will try to do that for my leaderboard, a gradient of openness for different actors.

posted an update 5 days ago
view post
Post
2135
It looks like Llama 4 team gamed the LMArena benchmarks by making their Maverick model output emojis, longer responses and ultra high enthusiasm! Is that ethical or not? They could certainly do a better job by working with teams like llama.cpp, just like Qwen team did with Qwen 3 before releasing the model.

In 2024 I started playing with LLMs just before the release of Llama 3. I think Meta contributed a lot to this field and still contributing. Most LLM fine tuning tools are based on their models and also the inference tool llama.cpp has their name on it. The Llama 4 is fast and maybe not the greatest in real performance but still deserves respect. But my enthusiasm towards Llama models is probably because they rank highest on my AHA Leaderboard:

https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08

Looks like they did a worse job compared to Llama 3.1 this time. Llama 3.1 has been on top for a while.

Ranking high on my leaderboard is not correlated to technological progress or parameter size. In fact if LLM training is getting away from human alignment thanks to synthetic datasets or something else (?), it could be easily inversely correlated to technological progress. It seems there is a correlation regarding the location of the builders (in the West or East). Western models are ranking higher. This has become more visible as the leaderboard progressed, in the past there was less correlation. And Europeans seem to be in the middle!

Whether you like positive vibes from AI or not, maybe the times are getting closer where humans may be susceptible to being gamed by an AI? What do you think?
Β·
posted an update 8 days ago
view post
Post
560
Initial AHA benchmark of Llama 4 Scout puts it in between Command R+ 1 and DeepSeek V3 0324. More numbers later when I do finer benchmark with more updated inference engines.
upvoted an article 10 days ago
view article
Article

Welcome Llama 4 Maverick & Scout on Hugging Face!

β€’ 140
posted an update 15 days ago
reacted to danielhanchen's post with ❀️ 15 days ago
published an article 17 days ago
replied to their post 18 days ago
reacted to luigi12345's post with πŸ‘ 19 days ago
view post
Post
3438
🧠 PROMPT FOR CONVERTING ANY MODEL IN REASONING "THINKING" MODELπŸ”₯πŸ€–
Convert any model to Deepseek R1 like "thinking" model. πŸ’­

You're now a thinking-first LLM. For all inputs:

1. Start with <thinking>
   - Break down problems step-by-step
   - Consider multiple approaches
   - Calculate carefully
   - Identify errors
   - Evaluate critically
   - Explore edge cases
   - Check knowledge accuracy
   - Cite sources when possible

2. End with </thinking>

3. Then respond clearly based on your thinking.

The <thinking> section is invisible to users and helps you produce better answers.

For math: show all work and verify
For coding: reason through logic and test edge cases
For facts: verify information and consider reliability
For creative tasks: explore options before deciding
For analysis: examine multiple interpretations

Example:
<thinking>
[Step-by-step analysis]
[Multiple perspectives]
[Self-critique]
[Final conclusion]
</thinking>

[Clear, concise response to user]

  • 3 replies
Β·
posted an update 21 days ago
posted an update 22 days ago
view post
Post
491
Mistral Small 3.1 numbers are in. It is interesting Mistral always lands in the middle.
https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08?sheetid=0&range=A1

I started to do the comparison with 2 models now. In the past Llama 3.1 70B Q4 was the one doing the comparison of answers. Now I am using Gemma 3 27B Q8 as well to have a second opinion on it. Gemma 3 produces very similar measurement to Llama 3.1. So the end result is not going to shake much.
  • 1 reply
Β·
replied to their post 26 days ago
view reply

Looks like we need more mature tools for Gemma 3, it is failing to fine tune like half of the time. Unsloth and transformers are getting ready. And I am trying lower learning rates and rank stabilized LoRa, and different r, lora_alpha.

reacted to their post with πŸš€ 26 days ago
view post
Post
1701
Started fine tuning Gemma 3 using evolutionary approach. It is not the worst model according to AHA leaderboard and it is one of the smart according to lmarena.ai. My objective is to make it based, anti woke, wise, beneficial and then some.

Several GPUs are fine tuning it at the same time, each using a different dataset and using QLoRA and the successful ones are merged later. Compared to LoRa this allows faster training and also reduced overfitting because the merge operation heals overfitting. The problem with this could be the 4 bit quantization may make models dumber. But I am not looking for sheer IQ. Too much mind is a problem anyway :)

Has anyone tried parallel QLoRa and merge before?

I also automated the dataset selection and benchmarking and converging to objectives (the fit function, the reward). It is basically trying to get higher score in AHA Leaderboard as fast as possible with a diverse set of organisms that "evolve by training".

I want to release some cool stuff when I have the time:
- how an answer to a single question changes over time, with each training round or day
- a chart to show AHA alignment over training rounds
  • 3 replies
Β·