TildeBench - a leaderboard for large language model (LLM) performance in Baltic, Finnic & Slavic languages
We are interested in three aspects:
- Language support: Do the models really “speak” our languages?
- Task quality: Can they perform our tasks to the quality our users expect?
- Task reliability: Is the failure rate of the models low enough for application in production?
Go to TildeBench to see the leaderboard.
This leaderboard is an ongoing effort (work in progress). If you have an interesting benchmark for our languages in mind that you would want to suggest or contribute with, let us know, and let’s push the state of the art together!