-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Prevent word splitting when using max-len option #455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mightymatth |
I've added |
When Because that space is now trimmed with this PR, the Before:
Now:
Note how the spaces delineate the start of the words "Nervous" and "Chaz". |
I didn't know that the leading space has a practical meaning, so I proposed to have it trimmed. However, if it is the case, should we revert the trimming of the output at all? or have it also under the flag? |
You're not the only one that was confused by the space. Issue #397 also questions why they're there. :) It doesn't seem like the leading space is necessary if you actually do only split on word boundaries, so I think it's probably fine if it's also under the flag. I went ahead and created a PR #476. |
I support both solutions; the one proposed in your PR and having it behind the flag. However, having it behind the flag might be redundant because you won't need that info (+ there are already many flags, not sure if we need another one). We can see how others see it. |
…gml-org#455) * Update whisper.cpp * fix: trim function * feat: added flag to split on word * fix: arguments for main
…gml-org#455) * Update whisper.cpp * fix: trim function * feat: added flag to split on word * fix: arguments for main
…gml-org#455) * Update whisper.cpp * fix: trim function * feat: added flag to split on word * fix: arguments for main
…gml-org#455) * Update whisper.cpp * fix: trim function * feat: added flag to split on word * fix: arguments for main
…gml-org#455) * Update whisper.cpp * fix: trim function * feat: added flag to split on word * fix: arguments for main
…gml-org#455) * Update whisper.cpp * fix: trim function * feat: added flag to split on word * fix: arguments for main
Description
In the Croatian language, we often get short sub-word tokens. When we choose
max-len
option, words often get split, and we don't want that.It looks like this:
To fix that, the idea is to split the segment only if we reach the limit set with
max-len
AND when the next token starts with a delimiter (currently a blank space). We don't want to split it before reaching the max length because we can get stuck if our wanted max length is lower than a single word.Now it looks like this:
I also trimmed the output line.
The file is located here, and the command to run it is:
Please, modify and refactor it as needed, this is just the idea.