Skip to content

Conversation

mightymatth
Copy link
Contributor

@mightymatth mightymatth commented Jan 27, 2023

Description

In the Croatian language, we often get short sub-word tokens. When we choose max-len option, words often get split, and we don't want that.

It looks like this:

[00:00:00.000 --> 00:00:01.050]   Zagreb se nalazi u
[00:00:01.050 --> 00:00:02.310]   kontinentalnoj Sred
[00:00:02.310 --> 00:00:03.740]  išnjoj Hrvatskoj,
[00:00:03.740 --> 00:00:04.640]   na južnim obronc
[00:00:04.640 --> 00:00:05.960]  ima Medvednice te na
[00:00:05.960 --> 00:00:07.500]   obalama rijeke Save
[00:00:07.500 --> 00:00:07.500]  .

To fix that, the idea is to split the segment only if we reach the limit set with max-len AND when the next token starts with a delimiter (currently a blank space). We don't want to split it before reaching the max length because we can get stuck if our wanted max length is lower than a single word.

Now it looks like this:

[00:00:00.000 --> 00:00:01.050]  Zagreb se nalazi u
[00:00:01.050 --> 00:00:02.810]  kontinentalnoj Središnjoj
[00:00:02.810 --> 00:00:04.220]  Hrvatskoj, na južnim
[00:00:04.220 --> 00:00:05.610]  obroncima Medvednice
[00:00:05.610 --> 00:00:07.130]  te na obalama rijeke
[00:00:07.130 --> 00:00:07.500]  Save.

I also trimmed the output line.

The file is located here, and the command to run it is:

 ./main -m ./models/ggml-large.bin -l hr -ml 20 -f ./croatian-sample-short.wav

Please, modify and refactor it as needed, this is just the idea.

@ggerganov
Copy link
Member

@mightymatth
Thanks for this contribution - I think this is very useful!
Although it is OK to merge like this, I will likely change it to have a bool flag in the params that specifies whether to do token or word splitting in order to keep both functionalities. If you feel like it, you can make the change yourself. If not, I will do it at some point the in the following days :)

@mightymatth
Copy link
Contributor Author

mightymatth commented Feb 4, 2023

I've added --split-on-word (--sow) flag to turn this on (it's turned off by default). You can check these commits.
You are free to modify this solution in terms of naming, positioning, and coding style.

@ggerganov ggerganov merged commit d012b5c into ggml-org:master Feb 5, 2023
@boolemancer
Copy link
Contributor

When --split-on-word is false, the space at the beginning of the text output is necessary to know whether or not the current output is a continuation of the previous word or a new word.

Because that space is now trimmed with this PR, the --split-on-word false case is now broken.

Before:

1
00:00:01,580 --> 00:00:01,650
 N

2
00:00:01,650 --> 00:00:01,670
erv

3
00:00:01,670 --> 00:00:02,110
ous

4
00:00:02,110 --> 00:00:02,390
,

5
00:00:02,390 --> 00:00:02,410
 Ch

6
00:00:02,410 --> 00:00:02,560
az

7
00:00:02,560 --> 00:00:03,120
 said

Now:

1
00:00:01,580 --> 00:00:01,650
N

2
00:00:01,650 --> 00:00:01,670
erv

3
00:00:01,670 --> 00:00:02,110
ous

4
00:00:02,110 --> 00:00:02,390
,

5
00:00:02,390 --> 00:00:02,410
Ch

6
00:00:02,410 --> 00:00:02,560
az

7
00:00:02,560 --> 00:00:03,120
said

Note how the spaces delineate the start of the words "Nervous" and "Chaz".

@mightymatth
Copy link
Contributor Author

...the space at the beginning of the text output is necessary to know whether or not the current output is a continuation of the previous word or a new word.

I didn't know that the leading space has a practical meaning, so I proposed to have it trimmed. However, if it is the case, should we revert the trimming of the output at all? or have it also under the flag?

@boolemancer
Copy link
Contributor

However, if it is the case, should we revert the trimming of the output at all? or have it also under the flag?

You're not the only one that was confused by the space. Issue #397 also questions why they're there. :)

It doesn't seem like the leading space is necessary if you actually do only split on word boundaries, so I think it's probably fine if it's also under the flag.

I went ahead and created a PR #476.

@mightymatth
Copy link
Contributor Author

I support both solutions; the one proposed in your PR and having it behind the flag. However, having it behind the flag might be redundant because you won't need that info (+ there are already many flags, not sure if we need another one). We can see how others see it.

rock3125 pushed a commit to rock3125/whisper.cpp that referenced this pull request Feb 21, 2023
…gml-org#455)

* Update whisper.cpp

* fix: trim function

* feat: added flag to split on word

* fix: arguments for main
anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
…gml-org#455)

* Update whisper.cpp

* fix: trim function

* feat: added flag to split on word

* fix: arguments for main
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
…gml-org#455)

* Update whisper.cpp

* fix: trim function

* feat: added flag to split on word

* fix: arguments for main
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
…gml-org#455)

* Update whisper.cpp

* fix: trim function

* feat: added flag to split on word

* fix: arguments for main
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
…gml-org#455)

* Update whisper.cpp

* fix: trim function

* feat: added flag to split on word

* fix: arguments for main
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
…gml-org#455)

* Update whisper.cpp

* fix: trim function

* feat: added flag to split on word

* fix: arguments for main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants