Prevent word splitting when using max-len option #455

mightymatth · 2023-01-27T01:12:45Z

Description

In the Croatian language, we often get short sub-word tokens. When we choose max-len option, words often get split, and we don't want that.

It looks like this:

[00:00:00.000 --> 00:00:01.050]   Zagreb se nalazi u
[00:00:01.050 --> 00:00:02.310]   kontinentalnoj Sred
[00:00:02.310 --> 00:00:03.740]  išnjoj Hrvatskoj,
[00:00:03.740 --> 00:00:04.640]   na južnim obronc
[00:00:04.640 --> 00:00:05.960]  ima Medvednice te na
[00:00:05.960 --> 00:00:07.500]   obalama rijeke Save
[00:00:07.500 --> 00:00:07.500]  .

To fix that, the idea is to split the segment only if we reach the limit set with max-len AND when the next token starts with a delimiter (currently a blank space). We don't want to split it before reaching the max length because we can get stuck if our wanted max length is lower than a single word.

Now it looks like this:

[00:00:00.000 --> 00:00:01.050]  Zagreb se nalazi u
[00:00:01.050 --> 00:00:02.810]  kontinentalnoj Središnjoj
[00:00:02.810 --> 00:00:04.220]  Hrvatskoj, na južnim
[00:00:04.220 --> 00:00:05.610]  obroncima Medvednice
[00:00:05.610 --> 00:00:07.130]  te na obalama rijeke
[00:00:07.130 --> 00:00:07.500]  Save.

I also trimmed the output line.

The file is located here, and the command to run it is:

 ./main -m ./models/ggml-large.bin -l hr -ml 20 -f ./croatian-sample-short.wav

Please, modify and refactor it as needed, this is just the idea.

ggerganov · 2023-02-04T07:35:33Z

@mightymatth
Thanks for this contribution - I think this is very useful!
Although it is OK to merge like this, I will likely change it to have a bool flag in the params that specifies whether to do token or word splitting in order to keep both functionalities. If you feel like it, you can make the change yourself. If not, I will do it at some point the in the following days :)

mightymatth · 2023-02-04T22:11:14Z

I've added --split-on-word (--sow) flag to turn this on (it's turned off by default). You can check these commits.
You are free to modify this solution in terms of naming, positioning, and coding style.

boolemancer · 2023-02-06T08:39:24Z

When --split-on-word is false, the space at the beginning of the text output is necessary to know whether or not the current output is a continuation of the previous word or a new word.

Because that space is now trimmed with this PR, the --split-on-word false case is now broken.

Before:

1
00:00:01,580 --> 00:00:01,650
 N

2
00:00:01,650 --> 00:00:01,670
erv

3
00:00:01,670 --> 00:00:02,110
ous

4
00:00:02,110 --> 00:00:02,390
,

5
00:00:02,390 --> 00:00:02,410
 Ch

6
00:00:02,410 --> 00:00:02,560
az

7
00:00:02,560 --> 00:00:03,120
 said

Now:

1
00:00:01,580 --> 00:00:01,650
N

2
00:00:01,650 --> 00:00:01,670
erv

3
00:00:01,670 --> 00:00:02,110
ous

4
00:00:02,110 --> 00:00:02,390
,

5
00:00:02,390 --> 00:00:02,410
Ch

6
00:00:02,410 --> 00:00:02,560
az

7
00:00:02,560 --> 00:00:03,120
said

Note how the spaces delineate the start of the words "Nervous" and "Chaz".

mightymatth · 2023-02-06T09:08:41Z

...the space at the beginning of the text output is necessary to know whether or not the current output is a continuation of the previous word or a new word.

I didn't know that the leading space has a practical meaning, so I proposed to have it trimmed. However, if it is the case, should we revert the trimming of the output at all? or have it also under the flag?

boolemancer · 2023-02-06T09:25:39Z

However, if it is the case, should we revert the trimming of the output at all? or have it also under the flag?

You're not the only one that was confused by the space. Issue #397 also questions why they're there. :)

It doesn't seem like the leading space is necessary if you actually do only split on word boundaries, so I think it's probably fine if it's also under the flag.

I went ahead and created a PR #476.

mightymatth · 2023-02-06T09:57:42Z

I support both solutions; the one proposed in your PR and having it behind the flag. However, having it behind the flag might be redundant because you won't need that info (+ there are already many flags, not sure if we need another one). We can see how others see it.

…gml-org#455) * Update whisper.cpp * fix: trim function * feat: added flag to split on word * fix: arguments for main

mightymatth added 2 commits January 27, 2023 01:44

Update whisper.cpp

8ea08d5

fix: trim function

dd6f07e

mightymatth added 3 commits February 4, 2023 22:40

feat: added flag to split on word

5c1af3d

fix: arguments for main

7436105

refactor: rebase to master

bbbe230

ggerganov merged commit d012b5c into ggml-org:master Feb 5, 2023

boolemancer mentioned this pull request Feb 6, 2023

Only trim if split_on_word is true #476

Merged

markni mentioned this pull request Mar 26, 2023

Feature Request: split on punctuation when max_len is provided #657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent word splitting when using max-len option #455

Prevent word splitting when using max-len option #455

Uh oh!

mightymatth commented Jan 27, 2023 •

edited

Loading

Uh oh!

ggerganov commented Feb 4, 2023

Uh oh!

mightymatth commented Feb 4, 2023 •

edited

Loading

Uh oh!

boolemancer commented Feb 6, 2023

Uh oh!

mightymatth commented Feb 6, 2023

Uh oh!

boolemancer commented Feb 6, 2023

Uh oh!

mightymatth commented Feb 6, 2023

Uh oh!

Uh oh!

Prevent word splitting when using max-len option #455

Prevent word splitting when using max-len option #455

Uh oh!

Conversation

mightymatth commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

ggerganov commented Feb 4, 2023

Uh oh!

mightymatth commented Feb 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boolemancer commented Feb 6, 2023

Uh oh!

mightymatth commented Feb 6, 2023

Uh oh!

boolemancer commented Feb 6, 2023

Uh oh!

mightymatth commented Feb 6, 2023

Uh oh!

Uh oh!

mightymatth commented Jan 27, 2023 •

edited

Loading

mightymatth commented Feb 4, 2023 •

edited

Loading