Skip to content

Commit 17e52d7

Browse files
mrienstrajiahansu
authored andcommitted
docs : make model options / model install methods clearer (ggml-org#1806)
* Make models more "discoverable" * Clean up code block language identifiers * make 3 options clearer * undo Prettier formatter change * docs: `$` shell prompt, consistently * docs: minor changes
1 parent 2a51b2d commit 17e52d7

File tree

6 files changed

+136
-112
lines changed

6 files changed

+136
-112
lines changed

README.md

Lines changed: 73 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Supported platforms:
3636
- [x] [docker](https://github.com/ggerganov/whisper.cpp/pkgs/container/whisper.cpp)
3737

3838
The entire high-level implementation of the model is contained in [whisper.h](whisper.h) and [whisper.cpp](whisper.cpp).
39-
The rest of the code is part of the [ggml](https://github.com/ggerganov/ggml) machine learning library.
39+
The rest of the code is part of the [`ggml`](https://github.com/ggerganov/ggml) machine learning library.
4040

4141
Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications.
4242
As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: [whisper.objc](examples/whisper.objc)
@@ -61,22 +61,22 @@ Or you can even run it straight in the browser: [talk.wasm](examples/talk.wasm)
6161
- Sample real-time audio transcription from the microphone is demonstrated in [stream.cpp](examples/stream)
6262
- Various other examples are available in the [examples](examples) folder
6363

64-
The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD
65-
intrinsics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes since
66-
the Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.
64+
The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD intrinsics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes since the Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.
6765

6866
## Quick start
6967

70-
First clone the repository.
68+
First clone the repository:
7169

72-
Then, download one of the Whisper models converted in [ggml format](models). For example:
70+
```bash
71+
git clone https://github.com/ggerganov/whisper.cpp.git
72+
```
73+
74+
Then, download one of the Whisper [models](models/README.md) converted in [`ggml` format](#ggml-format). For example:
7375

7476
```bash
7577
bash ./models/download-ggml-model.sh base.en
7678
```
7779

78-
If you wish to convert the Whisper models to ggml format yourself, instructions are in [models/README.md](models/README.md).
79-
8080
Now build the [main](examples/main) example and transcribe an audio file like this:
8181

8282
```bash
@@ -91,7 +91,7 @@ make
9191

9292
For a quick demo, simply run `make base.en`:
9393

94-
```java
94+
```text
9595
$ make base.en
9696
9797
cc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
@@ -207,7 +207,7 @@ For detailed usage instructions, run: `./main -h`
207207
Note that the [main](examples/main) example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool.
208208
For example, you can use `ffmpeg` like this:
209209

210-
```java
210+
```bash
211211
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
212212
```
213213

@@ -239,9 +239,9 @@ make large-v3
239239

240240
## Memory usage
241241

242-
| Model | Disk | Mem |
243-
| --- | --- | --- |
244-
| tiny | 75 MiB | ~273 MB |
242+
| Model | Disk | Mem |
243+
| ------ | ------- | ------- |
244+
| tiny | 75 MiB | ~273 MB |
245245
| base | 142 MiB | ~388 MB |
246246
| small | 466 MiB | ~852 MB |
247247
| medium | 1.5 GiB | ~2.1 GB |
@@ -278,7 +278,7 @@ speed-up - more than x3 faster compared with CPU-only execution. Here are the in
278278

279279
- To ensure `coremltools` operates correctly, please confirm that [Xcode](https://developer.apple.com/xcode/) is installed and execute `xcode-select --install` to install the command-line tools.
280280
- Python 3.10 is recommended.
281-
- [OPTIONAL] It is recommended to utilize a Python version management system, such as [Miniconda](https://docs.conda.io/en/latest/miniconda.html) for this step:
281+
- [OPTIONAL] It is recommended to utilize a Python version management system, such as [Miniconda](https://docs.conda.io/en/latest/miniconda.html) for this step:
282282
- To create an environment, use: `conda create -n py310-whisper python=3.10 -y`
283283
- To activate the environment, use: `conda activate py310-whisper`
284284

@@ -304,8 +304,8 @@ speed-up - more than x3 faster compared with CPU-only execution. Here are the in
304304

305305
- Run the examples as usual. For example:
306306

307-
```bash
308-
./main -m models/ggml-base.en.bin -f samples/jfk.wav
307+
```text
308+
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
309309
310310
...
311311
@@ -333,7 +333,8 @@ This can result in significant speedup in encoder performance. Here are the inst
333333
- First, setup python virtual env. and install python dependencies. Python 3.10 is recommended.
334334

335335
Windows:
336-
```
336+
337+
```powershell
337338
cd models
338339
python -m venv openvino_conv_env
339340
openvino_conv_env\Scripts\activate
@@ -342,7 +343,8 @@ This can result in significant speedup in encoder performance. Here are the inst
342343
```
343344

344345
Linux and macOS:
345-
```
346+
347+
```bash
346348
cd models
347349
python3 -m venv openvino_conv_env
348350
source openvino_conv_env/bin/activate
@@ -356,7 +358,7 @@ This can result in significant speedup in encoder performance. Here are the inst
356358
python convert-whisper-to-openvino.py --model base.en
357359
```
358360

359-
This will produce ggml-base.en-encoder-openvino.xml/.bin IR model files. It's recommended to relocate these to the same folder as ggml models, as that
361+
This will produce ggml-base.en-encoder-openvino.xml/.bin IR model files. It's recommended to relocate these to the same folder as `ggml` models, as that
360362
is the default location that the OpenVINO extension will search at runtime.
361363

362364
- Build `whisper.cpp` with OpenVINO support:
@@ -366,24 +368,28 @@ This can result in significant speedup in encoder performance. Here are the inst
366368
After downloading & extracting package onto your development system, set up required environment by sourcing setupvars script. For example:
367369

368370
Linux:
371+
369372
```bash
370373
source /path/to/l_openvino_toolkit_ubuntu22_2023.0.0.10926.b4452d56304_x86_64/setupvars.sh
371374
```
372375

373376
Windows (cmd):
374-
```
377+
378+
```powershell
375379
C:\Path\To\w_openvino_toolkit_windows_2023.0.0.10926.b4452d56304_x86_64\setupvars.bat
376380
```
377381

378382
And then build the project using cmake:
383+
379384
```bash
380385
cmake -B build -DWHISPER_OPENVINO=1
381386
cmake --build build -j --config Release
382387
```
383388

384389
- Run the examples as usual. For example:
385-
```bash
386-
./main -m models/ggml-base.en.bin -f samples/jfk.wav
390+
391+
```text
392+
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
387393
388394
...
389395
@@ -434,7 +440,6 @@ cmake -B build -DWHISPER_CLBLAST=ON
434440
cmake --build build -j --config Release
435441
```
436442

437-
438443
Run all the examples as usual.
439444

440445
## BLAS CPU support via OpenBLAS
@@ -452,10 +457,12 @@ WHISPER_OPENBLAS=1 make -j
452457
## Docker
453458

454459
### Prerequisites
455-
* Docker must be installed and running on your system.
456-
* Create a folder to store big models & intermediate files (ex. /whisper/models)
460+
461+
- Docker must be installed and running on your system.
462+
- Create a folder to store big models & intermediate files (ex. /whisper/models)
457463

458464
### Images
465+
459466
We have two Docker images available for this project:
460467

461468
1. `ghcr.io/ggerganov/whisper.cpp:main`: This image includes the main executable file as well as `curl` and `ffmpeg`. (platforms: `linux/amd64`, `linux/arm64`)
@@ -491,7 +498,7 @@ in about half a minute on a MacBook M1 Pro, using `medium.en` model:
491498
<details>
492499
<summary>Expand to see the result</summary>
493500

494-
```java
501+
```text
495502
$ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8
496503
497504
whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
@@ -563,6 +570,7 @@ whisper_print_timings: encode time = 18665.10 ms / 9 runs ( 2073.90 ms per
563570
whisper_print_timings: decode time = 13090.93 ms / 549 runs ( 23.85 ms per run)
564571
whisper_print_timings: total time = 32733.52 ms
565572
```
573+
566574
</details>
567575

568576
## Real-time audio input example
@@ -571,7 +579,7 @@ This is a naive example of performing real-time inference on audio from your mic
571579
The [stream](examples/stream) tool samples the audio every half a second and runs the transcription continuously.
572580
More info is available in [issue #10](https://github.com/ggerganov/whisper.cpp/issues/10).
573581

574-
```java
582+
```bash
575583
make stream
576584
./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
577585
```
@@ -583,7 +591,7 @@ https://user-images.githubusercontent.com/1991296/194935793-76afede7-cfa8-48d8-a
583591
Adding the `--print-colors` argument will print the transcribed text using an experimental color coding strategy
584592
to highlight words with high or low confidence:
585593

586-
```java
594+
```bash
587595
./main -m models/ggml-base.en.bin -f samples/gb0.wav --print-colors
588596
```
589597

@@ -593,8 +601,8 @@ to highlight words with high or low confidence:
593601

594602
For example, to limit the line length to a maximum of 16 characters, simply add `-ml 16`:
595603

596-
```java
597-
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
604+
```text
605+
$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
598606
599607
whisper_model_load: loading model from './models/ggml-base.en.bin'
600608
...
@@ -617,8 +625,8 @@ main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 pr
617625

618626
The `--max-len` argument can be used to obtain word-level timestamps. Simply use `-ml 1`:
619627

620-
```java
621-
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
628+
```text
629+
$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
622630
623631
whisper_model_load: loading model from './models/ggml-base.en.bin'
624632
...
@@ -688,7 +696,7 @@ This requires to have `ffmpeg` installed.
688696

689697
Here are a few *"typical"* examples:
690698

691-
```java
699+
```bash
692700
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
693701
source ./samples/jfk.wav.wts
694702
ffplay ./samples/jfk.wav.mp4
@@ -698,7 +706,7 @@ https://user-images.githubusercontent.com/1991296/199337465-dbee4b5e-9aeb-48a3-b
698706

699707
---
700708

701-
```java
709+
```bash
702710
./main -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owts
703711
source ./samples/mm0.wav.wts
704712
ffplay ./samples/mm0.wav.mp4
@@ -708,7 +716,7 @@ https://user-images.githubusercontent.com/1991296/199337504-cc8fd233-0cb7-4920-9
708716

709717
---
710718

711-
```java
719+
```bash
712720
./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
713721
source ./samples/gb0.wav.wts
714722
ffplay ./samples/gb0.wav.mp4
@@ -722,7 +730,7 @@ https://user-images.githubusercontent.com/1991296/199337538-b7b0c7a3-2753-4a88-a
722730

723731
Use the [extra/bench-wts.sh](https://github.com/ggerganov/whisper.cpp/blob/master/extra/bench-wts.sh) script to generate a video in the following format:
724732

725-
```java
733+
```bash
726734
./extra/bench-wts.sh samples/jfk.wav
727735
ffplay ./samples/jfk.wav.all.mp4
728736
```
@@ -751,8 +759,7 @@ It is written in python with the intention of being easy to modify and extend fo
751759

752760
It outputs a csv file with the results of the benchmarking.
753761

754-
755-
## ggml format
762+
## `ggml` format
756763

757764
The original models are converted to a custom binary format. This allows to pack everything needed into a single file:
758765

@@ -767,51 +774,50 @@ or manually from here:
767774
- https://huggingface.co/ggerganov/whisper.cpp
768775
- https://ggml.ggerganov.com
769776

770-
For more details, see the conversion script [models/convert-pt-to-ggml.py](models/convert-pt-to-ggml.py) or the README
771-
in [models](models).
777+
For more details, see the conversion script [models/convert-pt-to-ggml.py](models/convert-pt-to-ggml.py) or [models/README.md](models/README.md).
772778

773779
## [Bindings](https://github.com/ggerganov/whisper.cpp/discussions/categories/bindings)
774780

775-
- [X] Rust: [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) | [#310](https://github.com/ggerganov/whisper.cpp/discussions/310)
776-
- [X] JavaScript: [bindings/javascript](bindings/javascript) | [#309](https://github.com/ggerganov/whisper.cpp/discussions/309)
781+
- [x] Rust: [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) | [#310](https://github.com/ggerganov/whisper.cpp/discussions/310)
782+
- [x] JavaScript: [bindings/javascript](bindings/javascript) | [#309](https://github.com/ggerganov/whisper.cpp/discussions/309)
777783
- React Native (iOS / Android): [whisper.rn](https://github.com/mybigday/whisper.rn)
778-
- [X] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggerganov/whisper.cpp/discussions/312)
779-
- [X] Java:
784+
- [x] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggerganov/whisper.cpp/discussions/312)
785+
- [x] Java:
780786
- [GiviMAD/whisper-jni](https://github.com/GiviMAD/whisper-jni)
781-
- [X] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggerganov/whisper.cpp/discussions/507)
782-
- [X] Objective-C / Swift: [ggerganov/whisper.spm](https://github.com/ggerganov/whisper.spm) | [#313](https://github.com/ggerganov/whisper.cpp/discussions/313)
787+
- [x] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggerganov/whisper.cpp/discussions/507)
788+
- [x] Objective-C / Swift: [ggerganov/whisper.spm](https://github.com/ggerganov/whisper.spm) | [#313](https://github.com/ggerganov/whisper.cpp/discussions/313)
783789
- [exPHAT/SwiftWhisper](https://github.com/exPHAT/SwiftWhisper)
784-
- [X] .NET: | [#422](https://github.com/ggerganov/whisper.cpp/discussions/422)
790+
- [x] .NET: | [#422](https://github.com/ggerganov/whisper.cpp/discussions/422)
785791
- [sandrohanea/whisper.net](https://github.com/sandrohanea/whisper.net)
786792
- [NickDarvey/whisper](https://github.com/NickDarvey/whisper)
787-
- [X] Python: | [#9](https://github.com/ggerganov/whisper.cpp/issues/9)
793+
- [x] Python: | [#9](https://github.com/ggerganov/whisper.cpp/issues/9)
788794
- [stlukey/whispercpp.py](https://github.com/stlukey/whispercpp.py) (Cython)
789795
- [aarnphm/whispercpp](https://github.com/aarnphm/whispercpp) (Pybind11)
790-
- [X] R: [bnosac/audio.whisper](https://github.com/bnosac/audio.whisper)
791-
- [X] Unity: [macoron/whisper.unity](https://github.com/Macoron/whisper.unity)
796+
- [x] R: [bnosac/audio.whisper](https://github.com/bnosac/audio.whisper)
797+
- [x] Unity: [macoron/whisper.unity](https://github.com/Macoron/whisper.unity)
792798

793799
## Examples
794800

795801
There are various examples of using the library for different projects in the [examples](examples) folder.
796802
Some of the examples are even ported to run in the browser using WebAssembly. Check them out!
797803

798-
| Example | Web | Description |
799-
| --- | --- | --- |
800-
| [main](examples/main) | [whisper.wasm](examples/whisper.wasm) | Tool for translating and transcribing audio using Whisper |
801-
| [bench](examples/bench) | [bench.wasm](examples/bench.wasm) | Benchmark the performance of Whisper on your machine |
802-
| [stream](examples/stream) | [stream.wasm](examples/stream.wasm) | Real-time transcription of raw microphone capture |
803-
| [command](examples/command) | [command.wasm](examples/command.wasm) | Basic voice assistant example for receiving voice commands from the mic |
804-
| [wchess](examples/wchess) | [wchess.wasm](examples/wchess) | Voice-controlled chess |
805-
| [talk](examples/talk) | [talk.wasm](examples/talk.wasm) | Talk with a GPT-2 bot |
806-
| [talk-llama](examples/talk-llama) | | Talk with a LLaMA bot |
807-
| [whisper.objc](examples/whisper.objc) | | iOS mobile application using whisper.cpp |
808-
| [whisper.swiftui](examples/whisper.swiftui) | | SwiftUI iOS / macOS application using whisper.cpp |
809-
| [whisper.android](examples/whisper.android) | | Android mobile application using whisper.cpp |
810-
| [whisper.nvim](examples/whisper.nvim) | | Speech-to-text plugin for Neovim |
811-
| [generate-karaoke.sh](examples/generate-karaoke.sh) | | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture |
812-
| [livestream.sh](examples/livestream.sh) | | [Livestream audio transcription](https://github.com/ggerganov/whisper.cpp/issues/185) |
813-
| [yt-wsp.sh](examples/yt-wsp.sh) | | Download + transcribe and/or translate any VOD [(original)](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818) |
814-
| [server](examples/server) | | HTTP transcription server with OAI-like API |
804+
| Example | Web | Description |
805+
| --------------------------------------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
806+
| [main](examples/main) | [whisper.wasm](examples/whisper.wasm) | Tool for translating and transcribing audio using Whisper |
807+
| [bench](examples/bench) | [bench.wasm](examples/bench.wasm) | Benchmark the performance of Whisper on your machine |
808+
| [stream](examples/stream) | [stream.wasm](examples/stream.wasm) | Real-time transcription of raw microphone capture |
809+
| [command](examples/command) | [command.wasm](examples/command.wasm) | Basic voice assistant example for receiving voice commands from the mic |
810+
| [wchess](examples/wchess) | [wchess.wasm](examples/wchess) | Voice-controlled chess |
811+
| [talk](examples/talk) | [talk.wasm](examples/talk.wasm) | Talk with a GPT-2 bot |
812+
| [talk-llama](examples/talk-llama) | | Talk with a LLaMA bot |
813+
| [whisper.objc](examples/whisper.objc) | | iOS mobile application using whisper.cpp |
814+
| [whisper.swiftui](examples/whisper.swiftui) | | SwiftUI iOS / macOS application using whisper.cpp |
815+
| [whisper.android](examples/whisper.android) | | Android mobile application using whisper.cpp |
816+
| [whisper.nvim](examples/whisper.nvim) | | Speech-to-text plugin for Neovim |
817+
| [generate-karaoke.sh](examples/generate-karaoke.sh) | | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture |
818+
| [livestream.sh](examples/livestream.sh) | | [Livestream audio transcription](https://github.com/ggerganov/whisper.cpp/issues/185) |
819+
| [yt-wsp.sh](examples/yt-wsp.sh) | | Download + transcribe and/or translate any VOD [(original)](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818) |
820+
| [server](examples/server) | | HTTP transcription server with OAI-like API |
815821

816822
## [Discussions](https://github.com/ggerganov/whisper.cpp/discussions)
817823

bindings/javascript/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ make publish-npm
4141

4242
## Sample run
4343

44-
```java
44+
```text
4545
$ node --experimental-wasm-threads --experimental-wasm-simd ../tests/test-whisper.js
4646
4747
whisper_model_load: loading model from 'whisper.bin'
@@ -63,7 +63,7 @@ whisper_model_load: ggml ctx size = 140.60 MB
6363
whisper_model_load: memory size = 22.83 MB
6464
whisper_model_load: model size = 140.54 MB
6565
66-
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 1 | BLAS = 0 |
66+
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 1 | BLAS = 0 |
6767
6868
operator(): processing 176000 samples, 11.0 sec, 8 threads, 1 processors, lang = en, task = transcribe ...
6969

0 commit comments

Comments
 (0)