-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Proposal name
A web extension to inject an AI API into the window when installed, so we can quickly prototype and explore.
Short description
We’d like to explore use cases that require finer-grained control over local inference than current proposals are able to provide, based on their very reasonable design constraints.
Being able to specify and configure an exact model from a model hub would unlock a number of capabilities (LoRA, tuning parameters or prompts to a known target, features beyond text generation, providing your own model or quickly trying new releases). We think those capabilities will enable developers to build new and interesting stuff with local inference, which will help inform standards work.
Developers can do this today with WASM and WebGPU libraries, but a challenge is the lack of cross origin storage for large model weight files (and for the libraries themselves). Also, browsers have somewhat limited ability to optimize the runtime performance.
A web extension could fill some of the gaps today and give us a way to quickly prototype and explore. The basic mechanism would be to inject an API into the window when installed, and pages that want to use it could add a banner with an install link to the relevant browser addon site.
The extension would manage fetching models, cross origin storage & user permissions/management of model weights, and provide a runtime API for inference itself using transformers.js / onnx as in Mozilla’s existing web extension API.
It would also be cross-browser, bundling all necessary dependencies so that no changes are required from browser vendors to support it. Though, engines could provide optimizations beyond what could be done from a normal library if they’d like.
We expect this would be appealing to a niche set of enthusiasts, where asking users to install an extension is workable, and could help us quickly learn about real world use cases to inform API design. It would also be a chance to highlight the functionality available with local inference today by making it easy to build and share compelling demos on the web.
The API design would be left up to anyone interested in contributing. We would want to take lessons from ongoing work within the group and share where it makes sense. Mozilla also has the web extension which provides a transformers-shaped API with pipelines and default models for each task which we could borrow from.
Example use cases
The examples below are assuming the shape of the API is the same as the one we've implemented in the WebExtension trial AI API, but this is completely open to discussions.
Example 1 : running a chatbot on several websites
The ACME company runs a fine-tuned Llama3 chatbot on all its websites.
The user is logged in, and the prompt is built with some information from the browsing context and some provided by a server-side API.
The chat bot window is updated with a stream of tokens generated by the API.
// grabs notifications when model files gets downloaded
window.localInferenceExtension.onProgress.addListener(progressData => {
progressBar.update(progressData);
});
// create the engine, may trigger the initial download of the model, then it’s shared across all websites
const chatbot = await window.localInferenceExtension.createEngine({
taskName: "text-generation",
modelName: "acme/llama3",
device: “gpu”
});
// build the prompt (enriched with server side info, like browsing activity from other websites)
const prompt = buildPrompt(pageContent, currentChatText, serverContext);
// Create a text streamer
const streamer = new window.localInferenceExtension.TextStreamer(chatbot.tokenizer, {
callback_function: (text) => { updateChatText(text)});
// Generate the answer in the text streamer
await chatbot(prompt, { streamer, });
Example 2 : a topic classifier
The website displays a list of results from a search, and wants to run a small Named Entity Recognition model on the client side, and highlight them.
// create the engine, may trigger downloads
const classifier = await window.localInferenceExtension.createEngine({
taskName: "token-classification",
modelName: "dslim/bert-base-NER",
modelHub: "huggingface"
});
// grabs the results.
const output = await classifier(pageContent);
// highlights the text
hilightNERInText(output)
Example 3 : a local TTS
The website runs a translation model, then a TTS one for visually impaired Klingons readers -- the model is stored on the website
// create the TTS engine
const synthesizer = await window.localInferenceExtension.createEngine({
taskName: "text-to-speech",
modelName: "star-trek-fans/klingon_tts"
});
// translate the text into Klingon (naive loop on each sentence)
for (sentence in translateText(extractText(page)) {
// Runs on the sentence
const out = await synthesizer(sentence);
// plays it
const audioElement = new Audio(URL.createObjectURL(out))
audioElement.play();
}
A rough idea or two about implementation
The web extension takes care of the runtime(s) lifecycle and deals with downloading and storing model files. It embeds transformers.js so will work without any intervention from the browser, but the browser can implement its own optimizations if desired (for example, Firefox could use a compatible native backend for transformers.js while exposing the same API).