Skip to content

Conversation

01mf02
Copy link
Contributor

@01mf02 01mf02 commented Jul 31, 2025

To quote the regex_lite docs:

Currently, this crate only supports searching &str. It does not have APIs for searching &[u8] haystacks, although it is planned to add these in the future if there’s demand.

This PR adds such an API in form of regex_lite::bytes::Regex, mirroring regex::bytes::Regex.
Thanks, @BurntSushi, for leaving some breadcrumbs in the source code that made this easier, in particular your hint in interpolate::bytes. This is much appreciated, as well as all your effort that went not only in the careful and beautiful API design, but also your very detailed and user-friendly documentation. Thanks a lot for your awesome work.

I made this PR because I need this functionality as part of jaq, where I want to transition from String to (a variation of) Vec<u8>, inspired by your "UTF-8 by convention" idea.
I did not cargo fmt this PR in order to make reviewing as easy as possible.

(Note to self: If I ever have to convert a lot of b"foo" to &b"foo"[..] again, this is how to do it in vim for all matches on the current line: :s/b"\([^"]*\)"/\&b"\1"[..]/g.)

@01mf02
Copy link
Contributor Author

01mf02 commented Oct 13, 2025

May I ask if there is any interest in merging this? Or is there some conflict?

I'm currently depending on my regex_lite fork in jaq via [patch.crates-io], but this way, I cannot publish my crate to crates.io. I'm currently thinking about publishing my fork on crates.io, but I'd like to avoid this if possible, because I do not see myself maintaining such a fork in the long run, particularly if there is no clear perspective for this functionality to be merged into upstream. Also, I'm not sure whether my approach has any problems that I'm not aware of, given that there is no response for this PR.

The other alternative I see is to move to regex in jaq, but then, this would require caching in order to make up for its slower regex compilation speed (compared to regex_lite). And caching regexes in my scenario would be quite involved.

@BurntSushi
Copy link
Member

Hi! Sorry for the late reply. Yes, this is definitely something I would like to add. But my review bandwidth is very small. I can't give you a timeline unfortunately.

For jaq specifically, I'm surprised that the search performance of regex-lite is good enough. It's a lot slower of that dimension. I'm curious if you could say more about how you're using regexes.

@01mf02
Copy link
Contributor Author

01mf02 commented Oct 13, 2025

Hi @BurntSushi, thanks for answering so quickly! :) I understand about your review bandwidth.

I started using regex-lite in jaq after noticing that it reduces runtime drastically, especially for small haystacks / regexes; see 01mf02/jaq@0840b1b for a small example. This made a large difference for running jqjq, a jq interpreter written in jq itself, which uses quite some regexes, but usually small ones.

The usage of regex-lite in jaq is nice because by jaq's design, I cannot easily cache regexes, because at the point where I compile the regex, I do not have any place where I can store it afterwards for later.

I did, however, make an experiment once where I stored compiled regexes in a thread-local LRU cache, but this was only slightly more performant than recompiling regexes with regex-lite every single time. While I could probably use this approach to get decent performance with regex, the problem then becomes that I now have a dependency on std, because std::thread_local is not in core.

Of course, I could also work around that somehow, but all in all, I really like the idea behind a "lite" regex implementation. :) It fits the ideals of jaq.

@01mf02
Copy link
Contributor Author

01mf02 commented Oct 13, 2025

For examples of regexes that appear within jqjq, see https://github.com/wader/jqjq/blob/736b06d8a6a2093d8a41527e67d7439247756803/jqjq.jq#L122 and the following 100 lines.

@BurntSushi
Copy link
Member

OK, so I want to push back a little here. Please don't interpret this as pushing back against this PR. I'm mostly just trying to offer unsolicited help with your use case. :-) If you don't want that, I'm happy to back off. With that said, consider these things as "have you considered" or "have you tried." You understand your use cases much better than I do.

So first, I do want to say this: if your use is indeed that you are:

  1. Compiling regexes without an easy or low cost way to cache them.
  2. ... and the haystacks you are searching are typically very small...

then yeah, regex-lite could absolutely be a win for you. Because:

  1. If you compile a regex once for each search, then unless the haystack is quite large, it's likely that compiling the regex will dominate the runtime.
  2. If you are searching small haystacks, then even if you have a very fast regex engine (like regex over regex-lite), then the potential gains may be marginal even if you remove compilation of a regex from consideration (as via the LRU cache you tried). It's like comparing my Toyota Camry to a fast race car. On a small windy mountain pass, there won't be much difference. But if you put them on the freeway, the difference will be much clearer.

Indeed, if you look at my regex engine benchmarks, you'll see that regex-lite is at the bottom in terms of search performance, but near the top in terms of compile time. regex is almost flipped, although it's more middle of the pack in terms of compile time.

So given all of that, one thing I'd ask is: how confident are you that your haystacks are nearly always short? Do you have benchmarks where the haystacks are longer? The jqjq benchmark seems like a nice optimization target, but I'd guess probably not a good real world model? Not sure though.

As for the jqjq benchmark, one thing to note is that the interpretation of some of those regexes will differ between the default configurations of regex and regex-lite and some of these differences may impact compile times. For example, \s in regex includes Unicode whitespace but regex-lite does not. I'm not sure if that difference was accounted for or not (and whether it is actually correct). In any case, there are some knobs you can twiddle with regex that may make compilation time faster. But you'll need to drop down to regex-automata and use its meta::Regex:

  • Disable Config::onepass, since this is generally only useful in a subset of cases that involve capture groups. When enabled, there is some cost associated with determining whether a particular regex can use this optimization at all.
  • Disable Config::dfa. For small regexes, enabling this will cause a full DFA to be built. This can also be costly. (It's actually disabled by default for regex. But if you drop down to regex-automata in its default configuration, it will be enabled by default.)
  • Via meta::Builder::syntax, ensure that syntax::Config::unicode is disabled. This is what's happening anyway with regex-lite.

Disabling these options should also imply that you can disable some crate features of regex-automata to reduce compilation time and binary sizes.

@BurntSushi
Copy link
Member

Also, off topic, but there may be some small perf wins by switching chrono to jiff. You would use Timestamp instead of DateTime<Utc>. And maybe Zoned instead of DateTime<FixedOffset>.

@01mf02
Copy link
Contributor Author

01mf02 commented Oct 14, 2025

Thanks a lot for your so detailed advice, @BurntSushi!

First of all, I do not really know on what kinds of inputs jaq users run regexes. The jqjq example was the first stress test for the regex routines in jaq, that's why I have tried to make its performance good.
To adapt your Toyota Camry analogy, I think that jaq mostly behaves like a squirrel: Its algorithms generally start fast, at the expense of low top speed. jaq, as it stands, also has no heuristics in it AFAIK, which makes its behaviour quite predictable. I'm therefore eager to include algorithms that share the same characteristics, to create a coherent whole.

I have tried your suggestions for regex and made a little benchmark to measure regex compilation performance, via cargo new regex-perf and cargo add regex-lite regex-automata regex. I left all default features set.
The comments indicate the runtime of each regex implementation (measured with cargo build --release && time target/release/regex-perf) :

fn main() {
    let re = r"^-=";
    for i in 0..100000 {
        // 0.062s
        //let re = regex_lite::Regex::new(re).unwrap();

        // 0.244s
        use regex_automata::{Match, nfa::thompson, util::syntax};
        let re = thompson::pikevm::PikeVM::builder()
            .syntax(syntax::Config::new().utf8(false))
            .thompson(thompson::Config::new().utf8(false))
            .build(re)
            .unwrap();

        // 0.635s
        /*
        use regex_automata::{Match, meta, util::syntax};
        let re = meta::Regex::builder()
            .syntax(syntax::Config::new().unicode(false))
            .configure(meta::Config::new().onepass(false).dfa(false))
            .build(re)
            .unwrap();
        */

        // 1.500s
        //let re = regex::Regex::new(re).unwrap();
    }
}

The outcome is that regex_lite is by far the fastest. It is followed by the PikeVM in regex_automata, even if that is about 4 times slower! The meta automaton is about twice as slow as PikeVM.

I tried to find out the reason for the comparatively slow compilation performance of the PikeVM in regex_automata, so I made a flamegraph of the compilation:

flamegraph

The only thing that stands out to me is the large amount of time taken by ByteClassSet::byte_classes, amounting to about 15% of total execution time. A large part of that is taken by ByteSet::contains (11% of total runtime) --- perhaps that might profit from #[inline]?

Still, this alone does not explain why regex_lite is 4 times faster ... If you have any clue about how to match regex_lite's compilation performance with regex_automata, I'd be very interested.

I'll try to integrate regex_automata's PikeVM into jaq and see how the performance of jqjq evolves with that.
And I'll also look into jiff, thanks for the tip. :)

@01mf02
Copy link
Contributor Author

01mf02 commented Oct 14, 2025

I have now made a prototype that (re-)integrates regex into jaq, using a PikeVM as underlying engine.
Running jqjq with this prototype is about twice as slow as with regex-lite.
See 01mf02/jaq#342 (comment) for details.

If you have any idea which knobs I could turn to make this better, I'm all ears. :)

@BurntSushi
Copy link
Member

BurntSushi commented Oct 14, 2025

Thanks for trying out all my ideas! Using just the PikeVM from regex-automata probably isn't worth it. It's going to be slower to compile than regex-lite (as you discovered) and likely not much faster at search time. The true test would be to use a meta::Regex and disable everything except for the PikeVM. That would give you literal optimizations, but you'd likely suffer another compile time hit for the various things that a meta::Regex does over a PikeVM. And whether the literal optimizations make up for that is of course work-load dependent.

The thing I was mostly trying to emphasize was whether there were use cases with regexes on large haystacks. If those use cases don't exist or are vanishingly rare, then regex-lite is probably the right choice here. If all your benchmarks are tiny haystacks where every search also compiles the regex, then I don't think regex-automata can ever compete on that dimension. It can't basically by design.

As for why just compiling the PikeVM in regex-automata is still slower than regex-lite... regex-lite is simplistic to its very core. From the parser on up to the matching engine, literally everything is basically the simplest it can be without optimization. This is why it can compile regexes so fast. In contrast, regex-automata is doing a shit-load of optimizations at almost every level. Including compiling the Thompson NFA. (Indeed, there are many optimizations at that level.) Optimizations generally require more work, and regex-automata is organized around the design philosophy that regex compilation can and should be amortized.

If you have any idea which knobs I could turn to make this better, I'm all ears. :)

It doesn't look like you disabled Unicode mode. That's critical. You disabled utf8, but that's not the same.

@01mf02
Copy link
Contributor Author

01mf02 commented Oct 15, 2025

The thing I was mostly trying to emphasize was whether there were use cases with regexes on large haystacks. If those use cases don't exist or are vanishingly rare, then regex-lite is probably the right choice here. If all your benchmarks are tiny haystacks where every search also compiles the regex, then I don't think regex-automata can ever compete on that dimension. It can't basically by design.

I have thought about the use cases, and indeed, I think that in JSON data, you generally tend to have short strings. So yes, the haystacks will be mostly tiny/small, and I suppose that regexes are used for tiny tasks such as ZIP code matching or things like that.

As for why just compiling the PikeVM in regex-automata is still slower than regex-lite... regex-lite is simplistic to its very core. From the parser on up to the matching engine, literally everything is basically the simplest it can be without optimization. This is why it can compile regexes so fast. In contrast, regex-automata is doing a shit-load of optimizations at almost every level. Including compiling the Thompson NFA. (Indeed, there are many optimizations at that level.) Optimizations generally require more work, and regex-automata is organized around the design philosophy that regex compilation can and should be amortized.

It seems that regex-lite shares exactly the same design goals as jaq. jaq also uses the simplest possible algorithms that always produce correct results in finite time (plus perhaps a sprinkle of very low-hanging optimisation fruit on top).
That's also why I think that they are a nice match. :)

It doesn't look like you disabled Unicode mode. That's critical. You disabled utf8, but that's not the same.

Is this what you mean by disabling Unicode mode? (I searched the regex-automata docs for "Unicode mode", and the syntax::Config::unicode function was the only one that came up.)

use regex_automata::{Match, nfa::thompson, util::syntax};
let re = thompson::pikevm::PikeVM::builder()
    .syntax(syntax::Config::new().utf8(false).unicode(false))
    .thompson(thompson::Config::new().utf8(false))
    .build(re)
    .unwrap();

This does not affect the performance at all. Whatever values I pass to utf8() and unicode(), the performance remains stable up to the millisecond. (Remember, I'm not actually searching with the regex, just compiling it.)

By the way, something that I noticed when I read your rebar README and several other posts by you, such as your regex-lite introduction on Reddit: I found it often ambiguous when you wrote about "compile times", as in:

It's for folks that want smaller binary sizes and/or shorter compile times.

Here, I suppose that binary size refers to the size of the Rust program using regex-lite, but I'm not confident about what compile time you are referring to. It could mean "time taken by rustc to compile main.rs", or "time taken by regex-lite to compile a regex". Given that you are talking about binary size before, the association that you mean "rustc time" by "compile time" comes up naturally.

Given that regex-lite is both faster to compile by rustc and compiles regexes faster, I think that it is good to be clear here. For me, it came as real surprise to learn just how much faster regex-lite compiles regexes, and I think that this fact could be also quite interesting for other people (because they might think that just rustc compilation is faster).

@01mf02
Copy link
Contributor Author

01mf02 commented Oct 15, 2025

So I think that I would like to keep regex-lite in jaq for the time being, given their nice match of design philosophy.

If it helps you with the reviewing of this PR: I literally just copied string.rs to bytes.rs and replaced a lot of &str by equivalent &[u8] parts. Using diff string.rs bytes.rs should be rather quick to breeze through.

@BurntSushi
Copy link
Member

BurntSushi commented Oct 15, 2025

Is this what you mean by disabling Unicode mode? (I searched the regex-automata docs for "Unicode mode", and the syntax::Config::unicode function was the only one that came up.)

I linked you to syntax::Config::unicode above. :-)

This does not affect the performance at all. Whatever values I pass to utf8() and unicode(), the performance remains stable up to the millisecond. (Remember, I'm not actually searching with the regex, just compiling it.)

Oh wow that is interesting. But yeah, compiling is where I'd expect it to help.

By the way, something that I noticed when I read your rebar README and several other posts by you, such as your regex-lite introduction on Reddit: I found it often ambiguous when you wrote about "compile times", as in:

Yeah I agree I should be clearer. In rebar, it's only talking about regex compile times. In the regex-lite crate docs, it's talking about rustc compile times. But I should probably mention regex compile times in the regex-lite crate docs. It's just that usually people don't care about regex compile times because they put them in a LazyLock or can otherwise amortize their cost. Many more care about rustc compile times. And indeed, rustc compile times and binary size were the primary motivations that led me to build regex-lite. The fact that this results in fast regex compile times is mostly incidental but also a fairly direct consequence of its design philosophy. Less code generally means less optimization and less time spent doing that optimization.

If it helps you with the reviewing of this PR: I literally just copied string.rs to bytes.rs and replaced a lot of &str by equivalent &[u8] parts. Using diff string.rs bytes.rs should be rather quick to breeze through.

Thanks! But there are some subtle differences that I need to think/review carefully. And there is also how iteration is handled in the presence of invalid UTF-8. And making sure the tests are all wired up correctly.

I'll try to get to it soon, but I don't make any promises.

@01mf02
Copy link
Contributor Author

01mf02 commented Oct 16, 2025

Thanks! But there are some subtle differences that I need to think/review carefully. And there is also how iteration is handled in the presence of invalid UTF-8. And making sure the tests are all wired up correctly.

I'll try to get to it soon, but I don't make any promises.

I understand. Take your time, and I think that I'll publish a crate that contains my current state, in the hope that I can at some point replace it with your upstream version again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants