-
Notifications
You must be signed in to change notification settings - Fork 485
Search &[u8]
haystacks in regex_lite
#1274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
May I ask if there is any interest in merging this? Or is there some conflict? I'm currently depending on my The other alternative I see is to move to |
Hi! Sorry for the late reply. Yes, this is definitely something I would like to add. But my review bandwidth is very small. I can't give you a timeline unfortunately. For jaq specifically, I'm surprised that the search performance of |
Hi @BurntSushi, thanks for answering so quickly! :) I understand about your review bandwidth. I started using The usage of I did, however, make an experiment once where I stored compiled regexes in a thread-local LRU cache, but this was only slightly more performant than recompiling regexes with Of course, I could also work around that somehow, but all in all, I really like the idea behind a "lite" regex implementation. :) It fits the ideals of jaq. |
For examples of regexes that appear within jqjq, see https://github.com/wader/jqjq/blob/736b06d8a6a2093d8a41527e67d7439247756803/jqjq.jq#L122 and the following 100 lines. |
OK, so I want to push back a little here. Please don't interpret this as pushing back against this PR. I'm mostly just trying to offer unsolicited help with your use case. :-) If you don't want that, I'm happy to back off. With that said, consider these things as "have you considered" or "have you tried." You understand your use cases much better than I do. So first, I do want to say this: if your use is indeed that you are:
then yeah,
Indeed, if you look at my regex engine benchmarks, you'll see that So given all of that, one thing I'd ask is: how confident are you that your haystacks are nearly always short? Do you have benchmarks where the haystacks are longer? The As for the
Disabling these options should also imply that you can disable some crate features of |
Also, off topic, but there may be some small perf wins by switching |
Thanks a lot for your so detailed advice, @BurntSushi! First of all, I do not really know on what kinds of inputs jaq users run regexes. The jqjq example was the first stress test for the regex routines in jaq, that's why I have tried to make its performance good. I have tried your suggestions for fn main() {
let re = r"^-=";
for i in 0..100000 {
// 0.062s
//let re = regex_lite::Regex::new(re).unwrap();
// 0.244s
use regex_automata::{Match, nfa::thompson, util::syntax};
let re = thompson::pikevm::PikeVM::builder()
.syntax(syntax::Config::new().utf8(false))
.thompson(thompson::Config::new().utf8(false))
.build(re)
.unwrap();
// 0.635s
/*
use regex_automata::{Match, meta, util::syntax};
let re = meta::Regex::builder()
.syntax(syntax::Config::new().unicode(false))
.configure(meta::Config::new().onepass(false).dfa(false))
.build(re)
.unwrap();
*/
// 1.500s
//let re = regex::Regex::new(re).unwrap();
}
} The outcome is that I tried to find out the reason for the comparatively slow compilation performance of the PikeVM in The only thing that stands out to me is the large amount of time taken by Still, this alone does not explain why I'll try to integrate |
I have now made a prototype that (re-)integrates If you have any idea which knobs I could turn to make this better, I'm all ears. :) |
Thanks for trying out all my ideas! Using just the The thing I was mostly trying to emphasize was whether there were use cases with regexes on large haystacks. If those use cases don't exist or are vanishingly rare, then As for why just compiling the
It doesn't look like you disabled Unicode mode. That's critical. You disabled |
I have thought about the use cases, and indeed, I think that in JSON data, you generally tend to have short strings. So yes, the haystacks will be mostly tiny/small, and I suppose that regexes are used for tiny tasks such as ZIP code matching or things like that.
It seems that
Is this what you mean by disabling Unicode mode? (I searched the use regex_automata::{Match, nfa::thompson, util::syntax};
let re = thompson::pikevm::PikeVM::builder()
.syntax(syntax::Config::new().utf8(false).unicode(false))
.thompson(thompson::Config::new().utf8(false))
.build(re)
.unwrap(); This does not affect the performance at all. Whatever values I pass to By the way, something that I noticed when I read your rebar README and several other posts by you, such as your regex-lite introduction on Reddit: I found it often ambiguous when you wrote about "compile times", as in:
Here, I suppose that binary size refers to the size of the Rust program using Given that |
So I think that I would like to keep If it helps you with the reviewing of this PR: I literally just copied |
I linked you to
Oh wow that is interesting. But yeah, compiling is where I'd expect it to help.
Yeah I agree I should be clearer. In rebar, it's only talking about regex compile times. In the
Thanks! But there are some subtle differences that I need to think/review carefully. And there is also how iteration is handled in the presence of invalid UTF-8. And making sure the tests are all wired up correctly. I'll try to get to it soon, but I don't make any promises. |
I understand. Take your time, and I think that I'll publish a crate that contains my current state, in the hope that I can at some point replace it with your upstream version again. |
To quote the
regex_lite
docs:This PR adds such an API in form of
regex_lite::bytes::Regex
, mirroringregex::bytes::Regex
.Thanks, @BurntSushi, for leaving some breadcrumbs in the source code that made this easier, in particular your hint in
interpolate::bytes
. This is much appreciated, as well as all your effort that went not only in the careful and beautiful API design, but also your very detailed and user-friendly documentation. Thanks a lot for your awesome work.I made this PR because I need this functionality as part of jaq, where I want to transition from
String
to (a variation of)Vec<u8>
, inspired by your "UTF-8 by convention" idea.I did not
cargo fmt
this PR in order to make reviewing as easy as possible.(Note to self: If I ever have to convert a lot of
b"foo"
to&b"foo"[..]
again, this is how to do it in vim for all matches on the current line::s/b"\([^"]*\)"/\&b"\1"[..]/g
.)