-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
make it easier to maintain fork with additional regex engines #1488
Description
I needed a CLI tool to parse a massive amount of regexps (and improve my rust at the same time) so I made a crate implementing the Matcher trait : https://git.sr.ht/~pierrenn/grep-hyperscan
From my (sporadic) tests, its starts to be useful when having at least 1000 regexps to parse more than 10GB of data. I had 4.5k on 150GB soo... Since the data comes from disk reads, using hyperscan basically limits your speed to your disk speed.
Plus there is a limit to the size of the compiled expressions in ripgrep, so using hyperscan allows to bypass that.
Ideally it would be cool if it can be integrated in ripgrep. I prefered to open this issue before doing a PR to talk about it and gauge interest. Details are below.
Implementation
It's just an implementation of find_at since hyperscan doesn't support groups (new_captures is implemented using NoCaptures).
I thought of 3 possible ways to implement find_at:
hyperscanhas aHS_FLAG_SINGLEMATCHwhich would be great forfind_at. However this is incompatible with the flagHS_FLAG_SOM_LEFTMOSTwhich is required to get thefromof theMatch.- you can force hyperscan to stop scanning a haystack after the first match by returning not 0 in the callback provided to
hyperscan. However, this means a call to hyperstack each time we have a match. An (outdated...) implementation of this idea is available in the branchideas/single_match - everytime we get a new haystack sent to
find_at, scan it in one go withhyperscan, remember the matches into aVecDeque, and consume the deque at each new call. I guessed first that when we returnOk(None)fromfind_at, the next haystack will be a new one. However (and this is weird?) sometimesfind_atget sent a new haystack while the "current" is not termined. From testing it seems to only be the same haystack with EOL added at the end, or the right-most part of the original haystack. Thus, we also start a newhyperscanrun when we see a new haystack length. Avoiding the successive calls tohyperscanallows according to my (sporadic...) benchmarks to speed up the overall match by 10-20%. Ideally, it would be great if we could requireripgrepto only send once each haystack (or the minimal amount of data), but no idea how to do that.
Things to do for integration
I think the following tasks should be done for integration:
- add a way to switch regexp engines easily. For testing I added
-Y/--hyperscanbut this kind of clumsy... IMHO it would be best to have an option--engine=which acceptsdefault|pcre2|hyperscan(and this would allow to easily add other engines such as chimera) (default to default, and the-eshortcut is already taken soo.. ?) - allow
-fto read a text file OR anhyperscandatabase. Most of the running time spend by hyperscan is actually to compile the list of text regexp pattern to it's own format DB (see benchmarks below). Plus a lot of DB comes in the already compiled form. Plus sometimes you want to rerun the same regexps on different files... - hence add a parameter to write the hyperscan compiled database to a file (after DB compilation/read and before the matching) (
-d/--hyper-write=filename, disabled by default) - add hyperscan specific option to allow empty buffers match, see
HS_FLAG_ALLOWEMPTYin https://intel.github.io/hyperscan/dev-reference/api_constants.html#pattern-flags (--hyper-allow-empty, default false) - add hyperscan specific option to force utf-8 mode for all regexp patterns, See
HS_FLAG_UTF8in same URL (--hyper-utf8, default false) - add hyperscan specific option to enable unicode property support for all regexp patterns. See
HS_FLAG_UCPin same URL. (--hyper-unicode-property, default false) - in the helper text for the options above + case sensitiveness + dotall + multiline + ... add that if they are used with hyperscan engine, they override the default behavior of each regexp
- suggests to use hyperscan when the regexp size exceeds the limit
- docs and testing...
Do you see something else ? Is there anything to change/which is not OK ? I'll edit the tasks accordingly.
Benchmarking
I has to parse around ~150GB of HTML scraped webpage through ~4500 regexps, so that was my benchmark. The format for hyperscan regexp and default regexp is different, so I used 2 set of regexps for benchmarking. Plus there is a limit to the amount of regexps that the default engine can handle, and since my list of regexp is in the shape:
some.domain1.com/@[\w.\+\-]+
some.other.domain2.net/@[\w.\+\-]+
...
where I have a 4.5k list of web domains (this is to find possible fediverse accounts in a webpage). Using a basic list like that is too big for the default engine, so I used: (some.domain1.com|some.other.domain2.net|...)/@[\w.\+\-]+.
The default regexps are here : https://termbin.com/xdse , the hyperscan regexps are here : https://termbin.com/62ov
I used a subset of 15GB of data for testing. Parsing the regexps with the default engine, it takes around 8:20mn (best case). Using the hyperscan engine it takes less than 30 seconds to parse the files (that's basically the speed of my SSD) AND 5 minutes to compile the regexps. That's why we need a flag to deserialize/serialize regexps so using the hyperscan engine becomes easier.
Sorry for the (too!) long issue. The reason I opened this is:
- code reviews : I'm using this as an excuse to learn rust, so any comment on https://git.sr.ht/~pierrenn/grep-hyperscan is more than welcome
- gauge interest : do you think integrating this in ripgrep is a good idea ?
- instructions on how to proceed if interest : is the above todo list accurate ? what could be improved/changed/added/removed ?