make it easier to maintain fork with additional regex engines

I needed a CLI tool to parse a massive amount of regexps (and improve my rust at the same time) so I made a crate implementing the `Matcher` trait : https://git.sr.ht/~pierrenn/grep-hyperscan

From my (sporadic) tests, its starts to be useful when having at least 1000 regexps to parse more than 10GB of data. I had 4.5k on 150GB soo...  Since the data comes from disk reads, using `hyperscan` basically limits your speed to your disk speed.

Plus there is a limit to the size of the compiled expressions in `ripgrep`, so using `hyperscan` allows to bypass that.

Ideally it would be cool if it can be integrated in `ripgrep`. I prefered to open this issue before doing a PR to talk about it and gauge interest. Details are below.


## Implementation

It's just an implementation of `find_at` since `hyperscan` doesn't support groups (`new_captures` is implemented using `NoCaptures`).

I thought of 3 possible ways to implement `find_at`:
- `hyperscan` has a `HS_FLAG_SINGLEMATCH` which would be great for `find_at`. However this is incompatible with the flag `HS_FLAG_SOM_LEFTMOST` which is required to get the `from` of the `Match`.
- you can force hyperscan to stop scanning a haystack after the first match by returning not 0 in the callback provided to `hyperscan`. However, this means a call to hyperstack each time we have a match. An (outdated...) implementation of this idea is available in the branch `ideas/single_match`
- everytime we get a new haystack sent to `find_at`, scan it in one go with `hyperscan`, remember the matches into a `VecDeque`, and consume the deque at each new call. I guessed first that when we return `Ok(None)` from `find_at`, the next haystack will be a new one. However (and this is weird?) sometimes `find_at` get sent a new haystack while the "current" is not termined. From testing it seems to only be the same haystack with EOL added at the end, or the right-most part of the original haystack. Thus, we also start a new `hyperscan` run when we see a new haystack length. Avoiding the successive calls to `hyperscan` allows according to my (sporadic...) benchmarks to speed up the overall match by 10-20%. Ideally, it would be great if we could require `ripgrep` to only send once each haystack (or the minimal amount of data), but no idea how to do that.

## Things to do for integration

I think the following tasks should be done for integration:
- [ ] add a way to switch regexp engines easily. For testing I added `-Y/--hyperscan` but this kind of clumsy... IMHO it would be best to have an option `--engine=` which accepts `default|pcre2|hyperscan` (and this would allow to easily add other engines such as [chimera](https://intel.github.io/hyperscan/dev-reference/chimera.html)) (default to default, and the `-e` shortcut is already taken soo.. ?)
- [ ] allow `-f` to read a text file OR an `hyperscan` database. Most of the running time spend by hyperscan is actually to compile the list of text regexp pattern to it's own format DB (see benchmarks below). Plus a lot of DB comes in the already compiled form. Plus sometimes you want to rerun the same regexps on different files...
- [ ] hence add a parameter to write the hyperscan compiled database to a file (after DB compilation/read and before the matching) (`-d/--hyper-write=filename`, disabled by default)
- [ ] add hyperscan specific option to allow empty buffers match, see `HS_FLAG_ALLOWEMPTY` in https://intel.github.io/hyperscan/dev-reference/api_constants.html#pattern-flags (`--hyper-allow-empty`, default false)
- [ ] add hyperscan specific option to force utf-8 mode for all regexp patterns, See `HS_FLAG_UTF8` in same URL (`--hyper-utf8`, default false)
- [ ] add hyperscan specific option to enable unicode property support for all regexp patterns. See `HS_FLAG_UCP` in same URL. (`--hyper-unicode-property`, default false)
- [ ] in the helper text for the options above + case sensitiveness + dotall + multiline + ... add that if they are used with hyperscan engine, they override the default behavior of each regexp
- [ ] suggests to use hyperscan when the regexp size exceeds the limit
- [ ] docs and testing...


Do you see something else ? Is there anything to change/which is not OK ? I'll edit the tasks accordingly.


## Benchmarking

I has to parse around ~150GB of HTML scraped webpage through ~4500 regexps, so that was my benchmark. The format for hyperscan regexp and default regexp is different, so I used 2 set of regexps for benchmarking. Plus there is a limit to the amount of regexps that the default engine can handle, and since my list of regexp is in the shape:

```
some.domain1.com/@[\w.\+\-]+
some.other.domain2.net/@[\w.\+\-]+
...
```

where I have a 4.5k list of web domains (this is to find possible fediverse accounts in a webpage). Using a basic list like that is too big for the default engine, so I used: `(some.domain1.com|some.other.domain2.net|...)/@[\w.\+\-]+`.

The default regexps are here : https://termbin.com/xdse , the hyperscan regexps are here : https://termbin.com/62ov

I used a subset of 15GB of data for testing. Parsing the regexps with the default engine, it takes around 8:20mn (best case). Using the hyperscan engine it takes less than 30 seconds to parse the files (that's basically the speed of my SSD) AND 5 minutes to compile the regexps. That's why we need a flag to deserialize/serialize regexps so using the hyperscan engine becomes easier.


----

Sorry for the (too!) long issue. The reason I opened this is:
- code reviews : I'm using this as an excuse to learn rust, so any comment on https://git.sr.ht/~pierrenn/grep-hyperscan is more than welcome
- gauge interest : do you think integrating this in ripgrep is a good idea ?
- instructions on how to proceed if interest : is the above todo list accurate ? what could be improved/changed/added/removed ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make it easier to maintain fork with additional regex engines #1488

Implementation

Things to do for integration

Benchmarking

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

make it easier to maintain fork with additional regex engines #1488

Description

Implementation

Things to do for integration

Benchmarking

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions