Skip to content

make it easier to maintain fork with additional regex engines #1488

@pierreN

Description

@pierreN

I needed a CLI tool to parse a massive amount of regexps (and improve my rust at the same time) so I made a crate implementing the Matcher trait : https://git.sr.ht/~pierrenn/grep-hyperscan

From my (sporadic) tests, its starts to be useful when having at least 1000 regexps to parse more than 10GB of data. I had 4.5k on 150GB soo... Since the data comes from disk reads, using hyperscan basically limits your speed to your disk speed.

Plus there is a limit to the size of the compiled expressions in ripgrep, so using hyperscan allows to bypass that.

Ideally it would be cool if it can be integrated in ripgrep. I prefered to open this issue before doing a PR to talk about it and gauge interest. Details are below.

Implementation

It's just an implementation of find_at since hyperscan doesn't support groups (new_captures is implemented using NoCaptures).

I thought of 3 possible ways to implement find_at:

  • hyperscan has a HS_FLAG_SINGLEMATCH which would be great for find_at. However this is incompatible with the flag HS_FLAG_SOM_LEFTMOST which is required to get the from of the Match.
  • you can force hyperscan to stop scanning a haystack after the first match by returning not 0 in the callback provided to hyperscan. However, this means a call to hyperstack each time we have a match. An (outdated...) implementation of this idea is available in the branch ideas/single_match
  • everytime we get a new haystack sent to find_at, scan it in one go with hyperscan, remember the matches into a VecDeque, and consume the deque at each new call. I guessed first that when we return Ok(None) from find_at, the next haystack will be a new one. However (and this is weird?) sometimes find_at get sent a new haystack while the "current" is not termined. From testing it seems to only be the same haystack with EOL added at the end, or the right-most part of the original haystack. Thus, we also start a new hyperscan run when we see a new haystack length. Avoiding the successive calls to hyperscan allows according to my (sporadic...) benchmarks to speed up the overall match by 10-20%. Ideally, it would be great if we could require ripgrep to only send once each haystack (or the minimal amount of data), but no idea how to do that.

Things to do for integration

I think the following tasks should be done for integration:

  • add a way to switch regexp engines easily. For testing I added -Y/--hyperscan but this kind of clumsy... IMHO it would be best to have an option --engine= which accepts default|pcre2|hyperscan (and this would allow to easily add other engines such as chimera) (default to default, and the -e shortcut is already taken soo.. ?)
  • allow -f to read a text file OR an hyperscan database. Most of the running time spend by hyperscan is actually to compile the list of text regexp pattern to it's own format DB (see benchmarks below). Plus a lot of DB comes in the already compiled form. Plus sometimes you want to rerun the same regexps on different files...
  • hence add a parameter to write the hyperscan compiled database to a file (after DB compilation/read and before the matching) (-d/--hyper-write=filename, disabled by default)
  • add hyperscan specific option to allow empty buffers match, see HS_FLAG_ALLOWEMPTY in https://intel.github.io/hyperscan/dev-reference/api_constants.html#pattern-flags (--hyper-allow-empty, default false)
  • add hyperscan specific option to force utf-8 mode for all regexp patterns, See HS_FLAG_UTF8 in same URL (--hyper-utf8, default false)
  • add hyperscan specific option to enable unicode property support for all regexp patterns. See HS_FLAG_UCP in same URL. (--hyper-unicode-property, default false)
  • in the helper text for the options above + case sensitiveness + dotall + multiline + ... add that if they are used with hyperscan engine, they override the default behavior of each regexp
  • suggests to use hyperscan when the regexp size exceeds the limit
  • docs and testing...

Do you see something else ? Is there anything to change/which is not OK ? I'll edit the tasks accordingly.

Benchmarking

I has to parse around ~150GB of HTML scraped webpage through ~4500 regexps, so that was my benchmark. The format for hyperscan regexp and default regexp is different, so I used 2 set of regexps for benchmarking. Plus there is a limit to the amount of regexps that the default engine can handle, and since my list of regexp is in the shape:

some.domain1.com/@[\w.\+\-]+
some.other.domain2.net/@[\w.\+\-]+
...

where I have a 4.5k list of web domains (this is to find possible fediverse accounts in a webpage). Using a basic list like that is too big for the default engine, so I used: (some.domain1.com|some.other.domain2.net|...)/@[\w.\+\-]+.

The default regexps are here : https://termbin.com/xdse , the hyperscan regexps are here : https://termbin.com/62ov

I used a subset of 15GB of data for testing. Parsing the regexps with the default engine, it takes around 8:20mn (best case). Using the hyperscan engine it takes less than 30 seconds to parse the files (that's basically the speed of my SSD) AND 5 minutes to compile the regexps. That's why we need a flag to deserialize/serialize regexps so using the hyperscan engine becomes easier.


Sorry for the (too!) long issue. The reason I opened this is:

  • code reviews : I'm using this as an excuse to learn rust, so any comment on https://git.sr.ht/~pierrenn/grep-hyperscan is more than welcome
  • gauge interest : do you think integrating this in ripgrep is a good idea ?
  • instructions on how to proceed if interest : is the above todo list accurate ? what could be improved/changed/added/removed ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionAn issue that is lacking clarity on one or more points.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions