Skip to content

match bug #1319

@BurntSushi

Description

@BurntSushi
$ rg --version
ripgrep 11.0.1 (rev 7bf7ceb5d3)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

This matches:

$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | egrep 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'
CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC

But this doesn't:

$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | rg 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'

To minimize, this doesn't match:

$ rg 'TTGAGTCCAGGAG[ATCG]{2}C' /tmp/subject

But this does:

$ rg 'TGAGTCCAGGAG[ATCG]{2}C' /tmp/subject
1:CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC

The only difference between the latter two is that the latter removes the first
T from the regex.

From inspecting the --trace output, I note that from the former regex, it
says this:

DEBUG|grep_regex::literal|grep-regex/src/literal.rs:105: required literals found: [Complete(TTGAGTCCAGGAGAC), Complete(TTGAGTCCAGGAGCC), Complete(TTGAGTCCAGGAGGC), Complete(TTGAGTCCAGGAGTC)]
TRACE|grep_regex::matcher|grep-regex/src/matcher.rs:52: extracted fast line regex: (?-u:TTGAGTCCAGGAGAC|TTGAGTCCAGGAGCC|TTGAGTCCAGGAGGC|TTGAGTCCAGGAGTC)

But in the latter regex (the one that works), we have this:

DEBUG|grep_regex::literal|grep-regex/src/literal.rs:59: literal prefixes detected: Literals { lits: [Complete(TGAGTCCAGGAGAAC), Complete(TGAGTCCAGGAGCAC), Complete(TGAGTCCAGGAGGAC), Complete(TGAGTCC
AGGAGTAC), Complete(TGAGTCCAGGAGACC), Complete(TGAGTCCAGGAGCCC), Complete(TGAGTCCAGGAGGCC), Complete(TGAGTCCAGGAGTCC), Complete(TGAGTCCAGGAGAGC), Complete(TGAGTCCAGGAGCGC), Complete(TGAGTCCAGGAGGGC)
, Complete(TGAGTCCAGGAGTGC), Complete(TGAGTCCAGGAGATC), Complete(TGAGTCCAGGAGCTC), Complete(TGAGTCCAGGAGGTC), Complete(TGAGTCCAGGAGTTC)], limit_size: 250, limit_class: 10 }

Therefore, this is almost certainly a bug in literal extraction. Moreover,
this Rust program correctly prints true:

fn main() {
    let pattern = r"CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C";
    let haystack = "CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC";

    let re = regex::Regex::new(pattern).unwrap();
    println!("{:?}", re.is_match(haystack));
}

Which points the finger at grep-regex's inner literal extraction. Sigh.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions