Click here to Skip to main content
15,867,308 members
Articles / General Programming / Algorithms

Fastest Fulltext (Vector&Scalar) Exact Searcher?!

Rate me:
Please Sign up or sign in to vote.
4.33/5 (3 votes)
21 Oct 2020Public Domain2 min read 15.7K   187   4   20
A fulltext CLI tool reporting number of exact matches, FAST!
Benchmarking the basic task of finding a needle in a haystack, an overview, showcasing (scalar and vector, side-by-side) superfast memmem() functions written in C.

Introduction

It's vector time.
It is time to benchmark the basic task of finding a needle in a haystack, using vectors.

Having implemented a simplistic vector memmem() etude, glad to share it. Showcasing (scalar and vector, side-by-side) superfast memmem() functions written in C.

The console tool attached here is only a wrapper around these functions, being a skeleton itself, NyoTengu serves well as a blueprint (starting point) for further development of search tools - with more functionality.

Image 1

Background

The theme is all over the books and what not, yet, exact searching is not optimized, it is an ugly fact, I am not aware of function (let alone tool) working at 10GB/s rates on modern machines, see below how my mainstream laptop executes these shared C etudes - compiled with GCC and ICL.

Full-Text Exact Showdown — Fastest GREP vs NyoTengu

The testmachine: laptop running Windows 10, i5-7200U @2.5GHz, 8GB DDR4 2133MHz:

Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle:   "Linus Torvalds"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle1.txt                 | 11,714,194,285 B/s; 0.084 seconds |     602 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle1.txt                    | 11,441,771,162 B/s; 0.086 seconds |     602 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "Linus Torvalds" li...tar |               N.A.; 0.256 seconds |     602 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle:   "5.8.5"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle2.txt                 |  9,553,323,495 B/s; 0.103 seconds |   74028 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle2.txt                    |  9,461,464,615 B/s; 0.104 seconds |   74028 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "5.8.5" li...tar          |               N.A.; 0.212 seconds |   74028 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: enwiki-20201001-all-titles (1,193,068,592 bytes)
Needle:   "grep"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe enwiki-20201001-all-titles grep.txt         | 12,051,197,898 B/s; 0.099 seconds |     173 |
| NyoTengu_YMM_GCC730_64bit.exe enwiki-20201001-all-titles grep.txt            | 11,583,190,213 B/s; 0.103 seconds |     173 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "grep" enwiki-2...titles  |               N.A.; 0.378 seconds |     173 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: gcc-7.3.0.tar (615,567,360 bytes)
Needle:   "SIMD"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe gcc-7.3.0.tar SIMD.txt                      | 11,837,833,846 B/s; 0.052 seconds |    3606 |
| NyoTengu_YMM_GCC730_64bit.exe gcc-7.3.0.tar SIMD.txt                         | 11,192,133,818 B/s; 0.055 seconds |    3606 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "SIMD" gcc-7.3.0.tar      |               N.A.; 0.121 seconds |    3606 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle:   "ACGT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 4.txt               |  3,508,357,471 B/s; 0.956 seconds | 1607850 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 4.txt                  |  4,651,858,173 B/s; 0.721 seconds | 1607850 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "ACGT" "SCB_DNA...-28"    |               N.A.; 5.701 seconds | 1607850 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle:   "AACCGGTT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 8.txt               |  2,187,860,236 B/s; 1.533 seconds |    2723 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 8.txt                  |  1,968,303,839 B/s; 1.704 seconds |    2723 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "AACCGGTT" "SCB_...-28"   |               N.A.; 5.320 seconds |    2723 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

In order to reproduce the benchmark, here is the package (except the four corpora):

The four corpora are downloadable outside the package:

  1. https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.8.5.tar.xz
  2. ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz
  3. https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-all-titles.gz
  4. https://ftp.gnu.org/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz

Note: The fastest GREP - RIPGREP (https://github.com/BurntSushi/ripgrep) received 22,000 stars on GitHub, as far as I see, it is second only to the Linux kernel with 99,000 stars!

Verdict: The current NyoTengu is 2x-5x faster than current RIPGREP, this is the 256bit world, who knows what will happen in the 512bit one...

Image 2

I intend to replace the memcmp() instances with Railgun_Trolldom_64's comparator (not using external code)...

I believe, Railgun_Nyotengu_XMM_YMM_ZMM() is currently the FASTEST memmem() available, will be glad to be proven wrong. :P

Using the Code

The attached NyoTengu.c is 3022 lines long, it is compilable as follows:

On Windows, with ICL:

icl /O3 Nyotengu.c -DXMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_XMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DYMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_YMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DZMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_ZMM_IntelV150_32bit /FAcs

On *nix, with GCC:

gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC_64bit.elf 
        -DYMMtengu -D_POSIX_ENVIRONMENT_

The following snippet was generated by GCC 7.3.0 in this way:

gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.exe 
        -DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY
gcc -O3 -mavx2 -fomit-frame-pointer -S NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.asm 
        -DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY

so, the whole ASM fragment:

ASM
    .seh_proc    Railgun_Nyotengu_XMM_YMM_ZMM
Railgun_Nyotengu_XMM_YMM_ZMM:
    pushq    %r15
    .seh_pushreg    %r15
    pushq    %r14
    .seh_pushreg    %r14
    pushq    %r13
    .seh_pushreg    %r13
    pushq    %r12
    .seh_pushreg    %r12
    pushq    %rbp
    .seh_pushreg    %rbp
    pushq    %rdi
    .seh_pushreg    %rdi
    pushq    %rsi
    .seh_pushreg    %rsi
    pushq    %rbx
    .seh_pushreg    %rbx
    subq    $120, %rsp
    .seh_stackalloc    120
    .seh_endprologue
    movl    %r8d, %r13d
    movq    %rcx, %rsi
    movq    %rdx, %r10
    leaq    (%rcx,%r13), %r15
    cmpl    %r9d, %r8d
    jb    .L293
    movl    %r9d, %ebp
    cmpl    $3, %r9d
    jbe    .L321
    movl    (%rdx), %eax
    movl    %eax, 80(%rsp)
    cmpl    $127, %r8d
    ja    .L269
    movl    %eax, %edi
    movl    %eax, %r13d
    movl    %eax, %r14d
    movzbl    %al, %esi
    sall    $24, %edi
    leal    -1(%r9), %r9d
    sall    $8, %r13d
    leaq    (%rcx,%rbp), %rdx
    sall    $16, %r14d
    movl    %edi, 72(%rsp)
    movslq    %r9d, %rdi
    movzwl    %r13w, %r13d
    andl    $16711680, %r14d
    negq    %rbp
    movq    %rdi, 80(%rsp)
    movl    %eax, %r12d
    movl    %esi, 32(%rsp)
    jmp    .L275
    .p2align 4,,10
.L270:
    movl    %eax, %r8d
    movl    $1, %ecx
    andl    $65280, %r8d
    cmpl    %r13d, %r8d
    je    .L274
    movl    %eax, %r8d
    movl    $2, %ecx
    andl    $16711680, %r8d
    cmpl    %r14d, %r8d
    je    .L274
    xorl    %ecx, %ecx
    andl    $-16777216, %eax
    cmpl    72(%rsp), %eax
    setne    %cl
    addq    $3, %rcx
.L274:
    addq    %rcx, %rdx
    cmpq    %rdx, %r15
    jb    .L293
.L275:
    leaq    (%rdx,%rbp), %rbx
    movl    (%rbx), %eax
    cmpl    %r12d, %eax
    jne    .L270
    movq    %rdx, %r8
    movl    %r9d, %eax
    subq    80(%rsp), %r8
    movl    $1, %ecx
    xorl    %edi, %edi
    jmp    .L271
    .p2align 4,,10
.L273:
    leal    (%rax,%rdi), %esi
    cmpl    %esi, %r9d
    jne    .L272
    cmpl    %r11d, 32(%rsp)
    setne    %r11b
    movzbl    %r11b, %r11d
    addl    %r11d, %edi
.L272:
    addl    $1, %ecx
    addq    $1, %r8
    subl    $1, %eax
    je    .L319
.L271:
    movl    %ecx, %r11d
    movsbl    (%r10,%r11), %r11d
    cmpb    (%r8), %r11b
    je    .L273
    leal    1(%rdi), %ecx
    addq    %rcx, %rdx
    cmpq    %rdx, %r15
    jnb    .L275
    .p2align 4,,10
.L293:
    xorl    %ebx, %ebx
.L319:
    movq    %rbx, %rax
    addq    $120, %rsp
    popq    %rbx
    popq    %rsi
    popq    %rdi
    popq    %rbp
    popq    %r12
    popq    %r13
    popq    %r14
    popq    %r15
    ret
    .p2align 4,,10
.L321:
    leal    -1(%r9), %eax
    movsbl    (%rdx), %ebx
    addq    %rbp, %rsi
    movsbl    (%rdx,%rax), %eax
    sall    $8, %ebx
    addl    %eax, %ebx
    movzbl    %bh, %edi
    cmpl    $3, %r9d
    je    .L265
    .p2align 4,,10
.L261:
    movsbl    -2(%rsi), %eax
    movsbl    -1(%rsi), %ecx
    sall    $8, %eax
    addl    %ecx, %eax
    cmpl    %eax, %ebx
    je    .L322
    leaq    1(%rsi), %rax
    cmpb    %dil, %cl
    je    .L267
    addq    $2, %rsi
    cmpq    %rsi, %r15
    jnb    .L261
    jmp    .L293
    .p2align 4,,10
.L324:
    cmpb    %cl, 1(%r10)
    je    .L323
    leaq    1(%rsi), %rax
    cmpb    %cl, %dil
    je    .L286
.L325:
    movq    %rsi, %rax
    xorl    %esi, %esi
    cmpb    %dil, %dl
    setne    %sil
    leaq    2(%rsi,%rax), %rsi
.L263:
    cmpq    %rsi, %r15
    jb    .L293
.L265:
    movsbl    -3(%rsi), %eax
    movsbl    -1(%rsi), %r8d
    movzbl    -2(%rsi), %ecx
    sall    $8, %eax
    movl    %r8d, %edx
    addl    %r8d, %eax
    cmpl    %eax, %ebx
    je    .L324
    leaq    1(%rsi), %rax
    cmpb    %cl, %dil
    jne    .L325
.L286:
    movq    %rax, %rsi
    jmp    .L263
    .p2align 4,,10
.L267:
    cmpq    %rax, %r15
    jb    .L293
    movq    %rax, %rsi
    jmp    .L261
    .p2align 4,,10
.L269:
    shrl    $5, %r8d
    leaq    32(%rcx), %rbx
    vpbroadcastd    80(%rsp), %ymm4
    movq    %r15, 104(%rsp)
    leal    -1(%r8), %r14d
    movq    %rbx, 88(%rsp)
    leaq    4096(%rcx), %rax
    movl    %r9d, 216(%rsp)
    salq    $5, %r14
    movq    %rax, %r12
    movq    %r14, 72(%rsp)
    xorl    %r14d, %r14d
    movq    %r13, 96(%rsp)
    movq    %rdx, %r13
    vmovdqu    %ymm4, 32(%rsp)
.L278:
    vmovdqu    32(%rsp), %ymm3
    prefetcht0    (%r12)
    vpcmpeqd    1(%rsi,%r14), %ymm3, %ymm0
    vpcmpeqd    (%rsi,%r14), %ymm3, %ymm1
    vpcmpeqd    3(%rsi,%r14), %ymm3, %ymm2
    vpor    %ymm0, %ymm1, %ymm1
    vpcmpeqd    2(%rsi,%r14), %ymm3, %ymm0
    vpor    %ymm0, %ymm2, %ymm0
    vpor    %ymm0, %ymm1, %ymm0
    vmovmskps    %ymm0, %eax
    testl    %eax, %eax
    jne    .L326
.L276:
    addq    $32, %r14
    cmpq    72(%rsp), %r14
    jb    .L278
    movq    %r13, %r10
    movq    96(%rsp), %r13
    movq    104(%rsp), %r15
    movl    216(%rsp), %r12d
    subq    %r14, %r13
    cmpq    %r13, %rbp
    ja    .L294
    movl    80(%rsp), %ebx
    leal    -1(%r12), %r9d
    leaq    (%r14,%rbp), %rax
    negq    %rbp
    addq    %rsi, %rax
    movzbl    %bl, %edi
    movl    %ebx, %edx
    movl    %ebx, %r13d
    movl    %ebx, %r12d
    movl    %edi, 32(%rsp)
    sall    $8, %edx
    movl    %ebx, %edi
    sall    $16, %r13d
    sall    $24, %edi
    movzwl    %dx, %r14d
    andl    $16711680, %r13d
    movl    %edi, 72(%rsp)
    movslq    %r9d, %rdi
    movq    %rdi, 80(%rsp)
    jmp    .L284
    .p2align 4,,10
.L279:
    movl    %edx, %r8d
    movl    $1, %ecx
    andl    $65280, %r8d
    cmpl    %r14d, %r8d
    je    .L283
    movl    %edx, %r8d
    movl    $2, %ecx
    andl    $16711680, %r8d
    cmpl    %r13d, %r8d
    je    .L283
    xorl    %ecx, %ecx
    andl    $-16777216, %edx
    cmpl    72(%rsp), %edx
    setne    %cl
    addq    $3, %rcx
.L283:
    addq    %rcx, %rax
    cmpq    %rax, %r15
    jb    .L294
.L284:
    leaq    (%rax,%rbp), %rbx
    movl    (%rbx), %edx
    cmpl    %r12d, %edx
    jne    .L279
    movq    %rax, %r8
    movl    %r9d, %edx
    subq    80(%rsp), %r8
    movl    $1, %ecx
    xorl    %edi, %edi
    .p2align 4,,10
.L280:
    movl    %ecx, %r11d
    movsbl    (%r10,%r11), %r11d
    cmpb    (%r8), %r11b
    jne    .L327
    leal    (%rdx,%rdi), %esi
    cmpl    %esi, %r9d
    jne    .L281
    cmpl    %r11d, 32(%rsp)
    setne    %r11b
    movzbl    %r11b, %r11d
    addl    %r11d, %edi
.L281:
    addl    $1, %ecx
    addq    $1, %r8
    subl    $1, %edx
    jne    .L280
    vzeroupper
    jmp    .L319
    .p2align 4,,10
.L326:
    movq    88(%rsp), %rax
    leaq    (%r14,%rsi), %r15
    leaq    (%rax,%r14), %rdi
    vzeroupper
    jmp    .L277
    .p2align 4,,10
.L328:
    addq    $1, %r15
    cmpq    %r15, %rdi
    je    .L276
.L277:
    movq    %rbp, %r8
    movq    %r13, %rdx
    movq    %r15, %rcx
    movq    %r15, %rbx
    call    memcmp
    testl    %eax, %eax
    jne    .L328
    jmp    .L319
    .p2align 4,,10
.L322:
    leaq    -2(%rsi), %rbx
    jmp    .L319
    .p2align 4,,10
.L327:
    leal    1(%rdi), %ecx
    jmp    .L283
    .p2align 4,,10
.L323:
    leaq    -3(%rsi), %rbx
    jmp    .L319
    .p2align 4,,10
.L294:
    xorl    %ebx, %ebx
    vzeroupper
    jmp    .L319
    .seh_endproc

My wish is for other coders to refine it even further, having a toy to play with is something that quickly brings results.

Points of Interest

The performance of the AVX512 tool yet remains to be seen, if a fellow member can run the benchmark on a CPU supporting it, please do and share the output here in Comments Section below.

History

  • 17th October, 2020: Finished initial version

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication


Written By
Other
Bulgaria Bulgaria
A Bulgarian old boy interested in search techniques, nothing special.

Comments and Discussions

 
Questionquestion regarding your YMMtengu C code Pin
Member 1172068125-Oct-20 15:35
Member 1172068125-Oct-20 15:35 
AnswerRe: question regarding your YMMtengu C code Pin
Sanmayce26-Oct-20 1:31
Sanmayce26-Oct-20 1:31 
GeneralRe: question regarding your YMMtengu C code Pin
Member 1172068126-Oct-20 3:53
Member 1172068126-Oct-20 3:53 
QuestionVery nice article Pin
Member 1172068122-Oct-20 6:48
Member 1172068122-Oct-20 6:48 
The DNA search example is a fantastic way to show how useful NyoTengu.c is for searching very large database.

It sure would be nice if somehow NyoTengu.c could be further developed to replace the Linux find
command.

Where does the name 'NyoTengu' come from ?
AnswerRe: Very nice article Pin
Sanmayce23-Oct-20 2:37
Sanmayce23-Oct-20 2:37 
GeneralRe: Very nice article Pin
Member 1172068123-Oct-20 3:53
Member 1172068123-Oct-20 3:53 
GeneralRe: Very nice article Pin
Member 1172068123-Oct-20 19:28
Member 1172068123-Oct-20 19:28 
GeneralRe: Very nice article Pin
Member 1172068123-Oct-20 19:47
Member 1172068123-Oct-20 19:47 
GeneralRe: Very nice article Pin
Sanmayce24-Oct-20 7:32
Sanmayce24-Oct-20 7:32 
GeneralRe: Very nice article Pin
Member 1172068124-Oct-20 8:20
Member 1172068124-Oct-20 8:20 
GeneralRe: Very nice article Pin
Sanmayce24-Oct-20 9:02
Sanmayce24-Oct-20 9:02 
GeneralRe: Very nice article Pin
Member 1172068124-Oct-20 9:44
Member 1172068124-Oct-20 9:44 
GeneralRe: Very nice article Pin
Sanmayce24-Oct-20 9:47
Sanmayce24-Oct-20 9:47 
GeneralRe: Very nice article Pin
Member 1172068124-Oct-20 10:38
Member 1172068124-Oct-20 10:38 
GeneralRe: Very nice article Pin
Sanmayce24-Oct-20 10:59
Sanmayce24-Oct-20 10:59 
GeneralRe: Very nice article Pin
Member 1172068124-Oct-20 11:33
Member 1172068124-Oct-20 11:33 
GeneralRe: Very nice article Pin
Sanmayce24-Oct-20 11:40
Sanmayce24-Oct-20 11:40 
GeneralRe: Very nice article Pin
Member 1172068124-Oct-20 20:39
Member 1172068124-Oct-20 20:39 
GeneralRe: Very nice article Pin
Member 1172068126-Oct-20 3:44
Member 1172068126-Oct-20 3:44 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.