Click here to Skip to main content
15,116,653 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
My current understanding is that in some cases (when massive YMM reads/writes occur) 2nd gen Intel executes them improperly, when YMM registers are replaced by corresponding 4 QWORD ones then it works, the test case:

/*
; 'Tsubame' decompression loop, 96-15+6=135 bytes long, 40 instructions:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-TP -O3 -QxSSE4.1 -D_N_YMM -D_N_prefetch_4096 -D_N_HIGH_PRIORITY -FAcs";

.B16.3::                        
  00015 41 0f 18 8b 00 
        10 00 00         prefetcht0 BYTE PTR [4096+r11]         
  0001d 41 8b 13         mov edx, DWORD PTR [r11]               
  00020 89 d1            mov ecx, edx                           
  00022 83 e1 03         and ecx, 3                             
  00025 75 34            jne .B16.7 
.B16.4::                        
  00027 0f b6 d2         movzx edx, dl                          
  0002a 85 d2            test edx, edx                          
  0002c 74 0a            je .B16.6 
.B16.5::                        
  0002e c4 c1 7e 6f 43 
        01               vmovdqu ymm0, YMMWORD PTR [1+r11]      
  00034 c5 fe 7f 00      vmovdqu YMMWORD PTR [rax], ymm0        
.B16.6::                        
  00038 89 d1            mov ecx, edx                           
  0003a 41 b9 01 00 00 
        00               mov r9d, 1                             
  00040 ba 00 00 00 00   mov edx, 0                             
  00045 41 0f 44 d1      cmove edx, r9d                         
  00049 c1 e9 03         shr ecx, 3                             
  0004c c1 e2 04         shl edx, 4                             
  0004f 03 d1            add edx, ecx                           
  00051 ff c1            inc ecx                                
  00053 48 03 c2         add rax, rdx                           
  00056 4c 03 d9         add r11, rcx                           
  00059 eb 38            jmp .B16.8 
.B16.7::                        
  0005b c1 e1 03         shl ecx, 3                             
  0005e 41 b9 ff ff ff 
        ff               mov r9d, -1                            
  00064 41 d3 e9         shr r9d, cl                            
  00067 44 23 ca         and r9d, edx                           
  0006a 83 e2 0c         and edx, 12                            
  0006d 41 c1 e9 04      shr r9d, 4                             
  00071 f7 da            neg edx                                
  00073 83 c2 10         add edx, 16                            
  00076 49 f7 d9         neg r9                                 
  00079 4c 03 c8         add r9, rax                            
  0007c c1 e9 03         shr ecx, 3                             
  0007f f7 d9            neg ecx                                
  00081 83 c1 04         add ecx, 4                             
  00084 c4 c1 7e 6f 01   vmovdqu ymm0, YMMWORD PTR [r9]         
  00089 c5 fe 7f 00      vmovdqu YMMWORD PTR [rax], ymm0        
  0008d 48 03 c2         add rax, rdx                           
  00090 4c 03 d9         add r11, rcx                           
.B16.8::                        
  00093 4d 3b d8         cmp r11, r8                            
  00096 0f 82 79 ff ff 
        ff               jb .B16.3 
*/


Since I have only Core 2 and i5 2540M I cannot try whether next decompression function works on 3??? and next ones Intel CPUs properly, so I ask for someone to run this command line and share whether 'FAILED':

D:\Tsubame\buggy_AVX_compile>Nakamichi_Tsubame_YMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe alice29.txt
Nakamichi 'Tsubame', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 0 MB ...
Allocating Target-Buffer 32 MB ...
Allocating Verification-Buffer 0 MB ...
Compressing 152,089 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 4
NumberOf(Tiny)Matches[Tiny]Window (4): 157
NumberOf(Short)Matches[Tiny]Window (8): 52
NumberOf(Medium)Matches[Tiny]Window (12): 11
RAM-to-RAM performance: 11 KB/s.
Compressed to 73,071 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x1366,78ee
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x8cec,be70
Decompressing 73,071 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1152 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks mismatch) FAILED!


The command line that interests me:

D:\Tsubame\buggy_AVX_compile>Nakamichi_Tsubame_YMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe alice29.txt


The test suite, 241KB zip file, executables & source & testdatafile[^]

I asked the same question on Intel's forum, sadly, no one seems to care:

YMMWORD != 4xQWORD[^]

What I have tried:

Laptop Toshiba i5-2540M, Windows 7, Intel C Optimizer v15.0
Posted
Updated 30-Jun-16 11:08am

And you are probably going to be unlucky here as well.
It's very, very unlikely that anyone here is going to download an EXE file from an unknown source and execute it: we have no idea what it may do, we don't know you, and this could well be ransomware or similar.
I'm sorry, but you are going to have to find a friend or colleague to run your test on - in the modern world it's not likely that anyone on the internet is going to want to run your application!
   
Comments
Sanmayce 30-Jun-16 16:33pm
   
> ... in the modern world it's not likely that anyone on the internet is going to want to run your application!

No, I am a man of faith, it is said "ask and you will receive", guess what, not in Christianity only, but 700 years earlier in China.

Guess your rationality will protect you from malign software, however it holds some coldness that also would prevent you from helping a coder in need.

To say "we don't know you" is also cold, I am a member of CodeProject and have enjoyed help from other fellow members in my previous testings, so it is not a matter of some suspicious activity but rather of good will, after all, the source code and the compile line is given in that zip file, so even GCC users can compile it and report the outcome in case they lack the superb Intel Optimizer. Recently I read about GCC 7 dev showing some unseen boosts in such decompression snippets, to be exact, the Yann's awesome Zstd decompressor showed some 20% speed boost compared to the Microsoft's cl.

Last year, AFAIR, I proposed CP to have/offer a machine (5960x for example) dedicated to help members in their speed tests. Yes, the problems with security are always there, but having a backup (mirror copy) of the installed OS and packages will easy the reinstallation a lot, just saying.
OriginalGriff 1-Jul-16 2:55am
   
Faith is a wonderful thing. But...it isn't the answer to everything.
You have to be aware that there are malicious people out there, who (through faith, or greed, or both) will happily try to get you to install their apps. Some of these guys target hospitals for elephants sake!
Now, I'm pretty solidly backed up, so if I ran it and it included ransomware I'd reload from my last backup and lose a few hours work. But...it'd take me half a day to do it.
By all means have faith - but back it up with sensible precautions!

"To say "we don't know you" is also cold, I am a member of CodeProject"
So are 12,000,000 others - and I know for a fact that some of those are malicious! You should *see* some of the posts that try to get in here. Or rather, you shouldn't - because volunteers like me try damn hard to keep them out!

In summary: protect yourself. Don't visit random sites, don't run random applications, do backup often and keep them offline. Otherwise, it's not a case of "if" you are going to have problems, it's "when".
It's not that they don't care, it's that they do. They care about their own machines.

Nobody in their right minid is going to download strange code off the internet from someone they don't know and run it.
   
Ok for me on Core i7 3960x, Windows 7

Nakamichi 'Tsubame', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim De
mpsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 0 MB ...
Allocating Target-Buffer 32 MB ...
Allocating Verification-Buffer 0 MB ...
Compressing 152,089 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 4
NumberOf(Tiny)Matches[Tiny]Window (4): 157
NumberOf(Short)Matches[Tiny]Window (8): 52
NumberOf(Medium)Matches[Tiny]Window (12): 11
RAM-to-RAM performance: 18 KB/s.
Compressed to 73,071 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x1366,78ee
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x8cec,be70
Decompressing 73,071 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1664 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.

LZSSE2: Compressing with LZSSE2 (level 17) 152,089 bytes ...
LZSSE2: Compressed to 56,526 bytes.
LZSSE2: RAM-to-RAM performance: 7072 KB/s.
LZSSE2: Decompressing 56,526 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 18560 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.

LZSSE2 vs Nakamichi 'Tsubame', tighter: 0.77:1
LZSSE2 vs Nakamichi 'Tsubame', quicker: 11.15:1
   
Comments
Sanmayce 30-Jun-16 16:50pm
   
Ugh, forgot my habit to salute with a song people who helped me, having seen your lyrics so much alike to the golden Eruption/Boney-M hit "One Way Ticket", I salute you with one dear to me hit from two awesome French DJs carrying the spirit of sadness-of-leaving-lovely-things/persons-behind:

Thievery Corporation - Is it Over?
https://www.youtube.com/watch?v=4WbxYzAG2I4

Cheers!
@jfriedman
Many thanks man!

You can see yourself how difficult is for me to test snippets that need CPUs that I cannot afford to buy, the Internet is worse than real life - the disbelief is the norm, call me naive even stupid, but asking for help in some tests and receiving "are you out of your mind to ask such things" shows how rampant is the fear of someone hurting their data, not how "not normal" is to ask for help.

You bolded the needed lines, nice, your 3rd gen Intel shows that intensive YMM unaligned fetches/stores with OVERLAPPING work properly in contrast to my mobile 2nd gen. Since my money is scarce and I couldn't ask my friends for more favors, some don't have 3rd+ CPUs, some have helped me several times and I am reluctant to be insolent to bother them, I was feeling kinda down, now I see that the 256bit unaligned reads/writes with overlaps are executed in a buggy way on i5-2540M, this laptop was bought by my brother from Newegg and is equipped with 16GB RAM - he gave it to me to prepare my heavy-decompression-textual showdown and I was happy until this bug in the AVX occurred, as you can see QWORD and XMM counterparts work properly, anyway, I felt cheated because I couldn't finish/present my tests with AVX code versus the rest superfast performers, I had to go with 4xQWORD operations instead 1xYMM - which was the idea!

The crippled showdown/benchmark is made with 4xQWORDs and posted on my blog:
The 88 benchmark | Sanmayce's dumps[^]

In my eyes, Toshiba, or Intel for that matter, ought to replace freely my buggy laptop with one with similar characteristics but with 3rd, or better, gen CPU. They made a product that fails to work! Had I had more money I would buy a new one making no noise but being moneyless is not fun.

Again, thanks a lot.
My guess is that this bug is fixed for good, that is, 4th, 5th and 6th generations are are okay as well.
I will return to Intel, in there, Tim Prince - a veteran in compiler designs and an expert on Intel's mojos have tried to help me, but for some reason it didn't happen, anyway, to finish my post I will cite one wisdom, hanged on the wall in a frame, of Napoleon which I was looking when was in the army,
"The coherence is a top virtue!"
I like to think that I am chasing the goals steadily without "quitting" them.
   
Comments
Dave Kreskowiak 30-Jun-16 16:26pm
   
Considering I'm still rebuilding 9 of my 11 machines at work because of a zero-day virus that a small group of us got hit by last week, yeah, I'm justified in saying that it's ridiculous to ask strangers to download unknown code from an untrusted source and blindly execute it.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900