LinuxSecurity.com
Share your story
The central voice for Linux and Open Source security news
Home News Topics Advisories HOWTOs Features Newsletters About Register

Welcome!
Sign up!
EnGarde Community
Login
Polls
What is the most important Linux security technology?
 
Advisories
Community
Linux Events
Linux User Groups
Link to Us
Security Center
Book Reviews
Security Dictionary
Security Tips
SELinux
White Papers
Featured Blogs
All About Linux
DanWalsh LiveJournal
Securitydistro
Latest Newsletters
Linux Security Week: February 6th, 2012
Linux Advisory Watch: February 3rd, 2012
Subscribe
LinuxSecurity Newsletters
E-mail:
Choose Lists:
About our Newsletters
RSS Feeds
Get the LinuxSecurity news you want faster with RSS
Powered By

  
John the Ripper 1.7.2 Print E-mail
User Rating:      How can I rate this item?
Source: Openwall-announce -- Alexander Peslyak - Posted by Eric Lubow   
Security Projects John the Ripper 1.7.2 (a "development" version) adds bitslice DES assembly code for x86-64 making use of the 64-bit mode extended SSE2 with 16 XMM registers. You can download it at the usual location: http://www.openwall.com/john/.

The SSE2 code introduced in 1.7.1 was only available with 32-bit x86 builds of John. Now John 1.7.2 is able to use SSE2 with 64-bit builds as well, which in practice brings their performance on current 64-bit capable AMD processors to the same level that 1.7.1 had achieved with 32-bit builds running on those same processors. So if your operating system is 64-bit only (no 32-bit compatibility) and thus you were unable to make use of the SSE2 acceleration in 1.7.1, you should be able to do so now. :-)

There's no performance improvement for current 64-bit capable Intel processors with this version. Those were happy running native x86-64 code, achieving the same level of performance that they do with SSE2 code. There's, however, one advantage to using this version on Intel processors as well: the performance should be more stable, regardless of C compiler version and options.

Some technical detail for the curious:

Yes, the new 64-bit mode SSE2 code is also different. The DES S-box implementations are not derived from the 32-bit mode MMX/SSE2 ones, but rather they have been generated from Matthew Kwan's optimized S-box expressions anew, making use of the 16 XMM registers. A Perl script was written and used to compile the expressions into a virtual 3-operand architecture, convert to 2-operand and allocate virtual registers, allocate real registers and spill whatever didn't fit into registers to memory locations, and do some instruction scheduling based on published SSE2 instruction latencies for the Pentium 4 (which are larger than those for current AMD64 processors, so should be good for both camps). Then the script's output required some manual work to resolve the few occurrences of unsupported operand combinations. Finally, the individual S-boxes were subjected to "brute-force instruction scheduling", which tested thousands of different ways in which instructions could be re-ordered, measuring the actual execution times in clock cycles (yes, code was actually being run on the CPU multiple times for each of the thousands of code versions being tested).

Now, you would expect excellent performance from the above, wouldn't you? Well, it's not so great in practice. The 8-registers MMX code from which the non-64-bit-mode SSE2 code was derived is excellent, making it very hard to do things better - even with more registers. Yes, the availability of 16 registers helps save some moves and reduces the instruction count in S-boxes by 10%, but it turns out that accessing the added 8 XMM registers may be slower than accessing memory (actually, L1 cache) in some cases. I think that this has to do with limitations of the instruction decoder. For example, blindly converting the "old" 8-registers SSE2 code to use the extra 8 registers instead of temporaries in memory results in an 8% slowdown on an Athlon 64, but makes no difference on a Xeon - which is consistent with the "decoder theory". Brute-re-scheduling that code does get back some of the lost 8%, but it does not restore the original 8-registers code performance.

So for the time being, I am satisfied with the brand new 16-registers code being about as fast as the old 8-registers one is. In future versions, I might further improve the Perl script to do smarter register allocation, taking into account the fact that consecutive instructions which access registers XMM8-XMM15 have a higher cost on AMD64 processors. (In my testing, the slowdown is seen with 4+ consecutive instructions like that, but in real-world code there are more constraints.) I might also generate some hybrid native x86-64 + SSE2 code, which might make this issue irrelevant... or it might not. :-)

Only registered users can write comments.
Please login or register.

Powered by AkoComment!

 
< Prev   Next >
    
Partner

 

Latest Features
Password guessing with Medusa 2.0
Password guessing as an attack vector
Squid and Digest Authentication
Squid and Basic Authentication
Demystifying the Chinese Hacking Industry: Earning 6 Million a Night
Free Online security course (LearnSIA) - A Call for Help
What You Need to Know About Linux Rootkits
Review: A Practical Guide to Fedora and Red Hat Enterprise Linux - Fifth Edition
Using the sec-wall Security Proxy
sec-wall: Open Source Security Proxy
Yesterday's Edition
Hackers Hit Apple Supplier Foxconn, Leak Usernames And Passwords
Hackers Mug Google's Wallet App on Rooted Android Devices
Google Chrome will no longer check for revoked SSL certificates online
Have Your Users' Passwords Already Been Hacked?
DDoS Tools Flourish, Give Attackers Many Options
Partner Sponsor

Community | HOWTOs | Blogs | Features | Book Reviews | Networking
 Security Projects |  Latest News |  Newsletters |  SELinux |  Privacy |  Home
 Hardening |   About Us |   Advertise |   Legal Notice |   RSS |   Guardian Digital
(c)Copyright 2012 Guardian Digital, Inc. All rights reserved.