Forrest Heller

SHA1 in Verilog

Stealing my code

If you are going to use my code, just acknowledge me. All of this is original code except for the keycode to ASCII converter, which I was much too lazy to type myself. The conversion block is copied without modification from John Clayton's much more fancy PS/2 module. I generated large amounts of code with small C programs.

The Assignment (Boring)

Usually I do not post my school-related code or projects but I thought this project was too funny not to post. I am currently taking the Digital Logic class offered by the University of Colorado. This class is an introduction to digital design and includes a laboratory session utilizing Verilog with FPGAs. The most recent assignment was a project to "design, build, and demonstrate an interesting project." The design must be fully Combinational, save for latching user-input to overcome the limitations of our FPGA development board (Altera DE2). Many of my classmates designed arithmetic circuits--however, I thought it would be funny (and by funny I mean laugh-out-loud hilarious) if I managed to implement the SHA1 hash in such a manner.

Introduction to SHA1

SHA1 is a cryptographically secure hash function in widespread use. It takes data (commonly known as the "message") and produces a 40-byte hash ("digest") that is unique to every message. You can know a hash for a given message, but you can't know a message for a given hash--it takes too much computing power. Examples (note the hashes are given in hexadecimal notation--they are just numbers!): SHA1("The quick brown fox jumps over the lazy dog") = 2fd4e1c67a2d28fced849ee1bb76e7391b93eb12 SHA1("kitty") = 95d79f53b52da1408cc79d83f445224a58355b13 SHA1("Kitty") = a33027811ed9a95dc65084c59ae0560b81734d54 (capitalized k) Example uses:

My idea

To implement the SHA1 hash I simply looked at the unomptimized--but easy to read--C implementation and translated it into Verilog code. The basic idea is to unroll the C loops into Verilog assign statements with an individual assign statement and wire for each line of C code where an assignment was made. For example, unsigned long E;...E = D; becomes wire [31:0] E [80:0];...assign E[1] = D[0]; I do not know how to do this concisely in Verilog so I wrote a C program to generate the otherwise tedious Verilog. By the time I was informed of Verilog's generate statement, it was too late. Also, after screwing around with the generate statement for five minutes, I couldn't get it to compile. Hence, it is pointless.

The key to SHA1 in Combinational logic

My program generated this Verilog module. As input, the module takes the currently calculated hash value for the message and the 512-bit message block to evaluate. As output, it produces the next intermediate (or final) hashes. It does not handle padding.

This is the process by which I translated the C code to Verilog. Example C loop iteration for SHA1 round 2: temp = SHA1CircularShift(5,A) + (B ^ C ^ D) + E + W[t] + K[1]; E = D; D = C; C = SHA1CircularShift(30,B); B = A; A = temp; Generated Verilog code for first iteration: assign temp[20] = `SHA1CircularShift(5,A[20]) + (B[20]^C[20]^D[20]) + E[20] + W[20] + `kone; assign E[20+1] = D[20]; assign D[20+1] = C[20]; assign C[20+1] = `SHA1CircularShift(30,B[20]); assign B[20+1] = A[20]; assign A[20+1] = temp[20][31:0]; Each assignment in the C code is mimicked by an individual bus assignment in the Verilog code. 20 of these Verilog blocks make up the C for loop above.

How well does it work?

When combined with an LCD module for displaying the SHA1 hash and a PS/2 keyboard interface parser to enable the user to enter a 55-byte message the logic takes 55% of logic elements available on an Altera Cyclone 2 (EP2C35). It takes around a half hour to synthesize and assemble the program for use an Altera DE2 board. The delay propagation from latching a message for digest and retrieving the hash is ~560+ ns. The Altera DE2 runs at 50 MHz so I had to find a way to delay the display of the hash until the SHA1 combinational logic settled. An overview of the setup Therefore, to get the SHA1 hash on an Altera DE2 board with my code on it you:

  1. Type in your message with a PS/2 keyboard
  2. Flip switch 1
  3. Wait at least 560 ns (560 billionths of a second)
  4. Flip switch 17
Flipping switch 1 triggers a padding of the SHA1 message block and latches the message block to the input of the combinational (Combinational?) SHA1 logic. Flipping switch 2 displays the logic on the LCD screen. However, the entire hash doesn't fit on the LCD screen, so the final 32-bit octet is displayed on on-board 8 7-segment displays.

An overview of the setup This is SHA1 the hash of "kitty".

Casualties

When the teaching assistant assigned to my lab section first saw my generated Verilog code (which at the time produced a hash, but not the correct one) he became visibly disturbed and told me to try and implement SHA1 with registers first. I did so. The resulting compiled design overheated an Altera DE2 FPGA development board. This is almost certainly because I ignored the "21000 Combinational loops" warning--in the software world only whiners and the underconfident pay attention to compiler warnings.

Simulation

I came to my senses and decided to simulate my code before destroying more boards. I used the open source Icarus Verilog. As opposed to waiting for Quartus to compile my code--30 minutes of my life--Icarus Verilog compiles and simulates it instantly.

Is it useful?

No. I believe the rate at which my synthesized design generates hashes--~1.8 million hashes/second according to the Quartus Timing Analyzer--does not compare to a 1 GHz CPU at these bytes per cycle. This is the primary reason why you can't find another combinational SHA1 implementation on the Internet.

Comparison

My design executes SHA1 in a 100% parallel fashion. If ever the delay times in an FPGA are reduced drastically enough then my code can be synthesized to beat all pipelined SHA1 implementations. This is super-duper unlikely. My code is an academic exercise. Currently, the only IC that produces an SHA1 hash is from CAST. CAST claims their chip runs at a maximum of clock rate of 500 MHz. They have pipelined the SHA1 calculation to 82 cycles on a 512-bit block. 500 MHz /82 cycles = 6.1 million hashes/ second. This almost quadruples my rate of 1.8 MHz--not to mention it requires many less logic gates and is fully synchronous. OpenCores has a SHA1 core by Paul Hartke. His performance claims also dwarf mine. I did not use these projects as resources while writing my own SHA1 implementation because mine is for humor while these are for real use.

Walkthrough of SHA1 code

Instead of walking through the generated sha1.v file I will walk through the C code that generates it. However, a cursory glance will reveal that I looked at the RFC implementation and then blindly mimicked the C code with my temporary wire assignments.

This code defines the constants found in the RFC for SHA1. Their magical properties are unknown to me, but they're probably prime numbers. printf("/* round constants */\n"); printf("`define kzero 32'h5A827999\n"); printf("`define kone 32'h6ED9EBA1\n"); printf("`define ktwo 32'h8F1BBCDC\n"); printf("`define kthree 32'hCA62C1D6\n"); This defines the circular hash function of SHA1 as a macro. Although it looks like I copied and pasted it from the RFC, I actually typed it. printf("`define SHA1CircularShift(bits,word) (((word)<<(bits))|((word)>>(32-(bits))))\n"); Setup module declaration: printf("module sha1(input wire [511:0] block ,\n input wire [159:0] current_hash,\n output wire [159:0] new_hash);\n\n"); This declares all the wires for the intermediate steps of the hash. They match the RFC naming conventions. printf("/*temporary wire variable to help with W initialization*/\n"); printf("wire [31:0] Wt [63:0];\n"); printf("/*mirrors W in SHA1 RFC reference*/\n"); printf("wire [31:0] W [79:0];\n"); printf("/* These variables mirror those in the SHA1 RFC reference*/\n"); printf("wire [31:0] temp [79:0];\n"); printf("wire [31:0] A [80:0];\n"); printf("wire [31:0] B [80:0];\n"); printf("wire [31:0] C [80:0];\n"); printf("wire [31:0] D [80:0];\n"); printf("wire [31:0] E [80:0];\n\n"); This does the same thing as the equivalent C loops. printf(" /* intialize first 16 words of W with the contents of block*/\n"); for(i = 0; i < 16; i++) { printf("assign W[%d] = (block[%d:%d]) | (block[%d:%d] << 8) | (block[%d:%d] << 16) |( block[%d:%d] << 24);\n",i,i*32+31,i*32+24,i*32+23,i*32+16,i*32+15,i*32+8,i*32+7,i*32); } printf("/* initialize rest of W */\n"); for (i = 16; i < 80; i++) { sprintf(str,"W[%d] ^ W[%d] ^ W[%d] ^ W[%d]",i-3,i-8,i-14,i-16); printf("assign Wt[%d]= %s;\n",i-16,str); sprintf(str,"Wt[%d]",i-16); circular_shift(str2,1,str); printf("assign W[%d] = %s;\n",i,str2); } This is the heart of the SHA1 hash--the four rounds of circular shifting. Notice the practice of assigning to temp[n] and then using temp[n] to calculate temp[n+1]. After the four rounds are complete we only care about A[80], B[80], C[80], D[80], and E[80] as these will be added to the current hash to create the final hash. printf("/* assign initial hashes*/\n"); printf("assign A[0] = current_hash[31:0];\n"); printf("assign B[0] = current_hash[63:32];\n"); printf("assign C[0] = current_hash[95:64];\n"); printf("assign D[0] = current_hash[127:96];\n"); printf("assign E[0] = current_hash[159:128];\n\n"); printf("/* * * * * First Rounds [0-19] * * * * */\n"); for(i = 0; i < 20; i++) { sprintf(str2,"A[%d]",i); circular_shift(str,5,str2); printf("assign temp[%d] = %s + ((B[%d] & C[%d]) | ((~B[%d]) & D[%d])) + E[%d] + W[%d] + `kzero;\n",i,str,i,i,i,i,i,i); print_round_iteration(i); } printf("\n/* * * * * Second Rounds [20-39] * * * * */\n"); for (i = 20; i < 40; i++) { sprintf(str2,"A[%d]",i); circular_shift(str,5,str2); printf("assign temp[%d] = %s + (B[%d]^C[%d]^D[%d]) + E[%d] + W[%d] + `kone;\n",i,str,i,i,i,i,i); print_round_iteration(i); } printf("\n/* * * * * Third Rounds [40-59] * * * * */\n"); for (i = 40; i < 60; i++) { sprintf(str2,"A[%d]",i); circular_shift(str,5,str2); printf("assign temp[%d] = %s + ((B[%d] & C[%d]) | (B[%d] & D[%d]) | (C[%d] & D[%d])) + E[%d] + W[%d] + `ktwo;\n",i,str,i,i,i,i,i,i,i,i); print_round_iteration(i); } printf("\n/* * * * * Fourth Rounds [60-79] * * * * */\n"); for (i = 60; i < 80; i++) { sprintf(str2,"A[%d]",i); circular_shift(str,5,str2); printf("assign temp[%d] = %s + (B[%d]^C[%d]^D[%d]) + E[%d] + W[%d] + `kthree;\n",i,str,i,i,i,i,i); print_round_iteration(i); } This is the final operation of an SHA1 hash: to add together the current hashes with the intermediate hashes (A[80],B[80],C[80],D[80],E[80]) to get a final hash. printf("\n/* finally, add the new values to the current hash for a new hash*/\n"); printf("assign new_hash[31:0] = current_hash[31:0] + A[80];\n"); printf("assign new_hash[63:32] = current_hash[63:32] + B[80];\n"); printf("assign new_hash[95:64] = current_hash[95:64] + C[80];\n"); printf("assign new_hash[127:96] = current_hash[127:96] + D[80];\n"); printf("assign new_hash[159:128]= current_hash[159:128] + E[80];\n"); printf("endmodule\n"); return 0; } This mirrors the C code after the main circular shift operation in one of the for loops comprising a round. void print_round_iteration(int i) { char str2[256],str[256]; printf("assign E[%d+1] = D[%d];\n",i,i); printf("assign D[%d+1] = C[%d];\n",i,i); sprintf(str2,"B[%d]",i); circular_shift(str,30,str2); printf("assign C[%d+1] = %s;\n",i,str); printf("assign B[%d+1] = A[%d];\n",i,i); printf("assign A[%d+1] = temp[%d][31:0];\n",i,i); } This seems like a dumb function but I didn't always have this defined as a macro. void circular_shift(char *str, int bits, char *word) { sprintf(str,"`SHA1CircularShift(%d,%s)",bits,word); }

Acknowledgements

Issues

The biggest issue was my assumption that I could be sneaky and replace the following lines of C code: W[t] = context->Message_Block[t * 4] << 24; W[t] |= context->Message_Block[t * 4 + 1] << 16; W[t] |= context->Message_Block[t * 4 + 2] << 8; W[t] |= context->Message_Block[t * 4 + 3]; with something like: W[t] = {M[t+3],M[t+2],M[t+1],M[t]}; Dumb move. I tried many different ways of summing the bits and to this day I don't understand (or really care) why summing didn't work. As soon as I replaced that code with the actual shifts: assign W[0] = (block[31:24]) | (block[23:16] << 8) | (block[15:8] << 16) |( block[7:0] << 24);

  1. my simulation produced the correct SHA1
  2. I went to the local King Soopers.
  3. I bought 1/2 gallon of chocolate milk.
  4. I drank all of it.
  5. I got sick.

This doubles as a lab report. Normally I wouldn't hold your hand through my C source code or tell you a boring story about my bit summing incompetience.

Page last modified on January 04 2015 20:43:12 (PST)