At the end of June Microsoft released many documents which aim to provide normative information about various the protocols and file formats used by Microsoft products. Among these documents is MS_OFFCRYPTO which provides information about the algorithms used to encrypt Microsoft Office documents (new and old) and which are available under the terms of Microsoft’s Open Specification Promise.
The really useful information in this document are the algorithms described in sections 220.127.116.11 and 18.104.22.168. Section 22.214.171.124 describes a pseudo function to generate an encryption key that can be used to decrypt the encrypted package (Office 2007 zip file). Section 126.96.36.199 describes the algorithm to use to determine if the supplied password is valid.
This post is to provide and document the code I’ve created to decrypt an encrypted Office 2007 document.
Initially I was unsuccessful in my attempt to write code to decrypt encrypted Office documents and I have to thank David LeBlanc for his help and patience guiding me to a solution. David wrote the MS-OFFCRYPTO document so I’ve been very fortunate to have had such an expert guide.
Update 2008-12-12: David LeBlanc has now published the corrected version of the Office Crypto documentation so I’m making my sample code available under a creative common share-alike licence.
Update 2012-04-04: @Webie has created a C implementation of the password validation routines for use on Linux using OpenSSL and libgsf (to read OLE Storage files). At this time, support is provided for 2007 and 2010 Agile encryption. You will find his implementation here: https://github.com/magnumripper/magnum-jumbo
Reading the MS-OFFCRYPTO document
There’s a lot of stuff in the MS-OFFCRYPTO document which is necessary in theory, so Microsoft *has* to document it, but which is overkill when considering the needs of just decrypting an Office 2007 document.
While a normal Office 2007 document is a zip file, an encrypted document is an OLE storage file. While an Office 2007 document can be manipulated using the System.Security.Cryptography.Package class, encrypted documents must be handled using the Windows API Storage interfaces and functions. The code that will be attached contains a class that wraps these interfaces functions so you will be able to open and access the file contents from a C#.
The reason the file is a storage file rather a zip file seems to be because the Office 2007 team tried to use the Microsoft DRM to implement Office encryption. Micrsoft’s DRM technology stores both the payload (the encrypted document) and other information decribing the encryption algorithms and other transforms used to obfuscate the payload. A DRM compliant application can use this information to decode the payload (assuming the application knows the password or license). As a result the DRM technology allows a producer application to use any arbitrary encryption algorithms to create an encrypted payload and describe these algorithms so an capable consumer can decode the payload.
However, Office 2007 doesn’t really use the whole DRM infrastructure when encrypting Office documents. Presumably DRM is used so that if a company wants to use the DRM infrastructure to encrypt documents using some proprietary algorithm they can.
In some senses the Office team implements a proprietary encryption mechanism but for some reason, they chose to do so in a way that is not (cannot?) be descibed in DRM compliant terms. A measure of the impact of this approach is that the System.Security.Cryptography.Package class is unable to open an encrpted Office file.
On the plus side, it does mean there’s no need to plough through all the DRM encrption/transform descriptions. Instead, you can take a shortcut and read just two streams from the storage file and ignore the rest! The streams to read are EncryptionInfo and EncryptedPackage.
The code reads the storage file to access this stream which is parsed to retrieve information such as the encryption and hashing algorithms to use, keys sizes and various byte blocks used to verify the password.
Although the contents of this stream are documented in the file, to faciliate understanding of the structure, a hex dump of the content of a sample encryption info stream is included in the comments at the beginning of the code file and reproduced below.
The sample content in the dump is taken from a .xlsb storage file encrypted using the password “password” (without the quotes).
The code is a single C# class with a single entry point which can be called as:
Package OfficeCrypto.OpenEncryptedOfficeFile(string filename, string password);
It takes the name of the encrypted file, the password used to encrypt it and returns a System.IO.Packaging.Package instance. If there are errors along the way – for example the password may be wrong – it will generate exceptions you need to catch.
I’ve tried to document it reasonably thoroughly and reference relevant parts of sections 188.8.131.52 and 184.108.40.206. In fact, the code is not optimized deliberately so its structure and the variable names used can closely resemble the algorithm descriptions. Refactoring this code may make it appreciably quicker
Because there is a lot of code, I’ve added *lots* of “regions” so you can start by collapsing all regions then “drill-in” to areas of the code to get a good idea of the structure and functions available.
AES and SHA1 implementations
Being managed code it uses the managed code implementations of SHA1 and AES (in the System.Security.Cryptography namespace). To verify these implementations return expected values when used, there are two test functions: one to test SHA1 and one to test AES. The test functions use known inputs and verify the results against expected return values. These test values are taken from articles on Wikipedia and links to these articles are included in the code comments.
Generating the encryption key
This operation is performed by the function GeneratePasswordHashUsingSHA1 and is the heart of the code. Its also the piece of this code that does not appear to work for me without David LeBlanc’s insight.
The clue I think I’m OK providing is that you need to ignore the strictures of step 3 of section 220.127.116.11 and include step 4(a) even if the algorithm you are instructed to use is AES128.
Update: Some of the following comments are no longer required because the new version of MS-OFFCRYPTO is updated to include the same comments.
It seemes that, by default, Office 2007 documents are encrypted using AES128 and the encryption key is generated using SHA1.
The AES128 block size is 16 (0x10) bytes (128/8). SHA1 will always generate a 20 (0x14) byte key. Step 3 of 18.104.22.168 says if the key size is greater than the block size (which it is when AES128 and SHA1 are used) just take the first 16 (0x10) bytes of the SHA1 key. However this doesn’t work for me. Including step 4(a) of section 22.214.171.124 and using the first 16 (0x10) bytes of the hash generated by this step does work for me. Maybe it will work for you as well.
The other clue is that you should not try to use the CryptoAPI and should instead use the Rijndael (or AES) managed class (in the System.Security.Cryptography namespace). When you read the documentation about the EncryptionInfo stream contents you will see that the codes defining the encryption and hashing algorithms to use are exactly those defined in WinCrypt.h which might lead you toward the CryptoAPI.
However, although its not stated, the encryption/decryption algorithms do not use padding. Maybe its me, but I can’t figure out how to use the CryptoAPI without a padding mode. So far as I can tell, whenever a block cipher like AES is used the CryptoAPI will *always* use a PKCS5 padding. When I try to use any other padding mode the API always returns an error.
By default the Rijndael (or AES) implementations also uses PKCS5 so you need to explicitly set a padding mode of None. But at least this is an option with the managed code implementation.
While the documentation in MS-OFFCRYPTO does include descriptions of algorithms, it doesn’t contain any conformance data so when the decryption process doesn’t work and you need to debug your code it’s not possible to tell which aspect of the decryption process hasn’t worked correctly. The hex dumps below are provided to try and provide a simple form of conformance data.
!! HEALTH WARNING !!
These dumps are provided as examples and “as-is”. They work for me though there’s no saying they will work for you so I’m making no promises as to their fitness for any specific purpose.
Let me first reproduce the hex dump I include in the comments of the code. This is a dump of the EncryptionInfo stream created when I password protected an Office 2007 workbook using the password “password”
/// 00000000 03 00 02 00 Version<br />
/// 00000004 24 00 00 00 Flags (fCryptoAPI + fAES)<br />
/// 00000008 A4 00 00 00 Header length<br />
/// 0000000C 24 00 00 00 Flags (again)<br />
/// 00000010 00 00 00 00 Size extra<br />
/// 00000014 0E 66 00 00 Alg ID<br />
AlgID 0x0000660E = 128-bit AES,<br />
AlgID 0x0000660F = 192-bit AES,<br />
AlgID 0x00006610 = 256-bit AES<br />
/// 00000018 04 80 00 00 Alg hash ID 0x00008004 SHA1<br />
/// 0000001C 80 00 00 00 Key size<br />
0x00000080 = 128-bit<br />
0x000000C0 = 192 bit<br />
0x00000100 = 256-bit<br />
/// 00000020 18 00 00 00 Provider type 0x00000018 AES<br />
/// 00000024 A0 C7 DC 02 00 00 00 00 Reserved<br />
/// 0000002C 4D 00 69 00 M?i? CSP Name<br />
/// 00000030 63 00 72 00 6F 00 73 00 6F 00 66 00 74 00 20 00 c?r?o?s?o?f?t? ?<br />
/// 00000040 45 00 6E 00 68 00 61 00 6E 00 63 00 65 00 64 00 E?n?h?a?n?c?e?d?<br />
/// 00000050 20 00 52 00 53 00 41 00 20 00 61 00 6E 00 64 00 ?R?S?A? ?a?n?d?<br />
/// 00000060 20 00 41 00 45 00 53 00 20 00 43 00 72 00 79 00 ?A?E?S? ?C?r?y?<br />
/// 00000070 70 00 74 00 6F 00 67 00 72 00 61 00 70 00 68 00 p?t?o?g?r?a?p?h?<br />
/// 00000080 69 00 63 00 20 00 50 00 72 00 6F 00 76 00 69 00 i?c? ?P?r?o?v?i?<br />
/// 00000090 64 00 65 00 72 00 20 00 28 00 50 00 72 00 6F 00 d?e?r? ?(?P?r?o?<br />
/// 000000A0 74 00 6F 00 74 00 79 00 70 00 65 00 29 00 00 00 t?o?t?y?p?e?)<br />
/// 000000B0 10 00 00 00 Salt size<br />
/// 000000B4 90 AC 68 0E 76 F9 43 2B 8D 13 B7 1D Salt<br />
/// 000000C0 B7 C0 FC 0D<br />
/// 000000C4 43 8B 34 B2 C6 0A A1 E1 0C 40 81 CE Encrypted verifier<br />
/// 000000D0 83 78 F4 7A<br />
/// 000000D4 14 00 00 00 Hash length<br />
/// 000000D8 48 BF F0 D6 C1 54 5C 40 EncryptedVerifierHash<br />
/// 000000E0 FE 7D 59 0F 8A D7 10 B4 C5 60 F7 73 99 2F 3C 8F<br />
/// 000000F0 2C F5 6F AB 3E FB 0A D5<br />
OK, it doesn’t look quite the same as in the code file because WordPress removes all the nice whitespace padding though I hope you can still easily see the structure.
This stream tells the code to use AES 128 (offset 0x00000014) and SHA1 (offset 0x00000018). It also specifies the salt size to use (offset 0x000000B0) and the salt used when encrypting the document (offset 0x000000B4-0x000000C3). The 16 (0x10) byte encrypted verifier at (offset 0x000000C4) and the 32 (0x20) byte encrypted hash of the verifier at (offset 0x000000D8) are for use when verifying the password (see below).
The first step is to hash the salt and password. In this case it’s a 16 byte salt and 16 bytes of password (the unicode representation of “password”). The result is the following 20 (0x14) byte hash:
00000000 A1 21 9D 6D 2D 77 A1 92 EA 2F A2 E6 E3 7B C8 60<br />
00000010 CF EF 5F DE<br />
The algorithm then has to iterate from 0..49999 concatenating the iteration number (4 bytes) and the previous hash result to generate a new hash. After the zeroth iteration (i==0) this is what I see:
00000000 8B 33 F7 48 FA 35 AF BB 34 22 E8 AC D7 C6 DA E1<br />
00000010 8A F1 81 78<br />
At the end of the iteration (after hashing with i==49999) I see:
00000000 7D C5 97 D9 01 2A A3 E0 B8 56 3B 56 69 00 06 10<br />
00000010 CC C3 A6 D4<br />
Next the last hash generated by the iterator has to be hashed with four zero byte (what the documentation calls “block 0″). In the iterator the hash is appended to the iterator count then hashed. Here the four zero bytes are appended to the hash. Anyway here’s my result.
00000000 A6 65 59 03 FD 23 94 C8 83 1E 71 62 D7 8B 42 55<br />
00000010 51 B9 14 E4<br />
One of the clues given above is to include step 4(a) of the key derivation algorithm in all cases. After this step I see:
00000000 AC 7C 92 51 7C 31 2F B0 9F E9 32 E9 C0 62 D9 12<br />
00000010 38 29 30 35<br />
AES128 has a block size of 16 (0x10) bytes so take just the first 16 (0x10) bytes:
00000000 AC 7C 92 51 7C 31 2F B0 9F E9 32 E9 C0 62 D9 12<br />
Now you have a key that should successfully decrypt the payload (though don’t forget the first 8 bytes specify the length of the unencrypted content and should be removed before the payload is decrypted.
However you can follow the algorithm in 126.96.36.199 to verify the key you’ve generated.
The first step is to decode the 16 (0x10) byte encrypted verifier which starts at 000000C4 in thehex dump of encrypyion info stream above. After using the Rijndael manager cipher, setting a padding mode of none and specifying a block size of 16 (0x10) bytes or 128 (0x80) bits I see the following decrypted verifier:
00000000 11 92 99 99 FF 00 11 11 22 33 77 88 88 99 CC CC<br />
Using the same cipher and decrypting the verifier hash at 000000D8 in the hex dump above I get:
00000000 A6 D5 6B D6 51 2C E2 01 AC 0E 82 E1 EE 43 79 32<br />
00000010 6D 1C 1C BB<br />
The final step is to hash the decrypted verifier (generated in the step before last) using SHA1.
00000000 A6 D5 6B D6 51 2C E2 01 AC 0E 82 E1 EE 43 79 32<br />
00000010 6D 1C 1C BB
The last two results should be identical (and in my case they are) which confirms the key is valid and can be used to decode the payload.
And that, as they say, is all there is to it.