I've spent the last 2 weeks tearing my hair out at work, wrestling with what I originally believed to be network infrastructure issues. As part of our first big push to virtualize our forensic environment, we have purchased and begun experimenting with ESXi 5.5. During my preliminary evaluation of OSes for our forensic environment (we have historically used Windows 7), I immediately gravitated towards Windows 8/8.1 as I've had excellent personal experiences with it (the UI is questionable but has gotten a lot better with 8.1) and all of the core kernel improvements (especially SMB 3.0).
With that in mind, I recreated our forensic environment with Windows 8 as the base and began putting it through the paces, using it for daily forensic tasks. Note (background important to the story): we run a Netapp FAS2240-4 HA with roughly 140TB (raw) worth of storage for evidence and case work. It is configured with 10GB mezzanine cards and attached directly to our ESXi box via SFP+ in an effort to avoid putting traffic generated from evidence processing onto our core (1GB) network where round trips are longer and where switching equipment is is older, less reliable.
Both ESXi and 10GB were new to our environment, each with their own pitfalls and learning curves, so when I first witnessed issues with evidence on our NAS not verifying properly in Encase or FTK Imager, I immediately assumed it was some sort of configuration issue. Evidence verification jobs were returning unbelievable amounts of bad sectors and segment CRC errors. Attempting to hash files in Encase gave very unpredictable results - not what you want out of a forensic tool. Most of files would return valid (looking) hash values. However, during my review of the results, something caught my eye - groups of clearly dissimilar files (with distinctly different file sizes) were returning '93b885adfe0da089cdf634904fd59f71' (aka null byte) as their hash value. If I wasn't looking closely specifically for these kinds of issues, I may never have noticed - unnerving to say the least. The fact that these dubious hashes were occurring in runs of 50-100 files at a time made me think that Encase was having trouble reading parts of the image over the network (maybe losing connection temporarily), much the same way it appeared to be having trouble reading image segments during verification, hence the large number of sector errors indicative of bad image segments. In FTK Imager (versions 3.1.4 and 3.2), verification speeds would start high (70+MB/s) and quickly drop off, slowly bleeding down to ~15MB/s by the end and returning 'mismatch' for the verification result.
Could it be an issue with IP-hash load balancing over Etherchannel/LACP? Maybe our new virtualized domain controller was timing out intermittently and user authentication to the NAS was being lost temporarily - some sort of Kereberos time source differential problem? Maybe 802.3x (flow control) was dropping frames? Was our 2240 silently corrupting data? Did Cryptolocker somehow make it on to our network and start overwriting various E01 segments?
I spent about a week working through all of the aforementioned issues with no resolution and was starting to lose hope - not a good feeling. But as part of any good troubleshooting process (and probably where I should have started), I thought - maybe it has something to do with Windows 8? Microsoft reworked the network sharing protocol (SMB) a lot between W7 and W8 so maybe something got knocked loose, but as of the time of writing, Netapp's most recent ONTAP version (8.2.1) doesn't even support SMB3.0 and will only negotiate client sessions to SMB2.1, which has proven fine for us for a while now, so I wasn't readily considering it as a suspect. But as I exhausted other avenues, I had to consider the fact that Windows 8 was the last remaining 'unknown' in our environment, one that I had not considered as a possible suspect in this hellride - why would W8 make a difference if it's still using SMB2.1 at the heart of it's file sharing? Did the Windows file APIs change somehow?
I don't know, but recreating our forensic environment (exact same tools, exact same steps) with Windows 7 at the heart fixed the issue immediately. I am relieved the say the least, but still nervous - why would this be the case? I am going to continue following up on this - I'm too invested at this point not to have an answer. I thought maybe Netapp was the issue - I evaluated the 'cifs stat' output from our filers but saw nothing unusual. At the time of writing, I attempting to recreate the issues and pinpoint Windows 8 as the culprint by obtaining similar results, removed completely from our environment.
Update (07/10/14): I forgot to share some of the things I tried through my research with no success
- Setting HKLM\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\Parameters\RequireSecureNegotiate to 0 (via http://support.microsoft.com/kb/2686098/en-us)
- Setting HKLM\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\Parameters\SessTimeout to 60 (via http://blogs.msdn.com/b/openspecification/archive/2013/03/27/smb-2-x-and-smb-3-0-timeouts-in-windows.aspx) - was 60s starting in Vista, changed to 20s in W8