Saturday, 28 May 2011

Scanning, Storage & RBS

Problem:  The client has millions of physical documents, that need to be available via SharePoint, additionally documentation still arrives in physical form and needs to be scanned and classified.

Initial Hypothesis: SP2010 can store documents in the SQL database in blob format however, it's not really made for large blob storage performance wise, additionally SQL storage is expensive (RAID, HA). Remote Blob Storage (RBS) helps with storing blobs but does not get around limitations imposed by MS guidanace.  RBS can reduce storage and improve performace if you data storage involves a lot of large blobs (over 256kb  is a good size).  My rough sums show a huge data requirement so for example 600,000 customers transact with the client.  On average each customer has 3 physical documents a year.  So we are talking 1,8 million scanned documents a year.

Documents need to be scanned in at 300 dpi so they can be printed and stored adequately.  With compression and converting these files into tiff/pdf files we are assuming an average of 1 MB per file. So our storage requirement per year would require 1.8 million scanned documents at 1MB per file meaning my storage on 1,800GB

As we have a restriction of 200GB per content database in SP2010 (threshold that MS will support up to).  So we would require 9 new site collections on a new content db per year to meet this requirement. 

Tip: Also worth considering are thresholds and bounderies provide by the SharePoint team.  Site collections max size is 100GB, this scenario has a caviet in that a single Site Collection using a single document library/site supports up to 1TB in the Content Database.  You can have subsites nested in a site collection but 2000 per view is recommended.  Max of 300 content db's per Web Applicaion.  Max 5000 site collections per content database.

Our storage cost is much higher as our disks are RAID so at a minimum we would use 3 times this in actual physical disk space.  On top of this my indexes will be about 25% of the storage requirement.  So price, performance are getting out of control pretty quickly.

Resolution: Using RBS my estimate on these blobs is will will reduce the content database by 90% however content database size is calcualte including RBS so our storage requirement will be cheaper using RBS that is resilient however the content database sizing will not be reduced by using RBS. 

Updated: 21/07/2011 - RBS sizing Calc

Scanning tips for SP:
  • Tiff or pdf are the common base storage file type;
  • 300dpi is good print quality most requirements can be lower;
  • Black and white is far smaller then grey scale scanning.
  • Pdf's if stored correctly can be indexed by the search crawler.
More Info Scanning:
http://www.psigen.com/ - scanning and capture for SP2010.
Capturx from www.adapx.com/sharepoint is a pen that automates data capture on forms.
CoSign does digital signatures and looks to have pretty decent integration with http://www.arx.com/digital-signature/sharepoint
www.kodak.com/go/sharepoint
http://www.goscan.com/connectors-sharepoint.php
http://www.kofax.com/solutions/microsoft.asp

More Info Sizing:
HP Sizer for SP2010
Capacity management for SP2010 - Sw boundries

0 comments:

Post a comment