Tuesday 28 January 2014

Search stops working and CA Search screens error

Problem: In my redundant Search farm, search falls over.  It was working, nothing appears to of changed but it suddenly stops working.  This problem has caused itself to surface in several places:
1.> In  CA going to "Search Administration" displays the following error message: Search Application Topology - Unable to retrieve topology component health states. This may be because the admin component is not up and running.


2.> Query and crawl stopped working.
3.> Using PowerShell I can't get the status of the Search Service Application
PS> $srchSSA = Get-SPEnterpriseSearchServiceApplication
PS> Get-SPEnterpriseSearchStatus -SearchApplication $srchSSA
Error: Get-SPEnterpriseSearchStatus : Failed to connect to system manager. SystemManagerLocations: net.tcp://sp2013-srch2/CD8E71/AdminComponent2/Management
4.> In the "Search Administration" page within CA, if I click "Content Sources" I get the error message: Sorry, something went wrong
The search application 'ef5552-7c93-4555-89ed-cd8f1555a96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information.

I used PowerShell to get the ULS logs for the correlation Id returned on the screen via the CA error message
PS> Merge-SPLogFile -Path "d:\error.log" -Correlation "ef109872-7c93-4e6c-89ed-cd8f14bda96b" Out shows..
Logging Correlation Data       Medium Name=Request (GET:http://sp2013-app1:2013/_admin/search/listcontentsources.aspx?appid=b555d269%255577%2D430d%2D80aa%2D30d55556dc57)
Authentication Authorization   Medium Non-OAuth request. IsAuthenticated=True, UserIdentityName=, ClaimsCount=0
Logging Correlation Data       Medium Site=/ 05555e9c-5555-555a-69e2-3f65555be9f4
Topology  Medium WcfSendRequest: RemoteAddress: 'https://sp2013-srch2:32844/8c2468a555594301abf555ac41a555b0/SearchAdmin.svc' Channel: 'Microsoft.Office.Server.Search.Administration.ISearchApplicationAdminWebService' Action: 'http://tempuri.org/ISearchApplicationAdminWebService/GetVersion' MessageId:
General    Medium Application error when access /_admin/search/listcontentsources.aspx, Error=The search application 'ef155572-7555-4e6c-89ed-cd8f14bda96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information.  Server stack trace: at System.ServiceModel.Channels.ServiceChannel.ThrowIfFaultUnderstood(Message reply, MessageFault fault, String action, MessageVersion version, FaultConverter faultConverter) at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc) at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Ob General  Medium ...(IMethodCallMessage methodCall, ProxyOperationRuntime operation)  at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)  Exception rethrown at [0]: at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.ErrorHandler(Object sender, EventArgs e) at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.OnError(EventArgs e) at System.Web.UI.Page.HandleError(Exception e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) at System.Web.UI.Page.ProcessRequest(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) 
General      Medium ...em.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) 05b5559c-4215-0555-69e2-3f656555e9f4 Runtime   tkau Unexpected System.ServiceModel.FaultException`1[[System.ServiceModel.ExceptionDetail, System.ServiceModel, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]: The search application 'ef109555-7c93-444c-85555-cd8f14b5556b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information.   Server stack trace:      at System.ServiceModel.Channels.ServiceChannel.ThrowIfFaultUnderstood(Message reply, MessageFault fault, String action, MessageVersion version, FaultConverter faultConverter)     at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc) 
Runtime     tkau Unexpected ...s, TimeSpan timeout) at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)  Exception rethrown at [0]: at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.ErrorHandler(Object sender, EventArgs e) at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.OnError(EventArgs e) at System.Web.UI.Page.HandleError(Exception e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)
Runtime       tkau Unexpected ...ProcessRequest() at System.Web.UI.Page.ProcessRequest(HttpContext context) at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously
General    ajlz0 High Getting Error Message for Exception System.ServiceModel.FaultException`1[System.ServiceModel.ExceptionDetail]: The search application 'ef10-555-a96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information. (Fault Detail is equal to An ExceptionDetail, likely created by IncludeExceptionDetailInFaults=true, whose value is: System.Runtime.InteropServices.COMException: The search application '7c93-555c-89ed-cd8f555da6b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information. at icrosoft.Office.Server.Search.Administration.SearchApi.RunOnServer[T]General        ajlz0 High ...   at Microsoft.Office.Server.Search.Administration.SearchApi..ctor(String applicationName)     at Microsoft.Office.Server.Search.Administration.SearchAdminWebServiceApplication.GetVersion()     at SyncInvokeGetVersion(Object , Object[] , Object[] )     at System.ServiceModel.Dispatcher.SyncMethodInvoker.Invoke(Object instance, Object[] inputs, Object[]& outputs)     at System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin(MessageRpc& rpc) at System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMess...) Micro Trace uls4 Medium Micro Trace Tags: 0 nasq,5 agb9s,52 e5mc,21 8nca,0 tkau,0 ajlz0,0 aat87 05b16e9c-4215-00ba-69e2-3f656eabe9f4
Monitoring   b4ly Medium Leaving Monitored Scope (Request (GET:http://sp2013-app1:2013/_admin/search/listcontentsources.aspx?appid=be28d269%255577%2D430d%2D555a%2D30d55556dc57)). Execution Time=99.45349199409
Topology    e5mb Medium WcfReceiveRequest: LocalAddress: 'https://sp2013-srch2.demo.local:32844/8c2468a555943555bf42cac555f05b0/SearchAdmin.svc' Channel: 'System.ServiceModel.Channels.ServiceChannel' Action: 'http://tempuri.org/ISearchApplicationAdminWebService/GetVersion' MessageId: 'urn:uuid:c2d55565-bb59-4555-b34b-6555ef1d79a5'


I can see my Admin component on SP2013-SRCH2 is not working, so I turned off the machine hoping it would resolve to the other admin component, it did not change and keeps giving me the same log errors.  I turned the Server back on and reviewed the event log on the admin component Server (SP2013-SRCH2).  Following errors occured:
Application Server Administration job failed for service instance Microsoft.Office.Server.Search.Administration.SearchServiceInstance + Reason: The device is not ready. 
The Execute method of job definition Microsoft.Office.Server.Search.Administration.IndexingScheduleJobDefinition + The search application + on server SP2013-SRCH2 did not finish loading.

Examining the Windows application event logs on SP2013-SRCH2 and I noticed event log errors relating to permissions:
A database error occurred. Source: .Net SqlClient Data Provider Code: 229 occurred 0 time(s) Description:  Error ordinal: 1 Message: The EXECUTE permission was denied on the object 'proc_MSS_GetConfigurationProperty', database 'SP_Search'
Unable to read lease from database - SystemManager, System.Data.SqlClient.SqlException (0x80131904): The EXECUTE permission was denied on the object 'proc_MSS_GetLease', database 'SP_Search', schema 'dbo'.

The event log showed me the account trying to execute these stored procs.  It was my Search service account e.g. demo\sp_searchservice

Initial Hypothesis: It looks like the Admin search service is not starting or failing over.  By tracing the permissions I can see the demo\SP_SearchService account no longer has execute SP permissions on the SP_Search database. 

By opening the SP_Search database and looking at effective permissions I can see the "SPSearchDBadmin" role has "effective" permissions over the failing stored procs (Proc_Mss_GetConfigurationProperty).

If I look at the account calling the Stored Proc (demo\SP_SearchService), I can see it is not assigned to the role.  This examination leads me to my conjecture that the management tool or someone/something on the farm has caused the permissions to be changed

Resolution: Change the permissions on the database.  Give minimal permissions so go to the database and give the service account (demo\sp_SearchService) SPSearchDBAdmin role permissions.
All the issues recorded in the problem statement were working withing 5 minutes on my farm with the issue.

Note: I have a UAT environment in exact sync with my PR environment; UAT has the correct permissions in place already.

Sunday 26 January 2014

Fixing SSRS on SP2013

    Error: "For more information about this error navigate to the report server on the local server machine, or enable remote errors".  When opening a report in SharePoint I get the error shown above.
    Perform these 2 steps to view the error:
    1.> In IE, navigate to the "Reports Library", click "Site Settings" click "Reporting Services Site Settings".  Select "Enable remote errors in local mode".
    2.> Open CA > "Manage service applications" > Select the "SSRS" service application > System Settings > "Enable Remote Errors".
     TBC
     

How to test SharePoint 2013 applications


Overview: There are several forms of testing and this post aims to provide basic testing information to assist you make sure my SharePoint application are fit for purpose.  Basically, functional testing including the UI is essential but performance testing is another big win for large enterprise application on Share.Point.

Testing using Selenium for SharePoint
UI Testing and Monitoring

Create a Test Plan - consists of:
  1. Test configurations: The combinations of processor type, operating system version, and browser version on which you want to test the application.
  2. Test environment: The number and characteristics of the computers on which to run the tests. Set this if you are testing a web-based or distributed application.
  3. Test settings: The types of data that you want to collect while running the tests.
http://msdn.microsoft.com/en-us/library/vstudio/dd286583.aspx
http://visualstudiogallery.msdn.microsoft.com/e79e4a0f-f670-47c2-9b8a-3b6f664bf4ae/

Below are the results from a single server VM running a simple load test.  This is VS2012 on Windows 2008R2 against SP2013 with SQL 2012.





Burp Suite is a good easy to use Penetration testing tool and as always Fiddler has numerous uses for testing.

HP's WebInspect is also a good penetration/security testing tool:
http://www8.hp.com/uk/en/software-solutions/webinspect-dynamic-analysis-dast/

Tip: SAST & DAST are application security testing methodologies used to find vulnerabilities.

Monday 13 January 2014

CU Upgrading On Prem Office Web Apps 2013

Problem: I need to upgrade my WCA farm from the RTM version to the March 2013 CU to allow pdf's to be displayed with Office Web Apps.


Steps to upgrade an Existing WCA farm:
  1. Copy exe to machine: OWA1 & OWA2 (D:\OfficeWebApps\March 2013 CU)
  2. Remove secondary servers from farm.  In my case this is WCA2, remote into SP-WCA2 and run PS> Remove-OfficeWebAppsMachine (on WCA2)
  3. Run exe on secondaries (WCA2), Reboot, shut down and snapshot & then on the primary (WCA1)
  4. Check primary: verify the url works: http://wcauat.demo.dev/hosting/discovery this can be done on the local machine or using https from a client machine.
  5. Create a new OWAFarm, this will run over the top of your exisitng farm   PS> . missing Add-OfficeWebAppsFarm... (on WCA1)
  6. Join Secondary to farm    PS> new-officewebappsmachine –machinetojoin “sp-wca1”(Primary)
  7. Activate WordPDF PS>  New-SPWopibinding –servername “wcauat.demo.dev” –application “WordPDF” -allowhttp
Perform step 7 on the SharePoint farm.

More Info:





You can verify the version of you WCA farm using (not sure this reports the correct version): 
(Invoke-WebRequest https://wcauat.demo.dev
jsonAnonymous/BroadcastPing).Headers["X-OfficeVersion"]

An easier approach is to use Fiddler to monitor the http/https traffic to figure out Office Web Apps farm version:
Open IE
Open Fiddler
In IE, go to the url https://wca.demo.dev/m/met/particiapant.svc, replace "wca.demo.dev" with your Office web apps service url. 
In Fiddler, review the response header, you will see the response header X-OfficeVersion with a version number.

 

Saturday 11 January 2014

SharePoint 2013 Search Limits with an example

Overview:  This post aims to provide guidelines for building SharePoint 2013 Search farms.  There are 6 Search components (labelled C1-C6 below) and 4 database types (labelled DB1-DB4).  Index partitions are a big factor is search planning.

Example: Throughout this post I provide an example of a 60 million item search farm with redundancy/High Availability (HA).

Index partitions: Add 1 index partition per 10 million items is the MS recommendation, this really depends on IOPs and how the query is used.  An twinned partition (partition column) is needed for HA, this will improve query time over a single partition.  
Example: So assuming a max of 10 million items per index, to have a HA farm for 30 million items requires 6 partitions.

Index component (C1): 2 index components for each partition.
Example: 12 index components.

Query component (C2): Use 2 query processing components for HA/redundancy, add an additional 2 query components at 80 million items increase. 
Example: 2 Query components.

Crawl database (DB1): Use 1 crawl database per 20 million items.  This is probably the most commonly overlooked item in search farms.  The crawl database contains tracking and historical information about the crawled items. It also contains info such as the last crawl id, time etc, crawl history.  Crawl component feeds into the crawl database.  Medium usage should be under 100GB.  Add more content database before 20 million or 100GB database size.  My initial size is mdf 100 MB (growth 50MB) and the ldf is 300 MB (growth 50MB).
Example: 3 crawl databases at 20 million items each allows for a search farm containing 60 million items.

Link database (DB2): Use 1 link database per 60 million items.  I believe 1 link database will handle up to 100 million items.  Mdf 100 MB (growth 50 MB) and Ldf 25 MB (growth 25 MB).
Example: 1 link database.

Analytics reporting database (DB3): Add 1 search analytics reports database for each 500,000 unique items, viewed each day or every 10-20 million total items.  This is the heavy search database.  Add a new database to keep each Analytics reporting database under +-250GB.  Mdf 100 MB (growth 50 MB) and Ldf 25 MB (growth 25 MB).
Example: Start with 1 and grow as needed.

Analytics Processing Component (C3):
Example: 2-4 Anaytics Processing components.

Content Processing Component (C4): processes crawled items and moves the item data to the index component. It's function is to parses documents, performs property mapping and entity extraction, perform language processing, and ultimately moves crawled items into indexed items.
Example: 4 Content Processing components.

Admin component (C5): Use 1 administration components or 2 search for redundancy/HA.  For all farm sizes.
Example:2 Admin components.

Admin database (DB4): Low usage, even in big farms, you only need 1 database.  Should stay well under 100GB.  Holds crawled and managed properties, query rules, topology and history.  My initial size is mdf 100 MB (growth 10MB) and the ldf is 100 MB (growth 50MB).
Example: 1 Admin database. 

Crawl Component (C6):  The crawl component crawls content sources and delivers crawled items including metadata to the Content Processing component.  In SP2013 you don't specify the relationship between the crawl database and the crawl component.  The crawl component will distribute to all available crawl databases.  The 3 types of crawls available in SP2013 are: Full, Incremental and Continuous (only works for SP2013 content).  Schema changes still require a full crawl to pickup the change in SP2013.  Crawl does not do as much analysis as was the case in SP2010 so it is a much lighter/faster process.
Example:2 Crawl components allows for HA and improved performance

Database Hardware: for the example use 8CPUs, 16GB of Ram, disk size depends on content but it is smaller than SP2010.

Placing components on VMs for the example
Group your search roles onto servers:
  • Index & Query Processing
  • Analytics & Content Processing
  • Crawl, Content processing & Search Admin 

Note: I have included suggested mdf and ldf sizing and growth assuming the full recovery model as I use AOAG (if you are using the default simple recovery model, only worry about the mdf sizes), these are based on my farms usage so your will need to vary but it is a good guide for a starting point untill you can monitor your own database growth patterns. Change the ldf and mdf settings as the default database settings are completely inappropriate. Growth must be in fixed MB (never percentages) and you do not as little ldf growth on the fly as possible. A good guidline is 100 MB mdf initial size with 50 MB growth and ldf are 25-50% of the size of the ldf and I would use 50MB min for ldf initial sizing, then set the ldf growth to be 50MB on all 4 search databases. Search ldfs are pretty hectic so in this post you will noting much higher ldf setting than I am mention in this note. Also checkout the SQL Checklist for SP2013 post. Backup frequency affects log/ldf file usage so check out this post to understand how your system database needs to be set (In short, backups frequency requires smaller ldf's).

Database mdf mdf growth ldf ldf grow
SP_Search_Admin 100 MB 10 MB 100 MB 50 MB
SP_Search_CrawlStore 100 MB 50 MB 300 MB 100 MB
SP_Search_AnalyticsReportingStore 100 MB 50 MB 25 MB 25 MB
SP_Search_LinksStore 100 MB 50 MB 25 MB 25 MB

More Info:

Troubleshooting Crawl

Thursday 9 January 2014

SP 2013 SSRS failing after RBS enabled and disabled

Problem: I have SSRS (SharePoint mode) enabled on my SharePoint 2013 farm using SQL 2012 which was working, I enabled AvePoint's RBS provider on the farm and enabled RBS on 1 out of 2 web applications.  I then disabled RBS on the web applications and assumed all was good, RBS stopped working and threw 1 of 2 errors on the RBS enabled then disabled web application:
"For more information about this error navigate to the report server on the local server machine, or enable remote errors" or ....

Note: Rdl in the system before RBS was enabled work, during RBS don't work and rdl's added after disabling RBS both fail

Initial Hypothesis/Error tracking:
1.> SSRS and WCA errors unfortunately don't get correlationId's, so I turned off all the SSRS SSA instances except 1 so I know which server to find the error on in ULS log. 
2.> I ran the request for the report (rdl, this report just displays a label so I know it is good) again so the error is captured in the ULS log.
3.> I painfully open the latest log using ULS viewer and scan for errors and I find: 
System.Data.SqlClient.SqlException (0x80131904): The EXECUTE permission was denied on the object 'rbs_fn_get_blob_reference', database 'SP_Content_PaulXX', schema 'mssqlrbs'.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)

4.> Now I am going nuts as RBS has been disabled and I decide to trace the request in SQL Profiler, I can't find the call in SQL Server profiler and while looking for it I realise the function is not in the database.  I also find a post suggesting changing permissions but as I don't have the function, permissions isn't my issue. 
5.> I start looking at RBS on the farm using PowerShell PS>
cls
$cdbs=Get-SPContentDatabase
foreach ($cdb in $cdbs)
{
 $rbs=$cdb.RemoteBlobStorageSettings
 Write-Host "Content DB:" $cdb.Name
 Write-Host "Enabled:" $rbs.Enabled

}
I notice that the content database  'SP_Content_PaulXX' mentioned in the ULS log has the RemoteBlobStorageEnabled flag set to true.

Resolution:

*************************************************
Problem: I have SP2013 + SP2012, I am using SSRS in SP mode.  My app pool accounts for my web app and my SSRS SSA are different and on separate servers.  So for this to occur, you need SP2013, SSRS, RBS Enabled (or the Content database still thinks RBS is enabled), additionally the service account used by the SSRS SSA needs to have minimal permissions.  Existing rdl files display correctly however any rdl files added throw an exception.  The diagram below further explains the scenario:
Initial Hypothesis: Trawling through the ULS logs show the error: System.Data.SqlClient.SqlException (0x80131904): The EXECUTE permission was denied on the object 'rbs_fn_get_blob_reference', database...

A snippet of the ULS is shown below:

 
Resolution:  The app pool account used by the SSRS Service Application needs to have permissions to run the function. 
1.> Figure out the app pool account used by the SSRS SSA if you don't know it as shown below:

2.> Give the SP_Services account permissions over the erroring execution calls.  To prove it give the account dbo rights.

 
Thanks to Sam Keytel and Mark Oburoh for looking at this with me.
 

More Info:
 
 


Tuesday 7 January 2014

Office Web App Common Problems & Fixes

There is now a article on WCA on Technet that includes troubleshooting.  Updated 30/04/2014.

Finding your issues: Office Web Apps (WCA) displays and error, without a correlation Id.  If you have a small WCA farm, you can trawl the WCA ULS logs using ULS eventviewer to find your issue.  However on a busy or large farm finding the ULS errors is tedious.  You can use IE development tools, fiddler or any tool provided you can find the correlation Id send in the response from the WCA server.

The screenshot below shows how to use fiddler to find a correlationId returned from the WCA server.  In the browser view the http response and find the "X-CorrelationId" property.

Tip: It is also worth verify the WCA farm is reachable and running using https://wcaservername/hosting/discovery

==========================================

Problem: When opening a work or pdf document I receive the following error popup "Sorry, there was a problem and we can't open this PDF. If this happens again, try opening the PDF in Microsoft Word.".
Initial Hypothesis:  WCA was working and pdf's were opening.  The error is similar to the error received when the networking is not correct (WCA machines can't access the SharePoint WFE's).  Opening the ULS logs on the WCA machines, I can see the following error message "WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.UnexpectedErrorException: HttpRequest failed ---> Microsoft.Office.Web.Apps.Common.HttpRequestAsyncException: No Response in WebException ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 10.189.xx.15:443"   

Resolution:  My load balancer is not passing through the traffic from the WCA servers using https to the WFE's.  Fixed the loadbalance so https traffic is forwarded correctly.  The WCA servers need to speak to the WFE/SharePoint servers either on http or https depending on how the WCA farm is configured (SSL termination, with or without ssl are the 3 options).

===========================================

Problem:  Can't open any document in WCA and the WCA ULS is generating the following issue:
WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.UnexpectedErrorException: HttpRequest failed ---> Microsoft.Office.Web.Apps.Common.HttpRequestAsyncException: No Response in WebException ---> System.Net.WebException: The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel. ---> System.Security.Authentication.AuthenticationException: The remote certificate is invalid according to the validation procedure.   

Initial Hypothesis:  On each specific WCA server, I try open the SharEPoint web Application i.e. https://www.demo.dev, the browser displays that the certificate has errors "Certificate error".  Opening up the certificate and the certificate chain looks correct.

Resolution:
1.> Run > MMC > File > Add snap-ins > Certificates > Add > Computer Account > Local Computer > Finish. OK.
2.> In the MMC console navigate to Certificates > Trusted Root Certificate Authorities > Certificates > (Right click) All Tasks > Import (both the Trust root and the intermidiary certificates required.
After adding the missing certificates, open the the browser and check the certificate.  Thank to David C for documenting and figuring out the issue is the certificate chain being used on the WCA servers.  Office web apps is working again.

===========================================

Problem: When Opening a word document I get the following error when Office Web Apps tries to render the document "There's a configuration problem preventing us from getting your document. If possible, try opening this document in Microsoft Word."

Initial Hypothesis: Check the ULS logs on the Web Front End (SharePoint Server) as this doesn't tell me much.  I found the following issue in my ULS:
WOPI (CheckFile) - Invalid Proof Signature for file SandPit Environment Setup.docx url: http://web-sp2013-uat.demo.dev/Docs/_vti_bin/wopi.ashx/files/6d0f38c0d5554c87a655558da9cedcad?access_token...
Resolution: Run the following PS> Update-SPWOPIProofKey -ServerName "wca-uat.demo.dev"

More Info:
http://technet.microsoft.com/en-us/library/jj219460.aspx

===========================================

Problem:  When opening a docx file using WCA (Office Web Apps) I get the following error mes: "Sorry, there was a problem and we can't open this document. If this happens again, try opening the document in Microsoft Word."
I then tried to open an excel document and got the error: "We couldn't find the file you wanted.
It's possible the file was renamed, moved or deleted."

Initial Hypothesis: Checked the ULS logs on the only OWA server and found this unexpected error:
 HttpRequestAsync, (WOPICheckFile,WACSERVER) no response [WebExceptionStatus:ConnectFailure, url:http://webuat.demo.dev/_vti_bin/wopi.ashx/...
This appears to be a networking related issue, I have a NLB (KEMP) and I am using a wildcard certificate on the WCA adr with SSL termination.

Resolution:  The error message tells me that it can't get back to the SharePoint WFE servers from the OWA server.  The request from the WCA/OWA1 server back to the SP front end server is not done using https but http.  I have an issue as my nlb can't deal with traffic on port 80.
I add a host entry on my OWA1 server so that traffic to the SharePoint WFE goes directly to a server by IP and it works.  This means i don't have high availaibilty.  A NLB service dealing with the web application on port 80 will fix my issue.

The OWA Server need to access the web app on port 80.  My NLB stopped all traffic on port 80.

==========================================

Problem:  The above issue was temporairly fixed by adding a host entry on the WCA1 server so that using the url of the web application on port 80 would direct the user back to WFE1.  I turned on WCA2 and WFE2 so I now have 2 SharePoint Web front ends & A 2 server WCA farm.  In my testing I have docs, doc and excel files.  From multiple locations I could open and edit the docx and doc files but opening the excel file gave me this issue: "Couldn't Open the Workbook
We're sorry. We couldn't open your workbook. You can try to open this file again, sometimes that helps."
 

Initial Hypothesis: Documents are cached and the internal balancing seemed to make word document available using office web apps.  I assume the requests are coming out the cache or from OWA that has the host entry.  I need to tell OWA where to go via a host entry or NLB entry.  Note: using a host entry won't make the OWA highly available/redunadant.  This is the same issue as mentioned in the problem above.

Resolution: I added a host entry to the WCA2 server, it points to the WFE1 machine.

==========================================

Problem: Opening docx or pptx files in Office Web Apps 2013 results in the error "Sorry, Word Web App ran into a problem opening this document. To view this document please open it in Microsoft Word."

 Resolution:  I don't like it but I had to remove the link to the WCA farm, rebuild the WCA farm and hoop SP2013 back to the WCA farm.  [sic].  Other documents were opening and I realised that my bindings were incorrect after I rebuilt.

============================
Problem: Can't open word, pdf, pptx or excel documents using Office Web Apps.  ULS on the OWA servers included these log messages: WOPI (CheckFile) - Invalid Proof Signature for file.  WOPI Proof: All WOPI Signature verification attempts failed.  WOPI Signature verification attempt failed with public key. 
Also found in the logs: "Error message from host: Verifying signature failed, host correlation" "
HttpRequestAsync (WOPICheckFile,WACSERVER), request failure [HttpResponseCode:NotFound, HttpResponseCodeDescription:Not Found, url:https://www.demo.dev/_vti_bin/wopi.ashx/files/8b07d55558955551beb5555bed545553?access_token=REDACTED_1014&access_token_ttl=1392256555993]"

Same issue ULS excerpt:  
Error message from host: Verifying signature failed, host correlation:
WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.FileUnknownException: WOPI 404   
 at Microsoft.Office.Web.Apps.Common.WopiDocument.LogAndThrowWireException(HttpRequestAsyncResult result, HttpRequestAsyncException delayedException) 
 
FileUnknownException while loading the app.


Hypothesis: None, I can't understand why the SP and WCA farm are struggling to communicate.  I believe the cause is to do with the the load balancing/network changing [sic - maybe]. 

Resolution: Remove the link between the Sp farm and WCA
PS> Remove-SPWOPIBinding –All:$true
Connect SP to the WOPI farm
PS> $internalName = "wca.demo.dev"
PS> $internalZone = "internal-https"
PS> New-SPWOPIBinding -ServerName $internalName –AllowHTTP
PS> Set-SPWopiZone -zone $internalZone
 

==============================
 Problem: Intermitten requests are not returning the pdf/word documents.  Most requests are working and occasionally 1 request doesn't work.  Every 4th request tries to get the pdf to display on Office Web Apps for a few minutes without any error message and then stops trying and displays the message "Sorry, Word Web App can't open this ... document because the service is busy."

Initial Hypothesis:  Originally I thought it was only happening to pdfs but it is happening to word and pdf documents (I don't have excel docs in my system).  My monitoring software SolarWinds is badly configured on my OWA servers as the monitor is showing green, drining into the servers monitoring the 2 application monitors are failing.  The server should go amber if either of the 2 applications fail and in turn red after 5 minutes.  At this point I notice that I can't log onto my 4 OWA/WCA server.  Web request are not being returned.  I look at my KEMP load balancer and it says all 4 WCA servers are working, I notice the configuration is not on web requests but on ping (not right) and the NLB/KEMP is merely redirecting every 4th request to the broken server.

Resolution: 
  1. Reboot the broken server, once it comes up I can make http requests directly to http://wca.demo.dev/hosting/discovery the server: and it's all working again.
  2. SolarWinds monitoring is lousy - need to get it fixed
  3. Kemp hardware load balancing needs to be changed from checking the machine is on to rather checking each machine using a web request.