Showing posts with label error. Show all posts
Showing posts with label error. Show all posts

Wednesday 6 September 2023

App Insights for Power Platform - Part 11 - Custom Connector

Overview:  Power Automate can set retry policies on custom connectors, Canvas apps using a Custom Connector, does not have any retry configuration.  FYI is the Custom Connector gets a 5.x.x error it shall retry 4 times.  Proven using 500 and 502 errors.  408 (timeout) and 429 (to busy) errors appear to throw 4 times (retry driven by canvas app; using the Custom Connector trigger shall only try once).  

My example:  My Canvas app uses a custom connector, that calls my Azure Function, in turn this calls my APIM, and APIM calls the 3rd party.


My Azure function returns a 500 or 502 or 408 (response timeout) or 429 if the response has not been received in 10 seconds, and I push the http code back to the Custom connector.  I can see the response from the 3rd party is taking +-35 seconds.  I can see from network traces 4 invoked calls that all fail with 408 http response codes.

Result:  The custom connector retries 4 times resulting in my Power App being locked for 40+ seconds.

Possible remediation

1. Return a 418 HTTP response code (I am a teapot), my app calls the function using the custom connector once.  So using 418 (also tested with 400 - but 400 is not the right response) errors behaves differently from the 5xx, 408, 429 errors from the API.  

Note: Originally I was caught out as I trigger using the custom connector test rig, this only tries once.  But when called from Power Apps, it shall try four times.  Returning 408 is not a fix.  Returning 418 ensures I only try once, get a better user experience and now I have 418 logs that I add error details to.

2. 3rd party API should not take 35 seconds, improve it.  

3. I could set my timeout on the specific function to 40 seconds, however if the call starts going to 41 seconds, my canvas app will be locked for over 160 seconds.  

4. Go to all API's and if they are fast set the timeout as short as possible so the app does not get locked out while waiting for the 4 responses.

Summary: Examine the 3rd party API's, get them stable and performant and per the agreed SLA's.  If you only want to try once ensure that time outs on the 3rd arty are set to the SLA or if you intercept the request, you can choose the timeout, by examining the API's you can see the optimum time to avoid timeouts and using the 418 response code, the call only happens once.

Series

App Insights for Power Platform - Part 1 - Series Overview 

App Insights for Power Platform - Part 2 - App Insights and Azure Log Analytics 

App Insights for Power Platform - Part 3 - Canvas App Logging (Instrumentation key)

App Insights for Power Platform - Part 4 - Model App Logging

App Insights for Power Platform - Part 5 - Logging for APIM 

App Insights for Power Platform - Part 6 - Power Automate Logging

App Insights for Power Platform - Part 7 - Monitoring Azure Dashboards 

App Insights for Power Platform - Part 8 - Verify logging is going to the correct Log analytics

App Insights for Power Platform - Part 9 - Power Automate Licencing

App Insights for Power Platform - Part 10 - Custom Connector enable logging

App Insights for Power Platform - Part 11 - Custom Connector Behaviour from Canvas Apps Concern (this post)

Saturday 22 February 2020

Catch Error in Power Apps and App Insight Logging

Error Handling:
App Insights logging: https://sharepains.com/2019/01/24/powerapps-experimenting-with-error-handling/  Replaced as Microsoft have built in telemetry as of 3 Feb 2020.
https://powerapps.microsoft.com/en-us/blog/log-telemetry-for-your-apps-using-azure-application-insights/

Example Error capturing and tracing to Azure AppInsights:
IfError( // Perform API Call , // Fallback so log here! ,
    Trace("Pauls Unique PowerApp",TraceSeverity.Error, {UserName:User().Email,         Role:gblRole, ErrorMsg:ErrorInfo.Message, ErrorControl:ErrorInfo.Control,         ErrorProperty:ErrorInfo.Property});     Notify("Err message ..." & ErrorInfo.Message); // Display the error on the UI
More detail..

Possible Canvas Apps Error Handling Pattern:
  1. Ensure AppInsights key is added to each canvas app
  2. Use IfError() to check calls and logic
  3. Use the Trace method to write info to App Insights
  4. Do I want to enable the Experimental error handling features (great to trace by correlationId)
  5. Consider all Power Automate that use Power Apps (ensure you use the V2 Connector)
  6. Never use IfError to handle business logic
To Review your App Insights Logging:
Open you Azure Portal > Open your App Insights blade >
Click the "Search" navigation option > Free text entry e.g. "Loyalty PowerApp"
App Insights, finding Traces generated in Power Apps

Monitoring Tool within Power Apps

The Monitor tool in Power Apps is great for debugging and tracing.
Start a monitor on the open Power App.

Monitor Tool - Showing a GET via a custom Connector and the returned response

Function/Code Logging:
Server-side code should log to App Insights or you logging framework.
It is ideal with the Trace within Power Apps explained above to be used in conjunction with 3rd party API calls.

Overview: C# code needs to have logging. If an error occurs an appropriate response must be bubbled up for the next lay

Possible C# Error Handling Pattern:

  1. All catch write exception to Log analytics or App insights 
  2. Calls to data sources, Azure Services and third party API's and complex logic ideally should be wrapped in a try catch and log the error to App insights using the C# App Insights SDK 
  3. The catch blocks ideally return the failed information so the caller code can deal with the logic using the output.  If you don't deal with the returned message, simply log the exception and rethrowing the error (this needs to be a conscious decision on each catch) 
  4. Catch specific errors: log, if you don't pass info to caller rethrow the error if applicable (bubble), respond accordingly i.e. catch the specific error and lastly use a catch all. - Heavy, but only add to existing code where this happens often or we are having problems, i.e. be specific
  5. Don't use Try, Catch to deal with business logic

Thought: Bubble up means: Code must log exceptions and returns appropriate reply to the caller, if you don't send the appropriate reply rethrow the exception after logging it so the caller has to deal with it.


Wednesday 11 April 2018

HTTP 400 bad request response

Problem:  I have an old SharePoint 2013 custom application that is partially loading, The application has not encountered the error in several years that it has been running).  This is only happening for 1 user out of thousands and occurs on Chrome and IE.  I can see some in the IE developer toolbar that some requests are showing 200 responses, and some are showing 400 responses from the web servers.  The SP WFE's are load balanced, and all WFE's are showing the 400.

Initial Hypothesis:  Only 1 user has the issue.  Some URL requests work, and other are malformed (return 400 errors) on the same WFE.   The user on a different machine still fails.  Using a different browser, the user still fails.  The user is forming a malformed request.  It appears to be a problem with the specific user to a particular site collection and is likely to be the HTTP Header request.
Using the browser settings/Fiddler or Dev toolbars get the error details, i.e.
HTTP 400 – Bad Request - The size of the request headers is too long.
Alternatively, user the IE browser and turn on friendly, to identify if the issue is the HTTP header request is too long.

Possible Resolution: Look at the request header, it may be too long for the WFE to handle.  As making the header smaller is generally not an option, look to increase the size of the requests IIS allows for HTTP requests (HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\HTTP\Parameters).  As this is production issue and I can't replicate to a lower environment, I need to use a host entry to get my offending user to only be accessing a single WFE where the fix is applied.  By using the NLB and updating IIS, I can ensure the fix works without disrupting my user base.
See https://www.grouppolicy.biz/2013/06/how-to-configure-iis-to-support-large-ad-token-with-group-policy/



Tuesday 3 June 2014

Anonymous REST JS call error using SP2013

Problem: I am making an anonymous JavaScript REST call to a image library.  I get the error "Request failed.  The method GetItems of the type List with id ... is blocked by the administrator on the server.nundefined".
 
Resolution:  Change the ClientCallbackSettings on the web application as shown below.
 
 

$wa = Get-SPWebApplication http://www.demo.dev
$wa.ClientCallableSettings .AnonymousRestrictedTypes.Remove([microsoft.sharepoint.splist],"GetItems")
$wa.Update()


More Info:
http://www.sharepoint-zone.com/search/label/ClientCallableSettings.AnonymousRestrictedTypes
 

Friday 7 March 2014

EventLog Error Fix

Overview: After building my farms I trawl through the ULS and event logs to look for logs messages to identify any issues.  This post contains errors from my event logs that hopefully will help me in future.

Problem: My event log shows a Windows/IIS error whereby the IIS sites application pool uses a service account that does not have a user profile on the machine.  The error message reads "Event Id: 1511 Windows cannot find the local profile and is logging you on with a temporary profile. Changes you make to this profile will be lost when you log off."

Verify the issue:



Resolution: (IEDaddy's post gave me the resolution)
1.> Stop the processes that use the account (I stopped the web sites that used the application pool account "demo\OD_Srv")
2.> cmd prompt> net localgroup administrators demo\OD_Srv /add
3.> cmd prompt> runas /u:demo\OD_Srv /profile cmd
4.> in the new cmd prompt run > echo %userprofile%
5.> Check the user profiles and verify the profile store for the account (demo\OD_Srv) has a status of "Local"
6.> Remove the account from the local administrators group ie cmd> net localgroup administrators demo\OD_Srv /delete

More Info:
http://www.brainlitter.com/2010/06/08/how-to-resolve-event-id-1511windows-cannot-find-the-local-profile-on-windows-server-2008/
http://todd-carter.com/post/2010/05/03/give-your-application-pool-accounts-a-profile/

*************************

Problem: EVENT ID: 8321 - Task Category: Topology
A certificate validation operation took 120053.1569 milliseconds and has exceeded the execution time threshold. 

Resolution:  I performed various steps:
1.> Host entry add the host entry:
127.0.0.1  crl.microsoft.com
2.> Trust the SP root cert
http://support.microsoft.com/kb/2625048

Import the Trusted certificate

3.>  Reduce the time when the crl check is done (not a fix but it will fail quicker and carry on)

This post may also help - but it wasn't my issue: http://stevesps.blogspot.co.uk/2013/01/sharepoint-foundation-event-id-8321.html

 *************************

Problem: Event Id: 8313 - Task Category: Topology
A failure was reported when trying to invoke a service application: EndpointFailure
Process Name: w3wp
Process ID: 5640
AppDomain Name: /LM/W3SVC/1647355528/ROOT-1-13036555135555957
AppDomain ID: 2
Service Application Uri: urn:schemas-microsoft-com:sharepoint:service:649a3e7c090555059555c7a101555576#authority=urn:uuid:55b29cf855594c76555658fca66dac65&authority=https://sv-sp-web1:32844/Topology/topology.svc
Active Endpoints: 2
Failed Endpoints:1
Affected Endpoint: http://sv-sp-app2:32843/649a3e7c0904495552e4c7a555d64555/MetadataWebService.svc

Initial Hypothesis: It looks like the Web front ends cannot coomunicate with the MetadataWebService.svc, run mmc > file > add/remove snapin > snap-in "certificates" > Add > Computer Account > Local Computer > OK.
Expand "Certificates" > SharePoint > Certificates.  Open the certs and check if they are verified.  In my case my wfe's are good but my app servers do not have a valid certificate as shown below.

Resolution:
PS> $rootCert = (Get-SPCertificateAuthority).RootCertificate
PS> $rootCert.Export(“Cer”) | Set-Content C:\SharEPointRootAutority.cer –Encoding Byte




http://khalidstech.blogspot.co.uk/2012/11/certificate-validation-errors-in.html

Automation to add the SharePoint Root Certificate is done very nicely in this post: http://lennytech.wordpress.com/2013/06/18/powershell-install-sp-root-cert-to-trusted-root/

 **************

Disable CRL check (I believe this is from AutoSPInstaller)
Set-ItemProperty -path "HKCU:\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing" -name State -value 146944
set-ItemProperty -path "REGISTRY::\HKEY_USERS\.Default\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing" -name State -value 146944
get-ChildItem REGISTRY::HKEY_USERS | foreach-object {set-ItemProperty -ErrorAction silentlycontinue -path ($_.Name + "\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing") -name State -value 146944 

*************
 

Problem: Event Log is capturing EventId: 2159 Source: SharePoint Foundation Error message refers to Event 8306 within the ULS logs.

Resolution: Edit the web.config allowing the ULS to capture additional information relating to the error.  The resulting error show the common SharePoint COM class factory error.  In this scenario changing the "SecurityTokenService" app pool "Load User Profile" property to true correct the underlysing issue.

***************
 

Tuesday 28 January 2014

Search stops working and CA Search screens error

Problem: In my redundant Search farm, search falls over.  It was working, nothing appears to of changed but it suddenly stops working.  This problem has caused itself to surface in several places:
1.> In  CA going to "Search Administration" displays the following error message: Search Application Topology - Unable to retrieve topology component health states. This may be because the admin component is not up and running.


2.> Query and crawl stopped working.
3.> Using PowerShell I can't get the status of the Search Service Application
PS> $srchSSA = Get-SPEnterpriseSearchServiceApplication
PS> Get-SPEnterpriseSearchStatus -SearchApplication $srchSSA
Error: Get-SPEnterpriseSearchStatus : Failed to connect to system manager. SystemManagerLocations: net.tcp://sp2013-srch2/CD8E71/AdminComponent2/Management
4.> In the "Search Administration" page within CA, if I click "Content Sources" I get the error message: Sorry, something went wrong
The search application 'ef5552-7c93-4555-89ed-cd8f1555a96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information.

I used PowerShell to get the ULS logs for the correlation Id returned on the screen via the CA error message
PS> Merge-SPLogFile -Path "d:\error.log" -Correlation "ef109872-7c93-4e6c-89ed-cd8f14bda96b" Out shows..
Logging Correlation Data       Medium Name=Request (GET:http://sp2013-app1:2013/_admin/search/listcontentsources.aspx?appid=b555d269%255577%2D430d%2D80aa%2D30d55556dc57)
Authentication Authorization   Medium Non-OAuth request. IsAuthenticated=True, UserIdentityName=, ClaimsCount=0
Logging Correlation Data       Medium Site=/ 05555e9c-5555-555a-69e2-3f65555be9f4
Topology  Medium WcfSendRequest: RemoteAddress: 'https://sp2013-srch2:32844/8c2468a555594301abf555ac41a555b0/SearchAdmin.svc' Channel: 'Microsoft.Office.Server.Search.Administration.ISearchApplicationAdminWebService' Action: 'http://tempuri.org/ISearchApplicationAdminWebService/GetVersion' MessageId:
General    Medium Application error when access /_admin/search/listcontentsources.aspx, Error=The search application 'ef155572-7555-4e6c-89ed-cd8f14bda96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information.  Server stack trace: at System.ServiceModel.Channels.ServiceChannel.ThrowIfFaultUnderstood(Message reply, MessageFault fault, String action, MessageVersion version, FaultConverter faultConverter) at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc) at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Ob General  Medium ...(IMethodCallMessage methodCall, ProxyOperationRuntime operation)  at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)  Exception rethrown at [0]: at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.ErrorHandler(Object sender, EventArgs e) at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.OnError(EventArgs e) at System.Web.UI.Page.HandleError(Exception e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) at System.Web.UI.Page.ProcessRequest(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) 
General      Medium ...em.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) 05b5559c-4215-0555-69e2-3f656555e9f4 Runtime   tkau Unexpected System.ServiceModel.FaultException`1[[System.ServiceModel.ExceptionDetail, System.ServiceModel, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]: The search application 'ef109555-7c93-444c-85555-cd8f14b5556b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information.   Server stack trace:      at System.ServiceModel.Channels.ServiceChannel.ThrowIfFaultUnderstood(Message reply, MessageFault fault, String action, MessageVersion version, FaultConverter faultConverter)     at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc) 
Runtime     tkau Unexpected ...s, TimeSpan timeout) at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)  Exception rethrown at [0]: at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.ErrorHandler(Object sender, EventArgs e) at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.OnError(EventArgs e) at System.Web.UI.Page.HandleError(Exception e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)
Runtime       tkau Unexpected ...ProcessRequest() at System.Web.UI.Page.ProcessRequest(HttpContext context) at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously
General    ajlz0 High Getting Error Message for Exception System.ServiceModel.FaultException`1[System.ServiceModel.ExceptionDetail]: The search application 'ef10-555-a96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information. (Fault Detail is equal to An ExceptionDetail, likely created by IncludeExceptionDetailInFaults=true, whose value is: System.Runtime.InteropServices.COMException: The search application '7c93-555c-89ed-cd8f555da6b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information. at icrosoft.Office.Server.Search.Administration.SearchApi.RunOnServer[T]General        ajlz0 High ...   at Microsoft.Office.Server.Search.Administration.SearchApi..ctor(String applicationName)     at Microsoft.Office.Server.Search.Administration.SearchAdminWebServiceApplication.GetVersion()     at SyncInvokeGetVersion(Object , Object[] , Object[] )     at System.ServiceModel.Dispatcher.SyncMethodInvoker.Invoke(Object instance, Object[] inputs, Object[]& outputs)     at System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin(MessageRpc& rpc) at System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMess...) Micro Trace uls4 Medium Micro Trace Tags: 0 nasq,5 agb9s,52 e5mc,21 8nca,0 tkau,0 ajlz0,0 aat87 05b16e9c-4215-00ba-69e2-3f656eabe9f4
Monitoring   b4ly Medium Leaving Monitored Scope (Request (GET:http://sp2013-app1:2013/_admin/search/listcontentsources.aspx?appid=be28d269%255577%2D430d%2D555a%2D30d55556dc57)). Execution Time=99.45349199409
Topology    e5mb Medium WcfReceiveRequest: LocalAddress: 'https://sp2013-srch2.demo.local:32844/8c2468a555943555bf42cac555f05b0/SearchAdmin.svc' Channel: 'System.ServiceModel.Channels.ServiceChannel' Action: 'http://tempuri.org/ISearchApplicationAdminWebService/GetVersion' MessageId: 'urn:uuid:c2d55565-bb59-4555-b34b-6555ef1d79a5'


I can see my Admin component on SP2013-SRCH2 is not working, so I turned off the machine hoping it would resolve to the other admin component, it did not change and keeps giving me the same log errors.  I turned the Server back on and reviewed the event log on the admin component Server (SP2013-SRCH2).  Following errors occured:
Application Server Administration job failed for service instance Microsoft.Office.Server.Search.Administration.SearchServiceInstance + Reason: The device is not ready. 
The Execute method of job definition Microsoft.Office.Server.Search.Administration.IndexingScheduleJobDefinition + The search application + on server SP2013-SRCH2 did not finish loading.

Examining the Windows application event logs on SP2013-SRCH2 and I noticed event log errors relating to permissions:
A database error occurred. Source: .Net SqlClient Data Provider Code: 229 occurred 0 time(s) Description:  Error ordinal: 1 Message: The EXECUTE permission was denied on the object 'proc_MSS_GetConfigurationProperty', database 'SP_Search'
Unable to read lease from database - SystemManager, System.Data.SqlClient.SqlException (0x80131904): The EXECUTE permission was denied on the object 'proc_MSS_GetLease', database 'SP_Search', schema 'dbo'.

The event log showed me the account trying to execute these stored procs.  It was my Search service account e.g. demo\sp_searchservice

Initial Hypothesis: It looks like the Admin search service is not starting or failing over.  By tracing the permissions I can see the demo\SP_SearchService account no longer has execute SP permissions on the SP_Search database. 

By opening the SP_Search database and looking at effective permissions I can see the "SPSearchDBadmin" role has "effective" permissions over the failing stored procs (Proc_Mss_GetConfigurationProperty).

If I look at the account calling the Stored Proc (demo\SP_SearchService), I can see it is not assigned to the role.  This examination leads me to my conjecture that the management tool or someone/something on the farm has caused the permissions to be changed

Resolution: Change the permissions on the database.  Give minimal permissions so go to the database and give the service account (demo\sp_SearchService) SPSearchDBAdmin role permissions.
All the issues recorded in the problem statement were working withing 5 minutes on my farm with the issue.

Note: I have a UAT environment in exact sync with my PR environment; UAT has the correct permissions in place already.

Tuesday 7 January 2014

Office Web App Common Problems & Fixes

There is now a article on WCA on Technet that includes troubleshooting.  Updated 30/04/2014.

Finding your issues: Office Web Apps (WCA) displays and error, without a correlation Id.  If you have a small WCA farm, you can trawl the WCA ULS logs using ULS eventviewer to find your issue.  However on a busy or large farm finding the ULS errors is tedious.  You can use IE development tools, fiddler or any tool provided you can find the correlation Id send in the response from the WCA server.

The screenshot below shows how to use fiddler to find a correlationId returned from the WCA server.  In the browser view the http response and find the "X-CorrelationId" property.

Tip: It is also worth verify the WCA farm is reachable and running using https://wcaservername/hosting/discovery

==========================================

Problem: When opening a work or pdf document I receive the following error popup "Sorry, there was a problem and we can't open this PDF. If this happens again, try opening the PDF in Microsoft Word.".
Initial Hypothesis:  WCA was working and pdf's were opening.  The error is similar to the error received when the networking is not correct (WCA machines can't access the SharePoint WFE's).  Opening the ULS logs on the WCA machines, I can see the following error message "WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.UnexpectedErrorException: HttpRequest failed ---> Microsoft.Office.Web.Apps.Common.HttpRequestAsyncException: No Response in WebException ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 10.189.xx.15:443"   

Resolution:  My load balancer is not passing through the traffic from the WCA servers using https to the WFE's.  Fixed the loadbalance so https traffic is forwarded correctly.  The WCA servers need to speak to the WFE/SharePoint servers either on http or https depending on how the WCA farm is configured (SSL termination, with or without ssl are the 3 options).

===========================================

Problem:  Can't open any document in WCA and the WCA ULS is generating the following issue:
WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.UnexpectedErrorException: HttpRequest failed ---> Microsoft.Office.Web.Apps.Common.HttpRequestAsyncException: No Response in WebException ---> System.Net.WebException: The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel. ---> System.Security.Authentication.AuthenticationException: The remote certificate is invalid according to the validation procedure.   

Initial Hypothesis:  On each specific WCA server, I try open the SharEPoint web Application i.e. https://www.demo.dev, the browser displays that the certificate has errors "Certificate error".  Opening up the certificate and the certificate chain looks correct.

Resolution:
1.> Run > MMC > File > Add snap-ins > Certificates > Add > Computer Account > Local Computer > Finish. OK.
2.> In the MMC console navigate to Certificates > Trusted Root Certificate Authorities > Certificates > (Right click) All Tasks > Import (both the Trust root and the intermidiary certificates required.
After adding the missing certificates, open the the browser and check the certificate.  Thank to David C for documenting and figuring out the issue is the certificate chain being used on the WCA servers.  Office web apps is working again.

===========================================

Problem: When Opening a word document I get the following error when Office Web Apps tries to render the document "There's a configuration problem preventing us from getting your document. If possible, try opening this document in Microsoft Word."

Initial Hypothesis: Check the ULS logs on the Web Front End (SharePoint Server) as this doesn't tell me much.  I found the following issue in my ULS:
WOPI (CheckFile) - Invalid Proof Signature for file SandPit Environment Setup.docx url: http://web-sp2013-uat.demo.dev/Docs/_vti_bin/wopi.ashx/files/6d0f38c0d5554c87a655558da9cedcad?access_token...
Resolution: Run the following PS> Update-SPWOPIProofKey -ServerName "wca-uat.demo.dev"

More Info:
http://technet.microsoft.com/en-us/library/jj219460.aspx

===========================================

Problem:  When opening a docx file using WCA (Office Web Apps) I get the following error mes: "Sorry, there was a problem and we can't open this document. If this happens again, try opening the document in Microsoft Word."
I then tried to open an excel document and got the error: "We couldn't find the file you wanted.
It's possible the file was renamed, moved or deleted."

Initial Hypothesis: Checked the ULS logs on the only OWA server and found this unexpected error:
 HttpRequestAsync, (WOPICheckFile,WACSERVER) no response [WebExceptionStatus:ConnectFailure, url:http://webuat.demo.dev/_vti_bin/wopi.ashx/...
This appears to be a networking related issue, I have a NLB (KEMP) and I am using a wildcard certificate on the WCA adr with SSL termination.

Resolution:  The error message tells me that it can't get back to the SharePoint WFE servers from the OWA server.  The request from the WCA/OWA1 server back to the SP front end server is not done using https but http.  I have an issue as my nlb can't deal with traffic on port 80.
I add a host entry on my OWA1 server so that traffic to the SharePoint WFE goes directly to a server by IP and it works.  This means i don't have high availaibilty.  A NLB service dealing with the web application on port 80 will fix my issue.

The OWA Server need to access the web app on port 80.  My NLB stopped all traffic on port 80.

==========================================

Problem:  The above issue was temporairly fixed by adding a host entry on the WCA1 server so that using the url of the web application on port 80 would direct the user back to WFE1.  I turned on WCA2 and WFE2 so I now have 2 SharePoint Web front ends & A 2 server WCA farm.  In my testing I have docs, doc and excel files.  From multiple locations I could open and edit the docx and doc files but opening the excel file gave me this issue: "Couldn't Open the Workbook
We're sorry. We couldn't open your workbook. You can try to open this file again, sometimes that helps."
 

Initial Hypothesis: Documents are cached and the internal balancing seemed to make word document available using office web apps.  I assume the requests are coming out the cache or from OWA that has the host entry.  I need to tell OWA where to go via a host entry or NLB entry.  Note: using a host entry won't make the OWA highly available/redunadant.  This is the same issue as mentioned in the problem above.

Resolution: I added a host entry to the WCA2 server, it points to the WFE1 machine.

==========================================

Problem: Opening docx or pptx files in Office Web Apps 2013 results in the error "Sorry, Word Web App ran into a problem opening this document. To view this document please open it in Microsoft Word."

 Resolution:  I don't like it but I had to remove the link to the WCA farm, rebuild the WCA farm and hoop SP2013 back to the WCA farm.  [sic].  Other documents were opening and I realised that my bindings were incorrect after I rebuilt.

============================
Problem: Can't open word, pdf, pptx or excel documents using Office Web Apps.  ULS on the OWA servers included these log messages: WOPI (CheckFile) - Invalid Proof Signature for file.  WOPI Proof: All WOPI Signature verification attempts failed.  WOPI Signature verification attempt failed with public key. 
Also found in the logs: "Error message from host: Verifying signature failed, host correlation" "
HttpRequestAsync (WOPICheckFile,WACSERVER), request failure [HttpResponseCode:NotFound, HttpResponseCodeDescription:Not Found, url:https://www.demo.dev/_vti_bin/wopi.ashx/files/8b07d55558955551beb5555bed545553?access_token=REDACTED_1014&access_token_ttl=1392256555993]"

Same issue ULS excerpt:  
Error message from host: Verifying signature failed, host correlation:
WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.FileUnknownException: WOPI 404   
 at Microsoft.Office.Web.Apps.Common.WopiDocument.LogAndThrowWireException(HttpRequestAsyncResult result, HttpRequestAsyncException delayedException) 
 
FileUnknownException while loading the app.


Hypothesis: None, I can't understand why the SP and WCA farm are struggling to communicate.  I believe the cause is to do with the the load balancing/network changing [sic - maybe]. 

Resolution: Remove the link between the Sp farm and WCA
PS> Remove-SPWOPIBinding –All:$true
Connect SP to the WOPI farm
PS> $internalName = "wca.demo.dev"
PS> $internalZone = "internal-https"
PS> New-SPWOPIBinding -ServerName $internalName –AllowHTTP
PS> Set-SPWopiZone -zone $internalZone
 

==============================
 Problem: Intermitten requests are not returning the pdf/word documents.  Most requests are working and occasionally 1 request doesn't work.  Every 4th request tries to get the pdf to display on Office Web Apps for a few minutes without any error message and then stops trying and displays the message "Sorry, Word Web App can't open this ... document because the service is busy."

Initial Hypothesis:  Originally I thought it was only happening to pdfs but it is happening to word and pdf documents (I don't have excel docs in my system).  My monitoring software SolarWinds is badly configured on my OWA servers as the monitor is showing green, drining into the servers monitoring the 2 application monitors are failing.  The server should go amber if either of the 2 applications fail and in turn red after 5 minutes.  At this point I notice that I can't log onto my 4 OWA/WCA server.  Web request are not being returned.  I look at my KEMP load balancer and it says all 4 WCA servers are working, I notice the configuration is not on web requests but on ping (not right) and the NLB/KEMP is merely redirecting every 4th request to the broken server.

Resolution: 
  1. Reboot the broken server, once it comes up I can make http requests directly to http://wca.demo.dev/hosting/discovery the server: and it's all working again.
  2. SolarWinds monitoring is lousy - need to get it fixed
  3. Kemp hardware load balancing needs to be changed from checking the machine is on to rather checking each machine using a web request.




 

Thursday 21 November 2013

Search Host Controller Service in "Starting" state

Problem: On my SharePoint 2013 farm via CA I see my "Search Host Controller Service is stuck in the "Starting" state.

Initial Hypothesis: Check if any services are stuck in a provisioning state using the PS below:
Get-SPServiceInstance | sort TypeName | select TypeName, Status, Server | ? {$_.Status -eq "Provisioning"}  
Resolution:  After identify the Search Host Controller instance/instances that are stuck in the Provisioning state get them to the Online state.  This is quickly achieved using PowerShell.   Tip: I was forced to run my PS on the server that is running the offending service instance.

$inst = Get-SPServiceInstance | ? {$_.TypeName -eq "Search Host Controller Service" } | ? { $_.PrimaryHostController -eq $true }
$sh = Get-SPServiceInstance | ? {$_.SearchServiceInstanceId -eq $inst.SearchServiceInstanceId.ToString()}
$sh.Status
$sh.Unprovision()
$sh.Status
$sh.Provision()
$sh.Status


I also use this script to get a general understanding of the health of my Search Serve Applications.

More Info:
http://mmman.itgroove.net/2012/12/search-host-controller-service-in-starting-state-sharepoint-2013-8/

Thursday 8 August 2013

SP2013 AutoSPInstaller Error Correction

This post contains alist of issues I have had to correct on multiple farm builds. 

Problem:  After a remote offline install the main CA box shows an upgrade is required.  CA > Manage servers in this farm :Upgrade Required
Resolution: Run cmd > C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\15\BIN>PSConfig.exe -cmd upgrade -inplace b2b -force -cmd applicationcontent -install -cmd installfeatures

Problem: PU or CU is not getting installed.
Resolution: Ensure the .exe are extracted out e.g. ubersrv2013-kb2817414-fullfile-x64-glb.exe this is the June 2013 CU.  Ensure the March 2013 PU is extracted and in the same folder ubersrvsp2013-kb2767999-fullfile-x64-glb.exe.  Ensure they are placed in the correct location for me it is ..\2013\Updates\...  CU are cumulative so you only need the PU and then the latest CU.  Don't rename the CU - this caught me out.

Problem: I kept receiving the error: 17303 Extracted file error for the June CU (ubersrv2013-kb2817414-fullfile-x64-glb.exe).
Resolution:  Looks like the download was corrupt, re downloaded the CU and extracted the contents which fixed the issue.

Problem: I built a 4 SP server 2013 farm.  I had 2 Search nodes setup via AutoSPInstaller and added another 2 search nodes (index partitions with it's corresponding replicated index).  After a reboot I got into the following state and could not get WFE2 to join the search farm.
Using CA and looking at the Search topolgy on WFE2, I can see the error "Administrative status  The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel." and "Unable to retrieve topolgy component health states. This may be because the admin component is not up and running.".
 
Resolution: Check the routing, my CA box is WFE, I also had a CA role on WFE2, while playing with the farm I added a host entry to work with CA on WFE2.  The issue is that this is the same host entry as where my Admin component sits.  Removing the host entry on WFE2 fixed the status of the index and query components.
 
 

Monday 24 December 2012

Digital Signatures and Install Software gotcha

Problem: In automating SQL Server and SharePoint images, the actual installation is taking a long time on my managed environment whereas my developer laptop is fast.  All installations are done without Internet access (offline).

I have a dev environment build on my laptop that runs SSD and I run 3 VM using VMware workstation 9 (all use Windows 2008 R2 SP1).  I create an 1) AD with 1GB or RAM and 1 CPU 2) SQL 20012 with 10GB RAM and 4 CPU's 3) SP2010 CU Aug 2012 10GB 4 CPU's.  All the installation is automated using slip streamed images.

So for simplicity on the CI environment I will explain a simplified comparable setup. I have 3 machines with the same roles however the SQL 2012 and SP2010 install take considerably longer.  The CI environment is on ESX (Cisco blades & chassis, and Violin (SSD) storage.  The CPU/compute is connected to the storage via SAS/Fibre channel made no difference either).  I have summarised the results below:

                                                                 SQL2012 (duration)      SP2010 (duration)
Laptop(VMworkstation Workstation)           15 min                              16 min
CI (ESX)                                                        22 min                              92 min 

Finding: My hardcore/good ESX infrastructure is taking 9 minutes longer to install SQL Server 2012 on beter hardware and an amazing 70 minutes longer to install SP2010.

Update 21 Feb 2013: Don't use PowerShell 3 with AutoSPInstaller (including using the version switch i.e. -version 2), it doesn't work and even changing AutoSPInstallers internal web call fail.  It can be made to work with the version 2 switch but it isn't worth the effort.

Initial Hypothesis:
After many many hours between service providers managing the infrastructure, it was not hardware or ESX configuration/setup.  However if the network card on the VM is disabled, the performance change improves to:

                                                          SQL2012 (duration)            SP2010 (duration)
CI (ESX)                                                     13 min                       5 min and 5 seconds

Pretty hefty improvement.  Using netstat is looks like there are requests to the Internet.  After adding Wireshark to monitor all traffic.  I can see requests being sent to crl.mirosoft.com (certificate revocation lists) and ctldl.windowsupdate.com

Issue shown in Wireshark
Issue Shown in Fiddler
This is the 1st time I have seen this issue in a clients production environment.  If the WFEs/SP servers have internet access (less preferable) or the servers don't have access the install work in a timely fashion.  The symtoms of the issue are when the WFE's/SP Servers don't have internet access but think they do.  All the binaries are digitally signed and the install will try validate the signatures despite this being an offline install.

I confirmed the problem being how the networking is setup.  My issue shows up on the VM NIC adapter, Originally the IPv4 Connectivity has a status of "No Internet Access", once I ping google I get a reply and the status changes to "Internet".  I can ping google but not browse to it.


Resolution:  The problem is that executable code is digitally signed.  This is good, all code should be digitally signed so it can be authenticated.  However in this situation a lot of requests are being sent out from the VM as the install tries to verify all the SharePoint complied code.  The install on the local VM acts as if there is an Internet connection (which there is not).

It takes unique networking to get into this issue and SP/any digitally signed code will check the digital certs.

There are a few fixes such as:
1.> Allowing the servers to get out to the Internet, so open the firewall or set a proxy on the local VM.
2.> Add host entries to the cert fails immediately but will continue installing (This is not working for me).
3.> Make the following registry change:
set-ItemProperty -path "HKCU:\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing" -name State -value 146944
set-ItemProperty -path "REGISTRY::\HKEY_USERS\.Default\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing" -name State -value 146944
get-ChildItem REGISTRY::HKEY_USERS | foreach-object {set-ItemProperty -ErrorAction silentlycontinue -path ($_.Name + "\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing") -name State -value 146944}


More Information:

Certificates for installing sofware is cause slow install:
http://joelblogs.co.uk/2011/09/20/certificate-revocation-list-check-and-sharepoint-2010-without-an-internet-connection/

http://ddkonline.blogspot.co.uk/2010/05/fix-sharepoint-very-slow-to-start-after.html

If you want to verify if a machine is having problems with a poarticlar process Process Explorer (Usefule if a machine has high memory, CPU or IO issues)