Radimaging Ltd - Paul Beck's Technical Working Notes for Microsoft Technology: error

Showing posts with label error. Show all posts

Thursday, 10 October 2024

Network calls intermittently do not occur in Published play mode on a Canvas app

Problem: An app created does not run Patch or any other network call when published and in run mode. The network calls work in edit mode. Looking at the logs, I identified some users are successfully updating using network calls

Figure 1. Monitoring shows the Edit/Play working and the Publish/Play not making any network calls.

Initial Hypothesis:

The Patch call works in edit mode, and then intermittently, some users get it to work in Published/Play mode. This tells me the data in the Patch call is causing the network patch call to fail.

Using a Trace and running in play mode, I remove properties until I find the offending property.

Figure 2. In edit mode, a label control, CurrentDateVaue_1.Text is used

Once, we traced the value of CurrentDateValue_1, which is set by the global variable varCurrentDateTime. As this is done on a random screen, if the user does not open the screen in Publish /Play mode, the label control CurrentDateValue_1.Text is never set.

Resolution: Replace all the code using CurrentDateValue_1.Text with the variable varCurrentDateTime. The label control is not set when the network/patch call is made, and bizarrely, it does not happen. I'd expect it to send regardless and let the error handling catch the issue.

Tuesday, 13 August 2024

Power Automate, calling child flow Bad Request Error

Problem: When I call a child flow that uses a connection, the parent calling flow receives the error "Bad Request" with the following details "Action 'Run_Child_Flow' failed: Failed to parse invoker connections from trigger 'manual' outputs. Exception: Could not find property 'headers.X-MS-APIM-Tokens' in the trigger outputs. Workflow has connection references '["shared_wordonlinebusiness","shared_sharepointonline"]' with invoker runtime source.".

Error shows in the parent flow runs.

Initial Hypothesis: Reading the message originally to me suggested that permissions are being lost between the parent and child flows. I did some Googleing/copiloting to see if anyone had seen the issue before and Darren Lutchner has a video that explains the issue and fixes it perfectly (https://www.youtube.com/watch?v=Yuh3Nlf9wrs).

Resolution: Watch Darren's video for a great walkthrough to understand the issue fully.

Resolution Summary:

1. Go to the "Child Flow", Edit the "Run only users"

2. I had two connections: Word Online (Business) & SharePoint, change them both from "Provided by run-only user" to "Use this connection (...)". Save and the parent flow starts working and calls the child flow correctly.

Tuesday, 6 August 2024

Power Automate Flows containing Approvals failing with XrmApprovalsUserRoleNotFound

Overview: The Power Automate Approval action on an existing flow stops working and throws the error: "'XrmApprovalsUserRoleNotFound'. Error Message: The 'Approvals User' role is missing from the linked XRM instance." I did not have control on the environment so I needed others to perform the fix so that I could verify.

Problem:

The ‘Start and wait for an approval’ action/connector used in a Power Automate flow has been failing for 3 weeks in Dev, I now need this functionality to change the workflow ‘IDC-StartApprovalCertificateWorkflow’. The error show is “Action ‘Start_and Wait_for_an_approval_xx” failed. The request failed: Error code: ‘XrmApprovalsUserRoleNotFound”. Error Message: The ‘Approvals User’ role is missing from the linked XRM instance.'

Initial Hypothesis:

All the runs of my flow have been failing for 3 weeks in Dev for 3 weeks on the ‘Start and wait for an approval’ in the dev env. I have tried creating a new vanilla flow using the ‘Start and wait for an approval’ action and it fails with the same issue.

Triggering the flow in other environment including my production, and the ‘Start and wait for an approval’ action works. I cannot see any difference except the environments. The error message "XrmApprovalsUserRoleNotFound" is basically telling me that my user should be in the Approval Users role. I have the role assigned.

Env: Client-DEV
Type: Automated
Plan: This flow runs on owner's plan (paul@client.com)

Resolution:

Microsoft Support: Check the user running the flow is int he 'Approvals User' role is correctly assigned in the environment user security roles.

Admin: The user running the flow already has the role assigned. We have re-assigned the role again. Did not test, got the developer/owner to test.

Developer/me/flow owner: The approval has started working again in the Dev environment, I just retested and flow that was not firing yesterday is now working again. New flows also fire/work correctly.

Summary: The user role in the Dataverse was correctly assigned, it looks like a refresh of the user in the 'Approver User' role corrected the issue.

Research:

https://www.linkedin.com/pulse/power-automate-approvals-flows-failing-adrian-colquhoun

Wednesday, 6 September 2023

App Insights for Power Platform - Part 11 - Custom Connector

Overview: Power Automate can set retry policies on custom connectors, Canvas apps using a Custom Connector, does not have any retry configuration. FYI is the Custom Connector gets a 5.x.x error it shall retry 4 times. Proven using 500 and 502 errors. 408 (timeout) and 429 (to busy) errors appear to throw 4 times (retry driven by canvas app; using the Custom Connector trigger shall only try once).

My example: My Canvas app uses a custom connector, that calls my Azure Function, in turn this calls my APIM, and APIM calls the 3rd party.

My Azure function returns a 500 or 502 or 408 (response timeout) or 429 if the response has not been received in 10 seconds, and I push the http code back to the Custom connector. I can see the response from the 3rd party is taking +-35 seconds. I can see from network traces 4 invoked calls that all fail with 408 http response codes.

Result: The custom connector retries 4 times resulting in my Power App being locked for 40+ seconds.

Possible remediation:

1. Return a 418 HTTP response code (I am a teapot), my app calls the function using the custom connector once. So using 418 (also tested with 400 - but 400 is not the right response) errors behaves differently from the 5xx, 408, 429 errors from the API.

Note: Originally I was caught out as I trigger using the custom connector test rig, this only tries once. But when called from Power Apps, it shall try four times. Returning 408 is not a fix. Returning 418 ensures I only try once, get a better user experience and now I have 418 logs that I add error details to.

2. 3rd party API should not take 35 seconds, improve it.

3. I could set my timeout on the specific function to 40 seconds, however if the call starts going to 41 seconds, my canvas app will be locked for over 160 seconds.

4. Go to all API's and if they are fast set the timeout as short as possible so the app does not get locked out while waiting for the 4 responses.

Summary: Examine the 3rd party API's, get them stable and performant and per the agreed SLA's. If you only want to try once ensure that time outs on the 3rd arty are set to the SLA or if you intercept the request, you can choose the timeout, by examining the API's you can see the optimum time to avoid timeouts and using the 418 response code, the call only happens once.

Series

App Insights for Power Platform - Part 1 - Series Overview

App Insights for Power Platform - Part 2 - App Insights and Azure Log Analytics

App Insights for Power Platform - Part 3 - Canvas App Logging (Instrumentation key)

App Insights for Power Platform - Part 4 - Model App Logging

App Insights for Power Platform - Part 5 - Logging for APIM

App Insights for Power Platform - Part 6 - Power Automate Logging

App Insights for Power Platform - Part 7 - Monitoring Azure Dashboards

App Insights for Power Platform - Part 8 - Verify logging is going to the correct Log analytics

App Insights for Power Platform - Part 9 - Power Automate Licencing

App Insights for Power Platform - Part 10 - Custom Connector enable logging

App Insights for Power Platform - Part 11 - Custom Connector Behaviour from Canvas Apps Concern (this post)

Saturday, 22 February 2020

Catch Error in Power Apps and App Insight Logging

Error Handling:
App Insights logging: https://sharepains.com/2019/01/24/powerapps-experimenting-with-error-handling/ Replaced as Microsoft have built in telemetry as of 3 Feb 2020.
https://powerapps.microsoft.com/en-us/blog/log-telemetry-for-your-apps-using-azure-application-insights/

Example Error capturing and tracing to Azure AppInsights:
IfError( // Perform API Call , // Fallback so log here! ,

    Trace("Pauls Unique PowerApp",TraceSeverity.Error, {UserName:User().Email,
        Role:gblRole, ErrorMsg:ErrorInfo.Message, ErrorControl:ErrorInfo.Control, 
        ErrorProperty:ErrorInfo.Property});
    Notify("Err message ..." & ErrorInfo.Message); // Display the error on the UI

More detail..

Possible Canvas Apps Error Handling Pattern:

Ensure AppInsights key is added to each canvas app
Use IfError() to check calls and logic
Use the Trace method to write info to App Insights
Do I want to enable the Experimental error handling features (great to trace by correlationId)
Consider all Power Automate that use Power Apps (ensure you use the V2 Connector)
Never use IfError to handle business logic

To Review your App Insights Logging:
Open you Azure Portal > Open your App Insights blade >
Click the "Search" navigation option > Free text entry e.g. "Loyalty PowerApp"

App Insights, finding Traces generated in Power Apps

Monitoring Tool within Power Apps
The Monitor tool in Power Apps is great for debugging and tracing.

Start a monitor on the open Power App.

Monitor Tool - Showing a GET via a custom Connector and the returned response

Function/Code Logging:

Server-side code should log to App Insights or you logging framework.

It is ideal with the Trace within Power Apps explained above to be used in conjunction with 3rd party API calls.

Overview: C# code needs to have logging. If an error occurs an appropriate response must be bubbled up for the next lay

Possible C# Error Handling Pattern:

All catch write exception to Log analytics or App insights
Calls to data sources, Azure Services and third party API's and complex logic ideally should be wrapped in a try catch and log the error to App insights using the C# App Insights SDK
The catch blocks ideally return the failed information so the caller code can deal with the logic using the output. If you don't deal with the returned message, simply log the exception and rethrowing the error (this needs to be a conscious decision on each catch)
Catch specific errors: log, if you don't pass info to caller rethrow the error if applicable (bubble), respond accordingly i.e. catch the specific error and lastly use a catch all. - Heavy, but only add to existing code where this happens often or we are having problems, i.e. be specific
Don't use Try, Catch to deal with business logic

Thought: Bubble up means: Code must log exceptions and returns appropriate reply to the caller, if you don't send the appropriate reply rethrow the exception after logging it so the caller has to deal with it.

Wednesday, 11 April 2018

HTTP 400 bad request response

Problem: I have an old SharePoint 2013 custom application that is partially loading, The application has not encountered the error in several years that it has been running). This is only happening for 1 user out of thousands and occurs on Chrome and IE. I can see some in the IE developer toolbar that some requests are showing 200 responses, and some are showing 400 responses from the web servers. The SP WFE's are load balanced, and all WFE's are showing the 400.

Initial Hypothesis: Only 1 user has the issue. Some URL requests work, and other are malformed (return 400 errors) on the same WFE. The user on a different machine still fails. Using a different browser, the user still fails. The user is forming a malformed request. It appears to be a problem with the specific user to a particular site collection and is likely to be the HTTP Header request.
Using the browser settings/Fiddler or Dev toolbars get the error details, i.e.
HTTP 400 – Bad Request - The size of the request headers is too long.

Alternatively, user the IE browser and turn on friendly, to identify if the issue is the HTTP header request is too long.

Possible Resolution: Look at the request header, it may be too long for the WFE to handle. As making the header smaller is generally not an option, look to increase the size of the requests IIS allows for HTTP requests (HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\HTTP\Parameters). As this is production issue and I can't replicate to a lower environment, I need to use a host entry to get my offending user to only be accessing a single WFE where the fix is applied. By using the NLB and updating IIS, I can ensure the fix works without disrupting my user base.
See https://www.grouppolicy.biz/2013/06/how-to-configure-iis-to-support-large-ad-token-with-group-policy/

Tuesday, 3 June 2014

Anonymous REST JS call error using SP2013

Problem: I am making an anonymous JavaScript REST call to a image library. I get the error "Request failed. The method GetItems of the type List with id ... is blocked by the administrator on the server.nundefined".

Resolution: Change the ClientCallbackSettings on the web application as shown below.

$wa = Get-SPWebApplication http://www.demo.dev
$wa.ClientCallableSettings .AnonymousRestrictedTypes.Remove([microsoft.sharepoint.splist],"GetItems")
$wa.Update()

More Info:
http://www.sharepoint-zone.com/search/label/ClientCallableSettings.AnonymousRestrictedTypes

Friday, 7 March 2014

EventLog Error Fix

Overview: After building my farms I trawl through the ULS and event logs to look for logs messages to identify any issues. This post contains errors from my event logs that hopefully will help me in future.

Problem: My event log shows a Windows/IIS error whereby the IIS sites application pool uses a service account that does not have a user profile on the machine. The error message reads "Event Id: 1511 Windows cannot find the local profile and is logging you on with a temporary profile. Changes you make to this profile will be lost when you log off."

Verify the issue:

Resolution: (IEDaddy's post gave me the resolution)
1.> Stop the processes that use the account (I stopped the web sites that used the application pool account "demo\OD_Srv")
2.> cmd prompt> net localgroup administrators demo\OD_Srv /add
3.> cmd prompt> runas /u:demo\OD_Srv /profile cmd
4.> in the new cmd prompt run > echo %userprofile%
5.> Check the user profiles and verify the profile store for the account (demo\OD_Srv) has a status of "Local"
6.> Remove the account from the local administrators group ie cmd> net localgroup administrators demo\OD_Srv /delete

More Info:
http://www.brainlitter.com/2010/06/08/how-to-resolve-event-id-1511windows-cannot-find-the-local-profile-on-windows-server-2008/
http://todd-carter.com/post/2010/05/03/give-your-application-pool-accounts-a-profile/

*************************

Problem: EVENT ID: 8321 - Task Category: Topology
A certificate validation operation took 120053.1569 milliseconds and has exceeded the execution time threshold.

Resolution: I performed various steps:
1.> Host entry add the host entry:
127.0.0.1 crl.microsoft.com
2.> Trust the SP root cert
http://support.microsoft.com/kb/2625048

Import the Trusted certificate

3.> Reduce the time when the crl check is done (not a fix but it will fail quicker and carry on)

This post may also help - but it wasn't my issue: http://stevesps.blogspot.co.uk/2013/01/sharepoint-foundation-event-id-8321.html

*************************

Problem: Event Id: 8313 - Task Category: Topology
A failure was reported when trying to invoke a service application: EndpointFailure
Process Name: w3wp
Process ID: 5640
AppDomain Name: /LM/W3SVC/1647355528/ROOT-1-13036555135555957
AppDomain ID: 2
Service Application Uri: urn:schemas-microsoft-com:sharepoint:service:649a3e7c090555059555c7a101555576#authority=urn:uuid:55b29cf855594c76555658fca66dac65&authority=https://sv-sp-web1:32844/Topology/topology.svc
Active Endpoints: 2
Failed Endpoints:1
Affected Endpoint: http://sv-sp-app2:32843/649a3e7c0904495552e4c7a555d64555/MetadataWebService.svc

Initial Hypothesis: It looks like the Web front ends cannot coomunicate with the MetadataWebService.svc, run mmc > file > add/remove snapin > snap-in "certificates" > Add > Computer Account > Local Computer > OK.
Expand "Certificates" > SharePoint > Certificates. Open the certs and check if they are verified. In my case my wfe's are good but my app servers do not have a valid certificate as shown below.

Resolution:
PS> $rootCert = (Get-SPCertificateAuthority).RootCertificate
PS> $rootCert.Export(“Cer”) | Set-Content C:\SharEPointRootAutority.cer –Encoding Byte

http://khalidstech.blogspot.co.uk/2012/11/certificate-validation-errors-in.html

Automation to add the SharePoint Root Certificate is done very nicely in this post: http://lennytech.wordpress.com/2013/06/18/powershell-install-sp-root-cert-to-trusted-root/

**************

Disable CRL check (I believe this is from AutoSPInstaller)
Set-ItemProperty -path "HKCU:\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing" -name State -value 146944
set-ItemProperty -path "REGISTRY::\HKEY_USERS\.Default\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing" -name State -value 146944
get-ChildItem REGISTRY::HKEY_USERS | foreach-object {set-ItemProperty -ErrorAction silentlycontinue -path ($_.Name + "\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing") -name State -value 146944

*************

Problem: Event Log is capturing EventId: 2159 Source: SharePoint Foundation Error message refers to Event 8306 within the ULS logs.

Resolution: Edit the web.config allowing the ULS to capture additional information relating to the error. The resulting error show the common SharePoint COM class factory error. In this scenario changing the "SecurityTokenService" app pool "Load User Profile" property to true correct the underlysing issue.

***************

Tuesday, 28 January 2014

Search stops working and CA Search screens error

Problem: In my redundant Search farm, search falls over. It was working, nothing appears to of changed but it suddenly stops working. This problem has caused itself to surface in several places:
1.> In CA going to "Search Administration" displays the following error message: Search Application Topology - Unable to retrieve topology component health states. This may be because the admin component is not up and running.

2.> Query and crawl stopped working.
3.> Using PowerShell I can't get the status of the Search Service Application
PS> $srchSSA = Get-SPEnterpriseSearchServiceApplication
PS> Get-SPEnterpriseSearchStatus -SearchApplication $srchSSA
Error: Get-SPEnterpriseSearchStatus : Failed to connect to system manager. SystemManagerLocations: net.tcp://sp2013-srch2/CD8E71/AdminComponent2/Management

4.> In the "Search Administration" page within CA, if I click "Content Sources" I get the error message: Sorry, something went wrong
The search application 'ef5552-7c93-4555-89ed-cd8f1555a96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information.
I used PowerShell to get the ULS logs for the correlation Id returned on the screen via the CA error message
PS> Merge-SPLogFile -Path "d:\error.log" -Correlation "ef109872-7c93-4e6c-89ed-cd8f14bda96b" Out shows..
Logging Correlation Data      Medium Name=Request (GET:http://sp2013-app1:2013/_admin/search/listcontentsources.aspx?appid=b555d269%255577%2D430d%2D80aa%2D30d55556dc57)
Authentication Authorization Medium Non-OAuth request. IsAuthenticated=True, UserIdentityName=, ClaimsCount=0
Logging Correlation Data      Medium Site=/ 05555e9c-5555-555a-69e2-3f65555be9f4
Topology  Medium WcfSendRequest: RemoteAddress: 'https://sp2013-srch2:32844/8c2468a555594301abf555ac41a555b0/SearchAdmin.svc' Channel: 'Microsoft.Office.Server.Search.Administration.ISearchApplicationAdminWebService' Action: 'http://tempuri.org/ISearchApplicationAdminWebService/GetVersion' MessageId:
General    Medium Application error when access /_admin/search/listcontentsources.aspx, Error=The search application 'ef155572-7555-4e6c-89ed-cd8f14bda96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information. Server stack trace: at System.ServiceModel.Channels.ServiceChannel.ThrowIfFaultUnderstood(Message reply, MessageFault fault, String action, MessageVersion version, FaultConverter faultConverter) at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc) at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Ob General  Medium ...(IMethodCallMessage methodCall, ProxyOperationRuntime operation) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message) Exception rethrown at [0]: at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.ErrorHandler(Object sender, EventArgs e) at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.OnError(EventArgs e) at System.Web.UI.Page.HandleError(Exception e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) at System.Web.UI.Page.ProcessRequest(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)
General      Medium ...em.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) 05b5559c-4215-0555-69e2-3f656555e9f4 Runtime   tkau Unexpected System.ServiceModel.FaultException`1[[System.ServiceModel.ExceptionDetail, System.ServiceModel, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]: The search application 'ef109555-7c93-444c-85555-cd8f14b5556b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information.   Server stack trace:      at System.ServiceModel.Channels.ServiceChannel.ThrowIfFaultUnderstood(Message reply, MessageFault fault, String action, MessageVersion version, FaultConverter faultConverter)     at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc)
Runtime     tkau Unexpected ...s, TimeSpan timeout) at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message) Exception rethrown at [0]: at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.ErrorHandler(Object sender, EventArgs e) at Microsoft.Office.Server.Search.Internal.UI.SearchCentralAdminPageBase.OnError(EventArgs e) at System.Web.UI.Page.HandleError(Exception e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)
Runtime       tkau Unexpected ...ProcessRequest() at System.Web.UI.Page.ProcessRequest(HttpContext context) at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously
General    ajlz0 High Getting Error Message for Exception System.ServiceModel.FaultException`1[System.ServiceModel.ExceptionDetail]: The search application 'ef10-555-a96b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information. (Fault Detail is equal to An ExceptionDetail, likely created by IncludeExceptionDetailInFaults=true, whose value is: System.Runtime.InteropServices.COMException: The search application '7c93-555c-89ed-cd8f555da6b' on server SP2013-SRCH2 did not finish loading. View the event logs on the affected server for more information. at icrosoft.Office.Server.Search.Administration.SearchApi.RunOnServer[T]General        ajlz0 High ...   at Microsoft.Office.Server.Search.Administration.SearchApi..ctor(String applicationName)     at Microsoft.Office.Server.Search.Administration.SearchAdminWebServiceApplication.GetVersion()     at SyncInvokeGetVersion(Object , Object[] , Object[] )     at System.ServiceModel.Dispatcher.SyncMethodInvoker.Invoke(Object instance, Object[] inputs, Object[]& outputs)     at System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin(MessageRpc& rpc) at System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMess...) Micro Trace uls4 Medium Micro Trace Tags: 0 nasq,5 agb9s,52 e5mc,21 8nca,0 tkau,0 ajlz0,0 aat87 05b16e9c-4215-00ba-69e2-3f656eabe9f4
Monitoring   b4ly Medium Leaving Monitored Scope (Request (GET:http://sp2013-app1:2013/_admin/search/listcontentsources.aspx?appid=be28d269%255577%2D430d%2D555a%2D30d55556dc57)). Execution Time=99.45349199409
Topology    e5mb Medium WcfReceiveRequest: LocalAddress: 'https://sp2013-srch2.demo.local:32844/8c2468a555943555bf42cac555f05b0/SearchAdmin.svc' Channel: 'System.ServiceModel.Channels.ServiceChannel' Action: 'http://tempuri.org/ISearchApplicationAdminWebService/GetVersion' MessageId: 'urn:uuid:c2d55565-bb59-4555-b34b-6555ef1d79a5'

I can see my Admin component on SP2013-SRCH2 is not working, so I turned off the machine hoping it would resolve to the other admin component, it did not change and keeps giving me the same log errors. I turned the Server back on and reviewed the event log on the admin component Server (SP2013-SRCH2). Following errors occured:
Application Server Administration job failed for service instance Microsoft.Office.Server.Search.Administration.SearchServiceInstance + Reason: The device is not ready.
The Execute method of job definition Microsoft.Office.Server.Search.Administration.IndexingScheduleJobDefinition + The search application + on server SP2013-SRCH2 did not finish loading.

Examining the Windows application event logs on SP2013-SRCH2 and I noticed event log errors relating to permissions:
A database error occurred. Source: .Net SqlClient Data Provider Code: 229 occurred 0 time(s) Description: Error ordinal: 1 Message: The EXECUTE permission was denied on the object 'proc_MSS_GetConfigurationProperty', database 'SP_Search'
Unable to read lease from database - SystemManager, System.Data.SqlClient.SqlException (0x80131904): The EXECUTE permission was denied on the object 'proc_MSS_GetLease', database 'SP_Search', schema 'dbo'.
The event log showed me the account trying to execute these stored procs. It was my Search service account e.g. demo\sp_searchservice

Initial Hypothesis: It looks like the Admin search service is not starting or failing over. By tracing the permissions I can see the demo\SP_SearchService account no longer has execute SP permissions on the SP_Search database.

By opening the SP_Search database and looking at effective permissions I can see the "SPSearchDBadmin" role has "effective" permissions over the failing stored procs (Proc_Mss_GetConfigurationProperty).

If I look at the account calling the Stored Proc (demo\SP_SearchService), I can see it is not assigned to the role. This examination leads me to my conjecture that the management tool or someone/something on the farm has caused the permissions to be changed

Resolution: Change the permissions on the database. Give minimal permissions so go to the database and give the service account (demo\sp_SearchService) SPSearchDBAdmin role permissions.

All the issues recorded in the problem statement were working withing 5 minutes on my farm with the issue.

Note: I have a UAT environment in exact sync with my PR environment; UAT has the correct permissions in place already.

Tuesday, 7 January 2014

Office Web App Common Problems & Fixes

There is now a article on WCA on Technet that includes troubleshooting. Updated 30/04/2014.

Finding your issues: Office Web Apps (WCA) displays and error, without a correlation Id. If you have a small WCA farm, you can trawl the WCA ULS logs using ULS eventviewer to find your issue. However on a busy or large farm finding the ULS errors is tedious. You can use IE development tools, fiddler or any tool provided you can find the correlation Id send in the response from the WCA server.

The screenshot below shows how to use fiddler to find a correlationId returned from the WCA server. In the browser view the http response and find the "X-CorrelationId" property.

Tip: It is also worth verify the WCA farm is reachable and running using https://wcaservername/hosting/discovery

==========================================

Problem: When opening a work or pdf document I receive the following error popup "Sorry, there was a problem and we can't open this PDF. If this happens again, try opening the PDF in Microsoft Word.".

Initial Hypothesis: WCA was working and pdf's were opening. The error is similar to the error received when the networking is not correct (WCA machines can't access the SharePoint WFE's). Opening the ULS logs on the WCA machines, I can see the following error message "WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.UnexpectedErrorException: HttpRequest failed ---> Microsoft.Office.Web.Apps.Common.HttpRequestAsyncException: No Response in WebException ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 10.189.xx.15:443"

Resolution: My load balancer is not passing through the traffic from the WCA servers using https to the WFE's. Fixed the loadbalance so https traffic is forwarded correctly. The WCA servers need to speak to the WFE/SharePoint servers either on http or https depending on how the WCA farm is configured (SSL termination, with or without ssl are the 3 options).

===========================================

Problem: Can't open any document in WCA and the WCA ULS is generating the following issue:

WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.UnexpectedErrorException: HttpRequest failed ---> Microsoft.Office.Web.Apps.Common.HttpRequestAsyncException: No Response in WebException ---> System.Net.WebException: The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel. ---> System.Security.Authentication.AuthenticationException: The remote certificate is invalid according to the validation procedure.

Initial Hypothesis: On each specific WCA server, I try open the SharEPoint web Application i.e. https://www.demo.dev, the browser displays that the certificate has errors "Certificate error". Opening up the certificate and the certificate chain looks correct.

Resolution:
1.> Run > MMC > File > Add snap-ins > Certificates > Add > Computer Account > Local Computer > Finish. OK.
2.> In the MMC console navigate to Certificates > Trusted Root Certificate Authorities > Certificates > (Right click) All Tasks > Import (both the Trust root and the intermidiary certificates required.

After adding the missing certificates, open the the browser and check the certificate. Thank to David C for documenting and figuring out the issue is the certificate chain being used on the WCA servers. Office web apps is working again.

===========================================

Problem: When Opening a word document I get the following error when Office Web Apps tries to render the document "There's a configuration problem preventing us from getting your document. If possible, try opening this document in Microsoft Word."

Initial Hypothesis: Check the ULS logs on the Web Front End (SharePoint Server) as this doesn't tell me much. I found the following issue in my ULS:
WOPI (CheckFile) - Invalid Proof Signature for file SandPit Environment Setup.docx url: http://web-sp2013-uat.demo.dev/Docs/_vti_bin/wopi.ashx/files/6d0f38c0d5554c87a655558da9cedcad?access_token...
Resolution: Run the following PS> Update-SPWOPIProofKey -ServerName "wca-uat.demo.dev"

More Info:
http://technet.microsoft.com/en-us/library/jj219460.aspx

===========================================

Problem: When opening a docx file using WCA (Office Web Apps) I get the following error mes: "Sorry, there was a problem and we can't open this document. If this happens again, try opening the document in Microsoft Word."
I then tried to open an excel document and got the error: "We couldn't find the file you wanted.

It's possible the file was renamed, moved or deleted."

Initial Hypothesis: Checked the ULS logs on the only OWA server and found this unexpected error:
HttpRequestAsync, (WOPICheckFile,WACSERVER) no response [WebExceptionStatus:ConnectFailure, url:http://webuat.demo.dev/_vti_bin/wopi.ashx/...
This appears to be a networking related issue, I have a NLB (KEMP) and I am using a wildcard certificate on the WCA adr with SSL termination.

Resolution: The error message tells me that it can't get back to the SharePoint WFE servers from the OWA server. The request from the WCA/OWA1 server back to the SP front end server is not done using https but http. I have an issue as my nlb can't deal with traffic on port 80.
I add a host entry on my OWA1 server so that traffic to the SharePoint WFE goes directly to a server by IP and it works. This means i don't have high availaibilty. A NLB service dealing with the web application on port 80 will fix my issue.

The OWA Server need to access the web app on port 80. My NLB stopped all traffic on port 80.

==========================================

Problem: The above issue was temporairly fixed by adding a host entry on the WCA1 server so that using the url of the web application on port 80 would direct the user back to WFE1. I turned on WCA2 and WFE2 so I now have 2 SharePoint Web front ends & A 2 server WCA farm. In my testing I have docs, doc and excel files. From multiple locations I could open and edit the docx and doc files but opening the excel file gave me this issue: "Couldn't Open the Workbook
We're sorry. We couldn't open your workbook. You can try to open this file again, sometimes that helps."

Initial Hypothesis: Documents are cached and the internal balancing seemed to make word document available using office web apps. I assume the requests are coming out the cache or from OWA that has the host entry. I need to tell OWA where to go via a host entry or NLB entry. Note: using a host entry won't make the OWA highly available/redunadant. This is the same issue as mentioned in the problem above.

Resolution: I added a host entry to the WCA2 server, it points to the WFE1 machine.

==========================================

Problem: Opening docx or pptx files in Office Web Apps 2013 results in the error "Sorry, Word Web App ran into a problem opening this document. To view this document please open it in Microsoft Word."

Resolution: I don't like it but I had to remove the link to the WCA farm, rebuild the WCA farm and hoop SP2013 back to the WCA farm. [sic]. Other documents were opening and I realised that my bindings were incorrect after I rebuilt.

============================
Problem: Can't open word, pdf, pptx or excel documents using Office Web Apps. ULS on the OWA servers included these log messages: WOPI (CheckFile) - Invalid Proof Signature for file. WOPI Proof: All WOPI Signature verification attempts failed. WOPI Signature verification attempt failed with public key.
Also found in the logs: "Error message from host: Verifying signature failed, host correlation" "
HttpRequestAsync (WOPICheckFile,WACSERVER), request failure [HttpResponseCode:NotFound, HttpResponseCodeDescription:Not Found, url:https://www.demo.dev/_vti_bin/wopi.ashx/files/8b07d55558955551beb5555bed545553?access_token=REDACTED_1014&access_token_ttl=1392256555993]"

Same issue ULS excerpt:
Error message from host: Verifying signature failed, host correlation:
WOPI CheckFile: Catch-All Failure [exception:Microsoft.Office.Web.Common.EnvironmentAdapters.FileUnknownException: WOPI 404
at Microsoft.Office.Web.Apps.Common.WopiDocument.LogAndThrowWireException(HttpRequestAsyncResult result, HttpRequestAsyncException delayedException)

FileUnknownException while loading the app.

Hypothesis: None, I can't understand why the SP and WCA farm are struggling to communicate. I believe the cause is to do with the the load balancing/network changing [sic - maybe].

Resolution: Remove the link between the Sp farm and WCA
PS> Remove-SPWOPIBinding –All:$true
Connect SP to the WOPI farm
PS> $internalName = "wca.demo.dev"
PS> $internalZone = "internal-https"
PS> New-SPWOPIBinding -ServerName $internalName –AllowHTTP

PS> Set-SPWopiZone -zone $internalZone

==============================
Problem: Intermitten requests are not returning the pdf/word documents. Most requests are working and occasionally 1 request doesn't work. Every 4th request tries to get the pdf to display on Office Web Apps for a few minutes without any error message and then stops trying and displays the message "Sorry, Word Web App can't open this ... document because the service is busy."

Initial Hypothesis: Originally I thought it was only happening to pdfs but it is happening to word and pdf documents (I don't have excel docs in my system). My monitoring software SolarWinds is badly configured on my OWA servers as the monitor is showing green, drining into the servers monitoring the 2 application monitors are failing. The server should go amber if either of the 2 applications fail and in turn red after 5 minutes. At this point I notice that I can't log onto my 4 OWA/WCA server. Web request are not being returned. I look at my KEMP load balancer and it says all 4 WCA servers are working, I notice the configuration is not on web requests but on ping (not right) and the NLB/KEMP is merely redirecting every 4th request to the broken server.

Resolution:

Reboot the broken server, once it comes up I can make http requests directly to http://wca.demo.dev/hosting/discovery the server: and it's all working again.
SolarWinds monitoring is lousy - need to get it fixed
Kemp hardware load balancing needs to be changed from checking the machine is on to rather checking each machine using a web request.