Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

Wednesday 6 September 2023

App Insights for Power Platform - Part 11 - Custom Connector

Overview:  Power Automate can set retry policies on custom connectors, Canvas apps using a Custom Connector, does not have any retry configuration.  FYI is the Custom Connector gets a 5.x.x error it shall retry 4 times.  Proven using 500 and 502 errors.  408 (timeout) and 429 (to busy) errors appear to throw 4 times (retry driven by canvas app; using the Custom Connector trigger shall only try once).  

My example:  My Canvas app uses a custom connector, that calls my Azure Function, in turn this calls my APIM, and APIM calls the 3rd party.


My Azure function returns a 500 or 502 or 408 (response timeout) or 429 if the response has not been received in 10 seconds, and I push the http code back to the Custom connector.  I can see the response from the 3rd party is taking +-35 seconds.  I can see from network traces 4 invoked calls that all fail with 408 http response codes.

Result:  The custom connector retries 4 times resulting in my Power App being locked for 40+ seconds.

Possible remediation

1. Return a 418 HTTP response code (I am a teapot), my app calls the function using the custom connector once.  So using 418 (also tested with 400 - but 400 is not the right response) errors behaves differently from the 5xx, 408, 429 errors from the API.  

Note: Originally I was caught out as I trigger using the custom connector test rig, this only tries once.  But when called from Power Apps, it shall try four times.  Returning 408 is not a fix.  Returning 418 ensures I only try once, get a better user experience and now I have 418 logs that I add error details to.

2. 3rd party API should not take 35 seconds, improve it.  

3. I could set my timeout on the specific function to 40 seconds, however if the call starts going to 41 seconds, my canvas app will be locked for over 160 seconds.  

4. Go to all API's and if they are fast set the timeout as short as possible so the app does not get locked out while waiting for the 4 responses.

Summary: Examine the 3rd party API's, get them stable and performant and per the agreed SLA's.  If you only want to try once ensure that time outs on the 3rd arty are set to the SLA or if you intercept the request, you can choose the timeout, by examining the API's you can see the optimum time to avoid timeouts and using the 418 response code, the call only happens once.

Series

App Insights for Power Platform - Part 1 - Series Overview 

App Insights for Power Platform - Part 2 - App Insights and Azure Log Analytics 

App Insights for Power Platform - Part 3 - Canvas App Logging (Instrumentation key)

App Insights for Power Platform - Part 4 - Model App Logging

App Insights for Power Platform - Part 5 - Logging for APIM 

App Insights for Power Platform - Part 6 - Power Automate Logging

App Insights for Power Platform - Part 7 - Monitoring Azure Dashboards 

App Insights for Power Platform - Part 8 - Verify logging is going to the correct Log analytics

App Insights for Power Platform - Part 9 - Power Automate Licencing

App Insights for Power Platform - Part 10 - Custom Connector enable logging

App Insights for Power Platform - Part 11 - Custom Connector Behaviour from Canvas Apps Concern (this post)

Sunday 30 July 2023

Latency Metrics for API's

Overview:  More and more software is built on API's, we often need to know what are our slowest performing API's and how important are they.  Monitoring latency is how we determine performance and performance issues.  We need to know the fastest, most used, slowest, and average time to complete.  You need to look at all to get a full picture.  Latency metric let us know what percentage of monitoring metrics fall into a range.  

For instance, if you API end averages 1 seconds for all requests (10k) over an hour that sound okay, but if the majority of requests don't have data say 90%, and the slowest 10% of requests could be averaging 5 seconds.  Monitoring metric percentage take out the slowest % of performance requests and show the  faster performers, so in this scenario the slowest 10% of requests are excluded .  Often referrer to by the percentage as P90 i.e. 90%.  I use 95%/P95 normally but it's becoming more common to us 99%/P99 or even P99.9.

Sunday 2 January 2022

App Insights Overview for SaaS logging and tracing

Overview:  App Insights provides independent infrastructure for logging and tracing activities.  It is tightly coupled with Azure services including PaaS.  This allows for consistent scalable logging.  App Insights now stores logs in Azure Log Analytics, these are all under the umbrella of Azure Monitor, 

On a SaaS solution, I am looking for App Insights to log any errors have the ability to log trace information.  I want a unique correlationId (to allow for distributed tracing) on the front end if there is an error so support can identify the exact issue/transactions.  A unique correlationId in the http header allows for identifying a transaction and this is useful for tracing and performance monitoring.  Using the App Insights SDK's and implementing a common logging module is a good idea.  There are two common areas that need call out to ensure the ability to trace transactions:

  1. SPA's (Requirement to generate a unique operation/correlationId per operation not per pageview), and
  2. Long running operation such as timer jobs or service bus calls.

Support & DevOps:

Having a correlationId allows first line to log the correlationId and quickly follow the request without asking for replication steps.  This context tracing approach is common on newer applications. Third line support has full traceability of an issue to support who can empirically see the perceived performance parts broken down using the correlationId in the header.

Key API's can be continuously monitored for errors and slow down in performance, alerts can be configured around this monitoring. 

Building a first line support tool that displays the errors in a hierarchy, has help scripts and knowledge bases is a good option for streamlining support.

App Insights has live monitoring and also has Kusto query language is useful for monitoring specific queries.


Summary Report for Support

// I'm sure there are nicer ways to write/improve my Kusto, so pls let me let me know where the code can be improved
let dayminus0 = datetime(now);
let dayminus1 = ago(24h);
let dayminus2 = ago(48h);
let result0 = requests
    | where timestamp > dayminus1 and timestamp < dayminus0
    | summarize requestCount=sum(itemCount), avgDuration=avg(duration) by performanceBucket
    | where performanceBucket == "15sec-30sec" or performanceBucket == "7sec-15sec"
        or performanceBucket == "30sec-1-min" or performanceBucket == "1min-2min";
let dayminus1a = ago(24h);
let dayminus2a = ago(48h);
let result1 = requests
    | where timestamp > dayminus2a and timestamp < dayminus1a
    | summarize requestCount1=sum(itemCount), avgDuration1=avg(duration) by performanceBucket
    | where performanceBucket == "15sec-30sec" or performanceBucket == "7sec-15sec"
        or performanceBucket == "30sec-1-min" or performanceBucket == "1min-2min";
let dayminus1b = ago(2d);
let dayminus2b = ago(3d);
let result2 = requests
    | where timestamp > dayminus2b and timestamp < dayminus1b
    | summarize requestCount2=sum(itemCount), avgDuration2=avg(duration) by performanceBucket
    | where performanceBucket == "15sec-30sec" or performanceBucket == "7sec-15sec"
        or performanceBucket == "30sec-1-min" or performanceBucket == "1min-2min";
let resultTemp = result0
    | join kind=inner result1 on performanceBucket 
    | project performanceBucket, ['Today'] = avgDuration, ['Yesterday'] = avgDuration1;
let resultTemp2 = resultTemp;
resultTemp2
| join kind=inner result2 on performanceBucket 
| project
    performanceBucket,
    ['1) Today']= (round(['Today'], -2) / 1000),
    ['2) Yesterday'] = (round(['Yesterday'], -2) / 1000),
    ['3) Two Day ago'] = (round(avgDuration2, -2) / 1000) 
| render columnchart
    with (
    kind=unstacked,
    ytitle="Seconds Taken",
    xtitle="Performance Group",
    title="Ensure the 'Today' bar is not significantly higher than pervious days");


Monitoring:  Azure dashboards are great for monitoring application health and performance.  Easy to customise, make unique dashboards and security is easy to control.  sentry.io monitors API's, I have not used it.  I like all the Azure stuff coming out for testing and I feel continuously running Postman collections and reporting to App Insights is the best way to go.  Azure Dashboards can be limiting, Azure Grafana can be a great alternative/enhancement.  Check out Azure Managed Grafana.
source cloudiqtech

Alerting: I all to often see an overuse of alerting resulting in recipients ignoring a plethora of emails.  I believe in minimising alerts especially via email, and SMS type messaging.  For me, I like to create a dedicate channel for alerting that includes all DevOps members and either notify via a Teams card, and even easier is to email the channel.  This can be broken down further but to start I create a channel for alerting for each DTAP environment.

Note: The default channel setup only allows members of the teams channel to send email so the alerts from Azure monitor using rules won't be accepted.  On the channel, and admin needs to go to the "advance settings" and change the option from "Only members of this Team" and change it the setting to "Anyone can send".

Options:  There are great services for logging so my default tends to be Azure Monitor.  The main players in Application & API observability and monitoring include: 

  • Microsoft: Azure Monitor includes Application Insights & Azure Log Analytics
  • Dynatrace (really good if you use multicloud) or Dynatrace AWS cloudwatch,  Dynatrace - Saas offering is on AWS.  Can be on-prem.  OneAgent is deployed on the Compute i.e. VM, Kubernetes.  Can import logs from other SIEMs or Azure Monitor, so you can eventually get Azure service logs such as App Service or Service Bus.  Does Full stack and includes code-level and applications and infrastructure monitoring, also can show User monitoring.  Dynatrace offers scalable API's that are sitting on Kubernetes.  "Davis" is the AI engine used to help figure out the problems.  Alerting is solid.  
High-level Architecture

Dynatrace Admin Monitoring
  • AWS: Amazon CloudWatch Synthetics
  • AppDynamics,
  • Datadog (excellent),
  • New Relic,
  • SolarWinds (excellent)
SolarWinds admin UI from circa 2013/2014 

Dynatrace

Tuesday 1 December 2020

Testing your home Internet Speed using your IPhone

Problem:  Broadband offers various speeds options when purchasing, the actual speeds you get are usually well below and depend on you specific instance. 

Initial Hypothesis:  iOS has multiple apps to monitor speed to your iPhone.  

Resolution:  Download "Speedtest" using the app store an any Apple device.   5G performance is fantastic.

Below are my Results, I live in South West London (Zone 4)

Sky broadband - SW London

Broadband Download (Mbps)  Up Speed (Mbps)  Location             
Sky phone 34.80 5.72 SW London 
EE 4G - LTE 13.00 0.13 SW London
O2 4G - LTE 16.90 10.20SW London
EE - 5G 372.00 19.80 Newcastle

EE 4G - Mobile
EE 4G - SW London

O2 4G - SW London

5G - Newcastle

Thoughts:
Speed tests vary greatly, so worth doing at least 3 to get an average. 70 Mbps download on EE4G is very possible so the download speeds can be as good as my Sky broadband.  5G performance is fantastic - the MIFI/5G routers are going to be awesome when 5G rolls out to my area.  Using O2 and EE at my home, O2 is way faster down but interesting the upload speed is amazing using O2 (The O2 tower is way better positioned).

Update 15/03/2023
Nice package to check Internet upload and download speed on Mac, Windows and Linux that I got from Tobias Zimmergren

Install cmd PS>  npm install --location=global fast-cli
Run cmd> fast -u

Saturday 14 January 2017

Performance Testing SharePoint

Problem: Once again performance testing has raised caused concerns on a project.  There are various methods for calculating how many users a system can deal with.

Description:

Non-function Requirements are key to determining how "performant" the SharePoint farm needs to be to deal with peak loads.

Load testing allow us to mimic various users and see when the site/farm performance starts to degrade.  A good idea is to identify all the possible actions the users will perform  and items like Search are far more resource intensive than clicking on a link in.

Average visits per hour = (5,000 average visitors/day) / 10 hours = 500
Page Request per hour = (Ave visits/hr * 5 ave page request/hr) = 500*5  = 2,500
An example can be further broken down as follows by assuming of the users 5 request, 3 are for pages, 1 is a search and the last is viewing a document.
Recording this scenario with wait times provides a basic load test whereby the user numbers can be increased at 5 minute intervals.

Thursday 8 October 2015

Performance Testing Check List

Rough Notes

Performance Factors Checklist:

  1. Geography and networking - Australia and Africa users always tend to have issues regardless of the centralized SharePoint farms they are accessing.  Poor networking especially in remote satellite offices is not a SharePoint issue but as enterprise grow these are the pains they need to work out.
  2. Security and network design
  3. Usage patterns (Working with documents in the UK after noon when the US comes online may have a peak usage rate 5 times higher than when Australia starts in the morning)
  4. Functionality (Search is heavier than displaying a static web page)
  5. Application design (too many web services calls when 1 can do all the calls) and application implementation.  The application needs to meet the functional and non-functional requirements.  Coding error creep in, implementation needs to meet the design and unnecessary code improved.
  6. Platform (2 server VM farm will only get a set load no matter how well it is tweeked)


Figure out where the bottlenecks are and how big the impact is:

  • SQL is a common performance bottle neck, check you have optimized SQL
  • Monitor all server in your farm, the random, throw more RAM at the solution is pointless most of the time.
  • Size of lists in SP
  • Size of Content
  • Physical architecture
  • Search - design, how many items are crawled are they all needed, multiple search farms, what is going on, AD groups preferential to individual users
  • WAN latency Testing
  • Baseline SP testing
  • Peak loads and what is affected (Monitor and prove where the issues are).  Concurrency and usage patterns.
  • Archiving, document retention, document versioning, unwanted general storage area, recycle bin, can we clean up
  • Optimize highly used page - for example if your home page leverages search and the search service and farm is therefore under load, more the search component into a cached solution or pull off the home page.
  • Auditing logs - Do you log enough, do you log too much can you pull them out of the Content DB and store them elsewhere.
  • Encryption (TDE, SSL, Devices)
  • Web page basics: image size, CDNs, JavaScript optimisation, sprites

Thursday 20 August 2015

Non Functional Testing for SharePoint

Overview:  Functional Requirements are the business requirements that the business define for the application being built.  Non-functional testing is concerned with performance, reliability, scalability, recovery, load,  security and usability testing.  For SharePoint it is a good idea to test this at a platform level and then verify the individual application non functional testing is appropriate.

SharePoint Non Functional Testing:
All of these test should be performed against your various SharePoint platforms and will dictate the SLA's offered to the business using SharePoint as a service.  Baseline testing is a good idea as the differences can be used to determine the efficiency of the individual application being created.

Proxies:
Fiddler is my favourite (other tools for capturing web traffic Charles, BurpSuite, WireShark and you can use the developer tools shipped with the browsers). tcpdump, goPacket are awesome for network monitoring.

Use Fiddler to:
  • Observe traffic (http/https requests, headers,..)
  • Replay sessions, 
  • Evaluate performance,
  • Set break point
A common misconception is the function of performance vs load testing.  

Performance is primarily concerns with looking at typical user usage scenarios and see how long each page takes to load.  So a recorded script with wait time between recorded calls is useful  It's also worth looking at the same page with minimal data and also with large amounts of data.  

Load Test involves recording the standard users interaction (ensure some users query heavy and some light amounts of data), there are no wait times.  The norm is to multiple this concurrent number of users by 100 to know the amount of users the farm business application can support.  e.g. a concurrent user load of 100 users where the performance is acceptable means the farm should handle 10,000 users (100 users times 100).  You run the 100 users in a steady state for a few hours.

Stress testing is similar to load testing but you keep stepping up the number of concurrent users until the SharePoint farm starts running out of resources and throttling requests.  Once the system degrades and performance is not acceptable is pretty much how many users the application on the SharePoint farm are supported.  So if the performance throttles at 170 users, the farm can handle 17, 000 normal usage users.

References:
http://www.guru99.com/non-functional-testing.html

Friday 13 March 2015

Capturing NFRs for SharePoint

Problem: Gathering Non Functional Requirements (NFRs) are always a tricky situation in IT projects.  This is because it is always difficult to estimate how the system will be used before you build it.  I often get business users stating extreme NFRs in the attempt to negotiate or show how world class they are (I generally think the opposite when hearing unreasonable NFR's). 

An example is a CIO at a fairly small NGO telling me the on-prem. SP 2010 infrastructure needs to be up all the time so an SLA of 99.99999.  This equates to 3.2 seconds downtime a year.  In reality, higher SLA's start to cost a lot of money.  SP2013 and SQL 2012 introduce Always On Availability Groups (AOAG) which helps improve SLA uptime but this costs in licensing infrastructure and management.  I need redundancy and the ability to deal with performance issues, so the smallest possible farm consists to 6 server, 2 for each layer in SP namely: WFE, App and SQL.

Here is an old post of SP2010 SLA's but still relevant today.

The key is gather you NFR's and ensure all your usage/applications on the production farm meet expected behaviours.  I have a checklist below.  Going thru the Microsoft's SP Boundaries, Limits and Thresholds document shall help highlight any issues.

The high level items I cover include the following topics:
  • Availability
  • Capacity
  • Compatibility (Browser, device, mobile)
  • Concurrency
  • Performance
  • Disaster Recovery (RTO, RPO)
  • Scalability
  • Search
  • Security
  • SLA

Capacity Example

Item
Day 1
Year 1
Year 3
Year 5
Site Collections
10
100
250
400
Database Size in GB
> than 1GB
490 GB
1220 GB
1960 GB
Search Index Size in GB
> than 1GB
120 GB
310 GB
490 GB
No of Content Databases
1
1
4
8
No of Search Items
10,000
10 Million
25 Million
40 Million
No of Index Partitions
1
1
3
4


Item
Day 1
Year 1
Year 2
Year 3
Number of Users
1,000
50,000
80,000
130,000

*Also calculate peak and average concurrency numbers

Average concurrency, for 20,000 users, the assumption is that 10% (2,000) users will be actively using the solution at the same time, and that 1% of the total user base (200) users will be actively making requests.  For for performance testing you are looking to handle 200 users without delays and a page response time of under 5 seconds.  Based on the simple guideline I've always used from Microsoft.

Peak concurrency depends on your situation for example the NFL playoffs game schedule in the when announced is not the simple 4 times the average concurrency tha would be suitable for most internal business applications.  Although this example may be considered a load spike rather than a peak concurrency.  

It also worth doing a usage distribution pattern for your users experience, so 80% may be light users, login, read 10 pages in your site and perform a single search with 1 minute gaps between interactions (wait times).  the remaining 20% perform a login, upload a 100kb document, view 10 pages and perform 2 searches.

RPO & RTO:

RPO - Max amount of lost data (in time)
RTO - Max time lost (rebuild farm and get the latest backups restored) to make the system operational again.   

SQL Server Sizing:
Option 1: work out the rows and bytes for storage and multiple by the number of rows and then add the tables together to get the size.
Option 2: Assume 100 bytes for each row, count the number of rows and get the storage requirements.

More Info:
https://technet.microsoft.com/en-us/library/ff758647.aspx

Sunday 11 January 2015

Minification Tooling

Overview: Minification is the process of combining multiple css or js files, removing whitespaces and comments to improve web site performance.


Tools:
YUI Compressor(Yahoo)
Web Essentials(Microsoft)
Mavention(Microsoft)
Grunt
jscompress.com/
Google Code Compress (Google)


I'd always go for 1 of the Microsoft tools: Web Essentials or Mavention as the plug into Visual Studio, as a SharePoint guy this would be my preferred option.  Both the MS tools appear to use the same engine as the compression appears identical, work out to roughly 60% on both CSS and JS compression.

Thursday 28 August 2014

Monitoring SharePoint Public Websites

Overview:  This post is applicable to public website and not just SharePoint, I have used it for SharePoint and feel it is a good product.  The principle will apply to other monitoring products and services.

AlertFox is a SaaS monitoring service.  It allows me to monitor various websites using http posts or complicate macros to perform various steps such as logging into a website using ACS.  This differs from an internal monitoring service such as Solar Winds but it definitely has it's place.  I discuss various monitoring options in this post.

The benefits are:
  1. You are notified when the site is down and what the issue is from a web request point of view.
  2. You are monitoring externally so you can see what you customers see.
  3. You can see if your response time are slowing down.
  4. You keep the IIS webservers warmed up (so if you have an app pool recycle).
  5. Easy to monitor and you can setup alerts.
  6. Complex scenarios can be accounted for in testing so you know the complex parts of your site are working.
Image 1. See when you have problems, what the issue is and when it occurred.


Image 2. Verify the performance from around the world

Image 3. Check uptime

 

Thursday 31 October 2013

Monitoring SharePoint 2013

Overview: SharePoint farms have several dependencies so to effectively monitor your farm there are a lot of components to review.  Basically there are 2 forms of monitoring: Preventative & Reactive.  Also see my post on performance monitoring SP 2013.

Preventative Monitoring: is reviewing your SharEPoint estate to try identify issues before they occur.  A simple example would be identifying that your database is running out of storage space allowing you to remediate before the system fails.

Reactive Monitoring: is ensuring you are notified as early as possible so you can identify the root cause and fix with minimal downtime.  This can be a simple as waiting for the help desk to escalate that the farm is down.  In maturing this a simple set of web requests that alert the administrators as soon as a service is down is an improvement.

Monitoring needs to be a combination of preventative and reactive monitoring done via automation and manual verification.  As the automation piece improves, there is less reliance on the manual monitoring.

Tooling:
There are a wealth of tools such as SolarWinds and it's competitors "Enterprise IT management from such vendors as CA Technologies (UniCentre), BMC, IBM (Tivili), and Hewlett-Packard (OpenView)."  Wikipedia

SolarWinds Monitoring Screen

Idera have monitoring tools specifically for SharePoint.  SolarWinds is a good option for monitoring SP farms and its dependencies: Windows OS, Machine resources, SQL, SP, WCA/Office Web Apps.  Couple this with web monitoring and you get a comprehensive reactive and preventative monitoring solution.  This will tell you before collapse if the server, OS, SQL or SP is slowing down or running out of resources.  If any or a complete service stop occurs the operations team are notified and it is highlighted where the error is as opposed to "it's not working".

AvePoints, DocAve 6 has a solid monitoring tool for SharePoint and the servers so if you already have DocAve this would be my choice.  The UI gets jumbled on big farms but overall the tool is easy to use and does a solid job.

Metalogixs Diagnostics Manager looks like a nice tool.  Very similar to the DocAve Monitor but you don't need to deploy as any pieces onto the farm.  The UI can be a bit busy but definitely a product to look at.
Metalogix's Diagnostic Manager sample screen shots

Other tools include ExtraHops traffic Monitor, by check how long response takes it determines with minimal interference how well the components/nodes in the infrastructure are performing.  DocAve 6.3 has a good monitoring solution specific to SharePoint, it will monitor down to the OS and report on CPU and Memory.

AlertFox is a monitoring service (SaaS).  It can perform http get requests at regular intervals (e.g. every 5 minutes).  This ensures your webservers stay warm, measures the response time from various location around the world and can check the speed of multistep actions such as loging into your web site and performing a search.  There are a lot of these but AlertFox is good.  It has dashboard, email and sms notification included. 

SharePoint Best Practices Analyser - CA > Monitor > Health Analyser provides a good place to see common problems.

EventViewer & ULS are also good places to do reactive and even preventative monitoring however these logs will need to be trawled manually.

Key items to monitor for me are:
1.> OS/VM: CPU, OS Memory, OS disk capacity/utilsation.
2.> Windows Services: Each role needs a set of services, so your monitoring tool can verify they are working.  An example for SharePoint servers services are shown in appendix A.  If you have agents such as DocAve from AvePoint, verify these are running.  Office Web Apps 2013's service is WACSM.
3.> SQL Server: Verify the services are running, monitor SQL performance ...
4.> SharePoint: Verify web requests are returning results and measure TTL, this may indicate a bottleneck is starting to occur.  If you are using multiple front end servers, check each server is working.

Appendix A. SharePoint 2013 Services to Monitor

WFE & APP Roles
Service
Name
Status
Startup Type
Log On As
SharePoint Administration
SPAdminV4
Started
Automatic
Local System
SharePoint Search Host Controller
SharePointSearchHostController
 
Disabled
Network Service
SharePoint Server Search 15
OSearch15
 
Disabled
Local System
SharePoint Timer Service
SPTimeV4
Started
Automatic
Demo\Sp_farm*
SharePoint Tracing Service
SPTraceV4
Started
Automatic
Demo\Sp_Service*
SharePoint User Code Host
SPUserCodeV4
 
Disabled
Demo\Sp_farm*

Search Role
Service
Name
Status
Startup Type
Log On As
SharePoint Administration
SPAdminV4
Started
Automatic
Local System
SharePoint Search Host Controller
SharePointSearchHostController
Started
Automatic
Demo\SP_SearchService
SharePoint Server Search 15
OSearch15
Started
Manual
Demo\SP_SearchService
SharePoint Timer Service
SPTimeV4
Started
Automatic
Demo\Sp_farm*
SharePoint Tracing Service
SPTraceV4
Started
Automatic
Demo\Sp_Service*
SharePoint User Code Host
SPUserCodeV4
 
Disabled
Demo\Sp_farm*