Radimaging Ltd - Paul Beck's Technical Working Notes for Microsoft Technology: SolarWinds

Showing posts with label SolarWinds. Show all posts

Sunday, 2 January 2022

App Insights Overview for SaaS logging and tracing

Overview: App Insights provides independent infrastructure for logging and tracing activities. It is tightly coupled with Azure services including PaaS. This allows for consistent scalable logging. App Insights now stores logs in Azure Log Analytics, these are all under the umbrella of Azure Monitor,

On a SaaS solution, I am looking for App Insights to log any errors have the ability to log trace information. I want a unique correlationId (to allow for distributed tracing) on the front end if there is an error so support can identify the exact issue/transactions. A unique correlationId in the http header allows for identifying a transaction and this is useful for tracing and performance monitoring. Using the App Insights SDK's and implementing a common logging module is a good idea. There are two common areas that need call out to ensure the ability to trace transactions:

SPA's (Requirement to generate a unique operation/correlationId per operation not per pageview), and
Long running operation such as timer jobs or service bus calls.

Support & DevOps:

Having a correlationId allows first line to log the correlationId and quickly follow the request without asking for replication steps. This context tracing approach is common on newer applications. Third line support has full traceability of an issue to support who can empirically see the perceived performance parts broken down using the correlationId in the header.

Key API's can be continuously monitored for errors and slow down in performance, alerts can be configured around this monitoring.

Building a first line support tool that displays the errors in a hierarchy, has help scripts and knowledge bases is a good option for streamlining support.

App Insights has live monitoring and also has Kusto query language is useful for monitoring specific queries.

Summary Report for Support

// I'm sure there are nicer ways to write/improve my Kusto, so pls let me let me know where the code can be improved

let dayminus0 = datetime(now);

let dayminus1 = ago(24h);

let dayminus2 = ago(48h);

let result0 = requests

| where timestamp > dayminus1 and timestamp < dayminus0

| summarize requestCount=sum(itemCount), avgDuration=avg(duration) by performanceBucket

| where performanceBucket == "15sec-30sec" or performanceBucket == "7sec-15sec"

or performanceBucket == "30sec-1-min" or performanceBucket == "1min-2min";

let dayminus1a = ago(24h);

let dayminus2a = ago(48h);

let result1 = requests

| where timestamp > dayminus2a and timestamp < dayminus1a

| summarize requestCount1=sum(itemCount), avgDuration1=avg(duration) by performanceBucket

| where performanceBucket == "15sec-30sec" or performanceBucket == "7sec-15sec"

or performanceBucket == "30sec-1-min" or performanceBucket == "1min-2min";

let dayminus1b = ago(2d);

let dayminus2b = ago(3d);

let result2 = requests

| where timestamp > dayminus2b and timestamp < dayminus1b

| summarize requestCount2=sum(itemCount), avgDuration2=avg(duration) by performanceBucket

| where performanceBucket == "15sec-30sec" or performanceBucket == "7sec-15sec"

or performanceBucket == "30sec-1-min" or performanceBucket == "1min-2min";

let resultTemp = result0

| join kind=inner result1 on performanceBucket

| project performanceBucket, ['Today'] = avgDuration, ['Yesterday'] = avgDuration1;

let resultTemp2 = resultTemp;

resultTemp2

| join kind=inner result2 on performanceBucket

| project

performanceBucket,

['1) Today']= (round(['Today'], -2) / 1000),

['2) Yesterday'] = (round(['Yesterday'], -2) / 1000),

['3) Two Day ago'] = (round(avgDuration2, -2) / 1000)

| render columnchart

with (

kind=unstacked,

ytitle="Seconds Taken",

xtitle="Performance Group",

title="Ensure the 'Today' bar is not significantly higher than pervious days");

Monitoring: Azure dashboards are great for monitoring application health and performance. Easy to customise, make unique dashboards and security is easy to control. sentry.io monitors API's, I have not used it. I like all the Azure stuff coming out for testing and I feel continuously running Postman collections and reporting to App Insights is the best way to go. Azure Dashboards can be limiting, Azure Grafana can be a great alternative/enhancement. Check out Azure Managed Grafana.

source cloudiqtech

Alerting: I all to often see an overuse of alerting resulting in recipients ignoring a plethora of emails. I believe in minimising alerts especially via email, and SMS type messaging. For me, I like to create a dedicate channel for alerting that includes all DevOps members and either notify via a Teams card, and even easier is to email the channel. This can be broken down further but to start I create a channel for alerting for each DTAP environment.

Note: The default channel setup only allows members of the teams channel to send email so the alerts from Azure monitor using rules won't be accepted. On the channel, and admin needs to go to the "advance settings" and change the option from "Only members of this Team" and change it the setting to "Anyone can send".

Options: There are great services for logging so my default tends to be Azure Monitor. All the main vendors support Open Telemetry. The main players in Application & API observability and monitoring include:

Microsoft: Azure Monitor includes Application Insights & Azure Log Analytics
Dynatrace (really good if you use multicloud) or Dynatrace AWS cloudwatch, Dynatrace - Saas offering is on AWS. Can be on-prem. OneAgent is deployed on the Compute i.e. VM, Kubernetes. Can import logs from other SIEMs or Azure Monitor, so you can eventually get Azure service logs such as App Service or Service Bus. Does Full stack and includes code-level and applications and infrastructure monitoring, also can show User monitoring. Dynatrace offers scalable API's that are sitting on Kubernetes. "Davis" is the AI engine used to help figure out the problems. Alerting is solid.

High-level Architecture

Dynatrace Admin Monitoring

AWS: Amazon CloudWatch Synthetics
AppDynamics,
Datadog (excellent),
New Relic,
SolarWinds (excellent)

SolarWinds admin UI from circa 2013/2014

Dynatrace

Wednesday, 30 April 2014

OWA intermittently not returning office documents in Office Web Apps 2013

Problem: Intermittent requests are not returning the pdf/word documents. Most requests are working and occasionally 1 request doesn't work. Every 4th request tries to get the pdf to display on Office Web Apps for a few minutes without any error message and then stops trying and displays the message "Sorry, Word Web App can't open this ... document because the service is busy."

I have 4 OWA/WCA servers on a stretched farm being used by SP2013 etc.

Initial Hypothesis: Originally I thought it was only happening to pdfs but it is happening to word and pdf documents (I don't have excel docs in my system). My monitoring software SolarWinds is badly configured on my OWA servers as the monitor is showing green, drilling down into the servers monitoring; the 2 application monitors are both failing. The server should go amber if either of the 2 applications monitoring fails and in turn red after 5 minutes. At this point I notice that I can't log onto my 4 OWA/WCA server. Web request are not being returned. I look at my KEMP load balancer and it says all 4 WCA servers are working, I notice the configuration is not on web requests but on ping (not right) and the NLB/KEMP is merely redirecting every 4th request to the broken server.

Resolution:

Reboot the broken server, once it comes up I can make http requests directly to url http://wca.demo.dev/hosting/discovery on the rebooted server.
SolarWinds monitoring is lousy - need to fixed the monitoring.
Kemp hardware load balancing needs to be changed from checking the machine is "ON" to rather checking each machine using a web request.

SolarWinds Monitoring is not configured correctly