Radimaging Ltd - Paul Beck's Technical Working Notes for Microsoft Technology: RTO

Showing posts with label RTO. Show all posts

Sunday, 7 November 2021

Figuring out SaaS licencing and SLA's

Overview: Buyers, whether B2B or B2C, will likely want to understand your licensing, associated costs, and level of service. Keep it simple, keep it understandable, and make sure you cover what availability, performance, and actions users can use your service for are all clearly outlined in your Service Level Agreement (SLA).

Licensing pricing options: Pay-per-use one-off, yearly, pay-per-user monthly or annual, pay-per-consumption, e.g., Stripe.

SLA:

Availability 99.9 or better is good; it really depends on what you are offering, but there are often penalties for missing availability SLA. If I build a standard SaaS application that utilises App Services, APIM (standard, premium geo-loaded has a higher SLA) and Azure SQL, I can't achieve a 99.9% SLA excluding AAD and any patching or application-caused downtime. At a SaaS product level, providing an actual 99.999% (5 nines) SLA is not as easy as the marketing and legal stakeholders might assume.

It becomes easier to offer 99.9+% SLA's if you, as a company, assume the risk, i.e. it's unlikely all the downtime will occur and affect you sequentially so offering money back is absolutely possible. Additionally, most SaaS companies require clients to claim their credits, which are not monitored and are automatically applied to your bill.

Support - phone, bot, email and max time to respond and time to resolve.
B2B Monitoring - Good idea to monitor your SaaS provider and not just take their word for it. Technically, monitor the availability of individual services (websites or API's), it is also good to know when items outside of your control (with the SaaS vendor) are unavailable in internal support. Examples include page load times and login times, where you are looking for availability and speed. How much of the service is down, and how much does this affect end customers? You may want to use a 3rd party tool or write your own as a last resort to monitoring. When relying on 3rd parties to provide services, ensure you do a hazard risk assessment. Plan for when things happen, how you will respond, and how you will adjust.

SaaS Onboarding & Payment collections

SaaS Customer Experience

SLA's need to consider both RTO and RPO

Availability = ((Total Minutes−Unplanned Downtime)/Total Minutes) ×100

The formula used to calculate actual availability.

Note: Planned downtime is not typically included in availability calculations, so be aware of what you are in for early.

SLA vs SLO vs SLI:

SLA (Service Level Agreement - contractual agreement SaaS company makes with the customer.
SLO (Service Level Objective) - Goal availability (and acceptable performance) of the microservice or application. Measurement goal.
SLI (Service Level Indicator) - checks if SLO is achieved. Actual Measurement.

As part of High Availability and scalability, it is a good idea to know how many instances and how autogrowth is set up. Here is an example for Azure App Services.

Scale Out (CPU or Memory) - Matrix Threshold (Avg): 70, Duration: 5 Min, Cool down Time: 5 Min, Increase Count: 1

Scale In (CPU or Memory) - Matrix Threshold (Avg): 40, Duration: 30 Min, Cool down Time: 10 Min, Decrease Count: 1

Friday, 13 March 2015

Capturing NFRs for SharePoint

Problem: Gathering Non-Functional Requirements (NFRs) is a challenging task in IT projects. This is because it is always difficult to estimate how the system will be used before you build it. I often encounter business users who state extreme NFRs in an attempt to negotiate or demonstrate their world-class status (I generally think the opposite when hearing unreasonable NFRs).

An example is a CIO at a reasonably small NGO telling me about the on-premises setup. SP 2010 infrastructure needs to be up all the time, so an SLA of 99.99999% is required, which is nearly impossible. This equates to 3.2 seconds of downtime a year. In reality, higher SLA's start to cost a lot of money. SP2013 and SQL 2012 introduce Always On Availability Groups (AOAG), which help improve SLA uptime, but this comes at a cost in licensing, infrastructure, and management. I need redundancy and the ability to deal with performance issues, so the smallest possible farm consists of 6 servers, 2 for each layer in SP, namely: WFE, App and SQL.

Here is an old post of SP2010 SLA's but still relevant today.

The key is to gather your NFRs and ensure that all your applications on the production farm meet the expected behaviours. I have a checklist below. Reviewing Microsoft's SP Boundaries, Limits and Thresholds document shall help highlight any issues.

The high-level items I cover include the following topics:

Availability
Capacity
Compatibility (Browser, device, mobile)
Concurrency
Performance
Disaster Recovery (RTO, RPO)
Scalability
Search
Security
SLA

Capacity Example

Item	Day 1	Year 1	Year 3	Year 5
Site Collections	10	100	250	400
Database Size in GB	> than 1GB	490 GB	1220 GB	1960 GB
Search Index Size in GB	> than 1GB	120 GB	310 GB	490 GB
No of Content Databases	1	1	4	8
No of Search Items	10,000	10 Million	25 Million	40 Million
No of Index Partitions	1	1	3	4

Item	Day 1	Year 1	Year 2	Year 3
Number of Users	1,000	50,000	80,000	130,000

*Also calculate peak and average concurrency numbers

Average concurrency, for 20,000 users, is based on the assumption that 10% (2,000) of these users will be actively using the solution simultaneously, and that 1% of the total user base (200 users) will be actively making requests. For performance testing, aim to handle 200 users without delays and achieve a page response time of under 5 seconds. Based on the simple guideline I've always used from Microsoft.

Peak concurrency depends on your situation; for example, the NFL playoffs game schedule, when announced, is not the simple 4 times the average concurrency that would be suitable for most internal business applications. Although this example may be considered a load spike rather than a peak concurrency.

It is also worth creating a usage distribution pattern for your users' experience. For instance, 80% of users may be light users who log in, read 10 pages on your site, and perform a single search with 1-minute gaps between interactions (wait times). the remaining 20% perform a login, upload a 100kb document, view 10 pages and perform 2 searches.

RPO & RTO:

RPO - Max amount of lost data (in time)
RTO - Max time lost (rebuild farm and get the latest backups restored) to make the system operational again.

SQL Server Sizing:

Option 1: Calculate the rows and bytes required for storage, then multiply by the number of rows and sum the tables to determine the total size.

Option 2: Assume 100 bytes for each row, count the number of rows and get the storage requirements.

More Info:
https://technet.microsoft.com/en-us/library/ff758647.aspx