Amazon FSx for NetApp ONTAP – A Brief-ish Overview

Who is your daddy and what does he do?

What is it?

Amazon FSx is AWS’s managed services for filesystems. Basically if you want Windows File Server or Lustre out of the box – aka don’t want to manually setup and support those in your environment – you can use FSx to deploy an AWS managed environment.

FSx NetApp ONTAP, FSx for ONTAP, FSx ONTAP, or FSxO (pick your preferred name out of a hat) is the same. It allows AWS users to provision NetApp’s ONTAP environment as a service managed directly by AWS.

To blatantly copy from AWS’ deployment page…

Amazon FSx for NetApp ONTAP provides feature-rich, high-performance, and highly-reliable storage built on NetApp’s popular ONTAP file system and fully managed by AWS.

  • Broadly accessible from Linux, Windows, and macOS compute instances and devices (running on AWS or on-premises) via industry-standard NFS, SMB, and iSCSI protocols.
  • Provides ONTAP’s popular data management capabilities like Snapshots, SnapMirror (for data replication), FlexClone (for data cloning), and data compression / deduplication.
  • Automatically tiers infrequently-accessed data to capacity pool storage, a fully elastic storage tier that can scale to petabytes in size and is cost-optimized for infrequently-accessed data.
  • Offers highly-available and highly-durable multi-AZ storage with support for cross-region replication and built-in, fully managed backups.
  • Integrates with Microsoft Active Directory (AD) to support Windows-based environments and enterprises.
  • All file system data is automatically encrypted at rest.

No really, what is it?

At a high level, AWS is providing managed services for NetApp’s Cloud Volumes ONTAP (CVO).

Normally – perhaps ideally – if you wanted ONTAP services in AWS you would spin up CVO in your own infrastructure. It’s easy to implement, provides a lot of value over straight EBS/EFS services, but there are elements of that infrastructure you have to manage. With FSx you don’t have to manage anything. You just say how much capacity you want, what performance level, and everything is handled automagically by AWS.

FSx ONTAP is a two node, active passive,  high-availability architecture. In the backend it probably looks a lot like this, but hey, because it’s a managed service you don’t have to worry about any of that! During a fail over AWS will automatically move the mount point IP address over to the new system.

FSx ONTAP will be deployed across two Availability Zones. Single AZ deployments are on the roadmap.

If you have an on-prem ONTAP system, or CVO in AWS or another cloud infrastructure, you can set up a connection between FSx ONTAP and that system to SnapMirror data back and forth.

What does it do?

If you’re familiar with ONTAP, or CVO, then you’ve got the idea.

Basically you’re bringing the one of, if not the best, data management platforms into your cloud environment. The ability to provision workloads in an efficiency manner and easily move those workloads around as requirements change.

How do you manage it?

If you want to use FSx ONTAP you can provision an environment either through the AWS Console, or through NetApp’s Cloud Manager service. If you provision through AWS, and want to integrate with Cloud Manager later, it’s easy to discover and bring it in to your working environment.

From a day-to-day management perspective you can administer with AWS’ Console, AWS’ CLI/API, AWS CloudFormation NetApp Cloud Manager, and ONTAP’s CLI/API.

The abilities on the AWS side of the house are pretty limited (as of right now – I’m sure more features will be added in the future), basically have the ability to deploy and create NAS shares. If you want to replicate via SnapMirror, or clone volumes, you have to use Cloud Manager. If you really want to harness the power of ONTAP to do things like enable FPolicy or create QTrees then you need to use the ONTAP CLI/API.

There is no full featured GUI.

Software updates are all handled by AWS and rolled out non-disruptively without interaction. On the bright side you can define a 30 minute maintenance window on a weekly basis.

As part of this service you also get automatic backups (30 min minimum window and max retention of 30 days).

When provisioning a FSx ONTAP environment you’ll be able to select from multiple performance SLAs (currently 512 MB/s, 1 GB/s, and 2 GB/s). That can be changed later without taking the environment offline.

You can also provision based on IOPS, but I currently don’t quite understand the resulting impact to the config or the pricing. By default you get 3 IOPS per every GB of SSD storage (SSD storage being the only option right now). You can provision additional IOPS, above the 3 per GB, and you’ll pay for that average across a month.

When provisioning FSx ONTAP via the AWS Console the max IOPS is 80,000. However when doing so through Cloud Manager the max is 64,000 IOPS. Both are supposedly at 1ms of latency, but an IOP isn’t a standard unit, and I don’t see any info about the block size.

Who’s responsible for it?

AWS owns the underlying architecture, SLAs, and support services. You’re also billed through AWS.

You’re responsible for everyone at the ONTAP layer and up – SVMs, volumes, shares, working up the stack.

You also wont have full admin rights to the system, instead you’ll get a user login (fsxadmin) that’s somewhere between the admin and svm admin from a permissions perspective. I’m not aware that it limits any general ONTAP functionality, just limits what you can screw around with under the hood.

You can use either your own, or AWS provided, encryption keys for data at rest.

Data Protection

As noted, since it’s a HA config, there’s always going to be a second copy of your data out there. Which is good because AWS has been known to randomly lose backend storage.

You can also easily replicate that data to other FSx ONTAP instances in other regions, to other CVO instance (either within AWS or other providers like Azure and GCP), or back to on-prem environments.

At a simpler level you can take local snapshots which are instant and space efficient.

In addition data can be backed up within the cloud to a local object store. This is done via Cloud Backup, though sometimes you’ll see it labeled as “AWS Backup” for some reason (maybe it’s going to be integrated with AWS Backup?). Since I’ve only seen Cloud Backup managed through Cloud Manager, I’m not sure how restores would be managed through an AWS only environment.

You also get the same anti-virus, Cloud Data Sense, Cloud Insights/Cloud Secure, etc etc availability that you get out of CVO/on-prem environments.

Encryption

FSx ONTAP supports encryption at rest, and encryption in flight, depending on protocol.

For encryption at rest, it’s set up automatically using AWS’s managed AWS-256 encryption keys (AWS KMS). I don’t see any support for using your own KMS.

As for in flight, replication via SnapMirror is encrypted via TLS 1.2 AES-256. Data traffic will depend on the protocol, eg. SMB 3.0.

There’s currently no support for individually encrypted volumes (NetApp Volume Encryption/NVE).

Getting Started

The simplest way to get started is to bring up FSx in the AWS console. There you can choose Amazon FSx for NetApp ONTAP and go through a quick wizard.

There’s two options, Quick Create and Standard Create.

With Quick, all you have to do is provide a name, capacity and VPC where the data will be accessible. You can also choose to enable/disable storage efficiency. Best I can tell this doesn’t impact SLA performance.

This process creates a single SVM, single volume config, out of the gate. You can then go back and add additional volumes and SVMs as needed.

Standard Create gives you more options, including network, security, performance, encryption, SVM details, and backup & maintenance.

If you already have NetApp’s Cloud Manager set up you can deploy via adding a working environment and selecting Amazon FSx for ONTAP. If you already have FSx ONTAP  deployed you can discover it, allowing for simple drag and drop SnapMirror configuration.

Creating an instance within Cloud Manager is a bit more complex than AWS’ Quick Config, as you have to enter all your networking details, but in general the wizard is self explanatory.

Comparing the two options, it looks like deployment via the AWS Console provides the best experience, either for simplicity or fine tuned deployments. My recommendation would be to deploy via AWS, then discover via Cloud Manager.

You can find more on how to deploy via Cloud Manager on NetApp’s blog.

Limitations?

Well, bit of a moving target here. As with any product, especially at launch, there are current limitations and “limitations that won’t be limitations thanks to the roadmap.” Expect a fair amount of these to be improved upon over the next six to twelve months.

At launch you’re able to use 192 TiB (all the documentation NetApp I see says TB, but ONTAP reports in TiB and all the specialists talking about it said TiB – thankfully AWS uses TiB in their systems like sensible folk) of primary storage and “unlimited infrequent access” storage.

Confused?

ONTAP has the ability to tier cold block of data out to an object store. On prem it’s called FabricPool, in the CVO world it’s called Cloud Tiering (thanks again, marketing). In general this means instead of provisioning and paying for 100 TiB of SSD storage you could provision less and tier out all the cold data that doesn’t need that performance to S3. In this sense the amount of storage is unlimited* because there’s no restriction on the amount of data you can tier out. *Unlimited in the way Verizon Wireless has “Unlimited” plans… you consume metadata on that 192 TiB pool for all tiered data so you could potentially fill that up.

I don’t know whether “infrequent access” is being used as marketing speak for “cold data” or whether S3 IA is the only supported destination.

It’s also my understanding that you can’t provision more backend storage capacity beyond what’s initially configured (again, this is a launch limitation and is reportedly going away in future updates).

Since upgrades are rolled out automatically, and NFS/CIFS/iSCSI protocols being what they are, make sure to set the maintenance window for a period where your applications won’t mind being redirected to the standby FSx ONTAP node.

I’m 99.99% sure that the performance SLA is tied to whatever type of EC2 instance is provisioned in the background. When you change SLA the nodes will be brought down, one at a time, and brought backup on a different EC2 type. As with the upgrade be mindful of how your applications will respond to this change.

9/16 Update – Note that as of right now, the amount of SSD capacity provisioned during deployment is what you’re stuck with. There’s no ability to provision more.

Performance Levels & Pricing

When you deploy FSx ONTAP you tell AWS what performance tier you want and AWS will automatically provision the proper EC2 and EBS resources in the backend to meet that objective.

As of Sept 2021, there are three different performance SLA offerings: 512 MB/s, 1 GB/s, and 2 GB/s.

You also get 3 IOPS per GB (should this be per GiB? AWS specifically says “gigabyte”) by default but can increase that to a total of either 64,000 or 80,000 depending where you look. If my math is right, that’s a default of up to 58,824 IOPS. Thus, again if my math is right, at 80,000 max IOPS, you can only max out at 4 IOPS per GB.

Yeah… I’m gonna try and get clarification on that.

9/7 Update – Sounds like official performance benchmarks are a few weeks out. Once they’re released that should help clarify where the IOPS metric comes into play.

9/17 Update – I was finally able to get enough clarification for an update here.

For starters, the 64k max IOPS limit in Cloud Manager is a bug and will be sorted out in an upcoming release. 80k is the max you can manually set.

Back to one of the original points of confusion, the total IOPS. AWS is currently saying it’s 3,000 IOPS per TiB of provisioned SSD capacity by default. If you see it documented somewhere that it’s 3 IOPS per GB (or *choke* Gib) remember that NetApp marketing doesn’t understand Base 10 vs Base 2. That means if you provision 100 TiB you get, by default, 300,000 disk IOPS. If you provision the max capacity that’s 576,000 IOPS. So why would you want to ask for a paltry 80k? If you’re going to provision sub 27 TiB (if my math is right) and you want more IOPS than the standard, you can provision more for that workload. Think if you want a small transactional database that’s more interested in performance than capacity.

9/21 Update – This is the topic that will never die, huh?

I made a mistake by not looking at this holistically. With CVO, performance has always been limited by the EC2 and EBS resources that AWS was able to serve up. AWS running CVO for you doesn’t change those background limitations. Thus, even if the math says we’ll get to 578,000 IOPS, no single EC2 instance types will hit that. So, what about FSx ONTAP?

Problem is, because it’s a manged service, no one wants to talk about what resources AWS is provisioning to support it. In theory, it doesn’t matter, because you pick your performance SLA and you’re off and running. But this kind of head-in-the-sand approach completely obscures the real performance limits. Something of a key concern if you’re trying to consolidate or run performance impacted workloads.

As of right now, 9/21/21, there are three EC2 instance types being provisioned across the three performance SLAs. I think it’s m5d.4xlarge, m5d.12xlarge, and m5d.24xlarge respectively. Per the AWS EBS optimization documentation the m5d.24xlarge can only push a max of 80,000 IOPS.

Consider a scenario where you don’t need a lot of capacity, or network throughput, but you put the max 80,000 IOPS into the FSx ONTAP provisioning wizard. Even though a 4xlarge instance might fit the first two parameters, because the IOPS requirement, then AWS will provision m5d.4xlarge EC2 instances for you.

At the end of the day, if you have any performance requirements talk with your AWS rep. Ask them what instances and drive types are being supported by FSx ONTAP. What EC2/EBS resources are available today, can change tomorrow as AWS constantly improves their services and adds more FSx ONTAP support.

One thing I’m still awaiting on is how that provisioned SSD capacity results in cost of ownership. Will report back after some testing… I’m being told that it won’t.

Anyhow, like all AWS services, your end cost is going to be the result of various factors including the above performance variables. For capacity, there’s the cost of primary storage and tiered storage. That will change depending on workload dynamics but for budgetary estimates a 20/80 split is a decent starting point.

If you opt for backups, that’s an additional cost based on the amount of space used. Backups are incremental with efficiencies built in, so your final cost will vary depending on data types and change rates.

Eventually I’ll come up with a Drake Equation equivalent for how this will all price out, but for now I’m just going to punt to the AWS pricing documentation. Different AWS folks have different pricing discount schemes, so if in doubt your best bet is to reach out to your AWS account team and have them run the numbers.

Is _________ supported?

Frankly, this bit is a huge mess. In theory everything supported with CVO will work with FSx ONTAP. I haven’t seen any information that suggests that the restraints created by the top permission fsxadmin account will prevent the usual features and functionality from working as expected.

As far as “technically supported” is concerned, NetApp is punting that toward AWS. As AWS is responsible for running and administering the system, and handling support, it’s up to AWS to to make those calls.

To use Cloud Volumes ONTAP, or FSx ONTAP?

For me, this comes down to what do you value most: ease of deployment or control of your environment. I consider both FSx ONTAP and CVO to be equally easy to deploy. The real learning curve is setting up the AWS environment to support it. FSx ONTAP gives you the ability to deploy and let AWS automatically configure the back end infrastructure for you.

Conversely, CVO provides a few additional administration benefits. There’s currently more flexibility from a deployment option standpoint (ie. more EC2 and EBS options). You can also get more granular with the networking configuration. If you’re not CLI savvy, the ability to use CVO’s System Manager will provide more GUI options for configuration than FSx ONTAP currently offers.

Resources & Additional Information

 

Article Updates

  • 9/2 – Publication
  • 9/7 – Minor tweaks and clarifications
  • 9/16 – Supportability section added, encryption section added, notes on capacity expansion
  • 9/17 – Updated information on IOPS provisioning
  • 9/21 – Even more updates to the performance section