Cloud Formation: Portal for ArcGIS - High Availability

Ranga_Tolapi · ‎07-26-2022

I was referring to an Esri video in YouTube titled Deploying ArcGIS Enterprise in Amazon Web Services

At 27:05 in the video, what type of HA is configured for Portal for ArcGIS in this deployment architecture? Is it primary-primary or primary-standby?

For ArcGIS Data Store it was clearly depicted that it is primary-standby. However, it is not clear for Portal for ArcGIS.

Can someone please throw some light here.

@DavidCordes @KanikaKumar @PankajChaudhari @Anonymous User

ChristopherPawlyszyn · ‎07-26-2022

Just to add some nuance to this thorough answer, while the web tier of Portal for ArcGIS is active-active there is a database component that runs in primary-standby mode and fails-over in the event of a database failure on the existing primary machine. That is why write transactions during failover will fail for a short time, but shouldn't require administrative intervention to recover once failover is complete.

That is why you'll see primary and standby roles listed in the Machines API of Portal Admin, but both machines (when healthy) are able to receive web traffic and respond to requests while writing to the current primary database.

-- Chris Pawlyszyn

View solution in original post

Brian_Wilson · ‎07-26-2022

I don't want to watch the video but I think I can answer anyway.

There is no "primary/standby" for Portal (or Server), only for DataStore. That picture shows two copies of Portal running behind a load balancer. They are peers, serving (hopefully!) exactly the same content and (if things work correctly) no one outside will know you have more than one. They see one server because the load balancer (tries to) spread the traffic out and direct it to the Portal that is doing the least amount of work. If one fails, it stops sending traffic to it. That's the HA part.

The exact same thing goes for Server -- you can have more than one behind a load balancer, if your license allows it. I have tested with Server but not with Portal. We can't justify the budget for another license, we are small.

(Also, if your load balancer fails... well, that can never happen! You never eliminate the failure points, you just move them around. And you don't know that until one fails. And they are designed to fail only when you are on vacation.)

Tip #1: If you have a license for only one copy of Server you can still have a Primary and Standby Datastore. (I am unconvinced it works but that's what I ended up with here. It is probably slower because of the replication going on, but I have not noticed any change. The data in the DataStore is not being edited here so I am unsure why I care if it has live replication. But you could be using it more than we are.)

Tip #2: If you have virtual servers then it's pretty cheap to put the components on different servers. If you have a license for only 4 CPUs like we do, and our virtual server has 32 CPUs, then putting the components into different VMs means we get to use up to 16 cores and still stay within the license constraints. (4 for Portal, 4 for Server, and 4 for DataStore, and 4 for SQL Server) (we've moved SQL Server to a completely separate dedicated server with its own storage so actually we actually have more than 16 cores in use now)

TIp #3 It's painful to migrate from a single server to multiple servers and keep everything operational (in spite of what tech support says!! Things will break!) If you have not deployed yet, trust me, set up multiple machines now.

ChristopherPawlyszyn · ‎07-26-2022

Just to add some nuance to this thorough answer, while the web tier of Portal for ArcGIS is active-active there is a database component that runs in primary-standby mode and fails-over in the event of a database failure on the existing primary machine. That is why write transactions during failover will fail for a short time, but shouldn't require administrative intervention to recover once failover is complete.

That is why you'll see primary and standby roles listed in the Machines API of Portal Admin, but both machines (when healthy) are able to receive web traffic and respond to requests while writing to the current primary database.

-- Chris Pawlyszyn

Brian_Wilson · ‎07-26-2022

One more comment -- it's engineered for sites 1000 times bigger than ours. 🙂

I deployed on separate machines because it makes maintenance and support easier. Not for performance. When tech supports says "roll back to yesterday's snapshot" on one component, I can now, because I won't be wiping out changes made to the others.