portal HA 10.5 issue

10441
23
01-22-2017 10:28 AM
AhmadAwada1
New Contributor II
In a portal high availability 10.5 scenario, I noticed that when a standby portal machine detects failure in the primary machine, it communicates with the failed primary machine and drops (the standby machine i mean) its "c:\arcgisportal\db" directory and create a new one (may be based on info from the failed primary db folder)!
I noticed that IP communication between the two machines shall continue to exist during fail-over! In my case, if primary machine is shutdown, the standby machine would not startup unless there is access again to the failed machine (even if portal service is down).
Conclusion:
Is portal high availability is only on the service level? That is, if both machines are up (network wise) but one of the portal services is down, then fail-over will take place, while on the other hand, if network access to the primary failed machine is lost, then standby machine will not start?
What is wrong with my configuration?
Here is the error i get when standby machine has no network access to the failed primary machine:
"The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>"
0 Kudos
23 Replies
JonathanQuinn
Esri Notable Contributor

So when you publish any hosted service, they all get created as "Stopped"? What happens if you attempt to start it manually through the Admin API? Even though the icon is greyed out in Manager, you'll be able to start it in the Admin API. Do you see any errors in the Server logs when publishing the service? Can you publish a non-hosted service that copies data into the ArcGIS Data Store? You may want to create a new thread about this issue as well.

0 Kudos
EirikBuraas
New Contributor II

Hi Jonathan,

We have an ArcGIS Enterprise 10.6 HA-setup where it is critical that the system is up as close to 24/7 as possible. We also experience that failover takes several minutes, but due to the architecture and use of the system, a complete upgrade of the infrastructure to 10.6.1 is currently not possible. Is there a way to tweak any settings to get closer to the failover time we would get by upgrading to 10.6.1?

Best,

Eirik M. Buraas

0 Kudos
JonathanQuinn
Esri Notable Contributor

Unfortunately, most of the time is spent restarting the web server to account for new configuration settings during failover. This step is removed at 10.6.1, which is where most of the performance benefit lies, (on top of better stability as we don't need to modfiy the configuration settings anymore).

Short answer, there isn't much you'll be able to do at 10.6 to improve the failover time.

0 Kudos
AhmadAwada1
New Contributor II

Hi jonathan,

If arcgis server suddenly starts as a service but no arcsocs appears in task manager. Also permissions for arcgis accout are set properly on folders. Also server logs are not recording finally the service.log is stopping at a message "invoke afterstart()" . Can you tell me any clue or something to check please. The situation is so painfull

Get Outlook for iOS<https://aka.ms/o0ukef>

0 Kudos
JonathanQuinn
Esri Notable Contributor

I'd check the service logs under <install directory>\framework\etc\service\logs. The service-error-0.txt file should give you an indication of what's wrong.

0 Kudos
ThomasKringstad
New Contributor

Hi Jonathan.

How about promoting the secondary to primary while both of the machines are up and running. Any downtime on the newly demoted now secondary machine is then not an issue. And the promoting it back to primary after restart. This is in the case of needed reboot of single nodes. Not crashes and errors. (patches etc)

Best Regards Thomas

0 Kudos
JonathanQuinn
Esri Notable Contributor

If flipping the role of the Portals was possible in the software outside of stopping the Portal service, there will still be downtime as the standby's web server needs to restart.

0 Kudos
AhmadAwada1
New Contributor II

Dear Jonathan,

On Primary machine db folder, I can see recovery.done only (no recovery.conf)

On Standby machine db folder, I can see both (done and conf).

Opening the recovery.done on the primary machine contains the following:

#----------------------------------------------------
# STANDBY SERVER PARAMETERS
#---------------------------------------------------
standby_mode = 'on' # This represents a standby server
#
primary_conninfo = 'host=webgis2.domain.com port=7654 user=repuser1484908555377 password=s0qi37INGNUwOJ3puQ8yCQEWHlRFr7ZWxH8OYdDYSOg='
#
#
trigger_file = 'c:/arcgisportal/db/promote.dat'

while opening the recovery.done on standby machine contained the following:

#----------------------------------------------------
# STANDBY SERVER PARAMETERS
#---------------------------------------------------
standby_mode = 'on' # This represents a standby server
#
primary_conninfo = 'host=webgis2.domain.com port=7654 user=repuser1484908555377 password=s0qi37INGNUwOJ3puQ8yCQEWHlRFr7ZWxH8OYdDYSOg='
#
#
trigger_file = 'c:/arcgisportal/db/promote.dat'

and opening the recovery.conf on the standby machine has the following:

#----------------------------------------------------
# STANDBY SERVER PARAMETERS
#---------------------------------------------------
standby_mode = 'on' # This represents a standby server
#
primary_conninfo = 'host=webgis1.domain.com port=7654 user=repuser1484908555377 password=s0qi37INGNUwOJ3puQ8yCQEWHlRFr7ZWxH8OYdDYSOg='
#
#
trigger_file = 'c:/arcgisportal/db/promote.dat'

Note that having the .done file on the standby machine is not affecting the failover scenario as long as I do not fall in the "forbidden scenario" i.e. shutdown both and powering on the last known as standby before the primary and which you said is a limitation until this moment. Do you think having the .done file on standby might have an impact?

I hope that Esri takes into consideration the serious limitation as in case of the primary server encountered a crisis, then standby server will not function.

Thank you again Jonathan.

0 Kudos
JonathanQuinn
Esri Notable Contributor

The standby machine shouldn't have a .done file, only a .conf file and .conf.bak file.  If it has a .done but it's still the standby, that indicates that while failover was happening, the standby was stopped/restarted as well, which disrupted the failover.  In this case, failover will never fully complete because that file indicates that the DB has failed over, which is why you see those errors in the logs.  If you remove the .done file so you only have the .conf and .conf.bak, you should see a proper failover when you stop the primary.

Todd_Metzler
Occasional Contributor III

Hello Jonathan,

My comments pertain to ArcGIS Enterprise Portal 10.4.1.  Might apply to 10.5.x but I'm not sure.

COMMENT:  Thank you for the suggested shutdown startup order.  We'll find that useful.

QUESTION:  Now, if the suggested shutdown/startup order isn't followed, and the secondary portal becomes primary, how do we "gracefully" switch back to our desired original primary once both portal instances are back up and running?  I have been stopping the portal service on the secondary server (now really the primary) and waiting the 500 seconds for the portal to switch.  Must be a better way than that I don't know about.

Thank you,

Todd

0 Kudos