cancel
Showing results for 
Search instead for 
Did you mean: 

SAP ASE Cluster failed to start instance

Former Member
0 Kudos

Hi,

One of our customer has newly implemented ASE-CE 16.0 SP01/EBF 24928 in a production setup.  Recently they had a requirement to change the IP for both the nodes and post change, during startup of ASE Cluster the following error was reported.

file : SCC agent log on node1

[RMI TCP Connection(21)- <ipaddress>] - Error encountered in starting cluster <CLUSTER NAME>, instance <instance 2>

java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: <HOSTNAME 2>; nested exception is:

  java.net.ConnectException: Connection timed out]; nested exception is:

  java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: <HOSTNAME 2>; nested exception is:

  java.net.ConnectException: Connection timed out]

file : ASE Instance 2 error log

kernel  Instance '<instance 2>' (2) failed to get reply from coordinator '<instance 1>' (1) after sending JOIN request 15 times.  Communication links may be down, or the cluster may have abruptly exited.

kernel  ASE will check the quorum disk heartbeat values to determine if a cluster takeover is possible. This may take up to 1 second(s).

kernel  Quorum heartbeat values have changed. Cluster takeover is not advised.

kernel  ueshutdown: exiting

kernel  Main thread performing final shutdown.

kernel  Blocking call queue shutdown.

So basically the ASE 2nd Instance failed to start.  Can anybody advice what is going wrong here.

Thanks.

Accepted Solutions (0)

Answers (1)

Answers (1)

sap_mk
Active Participant
0 Kudos

HI Arun,

How was the IP address change made? Did they just update the interfaces file or did they use sybcluster and issue the set instance commands? Can you review output from "show cluster config" and see if it looks correct?

If sybcluster is not working, you can use qrmutil to check "Host node":

qrmutil -Q <quorum device> --display=instance --instance=<instance name>

If it is wrong, shutdown any instances that are running and SCC and use:

qrmutil -Q <quorum device> --extract-config=<filename>

Edit and look for "node" under each of the [instance] sections and make necessary changes:

vi <filename>

Upload the changes back to the quorum device:

qrmutil -Q <quorum device> --config-file=<filename>

It's always best to backup your quorum device before making any changes.

Regards,

Mark

Former Member
0 Kudos

Hi Mark,

Thanks for the details.  Please review my answers below,

  1. IP address was changed at n/w level and /etc/hosts was changed
  2. The interfaces file uses hostname so no modification
  3. "show cluster config" was not taken.  But we have taken o/p of,

qrmutil -Q <quorum device> --extract-config=<filename>

  Please see attached file "qconfig.txt"

          The o/p looks ok since hostname is used for cluster configuration

    4. Attached also please find the ase node 2 error log with more about 70 lines details,


Currently we do not have access to the production server.  If you advice any solution then we will have to go onsite to try it.

Thanks,

Shiva

sap_mk
Active Participant
0 Kudos

Hi Shiva,

This cluster is defined with the primary interconnect using the public network.

primary address = PEDWBIR04
primary address = PEDWBIR05


Since these are the same as the host names, it will use the public network. This should not be configured this way. The interconnects (primary and secondary) should be on a private network using a separate NIC and IP address/DNS name. These would be connected to each other via a private switch or a VLAN. The secondary interconnect is not defined and I would call it optional for a non-production environment, but require for prod. This is an HA solution so we don't want the interconnect to be a single point of failure.

Is there any firewall that might be blocking access to the 151xx ports on either node? The "connection refused" is indicating a network issue over the primary interconnect (which, as mentioned, is using the public network).

Mark

Former Member
0 Kudos

Hi Mark,

Thanks, noted your  recommendations. We will verify the access to the ports and update you if any further issue.

Thanks

Shiva

PN: earlier login had some issues. So using another one to reply to you.

Former Member
0 Kudos

Hi Mark,

We have got a very emphatic "all ports open" from the customer. What could be the other issue or how can we confirm what the customer is saying. Can we check using

$ telnet PEDWBIR05 15100

If this works, does it mean port is open. Do you think setting up a private network will solve the issue. Though pvt network is recommended current environment is still not live. They use blade servers and they mentioned there is an private network available between the systems. If you insist we can try to setup that.

I am Shiva's colleague and his account here is having some issues so he is unable to reply.

Regards, Amal

sap_mk
Active Participant
0 Kudos

Hi Amal,

Yes, you can use telnet to verify the port is reachable on the opposite node. I don't expect the private network to resolve this, but it needs to be in place before this becomes live. We use udp for the communication on the interconnect. I'm not a network admin, so please ensure nothing would be blocking udp (telnet won't verify this part).

Regards,

Mark

Former Member
0 Kudos

Hello Mark,

We have got a confirmation from network admins that all ports are open. Is it possible that the IP address is stored in some file when cluster is configured. I did a grep on all files in the SCC-3_3 (Agent) directory. The old IP shows up only in some log files. Is there any chance that the IP is saved somewhere where we are not looking.

I think the issue is with one of the agents. Maybe they are not starting properly. Kindly think of any other reasons for this issue.

Thanks

Amal

sap_mk
Active Participant
0 Kudos

Hi Amal,

To be clear, the second ASE instance is starting. When it gets to the point of trying to join the cluster, there is no response from instance 1. The agent has already done its work in kicking off the dataserver binary. Instance 2 is sending a message across the CIPC (private interconnect) to instance 1, but there is no response. Are there any messages in the instance 1 errorlog to indicate that instance 2 is trying to join?

Thanks,

Mark