One of the most frequent problems when SAP fails to start is the situation where disp+work has the status "running but not connected to messages server". This can have various reasons. I am posting my problem solving strategy here.
When this error occurs, disp+work is not able to connect to the message server. The message server should be started, and even if it crashes, it will automatically be restarted. However, first step should be to detect if the message server is running. I am assuming Linux here for the OS commands:
Step 1
$ ps -A | grep ms
20639 ? 00:00:02 ms.sapBW1_ASCS0
$ sapcontrol -nr 01 -function GetProcessList
10.04.2014 23:44:46
GetProcessList
OK
name, description, dispstatus, textstatus, starttime, elapsedtime, pid
msg_server, MessageServer, GREEN, Running, 2014 04 10 12:53:27, 10:51:19, 20639
enserver, EnqueueServer, GREEN, Running, 2014 04 10 12:53:27, 10:51:19, 20640
Step 2
Ok, we passed the smoke test. The message server is there and (the system thinks it is) running. The next test is to query it if it knows which application servers to talk to:
$ lgtst -S sapmsBW1 -H localhost
using trcfile: dev_lg
list of reachable application servers
-------------------------------------
[appsrv1_BW1_40] [appsrv1] [10.67.69.87] [triomotion] [3240] [DIA BTC ICM ]
[appsrv1_BW1_30] [appsrv1] [10.67.69.87] [sftdst-port] [3230] [DIA BTC ICM ]
[cisrv_BW1_20] [cisrv] [10.67.69.85] [xnm-ssl] [3220] [DIA BTC ICM ]
[cisrv_BW1_10] [cisrv] [10.67.69.85] [sapdp10] [3210] [DIA BTC ICM ]
[cisrv_BW1_00] [cisrv] [10.67.69.85] [sapdp00] [3200] [DIA UPD BTC SPO UP2 ICM ]
It is very typical that you will now see the error. A typo in the hostname (e.g. cisrv) or the instance name (cisrv_BW1_20 for example) can destroy all dreams of a starting SAP system. In this case look at your profiles at /usr/sap/SID/SYS/profile, correct them and restart SAP.
Step 3
If that still does not work, take a look at the devtraces. They in the work directory, typically like /usr/sap/SID/DVEBMGS00/work. Traces are log files, just a confusing second name for them. Noteable files are dev_disp for the dispatcher's log, dev_ms for the message server's log and dev_w0 for the first work process. Reading them line-by-line is good as you can get an understanding what happened. However searching for ERROR in them can be more efficient. Google for these error messages then. Note that you can sort your file list by modification time with the command
ls -ltr
This way you can easily see if dev_w0 has the later entries or dev_disp.
Step 4
Now it's time to check the basics. Is the harddisk full? Does a reboot help? Will the problem go away after 5 minutes of waiting? Is the system completely overloaded (command top)? Also ask your colleagues who changed what.
Step 5
If nothing has helped so far I would use a search engine in the web to find out more about the error. Maybe when you are reading this there is a new reason for this errror situation.
Step 6
If all this did not help I would start strace'ing the message server process to see what data it gets and if it reacts at all on requests from disp+work. Let me show you how this works, you will not find it in any man page:
$ ps -A | grep ms
20639 ? 00:00:03 ms.sapBW1_ASCS0
$ strace -s 9999 -p 20639
Process 20639 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=1833, ...}) = 0
poll([{fd=6, events=POLLIN|POLLPRI}, {fd=8, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI}, {fd=10, events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=15, events=POLLIN|POLLPRI}, {fd=16, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN|POLLPRI}, {fd=18, events=POLLIN|POLLPRI}, {fd=19, events=POLLIN|POLLPRI}, {fd=20, events=POLLIN|POLLPRI}, {fd=21, events=POLLIN|POLLPRI}, {fd=22, events=POLLIN|POLLPRI}, {fd=23, events=POLLIN|POLLPRI}, {fd=24, events=POLLIN|POLLPRI}], 18, 20000) = 0 (Timeout)
fstat(4, {st_mode=S_IFREG|0644, st_size=1833, ...}) = 0
You should know that strace lists all syscalls a process does. The first line (restart_syscall) is about an interrupted syscall and we ignore it. The second line is about the syscall fstat. What does this do? The command man 2 fstat explains: it gets the file status. Of which file? Of file descriptor 4 in process 20639. Let's find out what this file descriptor is about:
$ ll /proc/20639/fd/4
l-wx------ 1 bw1adm sapsys 64 Apr 10 15:04 /proc/20639/fd/4 -> /usr/sap/BW1/ASCS01/work/dev_ms
I like this. You can find out the meaning of every syscall with the command man 2 syscall. And you can just find out the files by their descriptors. Ok, going on, the next syscall is poll (man 2 poll tells us this is "wait for some event on a file descriptor"). The file descriptor is 6 which is a socket:
$ ll /proc/20639/fd/6
lrwx------ 1 bw1adm sapsys 64 Apr 10 15:04 /proc/20639/fd/6 -> socket:[429984]
Now where does this socket go? lsof tells us:
$ lsof | grep 429984
ms.sapBW1 20639 bw1adm 6u IPv4 429984 0t0 UDP localhost:64998
even more interesting I find file descriptor 8 aka socket 432018:
$ lsof | grep 432018
ms.sapBW1 20639 bw1adm 8u IPv4 432018 0t0 TCP *:sapmsBW1 (LISTEN)
It listens via TCP to all local addresses, port sapmsBW1. What is this port? /etc/services will have it:
$ cat /etc/services | grep sapmsBW1
sapmsBW1 3601/tcp # SAP System Message Server Port
sapmsBW1 3601/udp
it is port 3601, the message server port of instance 01...
The next command is again fstat that has already been discussed. Summarized, you can do a wonderful analysis of programs without even having their source code. Show me how this works on Windows. googling the solution may be faster, but this method is definitely more fun :smile:
Step 7
If nothing helped so far, take a look at the SAP notes, you can find them at https://service.sap.com/notes. You will need an SAP user ID to enter there.