on 2012 Sep 27 2:28 AM
Folks,
We have an intermittent hang starting dbeng12 via the capi on OSX. It is not 100% repeatable but affects our automated integration tests fairly frequently.
It does not happen with the same tests running on windows.
Is there any way to get any diagnostic information out of the dbeng12 process to try and narrow down why it is hanging?
Cheers, Dan
Request clarification before answering.
So, we finally seem to have reached the bottom of this issue and I thought I'd document it in case anyone else ran into it.
Basically, the issue is that the dynamic library load call we were using has some undesirable behaviour in the debugger when used with a particular flag. In the samples provided with the dbcapi the load library call is as follows:
handle = dlopen(name, RTLD_LAZY);
This is the recommended way of calling this function however there are some potential issues with debugging as documented here: http://tldp.org/HOWTO/Program-Library-HOWTO/dl-libraries.html
So we changed the code to: handle = dlopen(name, RTLD_NOW);
Which gets us past the problems documented above.
Thanks to all for help/suggestions.
Cheers, Dan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Ok, finally got one to hang, and unfortunately there was no log file created.
The process is stopped on: (marked with **)
libsystem_kernel.dylib`__wait4: 0x7fff8e46214c: movl $33554439, %eax 0x7fff8e462151: movq %rcx, %r10 0x7fff8e462154: syscall **0x7fff8e462156: jae 0x7fff8e46215d ; __wait4 + 17** 0x7fff8e462158: jmpq 3743 0x7fff8e46215d: ret 0x7fff8e46215e: nop 0x7fff8e46215f: nop
The assembly code that is running at the point it hangs is:
0x1079b9322: jne 0x1079b9338 ; sqlany_new_connection_ex + 1416 0x1079b9324: movq 8(%rdi), %rdi 0x1079b9328: callq 0x1079becfd ; symbol stub for: db_init 0x1079b932d: testl %eax, %eax 0x1079b932f: je 0x1079b93a0 ; sqlany_new_connection_ex + 1520 0x1079b9331: movl $1, 28(%rbx) 0x1079b9338: movq (%rbx), %rax 0x1079b933b: movq 144(%rax), %rdx 0x1079b9342: movq 8(%rbx), %rdi 0x1079b9346: xorl %esi, %esi 0x1079b9348: callq 0x1079bed03 ; symbol stub for: db_set_property 0x1079b934d: movq 8(%rbx), %rdi 0x1079b9351: movq %r12, %rsi 0x1079b9354: callq 0x1079bed09 ; symbol stub for: db_string_connect **0x1079b9359: movq 8(%rbx), %rdi** 0x1079b935d: movl 12(%rdi), %eax 0x1079b9360: movl %eax, 32(%rbx) 0x1079b9363: leaq 36(%rbx), %rsi 0x1079b9367: movl $256, %edx 0x1079b936c: callq 0x1079bede7 ; symbol stub for: sqlerror_message
Looking in "Activity Monitor" I can see a dbeng12 process.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
It seems like you were having some troubles passing in the "-o" parameter initially to the engine, but may have found a way around that. Is anything reported in the "-o" output from the database server when this happens?
We managed to get this stack trace out of gdb:
I'm replying here to get some better text formatting... I can't figure out how to do the formatting in the comments
I should clarify a few things here. Firstly, this hang only manifests itself in integration testing, either in a debugger on a local machine, or on the build server. It is particularly frequent in the debugger using xcode 4.4. Outside of those environments it all works fine.
The code we use to start and connect to the server is as follows:
{ _Logger->LogDebug( "Opening database with connection string: " + GetConnectionString() ); // create a new sqlany connection object, and connect to our configured database. _Connection = _Api.sqlany_new_connection(); if ( !_Api.sqlany_connect(_Connection, GetConnectionString().c_str() ) ) { char err_msg [512]; _Api.sqlany_error(_Connection, err_msg, sizeof(err_msg) ); std::stringstream err; err << "sqlanywhere err: " << err_msg << std::endl; err << "when attempting to open: " << GetConnectionString(); _Logger->LogError( err.str() ); /* failed to connect */ CatchError("Failed to connect"); /* SQL Anywhere's API requires us to go through the full * disconnection process even if a connection attempt failed. */ Close(); return false; } return true; }
An example of a connection string that we're using to start the server and connect to a database is as follows:
ENG=vartmptmp2792eUqhK;uid=DBA;pwd=sql;dbf=/var/tmp/tmp.279.2eUqhK;START=dbeng12 -ga -qi -n vartmptmp2792eUqhK -o /var/tmp/tmp.279.2eUqhK.log.txt
That connection string is for an integration test, which is why the database and server have a semi-random name to stop file and db server conflicts on the build server.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I'd suggest to add the LOG=filename connection parameter to get some diagnostic output on the apparantly failing attempts to connect to the engine. You could also add -z to the engine command line.
As Jeff has asked: Does the engine named vartmptmp2792eUqhK really start here? (Well, if not, -z won't help any further...)
FWIW, specifying both ENG and the -n in the START connection parameter is somewhat error-prone: If both are not identical, you may start a different engine than you're trying to connect to... cf. Graeme's explanation here. - Yes, I'm aware that you have set both to the same value - I'm just trying to give hint.
We've tried the LOG= connection parameter and it doesn't appear to produce a log file either. The way the code is written guarantees that the ENG and -n parameter are the same. However, I wonder if this could be some of the problem. If we have a race condition somewhere in thread startup we might indeed end up with a server that we can't connect to despite the names being the same.
From memory we tried the -n without the ENG= parameter and it wouldn't connect. According to that link you posted, it seems we tried it the wrong way round. if we specified ENG= and left the -n off would we connect to the named server?
I also read somewhere that ENG= was deprecated. Is SERVER= the correct replacement?
Also of note is that if the application is compiled with gcc we don't seem to get this issue (well, it hasn't manifested yet). It seems to be under clang that we get it consistently. Sadly we can't switch to gcc because then our Objective-c code doesn't compile.
I'll give the LOG= and the -z another whirl.
Cheers, Dan
...and with this connection string:
Server=vartmptmp2rYWlPv;uid=DBA;pwd=sql;dbf=/var/tmp/tmp.2.rYWlPv;LOG=/var/tmp/tmp.2.rYWlPv.log.txt;START=dbeng12 -ga -qi -z
I get this log file:
Tue Oct 02 2012 18:29:55 18:29:55 Attempting to connect using: UID=DBA;PWD=**;DBF=/var/tmp/tmp.2.rYWlPv;ServerName=vartmptmp2rYWlPv;START='dbeng12 -ga -qi -z';LOG=/var/tmp/tmp.2.rYWlPv.log.txt 18:29:55 Attempting to connect to a running server... 18:29:55 Attempting SharedMemory connection (no sasrv.ini cached address) 18:29:55 Failed to connect over SharedMemory 18:29:55 No server found, attempting to run START line...
...and the process has hung. There is a dbeng12 in the task list with the correct server name.
We want it to work with shared memory. Most of the time, this does work exactly as we expect it to. This 'hang' only happens during integration testing and in the xcode debugger (quite frequently in 4.4, not so much in 4.3). We really don't want to use TCP/IP because we're deploying SA12 as an embedded DB.
In the case of the log line above, I would expect it NOT to find the server because we've deliberately given it a unique name for this session only. So, that part I'm not worried about. The worry is that when it gets to the "attempting to run START line"... it never comes back...
My read of the above information suggests that it seems that the database server process is starting up ("dbeng12 in the task list"), but the "-o" log is never created, meaning there is something happening on server start-up that we haven't been able to capture (particularly if the server normally starts okay).
Running a "dtruss -f" of the process using the SQL Anywhere C API, or capturing a core file of the engine process that started would be the next step. If you haven't already, I'd highly recommend opening a technical support case so that we can help you go over this information directly.
User | Count |
---|---|
47 | |
6 | |
6 | |
5 | |
5 | |
4 | |
4 | |
3 | |
3 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.