Feedback
Did this article resolve your question/issue?

   

Article

OpenEdge processes sometimes hang or crash after receiving HANGUP or SIGUSR1

Information

 
TitleOpenEdge processes sometimes hang or crash after receiving HANGUP or SIGUSR1
URL NameOpenEdge-processes-hang-or-crash-after-receiving-HANGUP-or-USR1-signal
Article Number000182560
EnvironmentProduct: OpenEdge
Version: 10.2B, 11.x
OS: UNIX, Linux
Question/Problem Description
OpenEdge client process hangs or crashes after receiving a HANGUP/SIGHUP signal (when the shell that spawned it was shut down), or a USR1/SIGUSR1 signal (to force generating a protrace file)

Session hangs after sending 'KILL -SIGUSR1' command to a process to generate a protrace
Session hangs after issuing repeated kill SIGUSR1 signals to an ABL client process

An incomplete protrace file is generated, or no protrace at all.
Using gstack (part of gdb debugger package) to obtain a stack dump for the hanging process which shows OS localtime() and related ctime functions on the stack.

Database stalls after a self-service client deadlocks when it receives an interrupt signal at the wrong time
All other processes already connected to a database stall / hang and may terminate if they respond to signals
Any new client processes or database utilties (promon, proshut) connecting to the database may also hang while connecting, or shortly after connecting. 

PROMON reports (latches, buffer queues) show that the hanging usernum still has resources locked in shared memory
Hanging OpenEdge client session has no CPU activity
The hanging process does not respond to further signals apart from kill -9 which causes an ABNORMAL shutdown


Crashing clients trigger an Abnormal Shutdown of the database when they hold latches in shared memory
Session received HANGUP or -SIGUSR1 request also receives a simultaneous interrupt signal Memory violation (49) segmentation fault does not produce a protrace file or C-stack information in the protrace shows localtime() functions.
When a process that receives a -SIGUSR1 request to dump a stack or SIGHUP to terminate and also receives a simultaneous interrupt signal such as SIGSTOP,SIGKILL,SIGSEGV the process can end up crashing with SYSTEM ERROR 49 which may also bring the database down
The crashing process often results in orphaned OpenEdge processes (_progres, _proapsv ) on the system with Parent PID 1 / "init".
Orphaned process does not respond to most OS signals (kill -1 / SIGHUP, kill -15 / SIGTERM, kill -2 / SIGINT, kill -3 / SIGQUIT etc.)


No signal messages are written to the database lg (No: HANGUP signal received. (562) )
Session writes the 562 message twice on the same millisecond with no transaction backout
Database log shows login message for the hanging process, but does not show logout or error messages for it.
 
Steps to Reproduce1. Connect a client session and take note of the PID
$ _progress dbname -zp

2. Send SIGUSR1 signals from a looping batch process: example PID 14181
for (( i=1; i<=500 ; i++ )); do sudo /bin/kill -SIGUSR1 14181; sleep 5; done

3. After several minutes (approximately 10), the OpenEdge client session hangs, with no CPU activity.
Clarifying Information
When the process holds latches in the database's shared memory when the issue is hit, killing it (or it crashing) will trigger an Abnormal Shutdown of the database, without a protrace for the PID or an empty protrace without the c-stack dump.

gdb stack of a hanging _progres process:
 
(gdb) where
#0  0xf778b430 in __kernel_vsyscall ()
#1  0xf7579a33 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0xf751a56b in _L_lock_2223 () from /lib/libc.so.6
#3  0xf751a371 in __tz_convert () from /lib/libc.so.6
#4  0xf751868f in localtime () from /lib/libc.so.6
#5  0xf7518581 in ctime () from /lib/libc.so.6
#6  0x0861a29b in uttrace ()
#7  0x080e2e5f in drProTrace ()
#8  0x080e23fd in drSigDo1 ()
#9  0x080e2552 in drSigDispatch ()
#10 <signal handler called>
#11 0xf778b430 in __kernel_vsyscall ()
#12 0xf7558a33 in __xstat64@GLIBC_2.1 () from /lib/libc.so.6
#13 0xf751acde in __tzfile_read () from /lib/libc.so.6
#14 0xf751a248 in tzset_internal () from /lib/libc.so.6
#15 0xf751a391 in __tz_convert () from /lib/libc.so.6
#16 0xf751868f in localtime () from /lib/libc.so.6
#17 0x08619450 in utlocaltime ()
#18 0x08088290 in fdGetTime ()
#19 0x0837fee1 in fmtime_2 ()
#20 0x080928a9 in fmETIME_ ()
#21 0x080ae245 in fmeval ()
#22 0x080adf48 in fmeval ()
#23 0x083b9e70 in rnfasterifstmt ()
#24 0x083ba41f in rnexec_entry ()
#25 0x083bb813 in rninterpret ()
#26 0x080c5b7b in rnrq ()
#27 0x0807a567 in main ()
 
Error Message
Defect NumberDefect PSC00347434
Enhancement Number
Cause
When an OS signal comes in at the wrong moment, an OpenEdge process can crash or it can deadlock itself on the OS resources involved.

Within the signal handler which calls backtrace, OS functions are used that are not async-signal safe. In particular OpenEdge makes use of the localtime OS function, and the functions it calls to retrieve timezone information to get the current timestamp are not guaranteed to be safe with asynchronous signal handling. 

Part of this behavior depends on OS implementation of ctime functions in a way that's not safe with asynchronous signal handling, which is why this issue may or may not occur depending on the platform and does not always occur every time a signal is sent to a process.

For example: When the OpenEdge client receives a HANGUP signal:
  • It will try to write "HANGUP signal received. (562) " to the database logfile,
  • It calls the localtime function to get the correct timestamp.
  • If this happens while the OS functions already are on the stack, the client will deadlock itself, including any shared-memory resources it may have in use.
  • Any other process that then tries to use the same resource will stall as it will never be released. 
  • An example of an interrupt signal at the same time is described in Article  SE 49 while writing "HANGUP signal received" to the lg file  
Resolution
Upgrade to OpenEdge 11.6.3 or later.

To avoid the deadlocks, signal handling now gets delayed if a signal is received when the OpenEdge internals are running functions that rely on the localtime() and related functions.

If running on Linux, upgrade to OpenEdge 11.6.4, 11.7 or later. An additional problem with asynchronous call to malloc causes similar behavior. Refer to Article: As a side effect of the fix, messages from signal handlers will always show as UTC time instead of the database's timezone.
Article:
 
Workaround
Stop the database and assure all clients processes are removed from the system before re-starting.

If the database cannot be stopped normally, kill the orphan OpenEdge client with kill -9.
That will trigger
"User <user number> died holding <Number of locks> shared memory locks. (2522)",
"User died with buffers locked. (2523)" and/or similar errors,
Followed by an Abnormal Shutdown condition in the database, which will go through crash recovery when next opened.
Notes
Last Modified Date11/25/2020 3:08 PM
Attachment 
Files
Disclaimer The origins of the information on this site may be internal or external to Progress Software Corporation (“Progress”). Progress Software Corporation makes all reasonable efforts to verify this information. However, the information provided is for your information only. Progress Software Corporation makes no explicit or implied claims to the validity of this information.

Any sample code provided on this site is not supported under any Progress support program or service. The sample code is provided on an "AS IS" basis. Progress makes no warranties, express or implied, and disclaims all implied warranties including, without limitation, the implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample code is borne by the user. In no event shall Progress, its employees, or anyone else involved in the creation, production, or delivery of the code be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample code, even if Progress has been advised of the possibility of such damages.