It is possible, although very rare, that the DBServer process becomes frozen and does not respond to any client (ViewX shows offline for that server, server icon on the desktop is grey and any other servers cannot connect successfully to it). In these scenarios it is likely that a database lock has not been released correctly by a database thread and so that when another database threads requires a databack lock they must wait, and wait, and wait! Database locks are part of standard ClearSCADA functionality and typically database locks last only a few milliseconds so usually do not cause any problems or delays in system operation. However, when the database has been in the frozen state for more than 10 minutes it is very unlikely that it ever recovers on its own – the DBServer process has become deadlocked. Other performance problems could appear to be similar to deadlock – including (faulty) hardware causing IO delays and memory leak issues that will result in large virtual memory usage and cause a large usage of page File and RAM transfers.
ClearSCADA 2015 R2 (v77) contains deadlock detection of a Main server from the other hot-standby servers, as part of the polling between servers. In v78 a watch dog mechanism which allows detection and automated restarting of the affected server will be available.
Usually you need to kill the DBServer.exe process using Task Manager and manually start the service for the server to return to operation. Killing the DBServer.exe process is a severe action which will lose any configuration changes or value data not yet flushed to disk or to the Standby servers. It is also possible that the database became deadlocked midway through saving the database to disk, in which case the database may also be considered corrupt at startup.
Only 'you' know your system and how it behaves under typical usage; before killing a server which is/was Main it is worth discussing what was happening on the system that could have caused the problem. Standby servers will resync from the Main server, assuming it is operating correctly, when restarted so the risk in this case is a lot less. Although, in this scenario, the urgency to recover the Standby server is also usually lower.
However, before you kill the process there are some steps to take to ensure that enough information is gathered to ensure that technical support are better able to diagnose the problem and provide resolution.
As documented in Problem Reporting - Getting the Right Information#Server Stability, Crashes, Exceptions you need to collect the database log files and snapshot files from the server that is experiencing issues, ideally at least an hour before the server froze to at least five minutes into the freeze. However, if more information either side is available more information may be better.
Additional logging, although this cannot be enabled if ClearSCADA is currently deadlocked:
- 'Read/Write Lock Diagnostics' is a ServerStatus and DBSnapshot logging feature which should be enabled (via ServerStatus | Database | Read/Write Lock Diagnostics' and right-click column headings in the main pane) which is a modest logging overhead and provides summary information about database lock usage.
- Lock (LCK) database class logging having been enabled before the freeze starts is extremely useful however is not enabled by default and when enabled will generate a lot more logging reducing the coverage of the log files. If this problem keeps occuring then on subsequent restarts ensure this LCK logging is enabled via the ServerStatus program, and if possible the number of DB log files are increased to ensure coverage is sufficient.
If the server is in a Standby state, also collect the log files from the Main server for the same time period where it is possible to ensure that any information from the Main that could have cause the Standby server to freeze is caught for analysis.
To allow for a better understanding of what the DBServer process was doing at the time of the deadlock a minidump needs to be generated for analysis. The follow section discusses three options, ranging from simple (Task Manager) through to advanced (DebugDiag).
The size of the dump will depend on the tool used and will typically be the size of the virtual bytes of DBServer.exe, depending on the options selected.
Producing a dump (MDMP or DMP) file can be achieved via the Windows Task Manager or, often more conveniently and successfully, via an external diagnostic tool created for the purpose. Windows Task Manager on a 64-bit Windows operating system will have two versions, a 32-bit and 64-bit version, and the correct one must be used for target process being dumped.
Procdump.exe is a command line application from SysInternals (http://technet.microsoft.com/en-us/sysinternals/default) that will dump a process' current details to a .mdmp file. This file can then be provided to technical support to aid in the analysis of the log files. You will also need psexec.exe, also from SysInternals, to be able to generate a minidump file for DBServer.exe.
The following process assumes you are comfortable and understand the usage of the Windows Command Prompt.
Using this process on a healthy process will cause it to lock up whilst the minidump is generated, which may take a few minutes. Do so at your own risk!
1. Copy procdump.exe and psexec.exe to a suitable location on your server. Ensure sufficient disk space exists to store a file size the size of the current virtual memory usage of DBServer. Also install a suitable file compression tool if needed.
2. Open up a Command Prompt. If you're running Windows Server 2008 R2 or Windows Server 2012 R2 then run it as administrator by right-clicking the choosing the 'Run as Administrator' option.
3. Change directories to the location that you copied the procdump and psexec executables to.
4. Using psexec run a Command Prompt as the 'SYSTEM' user. This can be done with the following command:
psexec -s cmd
5. If successful, you will likely be changed to the %WINDIR%\System32 folder. Change back to the directory with the SysInternal executables.
6. Run the following command:
procdump -accepteula dbserver.exe normal.mdmp
This should quickly generate a small minidump file of a few MB in size. The -accepteula switch will automatically accept the dialogue that would normally be shown to a user to accept before the process is ran however as you're running as the SYSTEM account there is nowhere to show the dialogue and instead the process will appear to hang. Refer to the Microsoft website for the EULA for this software, or temporarily run the procdump.exe process as your user to get the dialogue shown.
7. Run the following command:
procdump -accepteula -ma dbserver.exe full.mdmp
This will generate a much larger minidump file, basically the entire memory of DBServer.exe and depending on how much memory DBServer.exe is using may take a long time.
8. Provide the normal.mdmp file, along with the log files, to technical support for analysis.
9. Recover the server to operational usage.
10. Compress the full.mdmp (it can be compressed down to about 5% of size using certain compression methods) and store it in a safe place should it be required for analysis.
Note that some company security policies may require the full minidump files be stored in a secure location similar to ClearSCADA database backups as the data contained within that minidump may contain sensitive information such as IP addresses and user names should someone want to data mine the contents
DebugDiag will provide similar functionality to the above tools, as well as including more information within the minidump file allowing a more detailed analysis.
The tool can be attached to a problematic DBServer process and extra information collected and added to DMP file(s) once installed and can be downloaded from: Debug Diagnostic Tool v2 Update 2 (https://www.microsoft.com/en-us/download/details.aspx?id=49924)
In addition to permission to install the DebugDiag tool, consideration needs to be given to the storage of report files and the very large DMP files output. A DMP file will consist of the Virtual Memory (whole footprint of the process) and the extra information collected, likely to be many GB in size. How will these large DMP files be transferred?
Configuration / Execution
This description covers the approach of creating a rule for the DBServer process, so that the folder location of the reports and dump files can be configured. Using the simple approach of initiating memory analysis from the process list could result in a default location being used and that disk/partition being overrun with DMP file content and affecting the whole host's performance.
Open the collection tool from the Windows, Start, All Programs:
On the 'Rule' tab, 'Add Rule':
Select 'Native (non-NET) Memeory and handle Leaks':
Find the target process, 'DBServer.exe', and scroll across the columns …
To reveal the Virtual Bytes – the likely size of the DMP file:
Next configure the production of a DMP after a period of time, here a DMP will be created 30 minutes after starting the rule. We're expecting two DMP files, one after 30 minutes and then a final one on closing down the DBServer.
Next, naming the rule and configuring where the dump files will be stored – space (based upon Virtual Bytes size) and security issues will influence the location of the DMP files.
It is quite likely that the C: Drive should not be used – pick a partition/drive without a key role in the operation of the host.
The Rule can be activated immediately:
In this example, a DMP file containing information on memory allocation will be produced 30 minutes after activating the rule. That DMP will contin false-positive memory leaks because some of the processing of DBServer will be partially complete at the time the DMP is created and will complete (and release memory) after the DMP has been created.
Periodic DMP created while the process is running will:
- Not stop the process
- May contain false-positive leaks
Hence, the second approach would be to shutdown the DBServer process normally while the memory leak tracking is attached. This will produce a more accurate and useable DMP file for TRSS development, however it will be an inconvenience to the customer. The activation can be checked and stopped from the 'Process' tab:
Notice the 'Tracking' qualifier for the DBServer process, in the 'Leak Tracking Status' column.
Right-clicking the DBServer entry if you need to stop the tracking (however, it is likely a shutdown of DBServer will be used to produce another DMP and stop the collection at the same time):