The maintenance app is used for automatic and manual error recovery. Below, the functions for the maintenance user are described in more detail.
Events will be caught by the DCM Maintenance APP and retried, or presented to a maintenance user for error recovery. Below is a schematic overview of how failed events will be handled by the application.
Below, an overview is given of the different tabs of the maintenance GUI, based on case information, and the failed event handling. The left pane of the Maintenance application is used to search for any case and error situation. The right pane contains the details for one selected case.
Items in the table on the left pane of the application are clickable when the identifier is known. This will open the details pane on the right, with the details of the given case. |
All DCM Events that fail to be processed, will be retried automatically as configured. When processing fails after the last retry, the event will be viewable in the list of failed events in the maintenance app. Notice that an action from the maintenance user is required to recover from the error and fix the case. So this is an important list for the maintenance user.
Events can be restored by retrying them one by one, or all items in the list at once.
The failed events list can be seen as a worklist for the maintenance user. All items require manual attention. An empty list is an indication that the system is functioning well. |
Sometimes DCM Events have to wait on each other. Especially when the case is locked, all events that are consumed by the case engine cannot be processed before the case lock has been released. Therefore, these events will remain in this list, and wait upon a case unlock event. When the case engine releases a case lock, a message will be sent to the Maintenance App, and the events waiting on that specific case lock will be republished automatically. So in a normal situation, these events would not need any manual attention, as they will be retried eventually.
The maintenance user can however retry any messages (or all messages) manually if desired. When the case is still locked, the message will end up in the delayed events list again.
This is an overview of all cases that are currently under case lock. This can be caused by end-users executing tasks or automatic task execution. In a normal situation locks will always be dissolved over time (when the action is finished or aborted) automatically. A lock with creation dates long ago might be interesting to check, since locks should only be as long as the user-action. This list shows an overview, sorted by lock creation date ascending.
Cases can be searched using various case attributes:
When a list is returned, it is possible to sort the items by clicking on the column.
The case details pane is designed to give an overview of a specific case. This could be a starting point of debugging any case related error. The case details view consists of the following views:
Refreshing the screen will cause the data to be refreshed.
The Maintenance user can perform different actions to restore cases when an error has occurred.
Events that failed to be processed can be republished to the original exchange. The Retry event endpoint can be triggered by the button, either in the overview screen or in the details screen. When the message has been republished successfully, a confirmation will be visible in the UI.
The "Retry all" button will republish all events in the given list. This could for example be useful after fixing problems in the infrastructure. Many events could have failed at once, and should be able to be processed after the error has been solved. Any event that fails again with the same error (or a different one), will end up in this list again (with an increased "times seen" number).
Retrying all events will mostly be a safe option, events will either be processed successfully, or fail again and re-enter the same list. However, this might impact the performance or throughput of the case engine, especially when the amount of events is high. They are sent to the same queue as the original, competing with new incoming events in this queue. |
The reopen task button will reset the started task to "open", by effectively canceling the task. This could be useful when sessions are not closed properly and the task is stuck at started (and cannot be finished by the user). Automatic task that cannot be processed for some reason will be added to the Failed events list (again).
When a task has been reopened, any data that was altered during task execution (in the Blueriq context) is lost. The task cannot be completed in the old session, and should be restarted. |
A failed automatic task will lead to a failed event, and could be restored by retrying the event instead of reopening the task. |
Unlock case
Unlock case will only delete the case lock from the system. This could be useful when the case-unlock-action failed for some reason. Be careful with this operation, because normally another action is the root cause for the stuck case lock. The problem should be fixed at the root cause and not at case lock level. Some examples are:
The unlock case should only be used as last resort. Fix the problem at the root cause, since other recovery actions might not work after unlocking a case |
The DCM Maintenance app can be used to view audit logs. Please refer to Viewing audit logs in DCM Maintenance Application more more information.