This feature is part of the ongoing effort to support more DCM features natively in the Blueriq Runtime (working title "DCM 2.0").

Introduction

For DCM 2.0, Blueriq uses asynchronous messaging. Message handling could sometimes fail for different reasons. The DCM Maintenance Application gives the functional administrator the insight and control needed to maintain a DCM system of applications and ways to restore from errors.

The DCM maintenance application:

Retries a message for which the handling failed automatically
Gives insight and control over messages that failed more than a configured number of retries
Gives insight and control over crashed (stale) locks
Gives insight into overdue scheduler tasks
Gives insight into the state of a case

The DCM Maintenance Application consists of a Spring Boot service that performs the work and comes with an Angular based user interface.

Functions of the Maintenance Application

The maintenace app is used for automatic and manual error recovery. Below the functions for the maintenance user is described in more detail.

Search cases and investigate error situations

Events will be caught by the DCM Maintenace APP and retried, or presented to a maintenance user for error recovery. Below is a schematic overview of how failed events will be handled by the application.

Below an overview is given of the different tabs of the maintenance GUI, based on case information, and the failed event handling. The left pane of the Maintenance application used to search for any cases and error situation. The right pane contains the details for one selected case.

Items on the left pane of the application are clickable when the identifier is known. This will open the details pane on the right, with the details of the given case.

Failed events

All DCM Events that fail to be processed will be retried automatically as configured. When retrying fails after the last retry, it viewable in this list of events. Notice that action from the maintenance user is required to recover from the error, and fix the case., so this is an important list for the maintenance user.

Events can be restored by retrying them one by one, or all items in the list at once.

The failed events list can be seen as a worklist for the maintenance user. All items require manual attention. An empty list is an indication that the system is functioning well.

Delayed events

Sometimes DCM Events have to wait on each other. Especially when the case is locked, all events that are consumed by the case engine cannot be processed before the case lock has been released. Therefore, these events will remain in this list, and wait upon a case inlock event. When the case engine releases a case lock, a message will be sent to the Maintenance App, and the events waiting on that specific case lock will be republished. So in a normal situation, these events would not need any manual attention, as they will be retried eventually.

The maintenance user can however retry any messages (or all messages) manually if desired. When the case is still locked, the message will end up in the delayed events list again.

Case locks

This is an overview of all cases that are currently under case lock. This can be caused by end-users executing tasks or automatic task execution. In a normal situation locks will always be dissolved over time (when the action is finished) automatically. A lock with creation dates long ago might be interesting to check, since locks should only be as long as the user-action. This list shows an overview, sorted by lock creation date ascending.

Cases

Cases can be searched using various case attributes:

process id, returns the case with the given process id, which corresponds with the case id in the process SQL database
task id, returns the case belonging to the specific task id
dossier id, returns the case belonging to the dossier aggregate id
metadata id, the case belonging to the metadata aggregate id
last updated data, returns all cases last updated during the given period. Notice that only one of the two parameters is required for a search

When a list is returned, it is possible to sort the items by clocking on the column.

Case details

The case details pane is designed to give an overview of a specific case. This could be a starting point of debugging any case related error. The case details view consists of the following views:

Datails, containg details stored by the case engine, these might be used to correlate ID's to the database records
Lock information (when present)
Failed events filtered on this case
Delayed events filtered on this case
Case profile as stored in the process SQL store (the active profile state in the process database)
Scheduled jobs, this overview shows all scheduled events for this case. Events that are scheduled in the past from NOW will turn red, the should be waiting in the queue
Tasks as stored in the process SQL store, might be useful to check the state of the case

Maintenance actions

The Maintenance user can perform different actions to restore cases when an error has occurred.

Retry message

Messages that failed to be processed can be republished to the original exchange. The Retry message endpoint can be triggered by the button, either in the overview screen, of in the details screen. When the message has been republished successfully, a confirmation will be visible in the UI.

Retry all messages

The "Retry all" button will republish all messages in the given list. This could for example be useful after fixing problems in the infrastructure. Many events could have failed at once, and should be able to be processed after the error has been solved. Any message that fail again with the same reason, will be re-added to the same list again.

Retrying all events will mostly be a safe option, events will either be processed successfully, or fail again and re-enter the same list. However, this might impact the performance or throughput of the case engine, especially when the amount of messages is high.

Reopen Task

The reopen task button will reset the started task to "open", by effectively canceling the task. This could be useful when sessions are not closed properly and the task is stuck at started (and cannot be finished by the user). Automatic task that cannot be processed for some reason will result in an Event in the failed events list.

When a task has been reopened, any data that was changed during task execution (in the Blueriq context) is lost. The task cannot be completed, and should be restarted. Automatic tasks will be republished to the queue to be processed again.

Unlock case

Unlock case will only delete the case lock from the system. This could be useful when the case-unlock-action failed for some reason. Be careful with this operation, because normally another action is the root cause for the stuck case lock. The problem should be fixed at the root cause and not at case lock level. Some examples:

A case is locked due to a failed DCM event, fix the DCM event, the lock will be cleared as part of this action
A case is locked due to a started task that has not been finished properly, use the "reopen task" endpoint to reset the state of the task, the caselock will be removed as part of this action

The unlock case is a last resort, fix the problem at the root cause, since other recover actions might not work after unlocking a case

Viewing audit logs

The DCM Maintenance app can be used to view audit logs. Please refer to Viewing audit logs in DCM Maintenance Application [editor] more more information.

Space shortcuts

Page tree