You are viewing the documentation for Blueriq 17. Documentation for other versions is available in our documentation directory.

Introduction

In the DCM Architecture, Blueriq relies on asynchronous messaging. Message handling could sometimes fail for different reasons. The DCM Maintenance Application gives the functional administrator the insight and control to maintain a DCM system of applications and ways to restore from errors.

The DCM maintenance application:

  • Retries messages for which the handling failed automatically
  • Gives insight and control over messages that failed more than a configured number of retries
  • Gives insight and control over crashed (stale) locks
  • Gives insight into overdue scheduler tasks
  • Gives insight into the state of a case

The DCM Maintenance Application consists of a Spring Boot service that performs the work and comes with an Angular based user interface. For information about the installation, please refer to: DCM Maintenance Application installation


Functions of the Maintenance Application

The maintenance app is used for automatic and manual error recovery. Below, the functions for the maintenance user are described in more detail.

Search cases and investigate error situations

Events will be caught by the DCM Maintenance APP and retried, or presented to a maintenance user for error recovery. Below is a schematic overview of how failed events will be handled by the application.

Below, an overview is given of the different tabs of the maintenance GUI, based on case information, and the failed event handling. The left pane of the Maintenance application is used to search for any case and error situation. The right pane contains the details for one selected case.

Items in the table on the left pane of the application are clickable when the identifier is known. This will open the details pane on the right, with the details of the given case.


Failed events

All DCM Events that fail to be processed, will be retried automatically as configured. When processing fails after the last retry, the event will be viewable in the list of failed events in the maintenance app. Notice that an action from the maintenance user is required to recover from the error and fix the case. So this is an important list for the maintenance user.

Events can be restored by retrying them one by one, or all items in the list at once.

The failed events list can be seen as a worklist for the maintenance user. All items require manual attention. An empty list is an indication that the system is functioning well.

Delayed events

Sometimes DCM Events have to wait on each other. Especially when the case is locked, all events that are consumed by the case engine cannot be processed before the case lock has been released. Therefore, these events will remain in this list, and wait upon a case unlock event. When the case engine releases a case lock, a message will be sent to the Maintenance App, and the events waiting on that specific case lock will be republished automatically. So in a normal situation, these events would not need any manual attention, as they will be retried eventually.

The maintenance user can however retry any messages (or all messages) manually if desired. When the case is still locked, the message will end up in the delayed events list again.

Case locks

This is an overview of all cases that are currently under case lock. This can be caused by end-users executing tasks or automatic task execution. In a normal situation locks will always be dissolved over time (when the action is finished or aborted) automatically. A lock with creation dates long ago might be interesting to check, since locks should only be as long as the user-action. This list shows an overview, sorted by lock creation date ascending.

Cases

Cases can be searched using various case attributes:

  • case id, returns the case with a given case id, corresponds to the case identifier in the cases collection (case-engine)
  • process id, returns the case with the given process id, which corresponds with the case id in the process SQL database
  • task id, returns the case belonging to the specific task id
  • dossier id, returns the case belonging to the dossier aggregate id
  • metadata id, the case belonging to the metadata aggregate id
  • last updated data, returns all cases last updated during the given period. Notice that only one of the two parameters is required for a search

When a list is returned, it is possible to sort the items by clicking on the column.

Case details

The case details pane is designed to give an overview of a specific case. This could be a starting point of debugging any case related error. The case details view consists of the following views:

  • Details, contains details stored by the case engine (source MongoDB), these might be used to correlate ID's to the database records
  • Lock information (when present)
  • Failed events filtered on this case (when present)
  • Delayed events filtered on this case (when present)
  • Case process profile as stored in the process SQL store (the active profile state in the process database)
  • Scheduled jobs, this overview shows all scheduled events for this case. Events that are scheduled in the past from the current time will turn red, they should be waiting in the queue or something went wrong
  • Tasks as stored in the process SQL store. It might be useful to check the state of the case when debugging

Refreshing the screen will cause the data to be refreshed.

Maintenance actions

The Maintenance user can perform different actions to restore cases when an error has occurred.

Retry event

Events that failed to be processed can be republished to the original exchange. The Retry event endpoint can be triggered by the button, either in the overview screen or in the details screen. When the message has been republished successfully, a confirmation will be visible in the UI.

Retry all events

The "Retry all" button will republish all events in the given list. This could for example be useful after fixing problems in the infrastructure. Many events could have failed at once, and should be able to be processed after the error has been solved. Any event that fails again with the same error (or a different one), will end up in this list again (with an increased "times seen" number).

Retrying all events will mostly be a safe option, events will either be processed successfully, or fail again and re-enter the same list. However, this might impact the performance or throughput of the case engine, especially when the amount of events is high. They are sent to the same queue as the original, competing with new incoming events in this queue.

Reopen Task

The reopen task button will reset the started task to "open", by effectively canceling the task. This could be useful when sessions are not closed properly and the task is stuck at started (and cannot be finished by the user). Automatic task that cannot be processed for some reason will be added to the Failed events list (again).

When a task has been reopened, any data that was altered during task execution (in the Blueriq context) is lost. The task cannot be completed in the old session, and should be restarted.

A failed automatic task will lead to a failed event, and could be restored by retrying the event instead of reopening the task.


Unlock case

Unlock case will only delete the case lock from the system. This could be useful when the case-unlock-action failed for some reason. Be careful with this operation, because normally another action is the root cause for the stuck case lock. The problem should be fixed at the root cause and not at case lock level. Some examples are:

  • A case is locked due to a failed DCM event, fix and retry the DCM event. The lock will be cleared as part of this action
  • A case is locked due to a started task that has not been finished properly. Use the "reopen task" endpoint to reset the state of the task, the caselock will be removed as part of this action

The unlock case should only be used as last resort. Fix the problem at the root cause, since other recovery actions might not work after unlocking a case


Schedule oldest open automatic task

This button will tell the case engine to schedule the oldest open automatic task of the specific case to be executed. It can happen that something goes wrong in the case engine. A message event to start an automatic task could be lost for example. normally you would need to wait for the case engine to complete an action like competing a task to have the message event resent. With this button you can schedule the task directly and dont have to wait.

  • No labels