Even innovative AI models alone can do nothing without the right training data - and productive trainers.
No Pain - No Gain? Programming is no longer necessary - but training is
The inserve team has already achieved the goal of covering almost all necessary processing steps in its own Intelligent Document Processing platform with adaptive AI modules (e.g. batch separation, document classification, information extraction). This reduces the effort and massively increases the adaptability for automation.
In addition, possibilities have emerged that were previously not feasible through programming and certainly not manually, such as finding clusters that are as homogeneous as possible in hundreds of thousands of pages. These even work unaided (without labels/labels).
For some steps, however, training by human specialists is of course necessary. Even if the training requirements and performance of the AI models have already been optimized, the training process must still be designed as efficiently as possible:
- How can the most informative training examples be identified?
- How can unbalanced distributions (to many classes) be covered?
- How can the user be guided through the training process as comfortably as possible?
- How does the user know how well the model they are training already generalizes?
Human in the Loop & Confidence
The Human-in-the-Loopprinciple provides the right framework for this by using the confidence of the AI models' predictions to weed out uncertain predictions and reassure the human specialist, both during initial training and in ongoing, productive use. Active Learning takes this principle to the extreme by automatically guiding the user through the training.
As few as possible, but target-oriented training examples
The basic idea of Active Learning consists of the following iterative approach:
- Find a manageable amount of training examples that are as informative as possible.
- Allow the user to mark them out as easily and quickly as possible.
- Retrain the AI model completely or incrementally.
- Decide - automated or by the user - if the training is sufficient.
- If not - start again from one (see above).
Active Learning Procedure (simplified)
Actually relatively simple, but there are still a few pitfalls to solve in detail:
What are the most informative examples?
As always, there are different approaches in research and literature, which are more or less intuitively comprehensible and sometimes combined with each other:
- Confidence-based: For which examples is the model most uncertain?
- Committee-based: On which examples are different models least likely to agree?
- Density-based: In which regions of the data space are the fewest examples present so far?
- Diversity-based: Which data points are most different from those already available?
- Change-based: Where does the model change the most?
- Learner-based: Another model estimates which examples provide the greatest performance improvement.
However, since with all fascinating ideas the practical relevance in the sense of processing speed and real, measurable advantages must always be kept in mind, care is required in the selection and implementation.
How fast is the technical processing?
On the one hand, the process must be as fast as possible for the user, on the other hand, all the steps mentioned cost processing time: The selection (sampling) must check a sufficient number of data points and then select particularly informative ones. Depending on the model, this can be done incrementally or must be completely re-trained; the more complex this is, the more time is required.
What is the optimal interaction for the user?
Ideally, the application automatically presents the selected examples one after the other in such a way that the trainer can evaluatethem as well as possible and make the correct labeling as ergonomically as possible. The optimal interaction varies greatly with the task at hand - classifying documents or information on a page, separating batches, recognizing entities.
When was enough training done?
In Active Learning, the user (or the system) must be able to evaluate when, in relation to the totality of the data (document collections or streams), the training appears to be sufficiently good or the efficiency (number of training examples in relation to improvement) falls too low.
Concrete practical example: document classification
For clarity, we deliberately chose a document set with many (43), very unbalanced and sometimes difficult to distinguish business transactions (classes). Different maximum proportions of the documents were used for training (80/20) and the rest as fixed validation data. These are already excellent, but in the Active Learning Simulation only the examples selected per iteration are used for training.
Faster training results
Development of MCC for max. 80% / 20% training data and different sampling strategies
As a comparative reference, the purely random-based sampling was compared with a confidence-based strategy using the primary metric MCC. In this case, one measurement was made on the data from the training stock that had not been marked by the trainer ("remaining") and one on the fixed validation stock.
In all cases, the performance of confidence-based sampling is slightly to significantly better. Most importantly, the maximum achievable power is approximated much earlier. In practice, this means much faster training successes!
When was enough training done?
In reality, of course, not all data with awards are available, but cross validation on the available training data can be used to estimate the achieved performance.
Comparison of actual with measurable performance
This visualization shows that the Cross Validation (CV) MCC systematically underestimates the actual performance of the confidence strategy in the training progress almost across the board (we could observe this for all the above variants). Intuitively, this is also understandable since this strategy selects the examples with the most uncertain predictions for each iteration.
This can be solved by considering the increase in model performance, additional randomly selected examples, or the development of confidence, or a combination of these approaches.
Higher confidence and different distribution
Last but not least, two aspects are interesting:
Naturally, the confidence increases much faster and higher in the documents of a fixed stock that are not seen by the trainer:
Comparison of confidence development on unseen data (chance/confidence)
Furthermore, there is also a different distribution among the classes: The higher frequency ones are selected in a more differentiated way and the lower frequency ones have (as far as possible) a systematically higher share:
Comparison of class distribution with different training progress (random/confidence)
The human-in-the-loop principle and active learning show once again that a pure focus on (complex) AI models does not always make sense for practical applications. The decisive factor is the integration into an overall concept consisting of efficient training mechanisms, ergonomic user interfaces, and scalable, high-performance execution of training and inferences.
The inserve team is therefore currently integrating Active Learning concepts into its platform in order to further expand their added value and competitive advantages in the processing of complex document inventories and streams.