This page explains and exemplifies the HALFADO algorithm. The HALFADO algorithm describes a computational feather-light yet provably powerful algorithm for detecting suspicious events in an information stream, based on a few examples of human judgement. That is, assume we have a human expert that can inspect a few cases thoroughly, but current settings produce a steady stream of – say – 1000 cases per second. What to do without having to invest in serious hardware nor in extended annotation efforts?
The HALFADO algorithm works as follows (see inset). We start with an active set – named A -containing all m experts. Each expert can judge a transaction as ‘suspicious’ or ‘normal’. When progressing through the information flow (for t=1,2,3, …), one predicts each flying by transaction as ‘suspicious’ or ‘not suspicious’ based on the majority vote of the experts in A.
After manual inspection of only those transaction that are as such deemed suspicious, this prediction can be judged accurate or wrong (a mistake). If a mistake were made, then the majority of experts in A got it wrong and can hence be removed from A. This leaves us with a A less than half the size of the A used before (‘halving’). One does this for each transaction in the flow where at least one of the experts in the current A alarms. Hence, the number of mistakes can at most be in the order of log(m), assuming we have a perfect expert amongst all. This also implies then that the number of false alarms is bounded.
- Description of a semi-supervised learning algorithm that can run on very modest hardware, able to handle high-frequency streams of information.
- Use of the expert-framework, allowing explicitly for re-use of existing expert systems (‘legacy solutions’) in this context.
- Theoretical guarantee of performance under reasonable assumptions.
- Two case studies: (1) one in the area of FinTec for realtime payment systems, and (2) for the detection of hate speech in the context of social media flows.
- 5-15 minutes with the manuscript suffices to ‘get it’, and theoretical results are immediate (‘do not require advanced mathematical tools nor access to highly specialised experts’).
- This work answers directly industrial and academical demands for such approach.
- The experiments are performed on modest computational hardware, and do not require investment in assistance from larger technological companies. This goes against the current trend of AI/ML stating that advanced machine learning requires advanced investments, hence excluding many.
- The approach is semi-supervised, alleviating the need for expensive annotation. The algorithm only asks to label the examples the algorithm judges relevant. This approach ensures human oversight, without sacrificing the computational benefits. Semi-supervised learning is a way of learning that combines human and computation intelligence for a common goal, and can potentially result in a massive saving of resources. This ensures readiness for future regulation regarding advanced automation.
- The algorithm (dubbed a ‘smart filter’) adopts to human judgement as collected when progressing through the stream. This information is passed by example, rather than by full description. Moreover, one only has to inspect (‘label’) examples the algorithm found some evidence for being fraudulent, with no effort being wasted on examples that are clearly normal.
- The filter does not need to be pre-trained but can be used from day zero.
Below we provide the resources for reproducing the 2 reported case studies.
Case 1: Detecting hate speech in social media flows.
This study works with text messages collected from twitter, using the public standard search API.
Python code to reproduce the experiment is here. Unfortunately, we can’t link to the text files we used in our experiments as that would be violating all possible privacy laws. Instead, our example retrieves realtime tweets and processes them on-the-fly. This will not result in the reported computational speeds as we depend on how fast text messages can be retrieved with your internet connection. For more information about the standard API as used, see here.
After unzipping the content, type in in a terminal
$ python tweets.py
For comparison, we also include code for the performance of single-word-detectors based on the words in the Oxford 3000 list. To run this example, type in a terminal
$ python singleworddetectors.py
The results are written in a CSV file called ‘tweets.csv’.
Case 2: Detecting fraud in TMS in FinTech.
This study makes a detector of fraud in a Transaction Monitoring System (TMS). This TMS handles a high-frequency stream of financial transactions. The python code for this example is tx2. Again, the reported computational speeds are slashed in this experiment by operating (computationally involved) pipe in the terminal. After unzipping, the example is run by typing in a terminal
$ python generateTx.py 1000 | python halfado.py
(Do replace 1000 with a maximum length, and stop the experiment at any time by hitting CTRL-C.) The implementation based on a fixed file is here. This implementation achieves computational speeds as reported (e.g. 60,000 tx/second), and can be started by unzipping and typing in a terminal
$ python halfado.py