When studying machine learning, you will sometimes come across the term “online learning”. For example, in supervised learning such as SVM, first you train with all of the sample data given. But depending on the volume of sample data or the application, it may not be appropriate to learn all the data at once. As mentioned above, all businesses have time restrictions. There are also system limitations on the volume of data that can be handled, such as the computer processing capacity and memory capacity. There are also cases where the sample data is provided in stages. In these cases, it can be convenient to optimize the parameters for each data set provided, and then revise it and retrain. This kind of method is called online learning. You could also call it “consecutive learning”. Typical online learning methods are perceptron, CW, AROW and SCW. There is also a method called Streaming Random Forests, which is Random Forest by online learning.
Considering the importance of machine learning for NLP as shown in this article, Rakuten has developed a morphological analyzer with a learning feature and has released it as OSS (ref: https://github.com/rakuten-nlp/rakutenma). The morphological analyzer uses sequence labeling for each character and uses SCW (exact soft confidence-weighted learning), an online learning method, for the learning part. SCW is CW, (confidence weighting) which uses an algorithm that takes account of the differences in the frequency of occurrence of training data, but which has been strengthened for noise as well. It has many advantages, such as soft margin optimization and proof of maximum loss.
Reinforcement learning, which is AI technology relating to development of robots, is also being applied to EC. In contrast to supervised learning, in reinforcement learning there is no training data. Instead, it is a method in which feedback data is received as a reward after learning, in order to proceed to further learning. Reinforcement learning assumes an environment with uncertainty. The characteristics of the reward differ from training data in terms of noise and the delay in the timing of receiving the data. As such, it is an appropriate method when a system or service is gradually optimized while incorporating the reaction of the users. Reinforcement learning is used in recommendation and product search when you incorporate data on customers’ reaction, i.e., whether users clicked on a product and whether the clicked on a product straight away, and then improve the recommendation or search results. In addition, multi-armed bandit algorithms are used for ad personalization and AB testing. A multi-armed bandit algorithm measures how much reward (a positive reaction from users) can be obtained with limited resources. By using the two types of behavior – exploitation of past experience and exploration of unused methods to reveal more rewards – the reward is maximized. In other words, in a fixed number of trials, by combining things understood from before and things that have not been done but should be tried, we search for the best option. Typical methods are epsilon-greedy, UCB and Thompson sampling.
For example, supposing you create two advertisement designs (A and B) and show them each to 50% of users in an AB test. The KPI would simply be the click rate for each impression. If the click count for both A and B is about the same, there would be no problem, but if the click count for design B is significantly lower, then while the AB test is underway, since it is a real service, showing design B to 50% of users becomes an opportunity loss. If we use a multi-armed bandit algorithm in this case, when the performance of design B appears to be bad, the percentage of users viewing B is decreased while the percentage viewing design A is increased so that the KPI is maximized and loss is reduced as much as possible. If there were only two designs, this could be done by humans, but if for example there were 60 designs, or in projects designed to be matched to different user segments, it would be impossible for a human to adjust each impression and improve overall KPIs. As such, there are advantages to applying reinforcement learning methods and automating the process. http://blog.marketing.rakuten.com/2014/04/tech-talk-testing-with-bandits ）
While reinforcement learning enables us to reduce opportunity loss compared to humans, since there is always a trade-off between exploitation of experience and exploration for reward, there remains the issues of how we should proceed and how far we should in order to minimize lost profit. To do so, which choice we make is dependent on previous data and imperfect suppositions. The UCB algorithm, which is one type of multi-armed bandit algorithm, can produce an answer to this problem within the scope of maximum loss, and is widely used. However, from experience, Thompson sampling is known to have higher performance, so at Rakuten we use Thompson sampling (ref: http://blog.marketing.rakuten.com/2014/04/tech-talk-testing-with-bandits)