*Option pricing has progressed through increasing complexity in models of stochastic volatility. Here, Marcos Costa Santos Carriera, co-author of “Brazilian Derivatives and Securities” and PhD Candidate at École Polytechnique, gives a brief introduction of Q-learning and the impact it has on model-free option pricing.*

Last year (2018), at the QuantMinds International conference in Lisbon I was fortunate to watch Richard Turner’s presentation on “Applications of Deep Learning to Systematic Trading”; after a brief history of Reinforcement Learning (where agents take actions at each state in order to maximize expected future rewards) and some examples (the cart pole from the OpenAI gym), we learned the justification for the use of deep neural networks: approximating the system, because it’s almost impossible to have full knowledge of states, transitions and rewards in the real world – toy models are always missing something.

One of the main advantages of using toy models and realistic rules is to be able to test ideas. In financial markets, actions can be considered limited to “buy”, “hold” or “sell” different amounts of a limited number of securities in response to particular states (price and volume histories, news and events, etc.).

One can define rewards as simply the P&L of the strategy; or you could add some sort of risk-reward measurement that might lead to less unstable results. But the risk of overfitting a strategy to a limited dataset or dynamic is real.

Richard’s presentation focused on the model described in the paper “The price dynamics of common trading strategies” (Farmer and Joshi, 2002), and the main goal was to study how different kinds of agents would interact in a market sensitive to the net order (liquidity is taken into account); this kind of agent-based model is a useful tool to understand how complex dynamics can emerge from simple rules and behaviors, and a worthy addition to a risk management toolbox.

After the presentation, Richard was kind enough to point me to “QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds”, a paper by Igor Halperin that applies Q-Learning to the problem of option pricing.

### So, first, what is Q-Learning?

For that we will look at the recently published 2^{nd} edition of the classic “Reinforcement Learning” by Sutton and Barto.

Imagine that you’re learning what the optimal policy is in a given problem with a finite horizon. If you’re using Monte Carlo methods, you would follow the policy until the end, check the final reward and update your estimates for the value of each state based on the average returns after visiting that state.

Temporal-Difference (TD) combines Monte Carlo (no model of the world is necessary, learning comes from experience) with Dynamic Programming (estimates are updated with the help of other learned estimates – bootstrapping).

Using TD methods, one can choose between on-policy and off-policy control methods. Policies are the maps that show the probabilities of selecting each possible action given a particular state. On-policy methods learn improvements on a near-optimal policy. Off-policy methods use two policies: a target policy (where learning is stored) and a behavior policy (more exploratory). This tradeoff between exploration and exploitation is always present in any learning methods; give an agent too much curiosity and it might get stuck in an irrelevant branch that presents endless noise (yes, social media is an example of this problem).

One off-policy TD control algorithm is Q-Learning; the Q refers to the action-value function Q(s,a,p), which reflects the expected return starting from the current states, taking action, and after that following policy p. In Q-Learning, the learned Q is a direct approximation of the optimal action-value function independent of the current policy; by learning Q, the agent can decide what to do quite easily.

### How does this work?

We update the Q-Table (a table with the estimated values of the Q function, with all possible states associated with rows and all possible actions associated with columns) based on: the previous value, the reward for choosing this particular action, the maximum expected future reward given the new state, a learning rate and the discount factors for future rewards. These values are related using the Bellman equation, a fundamental result in Dynamic Programming.

Given many opportunities to explore the environment and adequate parameters to balance exploration and exploitation, we should be able to learn the optimal policy (given the way we designed our rewards). But if the feedback is sparse or if it is difficult to learn long-term consequences of the agent’s actions, we’ll have problems in applying reinforcement learning. Deep Q Learning will be useful in approximating the Q-Table and learning the optimal Q-values.

### Okay, and now how do we use it?

One way of looking at the fair value of a derivative contract is by looking at the hedging portfolio; we expect that, on average, the price we pay (receive) for the contract will be offset by the cashflow of the hedging portfolio (minus costs, fees, price impact, etc.). We can then frame the problem of the fair value of a derivative contract like an option as a reinforcement learning problem, where (quoting Halperin) *pricing is done by learning to dynamically optimise risk-adjusted returns for an option replicating portfolio*.

We have explicit pricing formulas for some particular models of the world and even formulas to the theoretical optimal hedging ratios in these models (an excellent reference is “Stochastic Volatility Modeling” by Lorenzo Bergomi), but using Q-Learning, we are not only able to recover these values (although in a discrete setting) but come up with similar strategies on real trading data (which we know is not going to fit any model); remember that Q-Learning is model-free.

Because in the QLBS model the option price is the optimal Q-function and the optimal hedge is also an argument of this function, pricing and hedging are learned together. And the ability to work with simulated data and check convergence with theoretical results and then switch to real-world data and adjust pricing and hedging is a big advantage of this framework.

One interesting question is how much we should “overfit” to real data; suppose our training data has 7 months of low volatility and prices trending up followed by 7 months of high volatility and prices tumbling down. Is my ideal model supposed to switch between different behaviors (like in the papers “Regimes of Volatility” by Emanuel Derman and “Delta-Hedging Vega Risk” by Stéphane Crépey)? This could be different than a model-implied optimal delta.

Other recent papers on the subject include “Dynamic Replication and Hedging: A Reinforcement Learning Approach”, by Petter Kolm and Gordon Ritter and “Deep hedging” (Buehler, Gonon, Teichmann and Wood).

In a later post (and at QuantMinds International 2019 in Vienna) I’ll discuss how this idea can be applied to interest rate interpolation.