Paper: Universal Trading for Order Execution with Oracle Policy Distillation
Implementation: https://github.com/microsoft/qlib/tree/high-freq-execution/examples/trade/

Introduction Link to heading

The challenge of order execution in algorithmic trading using reinforcement learning lies in the noisy and imperfect market information, making it difficult to develop efficient learning methods for optimal execution. This paper introduces a new framework that improves order execution by using a policy distillation method, guided by an oracle with perfect information, showing significant advancements over traditional methods.

Formulation of Order Execution Link to heading

Order execution in trading is modeled for transactions within a set time frame, such as an hour or a day, based on discrete time intervals. This approach assumes a series of time steps, each with a specific trading price, where the trader decides on the volume of shares to trade at each step. The objective is to maximize revenue through optimal order execution over the total shares to be liquidated, aiming to maximize the average execution price for selling or minimize it for buying. This formulation addresses how to strategically execute trades to either maximize gains from selling or minimize costs when buying, within a predefined time horizon.

Given the trait of sequential decision-making, order execution can be defined as a Markov Decision Process.

Reinforcement Learning Component Link to heading

State, \( s_{t} \) Link to heading

Information given to the model includes private and public variable.

Private: elapsed time \( t \) and the remained inventory \( (Q - \sum_{i=1}^{t} q_i) \)
Public: Historic market information includes open, high, low, close, average price and transaction volume of each timestep.

self.state = self.obs(
    self.raw_df,
    self.feature_dfs,
    self.t,
    self.interval,
    self.position,
    self.target,
    self.is_buy,
    self.max_step_num,
    self.interval_num,
    action,
)
return self.state, reward, self.done, {}

source

Action, \( a_{t} \) Link to heading

The trading volume to be executed at the next time can be easily derived as \( q_{t}+1 = a_{t} · Q \), and each action at is the standardized trading volume for the trader

  action:
    name: Static_Action
    config:
      action_num: 5
      action_map: [0, 0.25, 0.5, 0.75, 1]

source

Reward, \( R_{t} \) Link to heading

The reward of order execution consists of two practically conflicting aspects, trading profitability and market impact penalty.

The reward is not included in the state thus would not influence the actions of the agent or cause any information leakage.

class VP_Penalty_small_vec(VP_Penalty_small):
    def get_reward(self, performance_raise, v_t, target, *args):
        """

        :param performance_raise: Abs(vv_ratio_t - 1) * 10000.
        :param target: Target volume
        :param v_t: The traded volume
        """
        assert target > 0
        reward = performance_raise * v_t.sum() / target
        reward -= self.penalty * ((v_t / target) ** 2).sum()
        assert not (np.isnan(reward) or np.isinf(reward)), f"{performance_raise}, {v_t}, {target}"
        return reward / 100

source

if self.is_buy:
    performance_raise = (1 - vwap_t / self.day_vwap) * 10000
    PA_t = (1 - vwap_t / self.day_twap) * 10000
else:
    performance_raise = (vwap_t / self.day_vwap - 1) * 10000
    PA_t = (vwap_t / self.day_twap - 1) * 10000

source

Policy Distiallation and Optimization Link to heading

• Teacher plays a role as an oracle whose goal is to achieve the optimal trading policy \( π˜φ(·|s˜t) \) through interacting with the environment given the perfect information \( s˜t \) (future data), where \( φ \) is the parameter of the teacher policy.
• Student itself learns by interacting with the environment to optimize a common policy \( πθ(·|st) \) with the parameter \( θ \) given the imperfect information st.

Proximal Policy Optimization (PPO) algorithm (Schulman et al. 2017) is utilised in actor-critic style to optimize a policy for directly maximizing the expected reward achieved in an episode.

As a result, the overall objective function of the student includes the policy loss Lp, the value function loss Lv and the policy distillation loss Ld as

supervision_loss = F.nll_loss(logits.log(), b.teacher_action)
loss = clip_loss + self._w_vf * vf_loss - self._w_ent * e_loss + self.kl_coef * kl_loss
loss += self.sup_coef * supervision_loss

source

Learning Algorithm Link to heading

Network Architecture Link to heading

Results Link to heading

Evaluation Metrics Link to heading

Reward
Price Advantage (PA): measures the relative gained revenue of a trading strategy compared to that from a baseline price
Gain-loss ratio (GLR)

Performance Link to heading

Conclusion Link to heading

This paper combined reinforcement learning and imitation learning through oracle policy distillation.
Interesting, has potential.