Ernie.QT

Common Alpha 010: Feature Importance & Generation

Added 2025-01-29 13:52:47 +0000 UTC

The content will talk about some of the useful packages I have used when I tried to build trading strategies using a Machine Learning framework. The content will be split into 3 parts, faster computation method in Python, feature generation & in machine learning & feature importance.

A faster computation can help a lot especially in generating new features when the dataset is large enough, and it reduces the time spent on pre-processing the data.

Faster Computation (Numba) - Python Package

Background Info:

a just-in-time (JIT) compiler that can significantly accelerate numerical computations in Python
compiles your code to native machine code, making loops and custom functions run much faster
works well alongside pandas, SciPy, scikit-learn, etc., for numeric code
helps you get a performance boost without rewriting large chunks of code in C/C++
Website: https://numba.pydata.org/
Github: https://github.com/numba/numba

Set-up:

pip install numba

Usage:

Apply the ‘@njit’ decorator to your function
Aims at numerical calculations

Example:

Comparison:

Example 1:

Pure Python for-loop

numba version

Example 2:

Pure Python Loops V.S. Numpy Vectorized V.S. Numba Loops

Features Generation

the process of transforming raw financial market data into meaningful, structured inputs (“features”) that a machine learning model can understand and learn from
highlight relationships or behaviors that are less obvious in the raw data
enable models to learn predictive patterns more effectively

Example (on Kline dataset):

RSI (Relative Strength Index)
ATR (Average True Range)
Bollinger Bands
Stochastic Oscillator
MACD
…

Here I try to use ‘numba’ to write some custom functions to compute the common features, and if you are backtesting the larger dataset like 5min, or even 1min resolution, the speed can make a huge difference.

Features Importance

calculate a score for all the input features for a given model. The scores represent the “importance” of each feature. A higher score means that the specific feature will have a larger effect on the model that is being used to predict a certain variable.

Below is an illustrative example of how to perform a time-series split for Random Forest Classification model training/testing, extract feature importance scores, and optionally carry out feature selection (via Recursive Feature Elimination, RFE) in a quant-trading scenario.

Assumptions:

Predict next close is greater than current close, label as 1, else 0
Use features generated from last part only
Use Random Forest Classification Model for prediction

1. Splits the dataset iteratively using TimeSeriesSplit (without shuffling), fitting a Random Forest on each split.

2. Recursive Feature Elimination (RFE) iteratively removes the least important features until a desired number of features remain.

A single forward split between older data (training) and newer data (test).
Use RFE on the training set.
Evaluate on the test set with only the selected features.

The above example is a template that I started with quant backtesting using a machine learning approach. There are still more follow-ups to be done on avoiding overfitting & evaluating the model. Meanwhile, the feature importance result can be used as a reference to develop strategies on rule based level, like which features are useful in machine learning perspective.