Random Subset Feature Selection: A dimensionality reduction approach

A brief explanation of dimensionality reduction and the algorithm.

gksriharsha
7 min readAug 4, 2021
Photo by Tamas Tuzes-Katai on Unsplash

In this article, I have shared my knowledge on Dimensionality Reduction and an algorithm called Random Subset Feature Selection (RSFS). Let’s (metaphorically) walk through the forest of knowledge and understand what this is all about.

Data is the new oil …, Data is increasing exponentially …, are some of the phrases that we have been hearing for some time now. Such an explosion of data generation and transfer is no accident, caused by a myriad of factors playing out in parallel. One of the most crucial factors is the improvement of storage density. In other words, the same amount of volume is holding more data than ever before. There is a need to store excess data as the computation power is high enough to utilize this additional data to build complex and accurate models.

A 3D save button ?

Remember these? The amount of data stored on these is so low that there is only space for essential information. If there is a possibility of storing more information to improve the user experience, they would have done it. As user experience and satisfaction are some crucial aspects which drive business.

Dimensionality Reduction

The definition of Dimensionality Reduction from Wikipedia is as follows

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.

I am sorry what?

An intuition of Dimensionality reduction can be understood from a seven segment display.

A typical 7 segment display

Many circuits inside elevators and traffic signals use seven-segment displays. Each number displayed has seven segments (LED bulbs) associated with it. For each number, there are seven On/Off values. If these combinations of On/Off values are represented in a tabular format, every digit will have seven 0s and 1s. For example, the number 8 will have all segments A through F turned on.We ignore the DP because we are only interested in integers in the following examples.

If an elevator uses this seven-segment display in a building that only has three floors (floors 1,2,3), segment F will not be used. For such a case, we can use a modified seven-segment display with one switched-off segment. Intuitively we were able to reduce the columns(segments) required to describe a digit. Such intuition is lost when the data is massive.

Confused expert trying to figure it out.

In translating the seven-segment example to a real-world application, the digits are the labels. We have reduced the number of segments required to guess the digit (dimensionality reduction). In this case, we know the mapping between the attributes (segments A-F) and the outcomes. For many real-world applications, such mapping may not be known. When such a mapping is partially or completely absent, reducing the dimensions becomes a challenging task. Many systems contain thousands if not millions of features/attributes to them. So expecting an expert to know the relation between all the attributes is near to impossible. Therefore an algorithm reduces the dataset objectively and hands it over to the expert. The expert can then perform required experimentation on the reduced set of features to finalize the model.

The simplest algorithm which does this task is Principal Component Analysis. Although this is the most popular algorithm to use, many of the practical applications are far too complex for this algorithm.

Types of Dimensionality Reduction.

Let’s not wander too much into the wilderness of knowledge

There are broadly two types of dimensionality reduction algorithms.

Feature transformation algorithms

PCA is a type of feature transformation algorithm. In other words, the algorithm transforms the data such that the least number of features can explain the underlying pattern. However, this algorithm is not directly deciding(selecting) the most or the least important features for the given feature set. It is performing a linear addition of features such that hybrid features produced, although less in number, can explain the entire model.

6 segment display

By continuing the same example, this is similar to changing the entire seven-segment display to a six-segment display as shown above. The dimensionality (the number of segments) has been decreased by transforming the entire dataset. The biggest shortcoming of this method is the unavailability of a one-to-one mapping from one domain to another. Therefore no remarks can be made about the untransformed domain’s features (7 segment display) by analyzing the transformed domain’s features (6 segment display). In other words, if a segment (feature) is unnecessary for a six-segment display, there may or may not be a particular segment that is also unnecessary in a seven-segment display.

Feature selection algorithms

These algorithms retain the original features and the reduction is performed by selecting the crucial ones. Since the original features are used, the data scientist/expert can see which among them are making an impact. These features are extracted and are sent to an expert. After getting approved by an expert, the model can be limited to these features only. Similar to the example given previously with the elevator going to floors 1,2 and 3.

To determine the best features using this method, a classifier should search through all combinations of features. Such a method is computationally expensive for large datasets. This computation time is reduced by introducing a learning algorithm. This algorithm reduces the total computation required to reach an optimum point (best feature set). One such algorithm is Random Subset Feature Selection proposed by Okko Räsänen and Jouni Pohjalainen in their paper.

Random Subset Feature Selection: RSFS

This algorithm is developed in MATLAB. I have re-written it for Python. I have used the NumPy library (which takes advantage of SIMD instructions) wherever possible to mimic MATLAB’s vectorization capabilities. This is to perform vector-based operations faster. The implemented package can be installed using the following command

pip install rsfs

The algorithm is as follows

  1. The dataset is divided into test and train sections.
  2. A calculated number of features (~ square root of the total number of features) are selected from the dataset.
  3. With the randomly selected features as a feature subset, classification is performed.
  4. The input to the classifier is the feature subsets of the Test and Train sections. All the rows from selected features are used in either test or train sections, therefore the labels column is not affected by the feature selection.
  5. Once the trained classifier predicts the output labels for the test feature-set, it is then compared to the test labels.
  6. For each label, a count of correct and wrong predictions is maintained.
  7. Two metrics are calculated to assess the performance of the feature sets, namely — performance criterion and expected criterion. They are average of correct and wrong predictions for one iteration and across all iterations respectively.
  8. The net performance (feature relevancy) is calculated by subtracting the expected criterion from the performance criterion. This performance is added to the corresponding feature’ score. Example: If the group performance of features 5,6,7,8 is 10 and the average performance until this iteration is 6, the relevance of features 5,6,7,8 is increased by 4 (10–6).
  9. The relevancy of features will vary according to the performance criteria.
  10. The same performance is updated for dummy features (a mock dataset with random values). These features are taken as a control group which helps in differentiating a true feature from a noise/random feature.
  11. In every iteration, the relevancy is modeled as a probability vector and the features with values greater than a cutoff value are selected.
  12. For every Threshold iterations, the number of selected features per iteration is noted. If this number does not vary significantly (5% for all statistical measures), then the algorithm is said to have converged.
  13. The resultant feature subset is the group of features that have a scaled probability value greater than the cutoff. The corresponding relevance value is the relative importance of the feature in the dataset.
  14. This reduced feature-set is returned to the calling function.

Note: The words in bold are the variable names used in the code.

Tldr: Instead of testing all combinations, this algorithm is testing random combinations many times. This ensures most of important features will be randomly present in final reduced subset.

Conclusion

In this article, I have briefly explained about dimensionality reduction and a package that I have written, that performs dimensionality reduction on large datasets. The usage of this package is present in the README file of the GitHub repository.

Thank you for reading my article on dimensionality reduction and one of its implementation. If you enjoy reading this article, please clap/like this article.Please let me know your questions here in the comment section or message me on LinkedIn.

--

--