Tcav 101

November 26, 2020

TCAV Introduction

Understanding deep learning models remains an open challenge in machine learning research. The concept of “understanding” itself is subjective and varies significantly depending on one’s technical background and perspective. This complexity is frequently addressed in research papers, where different approaches to model interpretability yield different insights.

What makes a model truly interpretable? Several excellent blog posts have explored this question:

While these resources offer valuable perspectives on model interpretation, TCAV (Testing with Concept Activation Vectors) takes a unique approach. It builds on the observation that specific neural network layers become activated by distinct features or “concepts” more than others.

Most research in model interpretability focuses on image-based tasks, where concepts are visually intuitive. However, there’s limited exploration of interpretability techniques for less visualizable data types, such as fraud detection or medical diagnosis datasets. This gap in research motivated my exploration of TCAV with the SWAT dataset - a time-series dataset where each entry contains multiple features and a binary target indicating the presence or absence of an attack.

Understanding TCAV (Testing with Concept Activation Vectors)

The sensitivity score in TCAV is calculated using the following formula:

$$ \nabla \textcolor{green}{ h_{l, k} ( \textcolor{blue}{ f_l( \textcolor{black}{X_{input}}) } ) } \cdot \textcolor{red}{v_C^l} $$

Here, $f_l$ represents the model up to layer l, while $h_{l,k}$ represents the model from layer l to the output class k. The vector $v_C^l$ is derived from the linear classifier trained on layer l activations - specifically, it’s the vector orthogonal to the classification boundary.

The intuition behind this formula is straightforward: the higher the dot product between the gradient and the concept vector, the more sensitive that layer is to the concept. When a linear classifier can clearly separate concept examples from counterexamples at a particular layer, and the gradient of the subsequent layers is substantial, we can conclude that this layer has learned to recognize signals related to our concept of interest.

For image-based tasks, this interpretation is intuitive. Consider a zebra classifier: layers that recognize stripes will show high sensitivity to the “stripes” concept. However, for non-visual datasets like our SWAT time-series data, concepts become more abstract. For instance, a “valve attack” concept might manifest as patterns in sensor readings that resemble normal operations but are actually malicious.

While I understand the mechanism to calculate the sensitivity and TCAV score, I am still interested in further understanding what the sensitivity implies and its usefulness compared to other scores or to comparing it amongst itself.

Calculating TCAV

TCAV’s implementation follows a systematic process that leverages the layer-wise nature of neural networks. Here’s the step-by-step breakdown:

Train your base neural network
Select a layer to serve as your “bottleneck”
Feed concept examples and counterexamples through the model to get their activations at the bottleneck layer
Train a linear classifier on these activations to distinguish between concept and non-concept examples
Calculate gradients and sensitivity scores using the formula above

One crucial consideration is the selection of counterexamples. For image datasets, random noise serves as an effective counterexample since it contains no meaningful patterns. However, for other data types like time-series, counterexamples must be carefully chosen to represent meaningful contrasts to your concept of interest.

from original TCAV paper

Implementation Overview

While official implementations exist for both TensorFlow and Keras (targeting TensorFlow ≤ 2.0), I chose to create my own implementation to better understand the mechanics. Most implementations use TensorFlow, though PyTorch implementations are possible using hooks to capture activations and gradients instead of explicit model splitting. Interestingly, this is one of the rare cases where the TensorFlow approach proves more intuitive than its PyTorch counterpart.

My implementation is available here. Let’s examine its core components:

First, we need to split a trained neural network into two components ($f_l$ and $h_{l,k}$) using Keras’s functional API:

def use_bottleneck(self, bottleneck: int):
    """split the model into pre and post models for tcav linear model

    Args:
        layer (int): layer to split nn model
    """

    if bottleneck < 0 or bottleneck >= len(self.model.layers):
        raise ValueError("Bottleneck layer must be greater than or equal to 0 and less than the number of layers!")

    self.model_f = tf.keras.Model(inputs=self.model.input, outputs=self.model.layers[bottleneck].output)

    # create model h functional
    model_h_input = tf.keras.layers.Input(self.model.layers[bottleneck + 1].input_shape[1:])
    model_h = model_h_input
    for layer in self.model.layers[bottleneck + 1 :]:
        model_h = layer(model_h)
    self.model_h = tf.keras.Model(inputs=model_h_input, outputs=model_h)
    self.bottleneck_layer = self.model.layers[bottleneck]

Now that we have the original model split into 2, we need to train the linear classifier such that we can get CAV scores:

def train_cav(self, concepts, counterexamples):
    concept_activations = self.model_f.predict(concepts)
    counterexamples_activations = self.model_f.predict(counterexamples)

    x = np.concatenate([concept_activations, counterexamples_activations])
    x = x.reshape(x.shape[0], -1)

    y = np.concatenate([np.ones(len(concept_activations)), np.zeros(len(counterexamples_activations))])

    self.lm.fit(x, y)
    self.coefs = self.lm.coef_
    self.cav = np.transpose(-1 * self.coefs)

The resulting Concept Activation Vector consists of the coefficients from our trained linear classifier. The magnitude of these coefficients indicates how well the classifier can separate concepts from counterexamples at this layer. Higher coefficients suggest that the layer has learned to detect meaningful signals related to our concept of interest. However, this is just one component of TCAV - we still need to calculate the gradient and sensitivity score to complete our analysis:

def calculate_sensitivty(self, concepts, concepts_labels, counterexamples, counterexamples_labels):
    """the sensitivity scores come from dot product of the gradients with the CAV"""
    activations = np.concatenate([self.model_f.predict(concepts), self.model_f.predict(counterexamples)])
    labels = np.concatenate([concepts_labels, counterexamples_labels])


    grad_vals = []

    for x, y in zip(activations, labels):
        x = tf.convert_to_tensor(np.expand_dims(x, axis=0), dtype=tf.float32)
        y = tf.convert_to_tensor(np.expand_dims(y, axis=0), dtype=tf.float32)

        with tf.GradientTape() as tape:
            tape.watch(x)

            y_out = self.model_h(x)
            loss = tf.keras.backend.categorical_crossentropy(y, y_out)

        grad_vals.append(tape.gradient(loss, x).numpy())

    grad_vals = np.array(grad_vals).squeeze()

    self.sensitivity = np.dot(grad_vals.reshape(grad_vals.shape[0], -1), self.cav)
    self.labels = labels
    self.grad_vals = grad_vals

def sensitivity_score(self):
    """Print the sensitivities in a readable way"""
    num_classes = self.labels.shape[-1]

    sens_for_class_k = {}
    for k in range(0, num_classes):
        class_idxs = np.where(self.labels[:, k] == 1)
        if len(class_idxs[0]) == 0:
            sens_for_class_k[k] = None
        else:
            sens_for_class = self.sensitivity[class_idxs[0]]
            sens_for_class_k[k] = len(sens_for_class[sens_for_class > 0]) / len(sens_for_class)

    return sens_for_class_k

This gives us the sensitivity between a concept and a provided counter example. Then to use this, we can do something as such:

attack_info_df = get_attack_info_df(pdf_path=model_df_dir / "docs/List_of_attacks_Final.pdf")

concept, counterexamples = create_concept(df, attack_info_df, [10, 11])

concepts_gen = tf.keras.preprocessing.sequence.TimeseriesGenerator(
    concept.drop(TARGETCOL, axis=1).values,
    concept[TARGETCOL].values,
    length=TIMESERIES_LENGTH,
    batch_size=1,
    shuffle=True,
)

# use stride to balance the number of samples somehow?
counterexamples_gen = tf.keras.preprocessing.sequence.TimeseriesGenerator(
    counterexamples.drop(TARGETCOL, axis=1).values,
    counterexamples[TARGETCOL].values,
    length=TIMESERIES_LENGTH,
    batch_size=1,
    stride=round(len(counterexamples) / len(concept)),
    shuffle=True,
)

concepts_x = []
concepts_y = []
for x, y in concepts_gen:
    concepts_x.append(x)
    concepts_y.append(y)

counterexamples_x = []
counterexamples_y = []
for x, y in counterexamples_gen:
    counterexamples_x.append(x)
    counterexamples_y.append(y)

concepts_x = np.array(concepts_x).squeeze()
concepts_y = np.array(concepts_y).squeeze()
counterexamples_x = np.array(counterexamples_x).squeeze()
counterexamples_y = np.array(counterexamples_y).squeeze()

model = tf.keras.models.load_model(model_path)
tcav = TCAV(model)


for layer_n in range(1, len(tcav.model.layers) - 1):

    tcav.use_bottleneck(layer_n)
    tcav.train_cav(concepts_x, counterexamples_x)

    tcav.calculate_sensitivty(concepts_x, counterexamples_x)
    sensitivity_score = tcav.sensitivity_score()

    logger.info(f"=== === ===")
    logger.info(f"sensitivity scores for LAYER: {layer_n} of type: {tcav.bottleneck_layer.name}")
    logger.info(f"[class 0 to concept] ==> {sensitivity_score[0]}")
    logger.info(f"[class 1 to concept] ==> {sensitivity_score[1]}")
    logger.info(f"=== === ===")

Full Implementation

The complete implementation and dataset used in this analysis can be found in this repository: https://gitlab.com/besiktas/falcon_tcav

Experimental Results

I conducted experiments using the SWAT dataset, a time-series dataset where different attack types were treated as concepts. While the results showed promise, interpreting them proved more challenging than with image-based models where concepts are visually intuitive.

Key findings from the experiments:

The relationship between layer depth and concept sensitivity is non-monotonic
Sensitivity scores showed unexpected peaks in both early and late layers
The same concept could have varying sensitivity patterns across different model architectures

These results suggest that TCAV’s effectiveness extends beyond image classification, though interpretation becomes more nuanced with abstract data types.

Future Research Directions

Several promising areas for future investigation emerged from this work:

Transformer Architectures: Can we extract meaningful sensitivity scores from transformer-based models? For example, could TCAV help us understand how language models process concepts like toxicity or sincerity in text?
Score Interpretation: What do different magnitudes of TCAV scores tell us about concept learning? We need better frameworks for interpreting these scores, especially for non-image domains.
Adversarial Robustness: The original TCAV paper hints at connections to adversarial attacks. Could TCAV scores help detect or prevent such attacks by monitoring concept sensitivity patterns?

adversarial example from original paper

Conclusion

TCAV offers a promising approach to model interpretability beyond traditional image-based applications. While our experiments with time-series data showed mixed results, they highlight both the potential and limitations of concept-based interpretation methods. As we push towards more complex models and diverse data types, tools like TCAV will be crucial for understanding how our models actually learn and make decisions.

Your feedback and questions are welcome! Feel free to reach out.