Skip to content

Analyzer

Analyzer Entry Point

The Analyzer class is part of the jarvais.analyzer module. It provides tools for exploring datasets and identifying issues.

jarvais.analyzer.Analyzer

Analyzer class for data visualization and exploration.

Parameters:

Name Type Description Default
data DataFrame

The input data to be analyzed.

required
output_dir str | Path

The output directory for saving the analysis report and visualizations.

required
categorical_columns list[str] | None

List of categorical columns. If None, all remaining columns will be considered categorical.

None
continuous_columns list[str] | None

List of continuous columns. If None, all remaining columns will be considered continuous.

None
date_columns list[str] | None

List of date columns. If None, no date columns will be considered.

None
boolean_columns list[str] | None

List of boolean columns. If None, no boolean columns will be considered.

None
target_variable str | None

The target variable for analysis. If None, analysis will be performed without a target variable.

None
task str | None

The type of task for analysis, e.g. classification, regression, survival. If None, analysis will be performed without a task.

None
generate_report bool

Whether to generate a PDF report of the analysis. Default is True.

True

Attributes:

Name Type Description
data DataFrame

The input data to be analyzed.

missingness_module MissingnessModule

Module for handling missing data.

outlier_module OutlierModule

Module for detecting outliers.

encoding_module OneHotEncodingModule

Module for encoding categorical variables.

boolean_module BooleanEncodingModule

Module for encoding boolean variables.

visualization_module VisualizationModule

Module for generating visualizations.

settings AnalyzerSettings

Settings for the analyzer, including output directory and column specifications.

Source code in src/jarvais/analyzer/analyzer.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
class Analyzer():
    """
    Analyzer class for data visualization and exploration.

    Parameters:
        data (pd.DataFrame): The input data to be analyzed.
        output_dir (str | Path): The output directory for saving the analysis report and visualizations.
        categorical_columns (list[str] | None): List of categorical columns. If None, all remaining columns will be considered categorical.
        continuous_columns (list[str] | None): List of continuous columns. If None, all remaining columns will be considered continuous.
        date_columns (list[str] | None): List of date columns. If None, no date columns will be considered.
        boolean_columns (list[str] | None): List of boolean columns. If None, no boolean columns will be considered.
        target_variable (str | None): The target variable for analysis. If None, analysis will be performed without a target variable.
        task (str | None): The type of task for analysis, e.g. classification, regression, survival. If None, analysis will be performed without a task.
        generate_report (bool): Whether to generate a PDF report of the analysis. Default is True.

    Attributes:
        data (pd.DataFrame): The input data to be analyzed.
        missingness_module (MissingnessModule): Module for handling missing data.
        outlier_module (OutlierModule): Module for detecting outliers.
        encoding_module (OneHotEncodingModule): Module for encoding categorical variables.
        boolean_module (BooleanEncodingModule): Module for encoding boolean variables.
        visualization_module (VisualizationModule): Module for generating visualizations.
        settings (AnalyzerSettings): Settings for the analyzer, including output directory and column specifications.
    """
    def __init__(
            self, 
            data: pd.DataFrame,
            output_dir: str | Path,
            categorical_columns: list[str] | None = None, 
            continuous_columns: list[str] | None = None,
            date_columns: list[str] | None = None,
            boolean_columns: list[str] | None = None,
            target_variable: str | None = None,
            task: str | None = None,
            generate_report: bool = True,
            group_outliers: bool = True
        ) -> None:
        self.data = data

        # Infer all types if none provided
        if not categorical_columns and not continuous_columns and not date_columns:
            categorical_columns, continuous_columns, date_columns, boolean_columns = infer_types(self.data)
            # Treat booleans as categorical downstream
            # categorical_columns = list(sorted(set(categorical_columns) | set(boolean_columns)))
        else:
            categorical_columns = categorical_columns or []
            continuous_columns = continuous_columns or []
            date_columns = date_columns or []

            specified_cols = set(categorical_columns + continuous_columns + date_columns)
            remaining_cols = set(self.data.columns) - specified_cols

            if not categorical_columns:
                logger.warning("Categorical columns not specified. Inferring from remaining columns.")
                categorical_columns = list(remaining_cols)

            elif not continuous_columns:
                logger.warning("Continuous columns not specified. Inferring from remaining columns.")
                continuous_columns = list(remaining_cols)

            elif not date_columns:
                logger.warning("Date columns not specified. Inferring from remaining columns.")
                date_columns = list(remaining_cols)        

        self.missingness_module = MissingnessModule.build(
            categorical_columns=categorical_columns, 
            continuous_columns=continuous_columns,
        )
        self.outlier_module = OutlierModule.build(
            categorical_columns=categorical_columns, 
            continuous_columns=continuous_columns,            
            group_outliers=group_outliers

        )
        self.encoding_module = OneHotEncodingModule.build(
            categorical_columns=categorical_columns, 
            target_variable=target_variable
        )
        self.boolean_module = BooleanEncodingModule.build(
            boolean_columns=boolean_columns
        )
        self.dashboard_module = DashboardModule.build(
            output_dir=Path(output_dir),
            continuous_columns=continuous_columns,
            categorical_columns=categorical_columns
        )
        self.visualization_module = VisualizationModule.build(
            output_dir=Path(output_dir),
            continuous_columns=continuous_columns,
            categorical_columns=categorical_columns,
            task=task,
            target_variable=target_variable
        )

        self.settings = AnalyzerSettings(
            output_dir=Path(output_dir),
            categorical_columns=categorical_columns,
            continuous_columns=continuous_columns,
            date_columns=date_columns,
            target_variable=target_variable,
            task=task,
            generate_report=generate_report,
            missingness=self.missingness_module,
            outlier=self.outlier_module,
            visualization=self.visualization_module,
            encoding=self.encoding_module,
            boolean=self.boolean_module,
            dashboard=self.dashboard_module
        )

    @classmethod
    def from_settings(
            cls, 
            data: pd.DataFrame, 
            settings_dict: dict
        ) -> "Analyzer":
        """
        Initialize an Analyzer instance with a given settings dictionary. Settings are validated by pydantic.

        Args:
            data (pd.DataFrame): The input data for the analyzer.
            settings_dict (dict): A dictionary containing the analyzer settings.

        Returns:
            Analyzer: An analyzer instance with the given settings.

        Raises:
            ValueError: If the settings dictionary is invalid.
        """
        try:
            settings = AnalyzerSettings.model_validate(settings_dict)
        except Exception as e:
            raise ValueError("Invalid analyzer settings") from e

        analyzer = cls(
            data=data,
            output_dir=settings.output_dir,
        )

        analyzer.missingness_module = settings.missingness
        analyzer.outlier_module = settings.outlier
        analyzer.visualization_module = settings.visualization
        analyzer.encoding_module = settings.encoding
        analyzer.boolean_module = settings.boolean
        analyzer.dashboard_module = settings.dashboard

        analyzer.settings = settings

        return analyzer

    def run(self) -> None:
        """
        Runs the analyzer pipeline.

        This function runs the following steps:
            1. Creates a TableOne summary of the input data.
            2. Runs the data cleaning modules.
            3. Runs the visualization module.
            4. Runs the encoding module.
            5. Saves the updated data.
            6. Generates a PDF report of the analysis results.
            7. Saves the settings to a JSON file.
        """

        # Create Table One
        self.mytable = TableOne(
            self.data[self.settings.continuous_columns + self.settings.categorical_columns].copy(), 
            categorical=self.settings.categorical_columns, 
            continuous=self.settings.continuous_columns,
            pval=False
        )
        print(self.mytable.tabulate(tablefmt = "grid"))
        self.mytable.to_csv(self.settings.output_dir / 'tableone.csv')

        # Run Modules
        self.input_data = self.data.copy()
        self.data = (
            self.data
            .pipe(self.missingness_module)
            .pipe(self.outlier_module)
            .pipe(self.visualization_module)
            .pipe(self.dashboard_module)
            .pipe(self.encoding_module)
            .pipe(self.boolean_module)
        )

        # Save Data
        self.data.to_csv(self.settings.output_dir / 'updated_data.csv', index=False)

        # Generate Report
        if self.settings.generate_report:
            generate_analysis_report_pdf(
                outlier_analysis=self.outlier_module.report,
                multiplots=self.visualization_module._multiplots,
                categorical_columns=self.settings.categorical_columns,
                continuous_columns=self.settings.continuous_columns,
                output_dir=self.settings.output_dir
            )
        else:
            logger.warning("Skipping report generation.")

        # Save Settings
        self.settings.settings_schema_path = self.settings.output_dir / 'analyzer_settings.schema.json'
        with self.settings.settings_schema_path.open("w") as f:
            json.dump(self.settings.model_json_schema(), f, indent=2)

        self.settings.settings_path = self.settings.output_dir / 'analyzer_settings.json'
        with self.settings.settings_path.open('w') as f:
            json.dump({
                "$schema": str(self.settings.settings_schema_path.relative_to(self.settings.output_dir)),
                **self.settings.model_dump(mode="json") 
            }, f, indent=2)

    def __rich_repr__(self) -> rich.repr.Result:
        yield self.settings

    def __repr__(self) -> str:
        return f"Analyzer(settings={self.settings.model_dump_json(indent=2)})"

from_settings(data, settings_dict) classmethod

Initialize an Analyzer instance with a given settings dictionary. Settings are validated by pydantic.

Parameters:

Name Type Description Default
data DataFrame

The input data for the analyzer.

required
settings_dict dict

A dictionary containing the analyzer settings.

required

Returns:

Name Type Description
Analyzer Analyzer

An analyzer instance with the given settings.

Raises:

Type Description
ValueError

If the settings dictionary is invalid.

Source code in src/jarvais/analyzer/analyzer.py
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
@classmethod
def from_settings(
        cls, 
        data: pd.DataFrame, 
        settings_dict: dict
    ) -> "Analyzer":
    """
    Initialize an Analyzer instance with a given settings dictionary. Settings are validated by pydantic.

    Args:
        data (pd.DataFrame): The input data for the analyzer.
        settings_dict (dict): A dictionary containing the analyzer settings.

    Returns:
        Analyzer: An analyzer instance with the given settings.

    Raises:
        ValueError: If the settings dictionary is invalid.
    """
    try:
        settings = AnalyzerSettings.model_validate(settings_dict)
    except Exception as e:
        raise ValueError("Invalid analyzer settings") from e

    analyzer = cls(
        data=data,
        output_dir=settings.output_dir,
    )

    analyzer.missingness_module = settings.missingness
    analyzer.outlier_module = settings.outlier
    analyzer.visualization_module = settings.visualization
    analyzer.encoding_module = settings.encoding
    analyzer.boolean_module = settings.boolean
    analyzer.dashboard_module = settings.dashboard

    analyzer.settings = settings

    return analyzer

run()

Runs the analyzer pipeline.

This function runs the following steps
  1. Creates a TableOne summary of the input data.
  2. Runs the data cleaning modules.
  3. Runs the visualization module.
  4. Runs the encoding module.
  5. Saves the updated data.
  6. Generates a PDF report of the analysis results.
  7. Saves the settings to a JSON file.
Source code in src/jarvais/analyzer/analyzer.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
def run(self) -> None:
    """
    Runs the analyzer pipeline.

    This function runs the following steps:
        1. Creates a TableOne summary of the input data.
        2. Runs the data cleaning modules.
        3. Runs the visualization module.
        4. Runs the encoding module.
        5. Saves the updated data.
        6. Generates a PDF report of the analysis results.
        7. Saves the settings to a JSON file.
    """

    # Create Table One
    self.mytable = TableOne(
        self.data[self.settings.continuous_columns + self.settings.categorical_columns].copy(), 
        categorical=self.settings.categorical_columns, 
        continuous=self.settings.continuous_columns,
        pval=False
    )
    print(self.mytable.tabulate(tablefmt = "grid"))
    self.mytable.to_csv(self.settings.output_dir / 'tableone.csv')

    # Run Modules
    self.input_data = self.data.copy()
    self.data = (
        self.data
        .pipe(self.missingness_module)
        .pipe(self.outlier_module)
        .pipe(self.visualization_module)
        .pipe(self.dashboard_module)
        .pipe(self.encoding_module)
        .pipe(self.boolean_module)
    )

    # Save Data
    self.data.to_csv(self.settings.output_dir / 'updated_data.csv', index=False)

    # Generate Report
    if self.settings.generate_report:
        generate_analysis_report_pdf(
            outlier_analysis=self.outlier_module.report,
            multiplots=self.visualization_module._multiplots,
            categorical_columns=self.settings.categorical_columns,
            continuous_columns=self.settings.continuous_columns,
            output_dir=self.settings.output_dir
        )
    else:
        logger.warning("Skipping report generation.")

    # Save Settings
    self.settings.settings_schema_path = self.settings.output_dir / 'analyzer_settings.schema.json'
    with self.settings.settings_schema_path.open("w") as f:
        json.dump(self.settings.model_json_schema(), f, indent=2)

    self.settings.settings_path = self.settings.output_dir / 'analyzer_settings.json'
    with self.settings.settings_path.open('w') as f:
        json.dump({
            "$schema": str(self.settings.settings_schema_path.relative_to(self.settings.output_dir)),
            **self.settings.model_dump(mode="json") 
        }, f, indent=2)

Analyzer Modules

The Analyzer class contains the following modules:

jarvais.analyzer.modules.MissingnessModule

Bases: AnalyzerModule

Source code in src/jarvais/analyzer/modules/missingness.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
class MissingnessModule(AnalyzerModule):

    categorical_strategy: Dict[str, Literal['unknown', 'knn', 'mode']] = Field(
        description="Missingness strategy for categorical columns.",
        title="Categorical Strategy",
        examples=[{"gender": "unknown", "treatment_type": "knn", "tumor_stage": "mode"}]
    )
    continuous_strategy: Dict[str, Literal['mean', 'median', 'mode']] = Field(
        description="Missingness strategy for continuous columns.",
        title="Continuous Strategy",
        examples=[{"age": "median", "tumor_size": "mean", "survival_rate": "median"}]
    )

    @classmethod
    def build(
            cls, 
            continuous_columns: list[str], 
            categorical_columns: list[str],
        ) -> "MissingnessModule":
        return cls(
            continuous_strategy={col: 'median' for col in continuous_columns},
            categorical_strategy={col: 'unknown' for col in categorical_columns}
        )

    def __call__(self, df: pd.DataFrame) -> pd.DataFrame: # noqa: PLR0912
        if not self.enabled:
            logger.warning("Missingness analysis is disabled.")
            return df

        logger.info("Performing missingness analysis...")

        df = df.copy()

        # Handle continuous columns
        for col, cont_strategy in self.continuous_strategy.items():
            if col not in df.columns:
                continue
            if cont_strategy == "mean":
                df[col] = df[col].fillna(df[col].mean())
            elif cont_strategy == "median":
                df[col] = df[col].fillna(df[col].median())
            elif cont_strategy == "mode":
                df[col] = df[col].fillna(df[col].mode().iloc[0])
            else:
                msg = f"Unsupported strategy for continuous column: {cont_strategy}"
                raise ValueError(msg)

        # Handle categorical columns
        for col, cat_strategy in self.categorical_strategy.items():
            if col not in df.columns:
                continue
            if cat_strategy == "unknown":
                df[col] = df[col].astype(str).fillna("Unknown").astype("category")
            elif cat_strategy == "mode":
                df[col] = df[col].fillna(df[col].mode().iloc[0])
            elif cat_strategy == "knn":
                df = self._knn_impute(df, col)
            else:
                df[col] = df[col].fillna(cat_strategy)

        return df

    def _knn_impute(self, df: pd.DataFrame, target_col: str) -> pd.DataFrame:
        df = df.copy()
        df_encoded = df.copy()

        # Encode categorical columns for KNN
        cat_cols = df_encoded.select_dtypes(include="category").columns
        encoders = {col: {k: v for v, k in enumerate(df_encoded[col].dropna().unique())} for col in cat_cols}
        for col in cat_cols:
            df_encoded[col] = df_encoded[col].map(encoders[col])

        df_imputed = pd.DataFrame(
            KNNImputer(n_neighbors=3).fit_transform(df_encoded),
            columns=df.columns,
            index=df.index
        )

        # Decode imputed categorical column
        if target_col in encoders:
            inverse = {v: k for k, v in encoders[target_col].items()}
            df[target_col] = (
                df_imputed[target_col]
                .round()
                .astype(int)
                .map(inverse)
                .astype("category")
            )
        else:
            df[target_col] = df_imputed[target_col]

        return df

jarvais.analyzer.modules.OutlierModule

Bases: AnalyzerModule

Source code in src/jarvais/analyzer/modules/outlier.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
class OutlierModule(AnalyzerModule):

    categorical_strategy: Dict[str, Literal['frequency']] = Field(
        description="Outlier strategy for categorical columns.",
        title="Categorical Strategy",
        examples=[{"treatment_type": "frequency"}]
    )
    continuous_strategy: Dict[str, Literal['none']] = Field(
        description="Outlier strategy for continuous columns (currently unsupported).",
        title="Continuous Strategy",
        examples=[{"age": "none"}]
    )
    threshold: float = Field(
        default=0.01,
        description="Frequency threshold below which a category is considered an outlier.",
        title="Threshold",
    )

    categorical_mapping: Dict[str, Dict[str, str]] = Field(
        default_factory=dict,
        description="Mapping from categorical column names to outlier handling details."
        "Generated after running outlier analysis. If a mapping is already provided, it will be used directly.",
        title="Categorical Outlier Mapping"
    )
    group_outliers: bool = Field(
        default=True,
        description="Whether to group outliers into a single category named 'Other'.",
        title="Group Outliers"
    )
    _outlier_report: str = PrivateAttr(default="")

    @classmethod
    def build(
            cls, 
            categorical_columns: list[str],
            continuous_columns: list[str] | None = None, 
            group_outliers: bool = True
        ) -> "OutlierModule":
        return cls(
            categorical_strategy={col: "frequency" for col in categorical_columns},
            continuous_strategy={col: "none" for col in continuous_columns} if continuous_columns is not None else {},
            group_outliers=group_outliers
        )

    @property
    def report(self) -> str:
        return self._outlier_report

    def __call__(self, df: pd.DataFrame) -> pd.DataFrame:
        if not self.enabled:
            logger.warning("Outlier analysis is disabled.")
            return df

        logger.info("Performing outlier analysis...")

        df = df.copy()

        # Handle continuous outliers
        # for col, strategy in self.continuous_strategy.items(): 
        #     continue

        # Handle categorical outliers
        for col, strategy in self.categorical_strategy.items():
            if col not in df.columns or strategy != 'frequency':
                continue

            # If a mapping is already provided, use it directly
            if col in self.categorical_mapping and self.categorical_mapping[col]:
                logger.warning(f"Using provided categorical mapping for column: {col}")
                mapping = self.categorical_mapping[col]
            else:
                # Otherwise, compute the mapping based on frequency threshold
                value_counts = df[col].value_counts()
                threshold = int(len(df) * self.threshold)
                outliers = value_counts[value_counts < threshold].index 

                mapping = {
                    val: ("Other" if val in outliers else val)
                    for val in value_counts.index
                }
                mapping["Other"] = "Other"

                self.categorical_mapping[col] = dict(mapping)

                if len(outliers) > 0:
                    outliers_msg = [f'{o}: {value_counts[o]} out of {df[col].count()}' for o in outliers]
                    self._outlier_report += f'  - Outliers found in {col}: {outliers_msg}\n'
                else:
                    self._outlier_report += f'  - No Outliers found in {col}\n'

            # Apply the mapping (whether passed or computed)
            df[col] = df[col].map(mapping).astype("category")

        if self._outlier_report:
            print(f"\nOutlier Report:\n{self._outlier_report}")

        if self.group_outliers:
            for col in self.categorical_mapping:
                df[col] = df[col].apply(lambda x: self.categorical_mapping[col][x])

        return df

jarvais.analyzer.modules.VisualizationModule

Bases: AnalyzerModule

Source code in src/jarvais/analyzer/modules/visualization.py
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
class VisualizationModule(AnalyzerModule):

    plots: list[str] = Field(
        description="List of plots to generate.",
        title="Plots",
        examples=["corr", "pairplot", "frequency_table", "multiplot", "umap"]
    )
    output_dir: str | Path = Field(
        description="Output directory.",
        title="Output Directory",
        examples=["output"],
        repr=False
    )
    continuous_columns: list[str] = Field(
        description="List of continuous columns.",
        title="Continuous Columns",
        examples=["age", "tumor_size", "survival_rate"],
        repr=False
    )
    categorical_columns: list[str] = Field(
        description="List of categorical columns.",
        title="Categorical Columns",
        examples=["gender", "treatment_type", "tumor_stage"],
        repr=False
    )
    task: str | None = Field(
        description="Task to perform.",
        title="Task",
        examples=["classification", "regression", "survival"],
        repr=False
    )
    target_variable: str | None = Field(
        description="Target variable.",
        title="Target Variable",
        examples=["death"],
        repr=False
    )
    save_to_json: bool = Field(
        default=False,
        description="Whether to save plots as JSON files."
    )

    _figures_dir: Path = PrivateAttr(default=Path("."))
    _multiplots: list[str] = PrivateAttr(default_factory=list)
    _umap_data: np.ndarray | None = PrivateAttr(default=None)

    def model_post_init(self, context: Any) -> None: 

        self._figures_dir = Path(self.output_dir) / "figures"
        self._figures_dir.mkdir(exist_ok=True, parents=True)

        plot_order = ["corr", "pairplot", "umap", "frequency_table", "multiplot", "kaplan_meier"]
        self.plots = [p for p in plot_order if p in self.plots] # Need UMAP before frequency table

    @classmethod
    def validate_plots(cls, plots: list[str]) -> list[str]:
        plot_registry = ["corr", "pairplot", "frequency_table", "multiplot", "umap", "kaplan_meier"]
        invalid = [p for p in plots if p not in plot_registry]
        if invalid:
            msg = f"Invalid plots: {invalid}. Available: {plot_registry}"
            raise ValueError(msg)
        return plots

    @classmethod
    def build(
            cls,
            output_dir: str | Path,
            continuous_columns: list[str],
            categorical_columns: list[str],
            task: str | None,
            target_variable: str | None
        ) -> "VisualizationModule":
        plots = ["corr", "pairplot", "frequency_table", "multiplot", "umap"]

        if task == "survival":
            plots.append("kaplan_meier")

        return cls(plots=plots, 
                   output_dir=output_dir,
                   continuous_columns=continuous_columns,
                   categorical_columns=categorical_columns,
                   task=task,
                   target_variable=target_variable
                )   

    def __call__(self, data: pd.DataFrame) -> pd.DataFrame:
        if not self.enabled:
            logger.warning("Visualization is disabled.")
            return data

        original_data = data.copy()

        if self.save_to_json:
            logger.warning("Saving plots as JSON files is enabled. This feature is experimental.")

        for plot in self.plots:
            try:
                match plot:
                    case "corr":
                        logger.info("Plotting Correlation Matrix...")
                        self._plot_correlation(data)
                    case "pairplot":
                        logger.info("Plotting Pairplot...")
                        self._plot_pairplot(data)
                    case "frequency_table":
                        logger.info("Plotting Frequency Table...")
                        plot_frequency_table(data, self.categorical_columns, self._figures_dir, self.save_to_json)
                    case "umap":
                        logger.info("Plotting UMAP...")
                        self._umap_data = plot_umap(data, self.continuous_columns, self._figures_dir)
                        if self.save_to_json:
                            with open(self._figures_dir / 'umap_data.json', 'w') as f:
                                json.dump(self._umap_data.tolist(), f)
                    case "kaplan_meier":
                        logger.info("Plotting Kaplan Meier Curves...")
                        self._plot_kaplan_meier(data)
            except Exception as e:
                logger.info(f"Skipping {plot} due to error: {e}")

        if 'multiplot' in self.plots:
            if self._umap_data is None:
                raise ValueError("Cannot plot multiplot without UMAP data.")

            logger.info("Plotting Multiplot...")
            self._plot_multiplot(data)

        return original_data

    def _plot_correlation(self, data: pd.DataFrame) -> None:
        p_corr = data[self.continuous_columns].corr(method="pearson")
        s_corr = data[self.continuous_columns].corr(method="spearman")
        size = 1 + len(self.continuous_columns)*1.2
        plot_corr(p_corr, size, file_name='pearson_correlation.png', output_dir=self._figures_dir, title="Pearson Correlation")
        plot_corr(s_corr, size, file_name='spearman_correlation.png', output_dir=self._figures_dir, title="Spearman Correlation")

        if self.save_to_json:
            p_corr.to_json(self._figures_dir / 'pearson_correlation.json')
            s_corr.to_json(self._figures_dir / 'spearman_correlation.json')

    def _plot_pairplot(self, data: pd.DataFrame) -> None:
        if self.target_variable in self.categorical_columns:
            plot_pairplot(data, self.continuous_columns, output_dir=self._figures_dir, target_variable=self.target_variable)
        else:
            plot_pairplot(data, self.continuous_columns, output_dir=self._figures_dir)

        if self.save_to_json:
            data.to_json(self._figures_dir / 'pairplot.json')

    def _plot_multiplot(self, data: pd.DataFrame) -> None:
        (self._figures_dir / 'multiplots').mkdir(parents=True, exist_ok=True)
        self._multiplots = Parallel(n_jobs=-1)(
            delayed(plot_one_multiplot)(
                data,
                self._umap_data,
                var,
                self.continuous_columns,
                self._figures_dir,
                self.save_to_json
            ) for var in self.categorical_columns
        )

    def _plot_kaplan_meier(self, data: pd.DataFrame) -> None:
        data_x = data.drop(columns=['time', 'event'])
        data_y = data[['time', 'event']]
        categorical_columns = [cat for cat in self.categorical_columns if cat != 'event']
        plot_kaplan_meier_by_category(data_x, data_y, categorical_columns, self._figures_dir / 'kaplan_meier')

jarvais.analyzer.modules.OneHotEncodingModule

Bases: AnalyzerModule

Source code in src/jarvais/analyzer/modules/encoding.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class OneHotEncodingModule(AnalyzerModule):
    columns: list[str] | None = Field(
        default=None,
        description="List of categorical columns to one-hot encode. If None, all columns are used."
    )
    target_variable: str | None = Field(
        default=None,
        description="Target variable to exclude from encoding."
    )
    prefix_sep: str = Field(
        default="|",
        description="Prefix separator used in encoded feature names."
    )

    @classmethod
    def build(
        cls,
        categorical_columns: list[str],
        target_variable: str | None = None,
        prefix_sep: str = "|",
    ) -> "OneHotEncodingModule":
        return cls(
            columns=[col for col in categorical_columns if col != target_variable],
            target_variable=target_variable,
            prefix_sep=prefix_sep
        )

    def __call__(self, df: pd.DataFrame) -> pd.DataFrame:
        if not self.enabled:
            logger.warning("One-hot encoding is disabled.")
            return df

        df = df.copy()
        return pd.get_dummies(
            df,
            columns=self.columns,
            dtype=float,
            prefix_sep=self.prefix_sep
        )

Analyzer Settings

The AnalyzerSettings class is used to configure the Analyzer class.

jarvais.analyzer.settings.AnalyzerSettings

Bases: BaseModel

Source code in src/jarvais/analyzer/settings.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class AnalyzerSettings(BaseModel):
    output_dir: Path = Field(
        description="Output directory.",
        title="Output Directory",
        examples=["output"],
    )
    categorical_columns: list[str] = Field(
        description="List of categorical columns.",
        title="Categorical Columns",
        examples=["gender", "treatment_type", "tumor_stage"],
    )
    continuous_columns: list[str] = Field(
        description="List of continuous columns.",
        title="Continuous Columns",
        examples=["age", "tumor_size", "survival_rate"],
    )
    date_columns: list[str] = Field(
        description="List of date columns.",
        title="Date Columns",
        examples=["date_of_treatment"],
    )
    task: str | None = Field(
        description="Task to perform.",
        title="Task",
        examples=["classification", "regression", "survival"],
    )
    target_variable: str | None = Field(
        description="Target variable.",
        title="Target Variable",
        examples=["death"],
    )
    generate_report: bool = Field(
        default=True,
        description="Whether to generate a pdf report."
    )
    settings_path: Path | None = Field(
        default=None,
        description="Path to settings file.",
    )
    settings_schema_path: Path | None = Field(
        default=None,
        description="Path to settings schema file.",
    )

    missingness: MissingnessModule
    outlier: OutlierModule
    encoding: OneHotEncodingModule
    visualization: VisualizationModule
    boolean: BooleanEncodingModule
    dashboard: DashboardModule

    def model_post_init(self, context: Any) -> None:
        self.output_dir.mkdir(parents=True, exist_ok=True)

    @classmethod
    def validate_task(cls, task: str | None) -> str | None:
        if task not in ['classification', 'regression', 'survival', None]:
            raise ValueError("Invalid task parameter. Choose one of: 'classification', 'regression', 'survival'.")
        return task