ANALYSIS TOOL BOX

17 functions

Data Processing

Clean, transform, and prepare data for analysis.

The most-used module in the library. Seventeen functions covering the full data preparation workflow — from first look to analysis-ready dataset.

Highlight functions

CreateDataOverview

A comprehensive first look at any DataFrame. Returns data types, missing value rates, and distribution plots for every column.

from analysistoolbox.data_processing import CreateDataOverview

CreateDataOverview(dataframe=df, plot_missingness=True)

ConductEntityMatching

Fuzzy matching between two datasets using Levenshtein distance, Jaro-Winkler, or cosine similarity. Handles the name disambiguation problem common in AML, compliance, and deduplication work.

from analysistoolbox.data_processing import ConductEntityMatching

matches = ConductEntityMatching(
    dataframe=df,
    column_to_match="company_name",
    list_to_match_against=known_entities,
    minimum_similarity_score=0.85
)

AddTukeyOutlierColumn

Adds a boolean column identifying outliers using Tukey's IQR fence method.

from analysistoolbox.data_processing import AddTukeyOutlierColumn

df = AddTukeyOutlierColumn(
    dataframe=df,
    column_name="revenue",
    fence_multiplier=1.5
)

ImputeMissingValuesUsingNearestNeighbors

KNN-based imputation that preserves multivariate relationships better than mean/median fills.

from analysistoolbox.data_processing import ImputeMissingValuesUsingNearestNeighbors

df_imputed = ImputeMissingValuesUsingNearestNeighbors(
    dataframe=df,
    number_of_neighbors=5
)

All functions

| Function | Description | |---|---| | AddDateNumberColumns | Extract year, month, quarter, week, day from a date column | | AddLeadingZeros | Pad numeric columns with leading zeros | | AddRowCountColumn | Row numbers within groups | | AddTPeriodColumn | Time period columns for time series analysis | | AddTukeyOutlierColumn | Flag outliers using Tukey's IQR method | | CleanTextColumns | Strip leading/trailing whitespace | | ConductAnomalyDetection | Z-score anomaly detection | | ConductEntityMatching | Fuzzy matching between datasets | | ConvertOddsToProbability | Convert betting odds to probabilities | | CountMissingDataByGroup | Missing value counts by group | | CreateBinnedColumn | Discretize continuous variables | | CreateDataOverview | Comprehensive dataset summary | | CreateRandomSampleGroups | Random validation splits | | CreateRareCategoryColumn | Flag low-frequency categories | | CreateStratifiedRandomSampleGroups | Stratified random sampling | | ImputeMissingValuesUsingNearestNeighbors | KNN imputation | | VerifyGranularity | Check dataset granularity |