The most-used module in the library. Seventeen functions covering the full data preparation workflow — from first look to analysis-ready dataset.
Highlight functions
CreateDataOverview
A comprehensive first look at any DataFrame. Returns data types, missing value rates, and distribution plots for every column.
from analysistoolbox.data_processing import CreateDataOverview
CreateDataOverview(dataframe=df, plot_missingness=True)
ConductEntityMatching
Fuzzy matching between two datasets using Levenshtein distance, Jaro-Winkler, or cosine similarity. Handles the name disambiguation problem common in AML, compliance, and deduplication work.
from analysistoolbox.data_processing import ConductEntityMatching
matches = ConductEntityMatching(
dataframe=df,
column_to_match="company_name",
list_to_match_against=known_entities,
minimum_similarity_score=0.85
)
AddTukeyOutlierColumn
Adds a boolean column identifying outliers using Tukey's IQR fence method.
from analysistoolbox.data_processing import AddTukeyOutlierColumn
df = AddTukeyOutlierColumn(
dataframe=df,
column_name="revenue",
fence_multiplier=1.5
)
ImputeMissingValuesUsingNearestNeighbors
KNN-based imputation that preserves multivariate relationships better than mean/median fills.
from analysistoolbox.data_processing import ImputeMissingValuesUsingNearestNeighbors
df_imputed = ImputeMissingValuesUsingNearestNeighbors(
dataframe=df,
number_of_neighbors=5
)
All functions
| Function | Description |
|---|---|
| AddDateNumberColumns | Extract year, month, quarter, week, day from a date column |
| AddLeadingZeros | Pad numeric columns with leading zeros |
| AddRowCountColumn | Row numbers within groups |
| AddTPeriodColumn | Time period columns for time series analysis |
| AddTukeyOutlierColumn | Flag outliers using Tukey's IQR method |
| CleanTextColumns | Strip leading/trailing whitespace |
| ConductAnomalyDetection | Z-score anomaly detection |
| ConductEntityMatching | Fuzzy matching between datasets |
| ConvertOddsToProbability | Convert betting odds to probabilities |
| CountMissingDataByGroup | Missing value counts by group |
| CreateBinnedColumn | Discretize continuous variables |
| CreateDataOverview | Comprehensive dataset summary |
| CreateRandomSampleGroups | Random validation splits |
| CreateRareCategoryColumn | Flag low-frequency categories |
| CreateStratifiedRandomSampleGroups | Stratified random sampling |
| ImputeMissingValuesUsingNearestNeighbors | KNN imputation |
| VerifyGranularity | Check dataset granularity |