pywrangler.pandas package

Submodules

pywrangler.pandas.base module

This module contains the pandas base wrangler.

class pywrangler.pandas.base.PandasSingleNoFit[source]

Bases: pywrangler.pandas.base.PandasWrangler

Mixin class defining fit and fit_transform for all wranglers with a single data frame input and output with no fitting necessary.

fit(df: pandas.core.frame.DataFrame)[source]

Do nothing and return the wrangler unchanged.

This method is just there to implement the usual API and hence work in pipelines.

Parameters:df (pd.DataFrame) –
fit_transform(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

Apply fit and transform in sequence at once.

Parameters:df (pd.DataFrame) –
Returns:result
Return type:pd.DataFrame
class pywrangler.pandas.base.PandasWrangler[source]

Bases: pywrangler.base.BaseWrangler

Pandas wrangler base class.

computation_engine

pywrangler.pandas.benchmark module

This module contains benchmarking utility for pandas wranglers.

class pywrangler.pandas.benchmark.PandasMemoryProfiler(wrangler: pywrangler.pandas.base.PandasWrangler, repetitions: int = 5, interval: float = 0.01)[source]

Bases: pywrangler.benchmark.MemoryProfiler

Approximate memory usage that a pandas wrangler instance requires to execute the fit_transform step.

As a key metric, ratio is computed. It refers to the amount of memory which is required to execute the fit_transform step. More concretely, it estimates how much more memory is used standardized by the input memory usage (memory usage increase during function execution divided by memory usage of input dataframes). In other words, if you have a 1GB input dataframe, and the usage_ratio is 5, fit_transform needs 5GB free memory available to succeed. A usage_ratio of 0.5 given a 2GB input dataframe would require 1GB free memory available for computation.

Parameters:
  • wrangler (pywrangler.wranglers.pandas.base.PandasWrangler) – The wrangler instance to be profiled.
  • repetitions (int) – The number of measurements for memory profiling.
  • interval (float, optional) – Defines interval duration between consecutive memory usage measurements in seconds.
measurements

The actual profiling measurements in bytes.

Type:list
best

The best measurement in bytes.

Type:float
median

The median of measurements in bytes.

Type:float
worst

The worst measurement in bytes.

Type:float
std

The standard deviation of measurements in bytes.

Type:float
runs

The number of measurements.

Type:int
baseline_change

The median change in baseline memory usage across all runs in bytes.

Type:float
input

Memory usage of input dataframes in bytes.

Type:int
output

Memory usage of output dataframes in bytes.

Type:int
ratio

The amount of memory required for computation in units of input memory usage.

Type:float
profile()[source]

Contains the actual profiling implementation.

report()

Print simple report consisting of best, median, worst, standard deviation and the number of measurements.

profile_report()

Calls profile and report in sequence.

input

Returns the memory usage of the input dataframes in bytes.

output

Returns the memory usage of the output dataframes in bytes.

profile(*dfs, **kwargs)[source]

Profiles the actual memory usage given input dataframes dfs which are passed to fit_transform.

ratio

Refers to the amount of memory which is required to execute the fit_transform step. More concretely, it estimates how much more memory is used standardized by the input memory usage (memory usage increase during function execution divided by memory usage of input dataframes). In other words, if you have a 1GB input dataframe, and the usage_ratio is 5, fit_transform needs 5GB free memory available to succeed. A usage_ratio of 0.5 given a 2GB input dataframe would require 1GB free memory available for computation.

class pywrangler.pandas.benchmark.PandasTimeProfiler(wrangler: pywrangler.pandas.base.PandasWrangler, repetitions: Union[None, int] = None)[source]

Bases: pywrangler.benchmark.TimeProfiler

Approximate time that a pandas wrangler instance requires to execute the fit_transform step.

Parameters:
  • wrangler (pywrangler.wranglers.base.BaseWrangler) – The wrangler instance to be profiled.
  • repetitions (None, int, optional) – Number of repetitions. If None, timeit.Timer.autorange will determine a sensible default.
measurements

The actual profiling measurements in seconds.

Type:list
best

The best measurement in seconds.

Type:float
median

The median of measurements in seconds.

Type:float
worst

The worst measurement in seconds.

Type:float
std

The standard deviation of measurements in seconds.

Type:float
runs

The number of measurements.

Type:int
profile()

Contains the actual profiling implementation.

report()

Print simple report consisting of best, median, worst, standard deviation and the number of measurements.

profile_report()

Calls profile and report in sequence.

pywrangler.pandas.util module

This module contains utility functions (e.g. validation) commonly used by pandas wranglers.

pywrangler.pandas.util.groupby(df: pandas.core.frame.DataFrame, groupby_columns: Union[str, Iterable[str], None]) → pandas.core.groupby.generic.DataFrameGroupBy[source]
Convenient function to group by a dataframe while taking care of
optional groupby columns. Always returns a DataFrameGroupBy object.
Parameters:
  • df (pd.DataFrame) – Dataframe to check against.
  • groupby_columns (TYPE_COLUMNS) – Columns to be grouped by.
Returns:

groupby

Return type:

DataFrameGroupBy

pywrangler.pandas.util.sort_values(df: pandas.core.frame.DataFrame, order_columns: Union[str, Iterable[str], None], ascending: Union[bool, Iterable[bool]]) → pandas.core.frame.DataFrame[source]
Convenient function to return sorted dataframe while taking care of
optional order columns and order (ascending/descending).
Parameters:
  • df (pd.DataFrame) – Dataframe to check against.
  • order_columns (TYPE_COLUMNS) – Columns to be sorted.
  • ascending (TYPE_ASCENDING) – Column order.
Returns:

df_sorted

Return type:

pd.DataFrame

pywrangler.pandas.util.validate_columns(df: pandas.core.frame.DataFrame, columns: Union[str, Iterable[str], None])[source]

Check that columns exist in dataframe and raise error if otherwise.

Parameters:
  • df (pd.DataFrame) – Dataframe to check against.
  • columns (iterable[str]) – Columns to be validated.
pywrangler.pandas.util.validate_empty_df(df: pandas.core.frame.DataFrame)[source]

Check for empty dataframe. By definition, wranglers operate on non empty dataframe. Therefore, raise error if dataframe is empty.

Parameters:df (pd.DataFrame) – Dataframe to check against.

Module contents