pywrangler.pandas package¶
Subpackages¶
Submodules¶
pywrangler.pandas.base module¶
This module contains the pandas base wrangler.
-
class
pywrangler.pandas.base.
PandasSingleNoFit
[source]¶ Bases:
pywrangler.pandas.base.PandasWrangler
Mixin class defining fit and fit_transform for all wranglers with a single data frame input and output with no fitting necessary.
-
class
pywrangler.pandas.base.
PandasWrangler
[source]¶ Bases:
pywrangler.base.BaseWrangler
Pandas wrangler base class.
-
computation_engine
¶
-
pywrangler.pandas.benchmark module¶
This module contains benchmarking utility for pandas wranglers.
-
class
pywrangler.pandas.benchmark.
PandasMemoryProfiler
(wrangler: pywrangler.pandas.base.PandasWrangler, repetitions: int = 5, interval: float = 0.01)[source]¶ Bases:
pywrangler.benchmark.MemoryProfiler
Approximate memory usage that a pandas wrangler instance requires to execute the fit_transform step.
As a key metric, ratio is computed. It refers to the amount of memory which is required to execute the fit_transform step. More concretely, it estimates how much more memory is used standardized by the input memory usage (memory usage increase during function execution divided by memory usage of input dataframes). In other words, if you have a 1GB input dataframe, and the usage_ratio is 5, fit_transform needs 5GB free memory available to succeed. A usage_ratio of 0.5 given a 2GB input dataframe would require 1GB free memory available for computation.
Parameters: -
report
()¶ Print simple report consisting of best, median, worst, standard deviation and the number of measurements.
-
profile_report
()¶ Calls profile and report in sequence.
-
input
Returns the memory usage of the input dataframes in bytes.
-
output
Returns the memory usage of the output dataframes in bytes.
-
profile
(*dfs, **kwargs)[source] Profiles the actual memory usage given input dataframes dfs which are passed to fit_transform.
-
ratio
Refers to the amount of memory which is required to execute the fit_transform step. More concretely, it estimates how much more memory is used standardized by the input memory usage (memory usage increase during function execution divided by memory usage of input dataframes). In other words, if you have a 1GB input dataframe, and the usage_ratio is 5, fit_transform needs 5GB free memory available to succeed. A usage_ratio of 0.5 given a 2GB input dataframe would require 1GB free memory available for computation.
-
-
class
pywrangler.pandas.benchmark.
PandasTimeProfiler
(wrangler: pywrangler.pandas.base.PandasWrangler, repetitions: Union[None, int] = None)[source]¶ Bases:
pywrangler.benchmark.TimeProfiler
Approximate time that a pandas wrangler instance requires to execute the fit_transform step.
Parameters: -
profile
()¶ Contains the actual profiling implementation.
-
report
()¶ Print simple report consisting of best, median, worst, standard deviation and the number of measurements.
-
profile_report
()¶ Calls profile and report in sequence.
-
pywrangler.pandas.util module¶
This module contains utility functions (e.g. validation) commonly used by pandas wranglers.
-
pywrangler.pandas.util.
groupby
(df: pandas.core.frame.DataFrame, groupby_columns: Union[str, Iterable[str], None]) → pandas.core.groupby.generic.DataFrameGroupBy[source]¶ - Convenient function to group by a dataframe while taking care of
- optional groupby columns. Always returns a DataFrameGroupBy object.
Parameters: - df (pd.DataFrame) – Dataframe to check against.
- groupby_columns (TYPE_COLUMNS) – Columns to be grouped by.
Returns: groupby
Return type: DataFrameGroupBy
-
pywrangler.pandas.util.
sort_values
(df: pandas.core.frame.DataFrame, order_columns: Union[str, Iterable[str], None], ascending: Union[bool, Iterable[bool]]) → pandas.core.frame.DataFrame[source]¶ - Convenient function to return sorted dataframe while taking care of
- optional order columns and order (ascending/descending).
Parameters: - df (pd.DataFrame) – Dataframe to check against.
- order_columns (TYPE_COLUMNS) – Columns to be sorted.
- ascending (TYPE_ASCENDING) – Column order.
Returns: df_sorted
Return type: pd.DataFrame
-
pywrangler.pandas.util.
validate_columns
(df: pandas.core.frame.DataFrame, columns: Union[str, Iterable[str], None])[source]¶ Check that columns exist in dataframe and raise error if otherwise.
Parameters: - df (pd.DataFrame) – Dataframe to check against.
- columns (iterable[str]) – Columns to be validated.