pywrangler package

Submodules

pywrangler.base module

This module contains the BaseWrangler definition and the wrangler base classes including wrangler descriptions and parameters.

class pywrangler.base.BaseWrangler[source]

Bases: abc.ABC

Defines the basic interface common to all data wranglers.

In analogy to sklearn transformers (see link below), all wranglers have to implement fit, transform and fit_transform methods. In addition, parameters (e.g. column names) need to be provided via the __init__ method. Furthermore, get_params and set_params methods are required for grid search and pipeline compatibility.

The fit method contains optional fitting (e.g. compute mean and variance for scaling) which sets training data dependent transformation behaviour. The transform method includes the actual computational transformation. The fit_transform either applies the former methods in sequence or adds a new implementation of both with better performance. The __init__ method should contain any logic behind parameter parsing and conversion.

In contrast to sklearn, wranglers do only accept dataframes like objects (like pandas/pyspark/dask dataframes) as inputs to fit and transform. The relevant columns and their respective meaning is provided via the __init__ method. In addition, wranglers may accept multiple input dataframes with different shapes. Also, the number of samples may also change between input and output (which is not allowed in sklearn). The preserves_sample_size indicates whether sample size (number of rows) may change during transformation.

The wrangler’s employed computation engine is given via computation_engine.

See also

https
//scikit-learn.org/stable/developers/contributing.html
computation_engine
fit(*args, **kwargs)[source]
fit_transform(*args, **kwargs)[source]
get_params() → dict[source]

Retrieve all wrangler parameters set within the __init__ method.

Returns:param_dict – Parameter names as keys and corresponding values as values
Return type:dictionary
preserves_sample_size
set_params(**params)[source]

Set wrangler parameters

Parameters:params (dict) – Dictionary containing new values to be updated on wrangler. Keys have to match parameter names of wrangler.
Returns:
Return type:self
transform(*args, **kwargs)[source]

pywrangler.benchmark module

This module contains benchmarking utility.

class pywrangler.benchmark.BaseProfiler[source]

Bases: object

Base class defining the interface for all profilers.

Subclasses have to implement profile (the actual profiling method) and less_is_better (defining the ranking of profiling measurements).

The private attribute _measurements is assumed to be set by profile.

measurements

The actual profiling measurements.

Type:list
best

The best measurement.

Type:float
median

The median of measurements.

Type:float
worst

The worst measurement.

Type:float
std

The standard deviation of measurements.

Type:float
runs

The number of measurements.

Type:int
profile()[source]

Contains the actual profiling implementation.

report()[source]

Print simple report consisting of best, median, worst, standard deviation and the number of measurements.

profile_report()[source]

Calls profile and report in sequence.

best

Returns the best measurement.

less_is_better

Defines ranking of measurements.

measurements

Return measurements of profiling.

median

Returns the median of measurements.

profile(*args, **kwargs)[source]

Contains the actual profiling implementation and has to set self._measurements. Always returns self.

profile_report(*args, **kwargs)[source]

Calls profile and report in sequence.

report()[source]

Print simple report consisting of best, median, worst, standard deviation and the number of measurements.

runs

Return number of measurements.

std

Returns the standard deviation of measurements.

worst

Returns the worst measurement.

class pywrangler.benchmark.MemoryProfiler(func: Callable, repetitions: int = 5, interval: float = 0.01)[source]

Bases: pywrangler.benchmark.BaseProfiler

Approximate the increase in memory usage when calling a given function. Memory increase is defined as the difference between the maximum memory usage during function execution and the baseline memory usage before function execution.

In addition, compute the mean increase in baseline memory usage between repetitions which might indicate memory leakage.

Parameters:
  • func (callable) – Callable object to be memory profiled.
  • repetitions (int, optional) – Number of repetitions.
  • interval (float, optional) – Defines interval duration between consecutive memory usage measurements in seconds.
measurements

The actual profiling measurements in bytes.

Type:list
best

The best measurement in bytes.

Type:float
median

The median of measurements in bytes.

Type:float
worst

The worst measurement in bytes.

Type:float
std

The standard deviation of measurements in bytes.

Type:float
runs

The number of measurements.

Type:int
baseline_change

The median change in baseline memory usage across all runs in bytes.

Type:float
profile()[source]

Contains the actual profiling implementation.

report()

Print simple report consisting of best, median, worst, standard deviation and the number of measurements.

profile_report()

Calls profile and report in sequence.

Notes

The implementation is based on memory_profiler and is inspired by the IPython %memit magic which additionally calls gc.collect() before executing the function to get more stable results.

baseline_change

Returns the median change in baseline memory usage across all run. The baseline memory usage is defined as the memory usage before function execution.

baselines

Returns the absolute, baseline memory usages for each run in bytes. The baseline memory usage is defined as the memory usage before function execution.

less_is_better

Less memory consumption is better.

max_usages

Returns the absolute, maximum memory usages for each run in bytes.

profile(*args, **kwargs)[source]

Executes the actual memory profiling.

Parameters:
  • args (iterable, optional) – Optional positional arguments passed to func.
  • kwargs (mapping, optional) – Optional keyword arguments passed to func.
class pywrangler.benchmark.TimeProfiler(func: Callable, repetitions: Union[None, int] = None)[source]

Bases: pywrangler.benchmark.BaseProfiler

Approximate the time required to execute a function call.

By default, the number of repetitions is estimated if not set explicitly.

Parameters:
  • func (callable) – Callable object to be memory profiled.
  • repetitions (None, int, optional) – Number of repetitions. If None, timeit.Timer.autorange will determine a sensible default.
measurements

The actual profiling measurements in seconds.

Type:list
best

The best measurement in seconds.

Type:float
median

The median of measurements in seconds.

Type:float
worst

The worst measurement in seconds.

Type:float
std

The standard deviation of measurements in seconds.

Type:float
runs

The number of measurements.

Type:int
profile()[source]

Contains the actual profiling implementation.

report()

Print simple report consisting of best, median, worst, standard deviation and the number of measurements.

profile_report()

Calls profile and report in sequence.

Notes

The implementation is based on standard library’s timeit module.

less_is_better

Less time required is better.

profile(*args, **kwargs)[source]

Executes the actual time profiling.

Parameters:
  • args (iterable, optional) – Optional positional arguments passed to func.
  • kwargs (mapping, optional) – Optional keyword arguments passed to func.
pywrangler.benchmark.allocate_memory(size: float) → numpy.ndarray[source]

Helper function to approximately allocate memory by creating numpy array with given size in MiB.

Numpy is used deliberately to define the used memory via dtype.

Parameters:size (float) – Size in MiB to be occupied.
Returns:memory_holder
Return type:np.ndarray

pywrangler.exceptions module

The module contains package wide custom exceptions and warnings.

exception pywrangler.exceptions.NotProfiledError[source]

Bases: ValueError, AttributeError

Exception class to raise if profiling results are acquired before calling profile.

This class inherits from both ValueError and AttributeError to help with exception handling

pywrangler.wranglers module

This module contains computation engine independent wrangler interfaces and corresponding descriptions.

class pywrangler.wranglers.IntervalIdentifier(marker_column: str, marker_start: Any, marker_end: Any = <object object>, marker_start_use_first: bool = False, marker_end_use_first: bool = True, orderby_columns: Union[str, Iterable[str], None] = None, groupby_columns: Union[str, Iterable[str], None] = None, ascending: Union[bool, Iterable[bool]] = None, result_type: str = 'enumerated', target_column_name: str = 'iids')[source]

Bases: pywrangler.base.BaseWrangler

Defines the reference interface for the interval identification wrangler.

An interval is defined as a range of values beginning with an opening marker and ending with a closing marker (e.g. the interval daylight may be defined as all events/values occurring between sunrise and sunset). Start and end marker may be identical.

The interval identification wrangler assigns ids to values such that values belonging to the same interval share the same interval id. For example, all values of the first daylight interval are assigned with id 1. All values of the second daylight interval will be assigned with id 2 and so on.

By default, values which do not belong to any valid interval, are assigned the value 0 by definition (please refer to result_type for different result types). If start and end marker are identical or the end marker is not provided, invalid values are only possible before the first start marker is encountered.

Due to messy data, start and end marker may occur multiple times in sequence until its counterpart is reached. Therefore, intervals may have different spans based on different task requirements. For example, the very first start or very last start marker may define the correct start of an interval. Accordingly, four intervals can be selected by setting marker_start_use_first and marker_end_use_first. The resulting intervals are as follows:

  • first start / first end
  • first start / last end (longest interval)
  • last start / first end (shortest interval)
  • last start / last end

Opening and closing markers are included in their corresponding interval.

Parameters:
  • marker_column (str) – Name of column which contains the opening and closing markers.
  • marker_start (Any) – A value defining the start of an interval.
  • marker_end (Any, optional) – A value defining the end of an interval. This value is optional. If not given, the end marker equals the start marker.
  • marker_start_use_first (bool) – Identifies if the first occurring marker_start of an interval is used. Otherwise the last occurring marker_start is used. Default is False.
  • marker_end_use_first (bool) – Identifies if the first occurring marker_end of an interval is used. Otherwise the last occurring marker_end is used. Default is True.
  • orderby_columns (str, Iterable[str], optional) – Column names which define the order of the data (e.g. a timestamp column). Sort order can be defined with the parameter ascending.
  • groupby_columns (str, Iterable[str], optional) – Column names which define how the data should be grouped/split into separate entities. For distributed computation engines, groupby columns should ideally reference partition keys to avoid data shuffling.
  • ascending (bool, Iterable[bool], optional) – Sort ascending vs. descending. Specify list for multiple sort orders. If a list is specified, length of the list must equal length of order_columns. Default is True.
  • result_type (str, optional) – Defines the content of the returned result. If ‘raw’, interval ids will be in arbitrary order with no distinction made between valid and invalid intervals. Intervals are distinguishable by interval id but the interval id may not provide any more information. If ‘valid’, the result is the same as ‘raw’ but all invalid intervals are set to 0. If ‘enumerated’, the result is the same as ‘valid’ but interval ids increase in ascending order (as defined by order) in steps of one.
  • target_column_name (str, optional) – Name of the resulting target column.
preserves_sample_size

Module contents