pywrangler package¶
Subpackages¶
- pywrangler.dask package
- pywrangler.pandas package
- pywrangler.pyspark package
- pywrangler.util package
Submodules¶
pywrangler.base module¶
This module contains the BaseWrangler definition and the wrangler base classes including wrangler descriptions and parameters.
-
class
pywrangler.base.
BaseWrangler
[source]¶ Bases:
abc.ABC
Defines the basic interface common to all data wranglers.
In analogy to sklearn transformers (see link below), all wranglers have to implement fit, transform and fit_transform methods. In addition, parameters (e.g. column names) need to be provided via the __init__ method. Furthermore, get_params and set_params methods are required for grid search and pipeline compatibility.
The fit method contains optional fitting (e.g. compute mean and variance for scaling) which sets training data dependent transformation behaviour. The transform method includes the actual computational transformation. The fit_transform either applies the former methods in sequence or adds a new implementation of both with better performance. The __init__ method should contain any logic behind parameter parsing and conversion.
In contrast to sklearn, wranglers do only accept dataframes like objects (like pandas/pyspark/dask dataframes) as inputs to fit and transform. The relevant columns and their respective meaning is provided via the __init__ method. In addition, wranglers may accept multiple input dataframes with different shapes. Also, the number of samples may also change between input and output (which is not allowed in sklearn). The preserves_sample_size indicates whether sample size (number of rows) may change during transformation.
The wrangler’s employed computation engine is given via computation_engine.
See also
https
- //scikit-learn.org/stable/developers/contributing.html
-
computation_engine
¶
-
get_params
() → dict[source]¶ Retrieve all wrangler parameters set within the __init__ method.
Returns: param_dict – Parameter names as keys and corresponding values as values Return type: dictionary
-
preserves_sample_size
¶
pywrangler.benchmark module¶
This module contains benchmarking utility.
-
class
pywrangler.benchmark.
BaseProfiler
[source]¶ Bases:
object
Base class defining the interface for all profilers.
Subclasses have to implement profile (the actual profiling method) and less_is_better (defining the ranking of profiling measurements).
The private attribute _measurements is assumed to be set by profile.
-
report
()[source]¶ Print simple report consisting of best, median, worst, standard deviation and the number of measurements.
-
best
Returns the best measurement.
-
less_is_better
¶ Defines ranking of measurements.
-
measurements
Return measurements of profiling.
-
median
Returns the median of measurements.
-
profile
(*args, **kwargs)[source] Contains the actual profiling implementation and has to set self._measurements. Always returns self.
-
profile_report
(*args, **kwargs)[source] Calls profile and report in sequence.
-
report
()[source] Print simple report consisting of best, median, worst, standard deviation and the number of measurements.
-
runs
Return number of measurements.
-
std
Returns the standard deviation of measurements.
-
worst
Returns the worst measurement.
-
-
class
pywrangler.benchmark.
MemoryProfiler
(func: Callable, repetitions: int = 5, interval: float = 0.01)[source]¶ Bases:
pywrangler.benchmark.BaseProfiler
Approximate the increase in memory usage when calling a given function. Memory increase is defined as the difference between the maximum memory usage during function execution and the baseline memory usage before function execution.
In addition, compute the mean increase in baseline memory usage between repetitions which might indicate memory leakage.
Parameters: -
report
()¶ Print simple report consisting of best, median, worst, standard deviation and the number of measurements.
-
profile_report
()¶ Calls profile and report in sequence.
Notes
The implementation is based on memory_profiler and is inspired by the IPython %memit magic which additionally calls gc.collect() before executing the function to get more stable results.
-
baseline_change
Returns the median change in baseline memory usage across all run. The baseline memory usage is defined as the memory usage before function execution.
-
baselines
¶ Returns the absolute, baseline memory usages for each run in bytes. The baseline memory usage is defined as the memory usage before function execution.
-
less_is_better
¶ Less memory consumption is better.
-
max_usages
¶ Returns the absolute, maximum memory usages for each run in bytes.
-
profile
(*args, **kwargs)[source] Executes the actual memory profiling.
Parameters: - args (iterable, optional) – Optional positional arguments passed to func.
- kwargs (mapping, optional) – Optional keyword arguments passed to func.
-
-
class
pywrangler.benchmark.
TimeProfiler
(func: Callable, repetitions: Union[None, int] = None)[source]¶ Bases:
pywrangler.benchmark.BaseProfiler
Approximate the time required to execute a function call.
By default, the number of repetitions is estimated if not set explicitly.
Parameters: -
report
()¶ Print simple report consisting of best, median, worst, standard deviation and the number of measurements.
-
profile_report
()¶ Calls profile and report in sequence.
Notes
The implementation is based on standard library’s timeit module.
-
less_is_better
¶ Less time required is better.
-
profile
(*args, **kwargs)[source] Executes the actual time profiling.
Parameters: - args (iterable, optional) – Optional positional arguments passed to func.
- kwargs (mapping, optional) – Optional keyword arguments passed to func.
-
-
pywrangler.benchmark.
allocate_memory
(size: float) → numpy.ndarray[source]¶ Helper function to approximately allocate memory by creating numpy array with given size in MiB.
Numpy is used deliberately to define the used memory via dtype.
Parameters: size (float) – Size in MiB to be occupied. Returns: memory_holder Return type: np.ndarray
pywrangler.exceptions module¶
The module contains package wide custom exceptions and warnings.
-
exception
pywrangler.exceptions.
NotProfiledError
[source]¶ Bases:
ValueError
,AttributeError
Exception class to raise if profiling results are acquired before calling profile.
This class inherits from both ValueError and AttributeError to help with exception handling
pywrangler.wranglers module¶
This module contains computation engine independent wrangler interfaces and corresponding descriptions.
-
class
pywrangler.wranglers.
IntervalIdentifier
(marker_column: str, marker_start: Any, marker_end: Any = <object object>, marker_start_use_first: bool = False, marker_end_use_first: bool = True, orderby_columns: Union[str, Iterable[str], None] = None, groupby_columns: Union[str, Iterable[str], None] = None, ascending: Union[bool, Iterable[bool]] = None, result_type: str = 'enumerated', target_column_name: str = 'iids')[source]¶ Bases:
pywrangler.base.BaseWrangler
Defines the reference interface for the interval identification wrangler.
An interval is defined as a range of values beginning with an opening marker and ending with a closing marker (e.g. the interval daylight may be defined as all events/values occurring between sunrise and sunset). Start and end marker may be identical.
The interval identification wrangler assigns ids to values such that values belonging to the same interval share the same interval id. For example, all values of the first daylight interval are assigned with id 1. All values of the second daylight interval will be assigned with id 2 and so on.
By default, values which do not belong to any valid interval, are assigned the value 0 by definition (please refer to result_type for different result types). If start and end marker are identical or the end marker is not provided, invalid values are only possible before the first start marker is encountered.
Due to messy data, start and end marker may occur multiple times in sequence until its counterpart is reached. Therefore, intervals may have different spans based on different task requirements. For example, the very first start or very last start marker may define the correct start of an interval. Accordingly, four intervals can be selected by setting marker_start_use_first and marker_end_use_first. The resulting intervals are as follows:
- first start / first end
- first start / last end (longest interval)
- last start / first end (shortest interval)
- last start / last end
Opening and closing markers are included in their corresponding interval.
Parameters: - marker_column (str) – Name of column which contains the opening and closing markers.
- marker_start (Any) – A value defining the start of an interval.
- marker_end (Any, optional) – A value defining the end of an interval. This value is optional. If not given, the end marker equals the start marker.
- marker_start_use_first (bool) – Identifies if the first occurring marker_start of an interval is used. Otherwise the last occurring marker_start is used. Default is False.
- marker_end_use_first (bool) – Identifies if the first occurring marker_end of an interval is used. Otherwise the last occurring marker_end is used. Default is True.
- orderby_columns (str, Iterable[str], optional) – Column names which define the order of the data (e.g. a timestamp column). Sort order can be defined with the parameter ascending.
- groupby_columns (str, Iterable[str], optional) – Column names which define how the data should be grouped/split into separate entities. For distributed computation engines, groupby columns should ideally reference partition keys to avoid data shuffling.
- ascending (bool, Iterable[bool], optional) – Sort ascending vs. descending. Specify list for multiple sort orders. If a list is specified, length of the list must equal length of order_columns. Default is True.
- result_type (str, optional) – Defines the content of the returned result. If ‘raw’, interval ids will be in arbitrary order with no distinction made between valid and invalid intervals. Intervals are distinguishable by interval id but the interval id may not provide any more information. If ‘valid’, the result is the same as ‘raw’ but all invalid intervals are set to 0. If ‘enumerated’, the result is the same as ‘valid’ but interval ids increase in ascending order (as defined by order) in steps of one.
- target_column_name (str, optional) – Name of the resulting target column.
-
preserves_sample_size
¶