pywrangler.util.testing package

Submodules

pywrangler.util.testing.datatestcase module

This module contains the DataTestCase class.

class pywrangler.util.testing.datatestcase.DataTestCase(engine: Optional[str] = None)[source]

Bases: object

Represents a data focused test case which has 3 major goals. First, it aims to unify and standardize test data formulation across different computation engines. Second, test data should be as readable as possible and should be maintainable in pure python. Third, it intends to make writing data centric tests as easy as possible while reducing the need of test case related boilerplate code.

To accomplish these goals, (1) it provides an abstraction layer for a computation engine independent data representation via PlainFrame. Test data is formulated once and automatically converted into the target computation engine representation. To ensure readability (2), test data may be formulated in column or row format with pure python objects. To reduce boilerplate code (3), it provides automatic assertion test functionality for all computation engines via EngineAsserter. Additionally, it allows to define mutants of the input data which should cause the test to fail (hence covering multiple distinct but similar test data scenarios within the same data test case).

Every data test case implements input and output methods. They resemble the data given to a test function and the computed data expected from the corresponding test function, respectively. Since the data needs to be formulated in a computation engine independent format, the PlainFrame is is used. For convenience, there are multiple ways of instantiation of a PlainFrame as a dict or tuple.

A dict requires typed column names as keys and values as values, which resembles the column format (define values column wise): >>> result = {“col1:int”: [1,2,3], “col2:str”: [“a”, “b”, “c”]}

A tuple may be returned in 2 variants. Both represent the row format (define values row wise). The most verbose way is to include data, column names and dtypes. >>> data = [[1, “a”], >>> [2, “b”], >>> [3, “b”]] >>> columns = [“col1”, “col2”] >>> dtypes = [“int”, “str”] >>> result = (data, columns, dtypes)

Second, dtypes may be provided simultaneously with column names as typed column annotations: >>> data = [[1, “a”], [2, “b”], [3, “b”]] >>> columns = [“col1:int”, “col2:str”] >>> result = (data, columns)

In any case, you may also provide a PlainFrame directly.

input

Represents the data input given to a data transformation function to be tested.

It needs to be implemented by every data test case.

mutants

Mutants describe modifications to the input data which should cause the test to fail.

Mutants can be defined in various formats. You can provide a single mutant like: >>> return ValueMutant(column=”col1”, row=0, value=3)

This is identical to the dictionary notation: >>> return {(“col1”, 0): 3}

If you want to provide multiple mutations within one mutant at once, you can use the MutantCollection or simply rely on the dictionary notation: >>> return {(“col1”, 2): 5, (“col2”, 1): “asd”}

If you want to provide multiple mutants at once, you may provide multiple dictionaries within a list: >>> [{(“col1”, 2): 5}, {(“col1”, 2): 3}]

Overall, all subclasses of BaseMutant are allowed to be used. You may also mix a specialized mutant with the dictionary notation: >>> [RandomMutant(), {(“col1”, 0): 1}]

output

Represents the data output expected from data transformation function to be tested.

It needs to be implemented by every data test case.

class pywrangler.util.testing.datatestcase.EngineTester(parent: pywrangler.util.testing.datatestcase.DataTestCase)[source]

Bases: object

Composite of DataTestCase which resembles a collection of engine specific assertion functions. More concretely, for each computation engine, the input data from the parent data test case is passed to the function to be tested. The result is then compared to the output data of the parent data test case. Each engine may additionally provide engine specific functionality (like repartition for pyspark).

generic_assert(test_func: Callable, test_kwargs: Optional[Dict[str, Any]], output_func: Callable)[source]

Generic assertion function for all computation engines which requires a computation engine specific output generation function.

Parameters:
  • test_func (callable) – A function that takes a pandas dataframe as the first keyword argument.
  • test_kwargs (dict, optional) – Keyword arguments which will be passed to test_func.
  • output_func (callable) – Output generation function which is computation engine specific.
generic_assert_mutants(func_generate_output: Callable)[source]

Given a computation engine specific output generation function generate_output, iterate all available mutants and confirm their test assertion.

Parameters:func_generate_output (callable) – Computation engine specific function that creates output PlainFrame given a mutant.
Raises:AssertionError is raised if a mutant is not killed.
pandas(test_func: Callable, test_kwargs: Optional[Dict[str, Any]] = None, merge_input: Optional[bool] = False, force_dtypes: Optional[Dict[str, str]] = None)[source]

Assert test data input/output equality for a given test function. Input data is passed to the test function and the result is compared to output data.

Some data test cases require the test function to add new columns to the input dataframe where correct row order is mandatory. In those cases, pandas test functions may only return new columns instead of adding columns to the input dataframe (modifying the input dataframe may result in performance penalties and hence should be prevented). This is special to pandas since it provides an index containing the row order information and does not require the input dataframe to be modified. However, data test cases are formulated to include the input dataframe within the output dataframe when row order matters because other engines may not have an explicit index column (e.g. pyspark). To account for this pandas specific behaviour, merge_input can be activated to make the assertion behave appropriately.

Parameters:
  • test_func (callable) – A function that takes a pandas dataframe as the first keyword argument.
  • test_kwargs (dict, optional) – Keyword arguments which will be passed to test_func.
  • merge_input (bool, optional) – Merge input dataframe to the computed result of the test function (inner join on index).
  • force_dtypes (dict, optional) – Enforce specific dtypes for the returned result of the pandas test function. This may be necessary due to float casts when NaN values are present.
Raises:

AssertionError is thrown if computed and expected results do not match.

pyspark(test_func: Callable, test_kwargs: Optional[Dict[str, Any]] = None, repartition: Union[int, List[str], None] = None)[source]

Assert test data input/output equality for a given test function. Input data is passed to the test function and the result is compared to output data.

Pyspark’s partitioning may be explicitly varied to test against different partitioning settings via repartition.

Parameters:
  • test_func (callable) – A function that takes a pandas dataframe as the first keyword argument.
  • test_args (iterable, optional) – Positional arguments which will be passed to test_func.
  • test_kwargs (dict, optional) – Keyword arguments which will be passed to test_func.
  • repartition (int, list, optional) – Repartition input dataframe.
Raises:

AssertionError is thrown if computed and expected results do not match.

class pywrangler.util.testing.datatestcase.TestCollection(datatestcases: Sequence[pywrangler.util.testing.datatestcase.DataTestCase], test_kwargs: Optional[Dict[str, Dict[KT, VT]]] = None)[source]

Bases: object

Contains one or more DataTestCases. Provides convenient functions to be testable as a group (e.g. for pytest).

testcases

List of collected DataTestCase instances.

Type:List[DataTestCase]
test_kwargs

A dict of optional parameter configuration which could be applied to collected DataTestCase instances. Keys refer to configuration names. Values refer to dicts which in turn represent keyword arguments.

Type:dict, optional
names
pytest_parametrize_kwargs(identifier: str) → Callable[source]

Convenient decorator to access provided test_kwargs and wrap them into pytest.mark.parametrize.

Parameters:identifier (str) – The name of the test kwargs.

Examples

In the following example, conf1 represents an available configuration to be tested. param1 and param2 will be passed to the actual test function.

>>> kwargs= {"conf1": {"param1": 1, "param2": 2}}
>>> test_collection = TestCollection([test1, test2])
>>> @test_collection.pytest_parametrize_testcases
>>> @test_collection.pytest_parametrize_kwargs("conf1")
>>> def test_dummy(testcase, conf1):
>>>     testcase().test.pandas(some_func, test_kwargs=conf1)
pytest_parametrize_testcases(arg: Union[str, Callable]) → Callable[source]

Convenient decorator to wrap a test function which will be parametrized with all available DataTestCases in pytest conform manner.

Decorator can be called before wrapping the test function to supply a custom parameter name or can be used directly with the default parameter name (testcase). See examples for more.

Parameters:arg (str, callable) – Name of the argument that will be used within the wrapped test function if decorator gets called.

Examples

If not used with a custom parameter name, testcase is used by default:

>>> test_collection = TestCollection([test1, test2])
>>> @test_collection.pytest_parametrize_testcases
>>> def test_dummy(testcase):
>>>     testcase().test.pandas(some_func)

If a custom parameter name is provided, it will be used:

>>> test_collection = TestCollection([test1, test2])
>>> @test_collection.pytest_parametrize_testcases("customname")
>>> def test_dummy(customname):
>>>     customname().test.pandas(some_func)
class pywrangler.util.testing.datatestcase.TestDataConverter(name, bases, nmspc)[source]

Bases: type

Metaclass for DataTestCase. It’s main purpose is to simplify the usage of DataTestCase and to avoid boilerplate code.

Essentially, it wraps and modifies the results of the input, output and mutants methods of DataTestCase.

For input and output, in converts the result to PlainFrame. For mutants, it converts the result to BaseMutant. Additionally, methods are wrapped as properties for simple dot notation access.

pywrangler.util.testing.datatestcase.convert_method(func: Callable, convert: Callable) → Callable[source]

Helper function to wrap a given function with a given converter function.

pywrangler.util.testing.mutants module

This module contains the data mutants and mutation classes.

class pywrangler.util.testing.mutants.BaseMutant[source]

Bases: object

Base class for all mutants. A mutant produces one or more mutations.

classmethod from_dict(raw: dict) → Union[pywrangler.util.testing.mutants.ValueMutant, pywrangler.util.testing.mutants.MutantCollection][source]

Factory method to conveniently convert a raw value into a Mutant instance. This is used for easy Mutant creation in dict format to avoid boilerplate code. Essentially, the dict format understands value mutations only. The key consists of a tuple of column and row and the value represents the actual new value, as follows:

>>> {("col1", 1): 0}

is identical to

>>> ValueMutant(column="col1", row=1, value=0)

Moreover, multiple mutations may be provided:

>>> {("col1", 1): 0, ("col1", 2): 1}

will result into

>>> MutantCollection([ValueMutant(column="col1", row=1, value=0),
>>>                   ValueMutant(column="col1", row=2, value=1)])
Parameters:raw (dict) – Raw value mutant definitions.
Returns:mutant
Return type:ValueMutant, MutantCollection
classmethod from_multiple_any(raw: Union[dict, BaseMutant, List[BaseMutant], None]) → List[pywrangler.util.testing.mutants.BaseMutant][source]

Factory method to conveniently convert raw values into a list of Mutant objects.

Mutants can be defined in various formats. You can provide a single mutant like: >>> return ValueMutant(column=”col1”, row=0, value=3)

This is identical to the dictionary notation: >>> return {(“col1”, 0): 3}

If you want to provide multiple mutations within one mutant at once, you can use the MutantCollection or simply rely on the dictionary notation: >>> return {(“col1”, 2): 5, (“col2”, 1): “asd”}

If you want to provide multiple mutants at once, you may provide multiple dictionaries within a list: >>> [{(“col1”, 2): 5}, {(“col1”, 2): 3}]

Overall, all subclasses of BaseMutant are allowed to be used. You may also mix a specialized mutant with the dictionary notation: >>> [RandomMutant(), {(“col1”, 0): 1}]

Parameters:raw (TYPE_RAW_MUTANTS) –
Returns:mutants – List of converted mutant instances.
Return type:list
generate_mutations(df: pywrangler.util.testing.plainframe.PlainFrame) → List[pywrangler.util.testing.mutants.Mutation][source]

Returns all mutations produced by a mutant given a PlainFrame. Needs to be implemented by every Mutant. This is essentially the core of every mutant.

Parameters:df (PlainFrame) – PlainFrame to generate mutations from.
Returns:mutations – List of Mutation instances.
Return type:list
get_params() → Dict[str, Any][source]

Retrieve all parameters set within the __init__ method.

Returns:param_dict – Parameter names as keys and corresponding values as values
Return type:dictionary
mutate(df: pywrangler.util.testing.plainframe.PlainFrame) → pywrangler.util.testing.plainframe.PlainFrame[source]

Modifies given PlainFrame with inherent mutations and returns new, modifed PlainFrame.

Parameters:df (PlainFrame) – PlainFrame to be modified.
Returns:modified
Return type:PlainFrame
class pywrangler.util.testing.mutants.FunctionMutant(func: Callable)[source]

Bases: pywrangler.util.testing.mutants.BaseMutant

Represents a Mutant which wraps a function that essentially generates mutations.

func

A function to be used as a mutation generation method.

Type:callable
generate_mutations(df: pywrangler.util.testing.plainframe.PlainFrame) → List[pywrangler.util.testing.mutants.Mutation][source]

Delegates the mutation generation to a custom function to allow all possible mutation generation.

Parameters:df (PlainFrame) – PlainFrame to generate mutations from.
Returns:mutations – List of Mutation instances.
Return type:list
class pywrangler.util.testing.mutants.ImmutableMutation(column, row, value)

Bases: tuple

column

Alias for field number 0

row

Alias for field number 1

value

Alias for field number 2

class pywrangler.util.testing.mutants.MutantCollection(mutants: Sequence[T_co])[source]

Bases: pywrangler.util.testing.mutants.BaseMutant

Represents a collection of multiple Mutant instances.

mutants

List of mutants.

Type:sequence
generate_mutations(df: pywrangler.util.testing.plainframe.PlainFrame) → List[pywrangler.util.testing.mutants.Mutation][source]

Collects all mutations generated by included Mutants.

Parameters:df (PlainFrame) – PlainFrame to generate mutations from.
Returns:mutations – List of Mutation instances.
Return type:list
class pywrangler.util.testing.mutants.Mutation[source]

Bases: pywrangler.util.testing.mutants.ImmutableMutation

Resembles a single mutation of a dataframe which essentially represents a data modification of a single cell of a dataframe. Hence, a mutation is fully specified via three values: a column, a row and a new value.

The column is always given via label (string). The row is always given via an index (integer) because plainframe does not have labeled indices. The row index starts with 0. The new value may be of any type.

key
class pywrangler.util.testing.mutants.RandomMutant(count: int = 1, columns: Sequence[str] = None, rows: Sequence[int] = None, seed: int = 1)[source]

Bases: pywrangler.util.testing.mutants.BaseMutant

Creates random mutations with naive values for supported dtypes of PlainFrame. Randomness is controlled via an explicit seed to allow reproducibility. Mutation generation may be narrowed to given rows or columns. The number of distinct mutations may also be specified.

count

The number of mutations to be executed.

Type:int, optional
columns

Restrict mutations to provided columns, if given.

Type:sequence, optional
rows

Restrict mutations to provided rows, if given.

Type:sequence, optional
seed

Set the seed for the random generator.

Type:int, optional
generate_mutation(df: pywrangler.util.testing.plainframe.PlainFrame, column: str, row: int) → pywrangler.util.testing.mutants.Mutation[source]

Generates single mutation from given PlainFrame for a given candidate. A candidate is specified via column name and row index.

Parameters:
  • df (PlainFrame) – PlainFrame to generate mutations from.
  • column (str) – Identifies relevant column of mutation.
  • row (int) – Identifies relevant row of mutation.
Returns:

mutation

Return type:

Mutation

generate_mutations(df: pywrangler.util.testing.plainframe.PlainFrame) → List[pywrangler.util.testing.mutants.Mutation][source]

Generates population of all possible mutations and draws a sample of it.

Parameters:df (PlainFrame) – PlainFrame to generate mutations from.
Returns:mutations – List of Mutation instances.
Return type:list
class pywrangler.util.testing.mutants.ValueMutant(column: str, row: int, value: Any)[source]

Bases: pywrangler.util.testing.mutants.BaseMutant

Represents a Mutant with a single mutation.

column

Name of the column.

Type:str
row

Index of the row.

Type:int
value

The new value to be used.

Type:Any
generate_mutations(df: pywrangler.util.testing.plainframe.PlainFrame) → List[pywrangler.util.testing.mutants.Mutation][source]

Returns a single mutation.

Parameters:df (PlainFrame) – PlainFrame to generate mutations from.
Returns:mutations – List of Mutation instances.
Return type:list

pywrangler.util.testing.plainframe module

This module contains the PlainFrame and PlainColumn classes.

class pywrangler.util.testing.plainframe.ConverterFromPandas(df: pandas.core.frame.DataFrame)[source]

Bases: object

Convert pandas dataframe into plain PlainFrame.

convert_series(column: str, dtype: str) → List[Union[bool, int, float, str, datetime.datetime, pywrangler.util.testing.plainframe.NullValue]][source]

Converts a column of pandas dataframe into PlainFrame readable format with specified dtype (np.NaN to NULL, timestamps to datetime.datetime).

Parameters:
  • column (str) – Identifier for column.
  • dtype (str) – Dtype identifier.
Returns:

values – Converted pandas series as plain python objects.

Return type:

list

static force_dtype(series: pandas.core.series.Series, dtype: str) → List[Union[bool, int, float, str, datetime.datetime, pywrangler.util.testing.plainframe.NullValue]][source]

Attempts to convert values to provided type.

Parameters:
  • series (pd.Series) – Values in pandas representation.
  • dtype (str) – Dtype identifier.
Returns:

values – Converted pandas series as plain python objects.

Return type:

list

get_forced_dtypes(dtypes: Union[List[str], Dict[str, str]]) → Dict[str, str][source]

Validate user provided dtypes parameter.

Parameters:dtypes (list, dict) – If list is provided, each value represents a dtype and maps to one column of the dataframe in order. If dict is provided, keys refer to column names and values represent dtypes.
Returns:dtypes_forced – Keys refer to column names and values represent dtypes.
Return type:dict
get_inferred_dtypes(dtypes_validated: Dict[str, str]) → Dict[str, str][source]

Get all dtypes for columns which have not been provided, yet. Assumes that columns of dtype object are not present. Raises type error otherwise.

Parameters:dtypes_validated (dict) – Represents already given column/dtype pairs. Keys refer to column names and values represent dtypes.
Returns:dtypes_inferred – Keys refer to column names and values represent dtypes.
Return type:dict
get_object_dtypes(dtypes_validated: Dict[str, str]) → Dict[str, str][source]

Inspect all columns of dtype object and ensure no mixed dtypes are present. Raises type error otherwise. Ignores columns for which dtypes are already explicitly set.

Parameters:dtypes_validated (dict) – Represents already given column/dtype pairs. Keys refer to column names and values represent dtypes.
Returns:dtypes_object – Keys refer to column names and values represent dtypes.
Return type:dict
static inspect_dtype(series: pandas.core.series.Series) → str[source]

Get appropriate dtype of pandas series. Checks against bool, int, float and datetime. If dtype object is encountered, raises type error.

Parameters:series (pd.Series) – pandas series column identifier.
Returns:dtype – Inferred dtype as string.
Return type:str
inspect_dtype_object(column: str) → str[source]

Inspect series of dtype object and ensure no mixed dtypes are present. Try to infer actual dtype after removing np.NaN distinguishing dtypes bool and str.

Parameters:column (str) – Identifier for column.
Returns:dtype – Inferred dtype as string.
Return type:str
class pywrangler.util.testing.plainframe.ConverterFromPySpark(df: pyspark.sql.DataFrame)[source]

Bases: object

Convert pyspark dataframe into PlainFrame.

TYPE_MAPPING = {'bigint': 'int', 'boolean': 'bool', 'date': 'datetime', 'double': 'float', 'float': 'float', 'int': 'int', 'smallint': 'int', 'string': 'str', 'timestamp': 'datetime'}
static convert_null(values: Iterable[T_co]) → list[source]

Substitutes python None with NULL values.

Parameters:values (iterable) –
get_column_dtypes() → Tuple[List[str], List[str]][source]

Get column names and corresponding dtypes.

class pywrangler.util.testing.plainframe.ConverterToPandas(parent: pywrangler.util.testing.plainframe.PlainColumn)[source]

Bases: object

Collection of pandas conversion methods as a composite of PlainColumn. It handles pandas specifics like the missing distinction between NULL and NaN.

sanitized

Replaces any Null values with np.NaN to conform pandas’ missing value convention.

class pywrangler.util.testing.plainframe.ConverterToPySpark(parent: pywrangler.util.testing.plainframe.PlainColumn)[source]

Bases: object

Collection of pyspark conversion methods as a composite of PlainColumn. It handles spark specifics like NULL as None and proper type matching.

sanitized

Replaces Null values with None to conform pyspark missing value convention.

class pywrangler.util.testing.plainframe.EqualityAsserter(parent: pywrangler.util.testing.plainframe.PlainFrame)[source]

Bases: object

Collection of equality assertions as a composite of PlainFrame. It contains equality tests in regard to number of rows, columns, dtypes etc.

class pywrangler.util.testing.plainframe.NullValue[source]

Bases: object

Represents null values. Provides operator comparison functions to allow sorting which is required to determine row order of data tables.

class pywrangler.util.testing.plainframe.PlainColumn(*args, **kwargs)[source]

Bases: pywrangler.util.testing.plainframe._ImmutablePlainColumn

Represents an immutable column of a PlainFrame consisting of a name, dtype and values. Ensures type validity.

Instantiation should be performed via from_plain factory method which adds preprocessing steps to ensure type correctness.

In addition, it contains conversion methods for all supported computation engines.

classmethod from_plain(name: str, dtype: str, values: Sequence[T_co]) → pywrangler.util.testing.plainframe.PlainColumn[source]

Factory method to instantiate PlainColumn from plain objects. Adds preprocessing steps for float and datetime types.

Parameters:
  • name (str) – Name of the column.
  • dtype (str) – Data type of the column. Must be one of bool, int, float, str or datetime.
  • values (sequence) – sequence of values
Returns:

plaincolumn

Return type:

PlainColumn

has_nan

Signals presence of NaN values.

has_null

Signals presence of NULL values.

modify(modifications: Dict[int, Any]) → pywrangler.util.testing.plainframe.PlainColumn[source]

Modifies PlainColumn and returns new instance. Modification does not change dtype, name or the number of values. One or more values will be modified.

Parameters:modifications (dict) – Dictionary containing modifications with keys representing row indicies and values representing new values.
Returns:modified
Return type:PlainColumn
to_pandas

Composite for conversion functionality to pandas.

to_pyspark

Composite for conversion functionality to pyspark.

typed_column

Return typed column annotation of PlainColumn.

class pywrangler.util.testing.plainframe.PlainFrame(*args, **kwargs)[source]

Bases: pywrangler.util.testing.plainframe._ImmutablePlainFrame

Resembles an immutable dataframe in plain python. Its main purpose is to represent test data that is independent of any computation engine specific characteristics. It serves as a common baseline format. However, in order to be usable for all engines, it can be converted to and from any computation engine’s data representation. This allows to formulate test data in an engine independent way only once and to employ it for all computation engines simultaneously.

The main focus lies on simple but correct data representation. This includes explicit values for NULL and NaN. Each column needs to be typed. Available types are integer, boolean, string, float and datetime. For simplicity, all values will be represented as plain python types (no 3rd party). Hence, it is not intended to be used for large amounts of data due to its representation in plain python objects.

There are several limitations. No index column is supported (as in pandas). Mixed dtypes are not supported (like dtype object in pandas). No distinction is made between int32/int64 or single/double floats. Only primitive/atomic types are supported (pyspark’s ArrayType or MapType are currently not supported).

Essentially, a PlainFrame consists of only 3 attributes: column names, column types and column values. In addition, it provides conversion methods for all computation engines. It does not offer any computation methods itself because it only represents data.

assert_equal

Return equality assertion composite.

columns

Return column names of PlainFrame.

data

Return data of PlainFrame row wise.

dtypes

Return dtypes of columns of PlainFrame.

classmethod from_any(raw: Union[PlainFrame, dict, tuple, pandas.core.frame.DataFrame, pyspark.sql.DataFrame]) → PlainFrame[source]

Instantiate PlainFrame from any possible type supported.

Checks following scenarios: If PlainFrame is given, simply pass. If dict is given, call constructor from dict. If tuple is given, call constructor from plain. If pandas dataframe is given, call from pandas. If spark dataframe is given, call from pyspark.

Parameters:raw (TYPE_ANY_PF) – Input to be converted.
Returns:plainframe
Return type:PlainFrame
classmethod from_dict(data: collections.OrderedDict[str, Sequence]) → PlainFrame[source]

Instantiate PlainFrame from ordered dict. Assumes keys to be column names with type annotations and values to be values.

Parameters:data (dict) – Keys represent typed column annotations and values represent data values.
Returns:plainframe
Return type:PlainFrame
classmethod from_pandas(df: pandas.core.frame.DataFrame, dtypes: Union[List[str], Dict[str, str]] = None) → pywrangler.util.testing.plainframe.PlainFrame[source]

Instantiate PlainFrame from pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – Dataframe to be converted.
  • dtypes (list, dict, optional) – If list is provided, each value represents a dtype and maps to one column of the dataframe in given order. If dict is provided, keys refer to column names and values represent dtypes.
Returns:

datatable – Converted dataframe

Return type:

PlainFrame

classmethod from_plain(data: Sequence[Sequence[T_co]], columns: Sequence[str], dtypes: Optional[Sequence[str]] = None, row_wise: bool = True)[source]

Instantiate PlainFrame from plain python objects. Dtypes have to be provided either via columns as typed column annotations or directly via dtypes. Typed column annotations are a convenient way to omit the dtypes parameter while specifying dtypes directly with the columns parameter.

An exmaple of a typed column annotation is as follows: >>> columns = [“col_a:int”, “col_b:str”, “col_c:float”]

Abbreviations may also be used like: >>> columns = [“col_a:i”, “col_b:s”, “col_c:f”]

For a complete abbreviation mapping, please see TYPE_ABBR.

Parameters:
  • data (list) – List of iterables representing the input data.
  • columns (list) – List of strings representing the column names. Typed annotations are allowed to be used here and will be checked of dtypes is not provided.
  • dtypes (list, optional) – List of column types.
  • row_wise (bool, optional) – By default, assumes data is provided in row wise format. All values belonging to the same row are stored in the same array. In contrast, if row_wise is False, column wise alignment is assumed. In this case, all values belonging to the same column are stored in the same array.
Returns:

plainframe

Return type:

PlainFrame

classmethod from_pyspark(df: pyspark.sql.DataFrame) → PlainFrame[source]

Converts pandas dataframe into TestDataTabble.

Parameters:df (pyspark.sql.DataFrame) – Dataframe to be converted.
Returns:datatable – Converted dataframe
Return type:PlainFrame
get_column(name: str) → pywrangler.util.testing.plainframe.PlainColumn[source]

Convenient access to PlainColumn via column name.

Parameters:name (str) – Label identifier for columns.
Returns:column
Return type:PlainColumn
modify(modifications: Dict[str, Dict[int, Any]]) → pywrangler.util.testing.plainframe.PlainFrame[source]

Modifies PlainFrame and returns new instance. Modification does not change dtype, name or the number of values of defined columns. One or more values of one or more columns will be modified.

Parameters:modifications (dict) – Contains modifications. Keys represent column names and values represent column specific modifications.
Returns:modified
Return type:PlainFrame
n_cols

Returns the number columns.

n_rows

Return the number of rows.

to_dict() → collections.OrderedDict[str, tuple][source]

Converts PlainFrame into dictionary with key as typed columns and values as data.

Returns:table_dict
Return type:OrderedDict
to_pandas() → pandas.core.frame.DataFrame[source]

Converts test data table into a pandas dataframe.

to_plain() → Tuple[List[List[T]], List[str], List[str]][source]

Converts PlainFrame into tuple with 3 values (data, columns, dtypes).

Returns:
Return type:data, columns, values
to_pyspark()[source]

Converts test data table into a pandas dataframe.

pywrangler.util.testing.util module

pywrangler.util.testing.util.concretize_abstract_wrangler(abstract_class: Type[CT_co]) → Type[CT_co][source]

Makes abstract wrangler classes instantiable for testing purposes by implementing abstract methods of BaseWrangler.

Parameters:abstract_class (Type) – Class object to inherit from while overriding abstract methods.
Returns:concrete_class – Concrete class usable for testing.
Return type:Type

Module contents