Developer guide

Setting up developer environment

Create and activate environment

First, create a separate virtual environment for pywrangler using the tool of your choice (e.g. conda):

conda create -n pywrangler_dev python=3.6

Next, make sure to activate your environment or to explicitly use the python interpreter of your newly created environment for the following commands:

source activate pywrangler_dev

Clone and install pywrangler

Install all dependencies

To clone pywrangler’s master branch into the current working directory and to install it in development mode (editable) with all dependencies, run the following command:

pip install -e git+https://github.com/mansenfranzen/pywrangler.git@master#egg=pywrangler[all] --src ''

You may separate cloning and installing:

git clone https://github.com/mansenfranzen/pywrangler.git
cd pywrangler
pip install -e .[all]

Install selected dependencies

You may not want to install all dependencies because they may be irrelevant for you. If you want to install only the minimal required development dependencies to develop pyspark data wranglers, switch [all] with [dev,pyspark]:

pip install -e git+https://github.com/mansenfranzen/pywrangler.git@master#egg=pywrangler[dev,pyspark] --src ''

All available dependency packages are listed in the setup.cfg under options.extras_require.

Running tests

pywrangler uses pytest as a testing framework and tox for providing different testing environments.

Using pytest

If you want to run tests within your currently activated python environment, just run pytest (assuming you are currently in pywrangler’s root directory):

pytest

This will run all tests. However, you may want to run only tests which are related to pyspark:

pytest -m pyspark

Same works with pandas and dask.

Using tox

pywrangler specifies many different environments to be tested to ensure that it works as expected across multiple python and varying computation engine versions.

If you want to test against all environments, simply run tox:

tox

If you want to run tests within a specific environment (e.g the most current computation engines for python 3.7), you will need provide the environment abbreviation directly:

tox -e py37-master

Please refer to the tox.ini to see all available environments.

Writing tests for data wranglers

When writing tests for data wranglers, it is highly recommended to use pywrangler’s DataTestCase. It allows a computation engine independent test case formulation with three major goals in mind:

  • Unify and standardize test data formulation across different computation engines.
  • Let test data be as readable and maintainable as possible.
  • Make writing data centric tests easy while reducing boilerplate code.

Note

Once a test is formulated with the DataTestCase, you may easily convert it to any computation backend. Behind the scences, an computation engine independent dataframe called PlainFrame converts the provided test data to the specific test engine.

Example

Lets start with an easy example. Imagine a data transformation for time series which increases a counter each time it encounters a specific target signal.

Essentially, a data tranfsormation focused test case requires two things: First, the input data which needs to be processed. Second, the output data which is expected as a result of the data wrangling stage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from pywrangler.util.testing import DataTestCase

class IncreaseOneTest(DataTestCase):

   def input(self):
      """Provide the data given to the data wrangler."""

      cols = ["order:int", "signal:str"]
      data = [[    1,       "noise"],
              [    2,       "target"],
              [    3,       "noise"],
              [    4,       "noise"],
              [    5,       "target"]]

      return data, cols

   def output(self):
      """Provide the data expected from the data wrangler."""

      cols = ["order:int", "signal:str", "result:int"]
      data = [[    1,       "noise",          0],
              [    2,       "target",         1],
              [    3,       "noise",          1],
              [    4,       "noise",          1],
              [    5,       "target",         2]]

      return data, cols

That’s all you need to do in order define a data test case. As you can see, typed columns are provided along with the corresponding data in a human readable format.

Next, let’s write two different implementations using pandas and pyspark and test them against the IncreaseOneTest:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
from pyspark.sql import functions as F, DataFrame, Window

def transform_pandas(df: pd.DataFrame) -> pd.DataFrame:
    df = df.sort_values("order")
    result = df["signal"].eq("target").cumsum()

    return df.assign(result=result)

def transform_pyspark(df: DataFrame) -> DataFrame:
    target = F.col("signal").eqNullSafe("target").cast("integer")
    result = F.sum(target).over(Window.orderby("order"))

    return df.withColumn(result=result)

# instantiate test case
test_case = IncreaseOneTest()

# perform test assertions for given computation backends
test_case.test.pandas(transform_pandas)
test_case.test.pyspark(transform_pyspark)

The single test case IncreaseOneTest can be used to test multiple implementations based on different computation engines.

The DataTestCase and PlainFrame offer much more functionality which is covered in the corresponding reference pages. For example, you may use PlainFrame to seamlessly convert between pandas and pyspark dataframes. DataTestCase allows to formulate mutants of the input data which should cause the test to fail (hence covering multiple distinct but similar test data scenarios within the same data test case).

Note

DataTestCase currently supports only single input and output data wranglers. Data wranglers requiring multiple input dataframes or computing multiple output dataframes are not supported, yet.

Building & writing docs

Design & architecture