d6tjoin package

Submodules

d6tjoin.top1 module

class d6tjoin.top1.MergeTop1(df1, df2, fuzzy_left_on=None, fuzzy_right_on=None, exact_left_on=None, exact_right_on=None, fun_diff=None, top_limit=None, is_keep_debug=False, use_multicore=True)[source]

Bases: object

Left best match join. It applies a difference function to find the key pair with the smallest difference to the join key.

Parameters:
  • df1 (dataframe) – left dataframe onto which the right dataframe is joined
  • df2 (dataframe) – right dataframe
  • fuzzy_left_on (list) – join keys for similarity match, left dataframe
  • fuzzy_right_on (list) – join keys for similarity match, right dataframe
  • exact_left_on (list, default None) – join keys for exact match, left dataframe
  • exact_right_on (list, default None) – join keys for exact match, right dataframe
  • fun_diff (list, default None) – list of difference functions to be applied for each fuzzy key
  • top_limit (list, default None) – list of values to cap similarity matches
  • is_keep_debug (bool) – keep diagnostics columns, good for debugging

Note

  • fun_diff: applies the difference function to find the best match with minimum distance
    • By default gets automatically determined depending on whether you have a string or date/number
    • Use None to keep the default, so example [None, lambda x, y: x-y]
    • Functions within list get applied in order same order to fuzzy join keys
    • Needs to be a difference function so lower is better. For functions like Jaccard higher is better so you need to adjust for that
  • top_limit: Limits the number of matches to anything below that values. For example if two strings differ by 3 but top_limit is 2, that match will be ignored
    • for dates you can use pd.offsets.Day(1) or similar
merge()[source]

Executes merge

Returns:keys ‘merged’ has merged dataframe, ‘top1’ has best matches by fuzzy_left_on. See example notebooks for details
Return type:dict
class d6tjoin.top1.MergeTop1Diff(df1, df2, fuzzy_left_on, fuzzy_right_on, fun_diff=None, exact_left_on=None, exact_right_on=None, top_limit=None, topn=1, fun_preapply=None, fun_postapply=None, is_keep_debug=False, use_multicore=True)[source]

Bases: object

Top1 minimum difference join. Mostly used for strings. Helper for MergeTop1.

merge()[source]
top1_diff()[source]
class d6tjoin.top1.MergeTop1Number(df1, df2, fuzzy_left_on, fuzzy_right_on, exact_left_on=None, exact_right_on=None, direction='nearest', top_limit=None, is_keep_debug=False)[source]

Bases: object

Top1 minimum difference join for numbers. Helper for MergeTop1.

merge()[source]
top1_diff()[source]

d6tjoin.utils module

d6tjoin.utils.ncharTokenCount(dfs, nchars=None, overlapping=False, mincount=2, minlength=1)[source]

Tokenize a series of strings by splitting strings into tokens of nchars length. Then count occurance of tokens in series.

Parameters:
  • dfs (pd.series) – pd.series of values
  • nchars (int) – number of characters in each token
  • overlapping (bool) – make overlapping tokens
  • mincount (int) – discard tokens with count less than mincount
  • minlength (int) – discard tokens with string length less than minlength
Returns:

count of tokens

Return type:

dataframe

d6tjoin.utils.splitcharTokenCount(dfs, splitchars='[^a-zA-Z0-9]+', mincount=2, minlength=1)[source]

Tokenize a series of strings by splitting strings on a set of characters. Then count occurance of tokens in series.

Parameters:
  • dfs (pd.series) – pd.series of values
  • splitchars (str) – regex by which to split string into tokens. For example “[^a-zA-Z0-9]+” for anything not alpha-numeric or “[ -_|]+” for common ID tokens.
  • mincount (int) – discard tokens with count less than mincount
  • minlength (int) – discard tokens with string length less than minlength
Returns:

count of tokens

Return type:

dataframe

d6tjoin.utils.tokenCount(dfs, fun, mincount=2, minlength=1)[source]

Tokenize a series of strings and count occurance of string tokens

Parameters:
  • dfs (pd.series) – pd.series of values
  • fun (function) – tokenize function
  • mincount (int) – discard tokens with count less than mincount
  • minlength (int) – discard tokens with string length less than minlength
Returns:

count of tokens

Return type:

dataframe

d6tjoin.utils.typeDataFrame(df)[source]

Find type of a pandas dataframe columns

Parameters:df (pd.dataframe) – pandas dataframe
Returns:column, type
Return type:dict
d6tjoin.utils.typeSeries(dfs)[source]

Find type of a pandas series

Parameters:dfs (pd.series) – pd.series of values
Returns:type
Return type:str
d6tjoin.utils.unique_contains(dfs, strlist)[source]

Find values which contain a set of substrings

Parameters:
  • dfs (pd.series) – pd.series of values
  • strlist (list) – substrings to find
Returns:

unique values which contain substring

Return type:

list

Module contents