d6tjoin package¶

Submodules¶

d6tjoin.top1 module¶

class d6tjoin.top1.MergeTop1(df1, df2, fuzzy_left_on=None, fuzzy_right_on=None, exact_left_on=None, exact_right_on=None, fun_diff=None, top_limit=None, is_keep_debug=False, use_multicore=True)[source]¶

Bases: object

Left best match join. It applies a difference function to find the key pair with the smallest difference to the join key.

Parameters:

df1 (dataframe) – left dataframe onto which the right dataframe is joined
df2 (dataframe) – right dataframe
fuzzy_left_on (list) – join keys for similarity match, left dataframe
fuzzy_right_on (list) – join keys for similarity match, right dataframe
exact_left_on (list, default None) – join keys for exact match, left dataframe
exact_right_on (list, default None) – join keys for exact match, right dataframe
fun_diff (list, default None) – list of difference functions to be applied for each fuzzy key
top_limit (list, default None) – list of values to cap similarity matches
is_keep_debug (bool) – keep diagnostics columns, good for debugging

Note

fun_diff: applies the difference function to find the best match with minimum distance
- By default gets automatically determined depending on whether you have a string or date/number
- Use None to keep the default, so example [None, lambda x, y: x-y]
- Functions within list get applied in order same order to fuzzy join keys
- Needs to be a difference function so lower is better. For functions like Jaccard higher is better so you need to adjust for that
top_limit: Limits the number of matches to anything below that values. For example if two strings differ by 3 but top_limit is 2, that match will be ignored
- for dates you can use pd.offsets.Day(1) or similar

merge()[source]¶

Executes merge

Returns:	keys ‘merged’ has merged dataframe, ‘top1’ has best matches by fuzzy_left_on. See example notebooks for details
Return type:	dict

class d6tjoin.top1.MergeTop1Diff(df1, df2, fuzzy_left_on, fuzzy_right_on, fun_diff=None, exact_left_on=None, exact_right_on=None, top_limit=None, topn=1, fun_preapply=None, fun_postapply=None, is_keep_debug=False, use_multicore=True)[source]¶

Bases: object

Top1 minimum difference join. Mostly used for strings. Helper for MergeTop1.

merge()[source]¶

top1_diff()[source]¶

class d6tjoin.top1.MergeTop1Number(df1, df2, fuzzy_left_on, fuzzy_right_on, exact_left_on=None, exact_right_on=None, direction='nearest', top_limit=None, is_keep_debug=False)[source]¶

Bases: object

Top1 minimum difference join for numbers. Helper for MergeTop1.

merge()[source]¶

top1_diff()[source]¶

d6tjoin.utils module¶

d6tjoin.utils.ncharTokenCount(dfs, nchars=None, overlapping=False, mincount=2, minlength=1)[source]¶

Tokenize a series of strings by splitting strings into tokens of nchars length. Then count occurance of tokens in series.

Parameters:	dfs (pd.series) – pd.series of values nchars (int) – number of characters in each token overlapping (bool) – make overlapping tokens mincount (int) – discard tokens with count less than mincount minlength (int) – discard tokens with string length less than minlength
Returns:	count of tokens
Return type:	dataframe

d6tjoin.utils.splitcharTokenCount(dfs, splitchars='[^a-zA-Z0-9]+', mincount=2, minlength=1)[source]¶

Tokenize a series of strings by splitting strings on a set of characters. Then count occurance of tokens in series.

Parameters:	dfs (pd.series) – pd.series of values splitchars (str) – regex by which to split string into tokens. For example “[^a-zA-Z0-9]+” for anything not alpha-numeric or “[ -_\|]+” for common ID tokens. mincount (int) – discard tokens with count less than mincount minlength (int) – discard tokens with string length less than minlength
Returns:	count of tokens
Return type:	dataframe

d6tjoin.utils.tokenCount(dfs, fun, mincount=2, minlength=1)[source]¶

Tokenize a series of strings and count occurance of string tokens

Parameters:	dfs (pd.series) – pd.series of values fun (function) – tokenize function mincount (int) – discard tokens with count less than mincount minlength (int) – discard tokens with string length less than minlength
Returns:	count of tokens
Return type:	dataframe

d6tjoin.utils.typeDataFrame(df)[source]¶

Find type of a pandas dataframe columns

Parameters:	df (pd.dataframe) – pandas dataframe
Returns:	column, type
Return type:	dict

d6tjoin.utils.typeSeries(dfs)[source]¶

Find type of a pandas series

Parameters:	dfs (pd.series) – pd.series of values
Returns:	type
Return type:	str

d6tjoin.utils.unique_contains(dfs, strlist)[source]¶

Find values which contain a set of substrings

Parameters:	dfs (pd.series) – pd.series of values strlist (list) – substrings to find
Returns:	unique values which contain substring
Return type:	list

d6tjoin package¶

Submodules¶

d6tjoin.top1 module¶

d6tjoin.utils module¶

Module contents¶