d6tjoin package¶
Submodules¶
d6tjoin.top1 module¶
-
class
d6tjoin.top1.
MergeTop1
(df1, df2, fuzzy_left_on=None, fuzzy_right_on=None, exact_left_on=None, exact_right_on=None, fun_diff=None, top_limit=None, is_keep_debug=False, use_multicore=True)[source]¶ Bases:
object
Left best match join. It applies a difference function to find the key pair with the smallest difference to the join key.
Parameters: - df1 (dataframe) – left dataframe onto which the right dataframe is joined
- df2 (dataframe) – right dataframe
- fuzzy_left_on (list) – join keys for similarity match, left dataframe
- fuzzy_right_on (list) – join keys for similarity match, right dataframe
- exact_left_on (list, default None) – join keys for exact match, left dataframe
- exact_right_on (list, default None) – join keys for exact match, right dataframe
- fun_diff (list, default None) – list of difference functions to be applied for each fuzzy key
- top_limit (list, default None) – list of values to cap similarity matches
- is_keep_debug (bool) – keep diagnostics columns, good for debugging
Note
- fun_diff: applies the difference function to find the best match with minimum distance
- By default gets automatically determined depending on whether you have a string or date/number
- Use None to keep the default, so example [None, lambda x, y: x-y]
- Functions within list get applied in order same order to fuzzy join keys
- Needs to be a difference function so lower is better. For functions like Jaccard higher is better so you need to adjust for that
- top_limit: Limits the number of matches to anything below that values. For example if two strings differ by 3 but top_limit is 2, that match will be ignored
- for dates you can use pd.offsets.Day(1) or similar
-
class
d6tjoin.top1.
MergeTop1Diff
(df1, df2, fuzzy_left_on, fuzzy_right_on, fun_diff=None, exact_left_on=None, exact_right_on=None, top_limit=None, topn=1, fun_preapply=None, fun_postapply=None, is_keep_debug=False, use_multicore=True)[source]¶ Bases:
object
Top1 minimum difference join. Mostly used for strings. Helper for MergeTop1.
d6tjoin.utils module¶
-
d6tjoin.utils.
ncharTokenCount
(dfs, nchars=None, overlapping=False, mincount=2, minlength=1)[source]¶ Tokenize a series of strings by splitting strings into tokens of nchars length. Then count occurance of tokens in series.
Parameters: - dfs (pd.series) – pd.series of values
- nchars (int) – number of characters in each token
- overlapping (bool) – make overlapping tokens
- mincount (int) – discard tokens with count less than mincount
- minlength (int) – discard tokens with string length less than minlength
Returns: count of tokens
Return type: dataframe
-
d6tjoin.utils.
splitcharTokenCount
(dfs, splitchars='[^a-zA-Z0-9]+', mincount=2, minlength=1)[source]¶ Tokenize a series of strings by splitting strings on a set of characters. Then count occurance of tokens in series.
Parameters: - dfs (pd.series) – pd.series of values
- splitchars (str) – regex by which to split string into tokens. For example “[^a-zA-Z0-9]+” for anything not alpha-numeric or “[ -_|]+” for common ID tokens.
- mincount (int) – discard tokens with count less than mincount
- minlength (int) – discard tokens with string length less than minlength
Returns: count of tokens
Return type: dataframe
-
d6tjoin.utils.
tokenCount
(dfs, fun, mincount=2, minlength=1)[source]¶ Tokenize a series of strings and count occurance of string tokens
Parameters: - dfs (pd.series) – pd.series of values
- fun (function) – tokenize function
- mincount (int) – discard tokens with count less than mincount
- minlength (int) – discard tokens with string length less than minlength
Returns: count of tokens
Return type: dataframe
-
d6tjoin.utils.
typeDataFrame
(df)[source]¶ Find type of a pandas dataframe columns
Parameters: df (pd.dataframe) – pandas dataframe Returns: column, type Return type: dict