macpie.pandas.date_proximity#

macpie.pandas.date_proximity(left, right, id_on=None, id_left_on=None, id_right_on=None, date_on=None, date_left_on=None, date_right_on=None, get: str = 'all', when: str = 'earlier_or_later', days: int = 90, left_link_id=None, dropna: bool = False, drop_duplicates: bool = False, duplicates_indicator: bool = False, merge='partial', merge_suffixes=('_x', '_y'), prepend_levels=(None, None)) DataFrame#

Links data across two pandas.DataFrame objects by date proximity.

Specifically, a “left” DataFrame contains a timepoint anchor, and a “right” DataFrame is linked to the left by retrieving all rows that match on a specified id col, and whose specified date fields are within a certain time range of each other.

Parameters:
leftDataFrame

Contains the timepoint anchor

rightDataFrame

To be linked to left

id_on: str

Primary column to join on. These must be found in both DataFrames.

id_left_onstr

Primary column to join on in the left DataFrame

id_right_on: str

Primary column to join on in the right DataFrame

date_onstr

Date columns to use for timepoint matching. These must be found in both DataFrames, and the one on the left will act as timepoint anchor.

date_left_onstr

Date column in left DataFrame to act as timepoint anchor.

date_right_on: str

Date column in right DataFrame to compare with left’s timepoint anchor

get{‘all’, ‘closest’}, default ‘all’

Indicates which rows of the right DataFrame to link in reference to the timepoint anchor:

  • all: keep all rows

  • closest: get only the closest row that is within days days of the timepoint anchor

when{‘earlier’, ‘later’, ‘earlier_or_later’}, default ‘earlier_or_later’

Indicates which rows of the right DataFrame to link in temporal relation to the timepoint anchor

  • earlier: get only rows that are earlier than the timepoint anchor

  • later: get only rows that are lter (more recent) than the timepoint anchor

  • earlier_or_later: get rows that are earlier or later than the timepoint anchor

daysint, default 90

The time range measured in days

left_link_idstr, optional

The id column in the left DataFrame to act as the primary key of that data. This helps to ensure there are no duplicates in the left DataFrame (i.e. rows with the same id_left_on and date_left_on)

dropnabool, default: False

Whether to exclude rows that did not find any match

drop_duplicatesbool, default: False

If True, then if more than one row in the right DataFrame is found, all will be dropped except the last one.

duplicates_indicatorbool or str, default False

If True, adds a column to the output DataFrame called “_mp_duplicates” denoting which rows are duplicates. The column can be given a different name by providing a string argument.

merge{‘partial’, ‘full’}, default ‘partial’

Indicates which columns to include in result

  • partial: include only columns from the right DataFrame

  • full: include all columns from both left and right DataFrames

merge_suffixeslist-like, default is (“_x”, “_y”)

A length-2 sequence where the first element is suffix to add to the left DataFrame columns, and second element is suffix to add to the right DataFrame columns.

prepend_levelslist-like, default is (None, None)

A length-2 sequence where each element is optionally a string indicating a top-level index to add to columnn indexes in left and right respectively (thus creating a pandas.MultiIndex if needed). Pass a value of None instead of a string to indicate that the column index in left or right should be left as-is. At least one of the values must not be None.

Returns:
DataFrame

A DataFrame of the two linked objects.