macpie.pandas.date_proximity#
- macpie.pandas.date_proximity(left, right, id_on=None, id_left_on=None, id_right_on=None, date_on=None, date_left_on=None, date_right_on=None, get: str = 'all', when: str = 'earlier_or_later', days: int = 90, left_link_id=None, dropna: bool = False, drop_duplicates: bool = False, duplicates_indicator: bool = False, merge='partial', merge_suffixes=('_x', '_y'), prepend_levels=(None, None)) DataFrame #
Links data across two
pandas.DataFrame
objects by date proximity.Specifically, a “left” DataFrame contains a timepoint anchor, and a “right” DataFrame is linked to the left by retrieving all rows that match on a specified id col, and whose specified date fields are within a certain time range of each other.
- Parameters:
- leftDataFrame
Contains the timepoint anchor
- rightDataFrame
To be linked to
left
- id_on: str
Primary column to join on. These must be found in both DataFrames.
- id_left_onstr
Primary column to join on in the left DataFrame
- id_right_on: str
Primary column to join on in the right DataFrame
- date_onstr
Date columns to use for timepoint matching. These must be found in both DataFrames, and the one on the left will act as timepoint anchor.
- date_left_onstr
Date column in left DataFrame to act as timepoint anchor.
- date_right_on: str
Date column in right DataFrame to compare with left’s timepoint anchor
- get{‘all’, ‘closest’}, default ‘all’
Indicates which rows of the right DataFrame to link in reference to the timepoint anchor:
all: keep all rows
closest: get only the closest row that is within
days
days of the timepoint anchor
- when{‘earlier’, ‘later’, ‘earlier_or_later’}, default ‘earlier_or_later’
Indicates which rows of the right DataFrame to link in temporal relation to the timepoint anchor
earlier: get only rows that are earlier than the timepoint anchor
later: get only rows that are lter (more recent) than the timepoint anchor
earlier_or_later: get rows that are earlier or later than the timepoint anchor
- daysint, default 90
The time range measured in days
- left_link_idstr, optional
The id column in the left DataFrame to act as the primary key of that data. This helps to ensure there are no duplicates in the left DataFrame (i.e. rows with the same
id_left_on
anddate_left_on
)- dropnabool, default: False
Whether to exclude rows that did not find any match
- drop_duplicatesbool, default: False
If
True
, then if more than one row in the right DataFrame is found, all will be dropped except the last one.- duplicates_indicatorbool or str, default False
If True, adds a column to the output DataFrame called “_mp_duplicates” denoting which rows are duplicates. The column can be given a different name by providing a string argument.
- merge{‘partial’, ‘full’}, default ‘partial’
Indicates which columns to include in result
partial: include only columns from the right DataFrame
full: include all columns from both left and right DataFrames
- merge_suffixeslist-like, default is (“_x”, “_y”)
A length-2 sequence where the first element is suffix to add to the left DataFrame columns, and second element is suffix to add to the right DataFrame columns.
- prepend_levelslist-like, default is (None, None)
A length-2 sequence where each element is optionally a string indicating a top-level index to add to columnn indexes in
left
andright
respectively (thus creating apandas.MultiIndex
if needed). Pass a value ofNone
instead of a string to indicate that the column index inleft
orright
should be left as-is. At least one of the values must not beNone
.
- Returns:
- DataFrame
A DataFrame of the two linked objects.