macpie
command#
Commands for manipulating and analyzing data contained in Excel and/or CSV/text files.
Main Options#
These main options help macpie
know which columns in your files are the key columns
(such as columns containing your primary IDs and dates and/or any secondary IDs).
If they are not specified using these options, then defaults (as described below) will be used.
- -i <STRING>, --id-col=<STRING>#
Default=InstrID
. ID column header. The column header of the primary ID column. In general, this column contains the primary key/index (unique identifiers) of the dataset. In a research data management system, this is typically the ID of a specific data form or assessment.
- -d <STRING>, --date-col=<STRING>#
Default=DCDate
. Date column header. The column header of the primary Date column. In a research data management system, this is typically the date the form or assessment was completed or collected.
- -j <STRING>, --id2-col=<STRING>#
Default=PIDN
. ID2 column header. The column header of the primary ID2 column. In general, this column contains the secondary key/index of the dataset. In a research data management system, this is typically the ID of the patient, subject, or participant who completed the form or assessment.
- -v, --verbose#
Verbose messages. Output more details on what the executed command is doing or has done.
macpie keepone#
This command groups rows that have the same macpie --id2-col
value, and allows you to keep
only the earliest or latest row in each group as determined by the macpie --date-col
values
(discarding the other rows in the group).
Usage#
$ macpie keepone [OPTIONS] PRIMARY
Options#
- -k <STRING>, --keep=<STRING> (all|earliest|latest)#
Specify which rows of the
PRIMARY
file to keep.all
(default): keep all rowsearliest
: for each unique value in the column specified by themacpie --id2-col
option, keep only the earliest row (determined by the values in themacpie --date-col
column)latest
: for each unique value in the column specified by themacpie --id2-col
option, keep only the latest row (determined by the values in themacpie --date-col
column)
Arguments#
- PRIMARY#
Required. A list of filenames and/or directories.
Output#
The results of each dataset will be stored in a corresponding worksheet inside the results file.
Examples#
For each
PIDN
, keep only the earliest CDR record as determined by itsDCDate
.$ macpie keepone --keep=earliest cdr.csv
Equivalent command but using shorter single-dash option names for brevity:
$ macpie keepone -k earliest cdr.csv
For each
VID
(a column containing Visit IDs), keep the latest record as determined by itsVDate
(a column containing the Visit Dates) values.$ macpie --id2-col=VID --date-col=VDate keepone --keep=latest visits.csv
Equivalent command but using shorter single-dash option names for brevity:
$ macpie -j VID -d VDate keepone -k earliest visits.csv
API#
macpie keepone#
This command groups rows that have the same --id2-col
value, and
allows you to keep only the earliest or latest row in each group as
determined by the --date-col
values (discarding the other rows
in the group).
- primarypathlib.Path
A file path
macpie keepone [OPTIONS] [PRIMARY]...
Options
- -k, --keep <keep>#
- Options:
all | earliest | latest
Arguments
- PRIMARY#
Optional argument(s)
macpie link#
This command links data across multiple datasets using a specified timepoint anchor and time range.
Specifically, a single PRIMARY
dataset contains the timepoint anchor (the macpie --date-col
column).
Then, one or more SECONDARY
datasets is linked by retrieving all rows that match on the
PRIMARY
dataset’s macpie --id2-col
field and whose macpie --date-col
fields are within a certain
time range of each other.
Usage#
$ macpie link [OPTIONS] PRIMARY [SECONDARY]
Options#
- -k <STRING>, --primary-keep=<STRING> (all|earliest|latest)#
Specify which rows of the
PRIMARY
file to keep. These rows will serve as the timepoint anchor.all
(default): keep all rowsearliest
: for each group of uniquemacpie --id2-col
values, keep the earliest row, as determined by themacpie --date-col
valueslatest
: for each group of uniquemacpie --id2-col
value, keep the latest row, as determined by themacpie --date-col
values
- -g <STRING>, --secondary-get=<STRING> (all|closest)#
Specify which rows of the
[SECONDARY]
file(s) to get:all
(default): get all rows that are within--secondary-days
days of the correspondingPRIMARY
timepoint anchorclosest
: get only the closest row that is within--secondary-days
days of the correspondingPRIMARY
timepoint anchor
- -t <INTEGER>, --secondary-days=<INTEGER>#
Default=90
. Specify the time range measured in days.
- -w <STRING>, --secondary-when=<STRING> (earlier|later|earlier_or_later)#
Specify which rows of the
[SECONDARY]
file(s) to get:earlier
: get only rows that are earlier than the timepoint anchorlater
: get only rows that are later (more recent) than the timepoint anchorearlier_or_later
(default): get rows that are earlier or later (more recent) than the timepoint anchor
- -i <STRING>, --secondary-id-col=<STRING>#
Default =
macpie --id-col
value. Secondary ID column header. The column header of the secondary ID column, if different from the primary ID column.
- -d <STRING>, --secondary-date-col=<STRING>#
Default =
macpie --date-col
value. Secondary Date column header. The column header of the secondary date column, if different from the primary Date column.
- -j <STRING>, --secondary-id2-col=<STRING>#
Default =
macpie --id2-col
value. Secondary ID2 column header. The column header of the secondary ID2 column, if different from the primary ID2 column.
- --merge-results/--no-merge-results#
Default=--merge-results
. Whether the linked results should be merged into one dataset. Otherwise, the linked datasets will remain in their worksheets.
- --help#
Show a short summary of the usage and options.
Arguments#
- PRIMARY#
Required. Filename of the primary dataset. One and only one must be specified.
- SECONDARY#
Optional. Filenames of the secondary dataset(s), delimited by a space. An unlimited number of files can be specified.
Output#
In the results file, the primary dataset will have the suffix _anchor
, and every linked
secondary dataset will have the suffix _linked
.
IMPORTANT NOTE REGARDING DUPLICATES: Each secondary dataset result will have an extra column
_duplicates
indicating whether that row is part of a duplicate set of rows (i.e. True
if it is
a duplicate, False
otherwise); that row will also be highlighted yellow. It is up to you to
remove any duplicates and keep the single record you consider the most valid or most useful to your dataset.
Duplicates can occur if more than one secondary record was found satisfying the time range criteria. Some common conditions producing duplicates include:
A patient completes the same assessment on the same day but for two different projects; since there are two assessments completed on the same day, both are valid as being the closest to the primary timepoint anchor date.
If the criteria is to find all records within 90 days of the timepoint anchor, it is possible that a patient completed two or more assessments within 90 days of each other
If a patient cancels a visits and comes in for a visit a few days later, the data entry staffer may have forgotten to remove the assessment that were automatically created as part of the cancelled visit.
A good way to know whether you are ready to combine your datasets into a single dataset is if each secondary dataset has the same number of rows as the primary anchor.
Examples#
For every row in
cdr.csv
, find all the records infaq.csv
that match on thePIDN
field and whoseDCDate
is within 90 days of each other.$ macpie link cdr.csv faq.csv
This is equivalent to the two commands below, with the defaults specified instead of implied.
The first command uses the longer double-dash option names syntax for clairty.
$ macpie --id-col=InstrID --date-col=DCDate --id2-col=PIDN link --keep=all --secondary-get=all --secondary-days=90 --secondary-when=earlier_or_later cdr.csv faq.csv
The second command uses the shorter single-dash option names for brevity.
$ macpie -i InstrID -d DCDate -j PIDN link -k all -g all -d 90 -w earlier_or_later cdr.csv faq.csv
Similar to above but uses a combination of defaults and option specifications. In this example, we are finding the records in
faq.csv
that match on thePIDN
field and whoseDCDate
is the closest one within 60 days of and earlier than theDCDate
incdr.csv
$ macpie link -g closest -d 60 -w earlier cdr.csv faq.csv
API#
macpie link#
Link command
macpie link [OPTIONS] PRIMARY [SECONDARY]...
Options
- -k, --primary-keep <primary_keep>#
- Options:
all | earliest | latest
- -g, --secondary-get <secondary_get>#
- Options:
all | closest
- -t, --secondary-days <secondary_days>#
- -w, --secondary-when <secondary_when>#
- Options:
earlier | later | earlier_or_later
- -i, --secondary-id-col <secondary_id_col>#
Secondary ID Column Header
- -d, --secondary-date-col <secondary_date_col>#
Secondary Date Column Header
- -j, --secondary-id2-col <secondary_id2_col>#
Secondary ID2 Column Header
- --merge-results, --no-merge-results#
- --keep-original, --no-keep-original#
Arguments
- PRIMARY#
Required argument
- SECONDARY#
Optional argument(s)
macpie merge#
This command is a common follow-up to the link command, as it allows you to select specific fields across various datasets to merge together into one dataset (thereby removing unwanted fields, which can be many).
The output file of the link
command includes a worksheet named _available_fields
. This provides
a view of all the fields across all the datasets that you input into the link
command. By placing an "x"
next to a particular field, the merge
command will attempt to merge only those fields you marked into one single dataset.
The linking fields (i.e. id_col_name
, date_col_name
, and id2_col
of the primary argument in the link
command,
e.g. PIDN
, DCDate
, InstrID
) will always be included.
NOTE: The output file of this command can also be an input to this same command.
Usage#
$ macpie merge PRIMARY
Options#
- --help#
Show a short summary of the usage and options.
Arguments#
- PRIMARY#
Required. Filename of the results file created by the
link
command OR this command.
Output#
In the results file, all the merged fields will be in a single worksheet. Any dataset that was not merged (by choice or because there were duplicates), will remain in its own worksheet. If a dataset could not be merged because there were duplicates, you can remove the duplicates, save the file, and use this same command to attempt the merge again.
Examples#
After linking
cdr.csv
andfaq.csv
together, I decide only want the the following fields in my dataset:CDRTot
andBoxScore
fromcdr.csv
FAQTot
fromfaq.csv
So first, open the results file from the
link
command and navigate to the_available_fields
worksheet.Mark an
"x"
next to those fields.Save the file.
Run the following command:
$ macpie merge results_XXX.xlsx
API#
macpie merge#
macpie merge [OPTIONS] PRIMARY
Options
- --keep-original, --no-keep-original#
Arguments
- PRIMARY#
Required argument