Skip to contents

How ripeds uses filters

When using ipeds_filter() in a call chain, ripeds makes a distinction between filters that use variables from wide data files versus those that come from long (or narrow) data files.

Wide

In a wide data, each row represents a unique institution.

unitid instnm
100654 Alabama A & M University
100663 University of Alabama at Birmingham

Long

In long (or narrow) data, unique institutions — here represented by their unitid — may have more than observation (row) in the data set. In this case, institutions have one observation each across five categories of efdelev.

unitid efdelev efdetot
100654 1 6007
100654 2 5206
100654 3 5196
100654 11 10
100654 12 801
100663 1 21639
100663 2 13032
100663 3 12776
100663 11 256
100663 12 8607

In general, ripeds will apply filters as follows:

  • wide: filter(s) applied in a preprocessing step before the primary data selection
  • long: filter(s) applied in a postprocessing step after the primary data selection

Preprocessing filter

Preprocessing filters require that a single data.frame containing all filtering variables is built prior to the main data request. Once constructed, all filters are applied and a single data.frame consisting of two columns, UNITID by year, is returned. This data.frame will then be used in the primary data selection stage to keep only those institution/year pairs returned by the filter.

Postprocessing filter

Because long files are not unique by UNITID and year (see example above), the filtering data.frame produced by a preprocessing filter will not adequately handle a filter like efdelev == 3.

A postprocessing filter is applied in the case of variables sourced from long data files. After the preprocessing filter and primary data call, any filter containing a variable from a long data file will be applied to the output requested by the user.

The result will depend on the complexity of the filter(s) and the type of output chosen by the user:

  • [join = TRUE (default)] Filter(s) will be applied to the single joined data.frame, which should always be successful assuming a properly formatted filter
  • [join = FALSE, bind = TRUE|FALSE] Filter(s) will be applied as applicable to the individual data.frames contained in the output list. This means that if a filter contains a variable found a particular long data file, there will be an attempt to apply the filter to that data file. This may not be successful.

A postprocessing filter will fail when a complex filter containing variables across multiple data files is requested alongside join = FALSE. Since the complex filter is applied to only one data file or data file type (e.g., HD*), the filter will not be able to find some of the variables. In this situation, an error message alerting the user will occur and the unfiltered data.frame will be returned.

When this error message is returned, the user should either set join = TRUE (the default), use a less complex filter, or filter the results in a separate process.

Considerations

When applying filters using ipeds_filter(), a user should keep in mind the trade-off between complex filters with large data requests, memory / time required to fulfill the request, and the potential for errors.

As a filter or set of filters becomes more complex, the data.frame needed for the preprocessing step will grow in size. When long data files are used, so too will the time and/or memory needed for the postprocessing steps. And in cases the user chooses not to join final output, there is increased chance of error or unexpected behavior. These issues will only scale as more variables and data years are requested.

In situations requiring complex filtering behavior alongside large data requests, the user may benefit from breaking the request into smaller chunks or with fewer filters and wrangling the results using other R tools (such as those in the Tidyverse). In all cases, the user should confirm that output matches expectations.