Skip to contents

IPEDS structure

IPEDS does not have a true API for accessing its data files. For users who want complete data files, there are two primary download options:

  1. Microsoft Access data bases
  2. Complete data files

Each version has benefits and drawbacks. Whereas Access databases helpfully provide all data tables for a given year, the files are large. They also come in a proprietary database format that may require the installation of additional software for some users. Finally, IPEDS Access databases only go back to the 2004-2005 data collection period.

On the other hand, the complete data files are smaller — zipped CSV versions of individual database tables. They also include all available data, which in some cases goes back to 1980. However, it can be difficult to know which files contain variables of interest without downloading multiple dictionary files. Once the correct files are identified, there are potentially many files to download.

ripeds simulates an API by smartly downloading, reading, and wrangling complete data files. For users who already have the complete data files they need on their computer, ripeds can skip downloading and instead use those files, saving bandwidth and the time necessary to download.

Using complete data files rather than Access databases allows for complete coverage of IPEDS data and doesn’t require users to install extra external database tools. To aid users not already familiar with the structure of IPEDS, ripeds also provides a searchable data dictionary that can be used to find and request variables of interest.

Design principles

ripeds is built with the following design principles, in general order of importance:

  1. Support users who don’t already know IPEDS structure
  2. Allow the user maximum flexibility in terms of variables selected, filters, and how output is returned
  3. Try to give the user what they request and, if not possible, give an informative error
  4. Be as memory and bandwidth efficient as possible

In some instances, these design principles are at odds with each other. Allowing users more flexibility in choosing filters means that there might be instances in which many files need to be downloaded (breaking principle 4) or that the request to return unjoined files in a list cannot also correctly apply the filter (breaking principle 3). Based on the complexity of IPEDS, there is unfortunately no way around some incommensurable requests. Users who understand these design choices, however, can find ways to mitigate unintended output and ultimately gather the data they need in the format required for their analysis.

How it works

The general process for returning data from a call is as follows:

  1. Scrape the IPEDS complete data file site to generate a table of currently available files if not already in memory.
  2. Using the table alongside an internal data dictionary, generate a list of complete data files that contain selected variables alongside chosen years
  3. Read complete data files into memory — downloading if necessary — selecting variables as requested
  4. Return output: list of each individual file used, list of files in which like files across years have been binded, or single joined data frame

As an example, take the following call:

ipeds_init() |>
  ipeds_select(instnm, stabbr) |>
  ipeds_year(2023) |>
  ipeds_get()

This call would generate the following work flow:

  1. Check memory for table of available files; if not available, scrape IPEDS site (only needs to do this once per session)
  2. Filter potential data files to those released for 2023, and check data dictionary for instnm and stabbr
    • both are found in HD2023 data file
  3. Check tempdir() for HD2023.zip. If it’s not there, download to tempdir() since no local directory to search is given
    • HD2023.zip is either found or downloaded to tempdir()
  4. Read in hd2023.csv (or revised version, hd2023_rv.csv, if it exists by default), selecting instnm and stabbr as well as unitid
  5. Return data frame

Using a filter

With a filter, the process becomes a little more complex. Because filtering variables might be located in different data files, these data files must first be read into memory and combined as necessary so that the filter can be applied.

As an added wrinkle, while most complete data files are in a wide format, with each table row representing a unique institution (by the variable UNITID), some complete data files are in long format, in which each institution can have multiple rows in the table.

For more detailed information on how filters work with ripeds, see the vignette on filtering behavior.

Return

By default, ripeds will attempt to combine all files into a single data.frame. This is a two step process:

  1. Row bind like files, meaning those data.frame coming from complete files that contain the same variables, but in different years (e.g., HD2020, HD2021, HD2022, etc)
  2. Join different file types by UNITID and year using a full or outer join so as not to artificially subset the resulting data.frame based on the order of the joins — this may result in missing values in some data.frame cells depending the specific nature of the call

The user can also choose to return files row bound only with like files, but not joined across files or entirely individually. Both of these options will result in the return of a list. For example, if a user sets join = FALSE but bind = TRUE and the data call requires the complete data files HD2020, HD2021, IC2020, and IC2021, the a list of two items will be returned: a data.frame in which the results from the two HD* files are bound together and a data.frame in which the results from the two IC* files are bound together.

If the user chooses only to bind like files, but the result is a single bound data.frame, output will be a list with one item. For example,

ipeds_init() |>
  ipeds_select(instnm, stabbr) |>
  ipeds_year(2022:2023) |>
  ipeds_get(join = FALSE)

returns a list with one item: a data.frame with two observations per unique UNITID containing the UNITID, institution name, state abbreviation, year, and file name.

Choosing not to bind (which by default also sets join = FALSE),

ipeds_init() |>
  ipeds_select(instnm, stabbr) |>
  ipeds_year(2022:2023) |>
  ipeds_get(bind = FALSE)

returns a list of two items: separate data.frames containing UNITID, institution name, state abbreviation, file name, and year for each year.

Returning to the original call, choosing either join = FALSE or bind = FALSE,

ipeds_init() |>
  ipeds_select(instnm, stabbr) |>
  ipeds_year(2023) |>
  ipeds_get(bind = FALSE)

will return a single data.frame in a list. Even though this is the same data (with the addition of the file name in the data.frame) as was returned in the first call — since all the requested data come from a single complete data file — ripeds returns a list instead of a data.frame object for the sake of consistency.

Speed and memory

The first data request requires a table of current complete data files to be in the tempdir(). If ipeds_init() does not find the table, one will be created by scraping the IPEDS website, munging the results, and storing the table in tempdir(). This step does not take long and once completed, will not need to be taken again for the rest of the session. Subsequent data pulls of the same size will be a little faster for not having to repeat this step.

When making a data call, ripeds will first check whether any of the required complete data files are located in the tempdir() (this is assuming the user does not have the files on their machine and did not supply ipeds_init() with a local_dir). Any missing files will be downloaded to the tempdir(). For the rest of the session, these files will be available for subsequent data requests. This is especially useful in interactive sessions in which the user makes multiple similar calls: data in memory will not need to be downloaded again. This should make additional calls faster as more necessary files are in memory.

Individual complete data files are small: both the tables themselves and the zipped file in which they are downloaded. For most computers and situations, users should not find that they are taking up too much memory via temporary file storage in tempdir(). For older computers or situations in which memory must be kept open, however, users may wish to clear out their tempdir() within a session. If a user wants to keep the downloaded files, ipeds_tmp_to_disk() will save all IPEDS files in memory to the chosen directory. If a user does not care to save the files, they can just close the session and reopen a clean one. Alternatively, they may want to use something like:

file.remove(list.files(tempdir(), full.names = TRUE, pattern = ".zip"))

WARNING The code above will remove all *.zip files from tempdir(). If a user has saved other zipped files to tempdir() during the session, these will be deleted as well.