IPEDS structure
IPEDS does not have a true API for accessing its data files. For users who want complete data files, there are two primary download options:
Each version has benefits and drawbacks. Whereas Access databases helpfully provide all data tables for a given year, the files are large. They also come in a proprietary database format that may require the installation of additional software for some users. Finally, IPEDS Access databases only go back to the 2004-2005 data collection period.
On the other hand, the complete data files are smaller — zipped CSV versions of individual database tables. They also include all available data, which in some cases goes back to 1980. However, it can be difficult to know which files contain variables of interest without downloading multiple dictionary files. Once the correct files are identified, there are potentially many files to download.
ripeds
simulates an API by smartly downloading, reading,
and wrangling complete data files. For users who already have the
complete data files they need on their computer, ripeds
can
skip downloading and instead use those files, saving bandwidth and the
time necessary to download.
Using complete data files rather than Access databases allows for
complete coverage of IPEDS data and doesn’t require users to install
extra external database tools. To aid users not already familiar with
the structure of IPEDS, ripeds
also provides a searchable
data dictionary that can be used to find and request variables of
interest.
Design principles
ripeds
is built with the following design principles, in
general order of importance:
- Support users who don’t already know IPEDS structure
- Allow the user maximum flexibility in terms of variables selected, filters, and how output is returned
- Try to give the user what they request and, if not possible, give an informative error
- Be as memory and bandwidth efficient as possible
In some instances, these design principles are at odds with each other. Allowing users more flexibility in choosing filters means that there might be instances in which many files need to be downloaded (breaking principle 4) or that the request to return unjoined files in a list cannot also correctly apply the filter (breaking principle 3). Based on the complexity of IPEDS, there is unfortunately no way around some incommensurable requests. Users who understand these design choices, however, can find ways to mitigate unintended output and ultimately gather the data they need in the format required for their analysis.
How it works
The general process for returning data from a call is as follows:
- Scrape the IPEDS complete data file site to generate a table of currently available files if not already in memory.
- Using the table alongside an internal data dictionary, generate a list of complete data files that contain selected variables alongside chosen years
- Read complete data files into memory — downloading if necessary — selecting variables as requested
- Return output: list of each individual file used, list of files in which like files across years have been binded, or single joined data frame
As an example, take the following call:
ipeds_init() |>
ipeds_select(instnm, stabbr) |>
ipeds_year(2023) |>
ipeds_get()
This call would generate the following work flow:
- Check memory for table of available files; if not available, scrape IPEDS site (only needs to do this once per session)
- Filter potential data files to those released for 2023, and check
data dictionary for
instnm
andstabbr
- both are found in
HD2023
data file
- both are found in
- Check
tempdir()
forHD2023.zip
. If it’s not there, download totempdir()
since no local directory to search is given- HD2023.zip is either found or downloaded to
tempdir()
- HD2023.zip is either found or downloaded to
- Read in
hd2023.csv
(or revised version,hd2023_rv.csv
, if it exists by default), selectinginstnm
andstabbr
as well asunitid
- Return data frame
Using a filter
With a filter, the process becomes a little more complex. Because filtering variables might be located in different data files, these data files must first be read into memory and combined as necessary so that the filter can be applied.
As an added wrinkle, while most complete data files are in a wide format, with each table row representing a unique institution (by the variable UNITID), some complete data files are in long format, in which each institution can have multiple rows in the table.
For more detailed information on how filters work with
ripeds
, see the vignette
on filtering behavior.
Return
By default, ripeds
will attempt to combine all files
into a single data.frame. This is a two step process:
- Row bind like files, meaning those data.frame coming from complete
files that contain the same variables, but in different years
(e.g.,
HD2020
,HD2021
,HD2022
, etc) - Join different file types by UNITID and year using a full or outer join so as not to artificially subset the resulting data.frame based on the order of the joins — this may result in missing values in some data.frame cells depending the specific nature of the call
The user can also choose to return files row bound only with like
files, but not joined across files or entirely individually. Both of
these options will result in the return of a list. For example, if a
user sets join = FALSE
but bind = TRUE
and the
data call requires the complete data files HD2020
,
HD2021
, IC2020
, and IC2021
, the a
list of two items will be returned: a data.frame in which the results
from the two HD*
files are bound together and a data.frame
in which the results from the two IC*
files are bound
together.
If the user chooses only to bind like files, but the result is a single bound data.frame, output will be a list with one item. For example,
ipeds_init() |>
ipeds_select(instnm, stabbr) |>
ipeds_year(2022:2023) |>
ipeds_get(join = FALSE)
returns a list with one item: a data.frame with two observations per unique UNITID containing the UNITID, institution name, state abbreviation, year, and file name.
Choosing not to bind (which by default also sets
join = FALSE
),
ipeds_init() |>
ipeds_select(instnm, stabbr) |>
ipeds_year(2022:2023) |>
ipeds_get(bind = FALSE)
returns a list of two items: separate data.frames containing UNITID, institution name, state abbreviation, file name, and year for each year.
Returning to the original call, choosing either
join = FALSE
or bind = FALSE
,
ipeds_init() |>
ipeds_select(instnm, stabbr) |>
ipeds_year(2023) |>
ipeds_get(bind = FALSE)
will return a single data.frame in a list. Even though this is the
same data (with the addition of the file name in the data.frame) as was
returned in the first call — since all the requested data come from a
single complete data file — ripeds
returns a list instead
of a data.frame object for the sake of consistency.
Speed and memory
The first data request requires a table of current complete data
files to be in the tempdir()
. If ipeds_init()
does not find the table, one will be created by scraping the IPEDS
website, munging the results, and storing the table in
tempdir()
. This step does not take long and once completed,
will not need to be taken again for the rest of the session. Subsequent
data pulls of the same size will be a little faster for not having to
repeat this step.
When making a data call, ripeds
will first check whether
any of the required complete data files are located in the
tempdir()
(this is assuming the user does not have the
files on their machine and did not supply ipeds_init()
with
a local_dir
). Any missing files will be downloaded to the
tempdir()
. For the rest of the session, these files will be
available for subsequent data requests. This is especially useful in
interactive sessions in which the user makes multiple similar calls:
data in memory will not need to be downloaded again. This should make
additional calls faster as more necessary files are in memory.
Individual complete data files are small: both the tables themselves
and the zipped file in which they are downloaded. For most computers and
situations, users should not find that they are taking up too much
memory via temporary file storage in tempdir()
. For older
computers or situations in which memory must be kept open, however,
users may wish to clear out their tempdir()
within a
session. If a user wants to keep the downloaded files,
ipeds_tmp_to_disk()
will save all IPEDS files in memory to
the chosen directory. If a user does not care to save the files, they
can just close the session and reopen a clean one. Alternatively, they
may want to use something like:
file.remove(list.files(tempdir(), full.names = TRUE, pattern = ".zip"))
WARNING The code above will remove all
*.zip
files from tempdir()
. If a user has
saved other zipped files to tempdir()
during the session,
these will be deleted as well.