CAIDA's Application to the UCSD Human Research Protections Program
Human Research Protections Program
University of California, San Diego
La Jolla, CA 92093-0052
17 October 2008
Cover Letter
Dear Committee on Investigations Involving Human Subjects:
Attached please find the completed Social and Behavioral Application submitted by The Cooperative Association for Internet Data Analysis (CAIDA) for your review and approval.
Therein we describe the policy and procedural risk controls CAIDA has implemented in adhering to privacy and confidentiality standards for research involving human subjects. Our risk controls are accountability-based and guided by objective reasonableness. Risk thresholds are assessed and maintained with the due care of both domain-specific researchers and the ordinary prudent person.
In light of these controls, we suggest that the potential risks and threats to the privacy and confidentiality of human subjects are minimal, reasonable, and tolerable. Specifically, the only data collected, maintained or disseminated that may present a potential risk are Internet Protocol Addresses (IPA) of the source and destination computers involved in network research measurement. This information identifies a specific host on the Internet, not a specific user. No authoritative, executive, judicial or legislative rulemaking entity in the United States has defined IPA to be personally identifiable information.
Nevertheless, this information could possibly be mapped back to a specific computer, user account and eventually an identifiable person. Whether obtained through valid legal process or illegal breach of our secured research data, the effort, resources and coordination required for this mapping to occur would significantly outweigh any potential benefit gained from that disclosure. The risk to the individual who might be associated with the IP address is that his/her single Internet transaction (e.g., visit to a website) would be disclosed. Given the extremely limited quality and quantity of information, and the difficulty in correlating the IP address to an individual, the disclosure risk presents little potential social stigmatization or psychological injury, and no physical risk to an identifiable subject.
To further diminish any identification and disclosure risks posed by IPAs, we implement multiple layers of additional risk controls. We mitigate the risk of correlating IP addresses with individual users by modifying the IP addresses using standard anonymization techniques, as well as by simply not collecting portions of traffic data known to contain personally identifying information. In addition, CAIDA requires both internal and external researchers to provide descriptions of intended usage and to sign data privacy agreements (MOAs) that restrict the distribution, use and disclosure of shared research data that carries any risk of privacy or confidentiality to an identifiable subject. Researchers wishing to work with non-anonymized data must physically visit and be vetted by CAIDA to carry out their research.
Finally, we contend that the controls implemented are proportionate to the reasonably foreseeable risk while not obstructing our research goal of improving generalizable knowledge. The risks described above are minimal, and represent an impact on privacy that is no greater than the risk that individuals face in their regular use of the Internet. This application presents issues of first impression for all involved, and our risk assessment has uncovered authoritative opinion that this research likely qualifies for exempt status. Nevertheless, we embrace full transparency with the IRB and believe there is great value in engaging dialogue and formal approval from the IRB so that we may collectively establish a framework for all stakeholders to achieve their respective goals while protecting their respective interests. We would be interested in the opportunity to extend the training course on protection of human research subjects to include a network research context or develop a separate course.
Application
The Cooperative Association for Internet Data Analysis (CAIDA)
Application to the University of California, San Diego
Human Research Protections Program
For Review by the Institutional Review Board
1. FACILITIES
CAIDA's primary facilities for managing and coordinating our research and measurements occur at the San Diego Supercomputer Center on the UCSD campus. Our research involves both actively (via probes) and passively (via taps) measuring the Internet, using measurement servers located in facilities around the globe.
2. DURATION
CAIDA intends this application to address various methods by which we measure the Internet. The group conducts limited duration measurement experiments and data collection events as well as ongoing efforts that span more than ten years in order to recognize longer term trends. Due to the nature of these longitudinal studies, we hope to maintain ongoing approval for an indefinite period of time, filing supplemental and/or renewal information to this application to reflect any changes in the research design and as deemed necessary by the IRB.
3. SPECIFIC AIMS
CAIDA provides data, tools and analyses promoting the engineering and maintenance of a robust, scalable global Internet infrastructure. CAIDA investigates both practical and theoretical aspects of the Internet, with particular focus on topics that:
- are macroscopic in nature and provide enhanced insight into the function of Internet infrastructure worldwide,
- improve the integrity of the field of Internet science,
- improve the integrity of operational Internet measurement and management, and
- inform science, technology, and communications public policies.
4. BACKGROUND AND SIGNIFICANCE
For over ten years CAIDA has undertaken various approaches to narrowing a gap that now impedes the field of network research as well as telecommunications policy: a dearth of available empirical data on the public Internet since the infrastructure has undergone privatization.
As an Internet data analysis and research group largely supported with public funding to apply measurement and analysis toward understanding and solving globally relevant Internet engineering problems, we accept a responsibility to seek, analyze, and communicate the salient features of the best available data about the Internet.
5. PROGRESS REPORT/PRELIMINARY STUDIES
As mentioned above, CAIDA has conducted measurement experiments and coordinated collection events for over ten years. The list of publications resulting from the collection, maintenance, and analysis of these datasets is too numerous to catalog in this application. The complete list of publications, presentations, and visualizations can be found on the CAIDA web site at https://www.caida.org/publications/.
To date, we do not know of any untoward effects on any individual(s) as a result of our measurements or data collection, distribution, or publication efforts.
6. RESEARCH DESIGN AND METHODS
CAIDA conducts measurements of the Internet using two distinctly different methods: active and passive. The following addresses these two methods for data collection, data analysis and data interpretation, respectively. In neither method do we publish data that reveals any personal information about users.
(a) Active methods: CAIDA's Macroscopic Topology Project [1] actively measures the connectivity and latency data for a wide cross-section of the commodity Internet. These measurements contribute to the study of the topology or graph of the Internet.
For over ten years, CAIDA has maintained a set of monitors (servers), hosted in sites around the globe, that send probes that trace the route that packets travel on their way through the numerous routers to random destination hosts in the Internet address space. The monitors collect the responses and send them to a central server located at SDSC on the UCSD campus. These monitors act in teams, coordinated by the central server, to distribute the work of probing the millions of destinations.
The collected research data is neither sought for nor predicated on identification with an individually identifiable living human subject. To explicate, the selection of the destination hosts (identified by IP address) is random, with the only deliberate target being the broad, routed networks on the Internet. This approach ensures that we sample from all routed network prefixes so that our topology measurements are complete and accurately reflect the full scope of the commodity Internet. The destination addresses within each routed network prefix are arbitrarily selected and probed approximately every 48 hours (one probing cycle), with each probing cycle involving the selection of another randomly selected host. We do this random probing within a large scope of possible hosts (i.e., there are roughly 250,000 routed network prefixes that we separate into roughly 7.4 million targeted networks each containing 256 IP addresses). At the time of writing, approximately 6.5% of probed IP addresses respond.
Any devices that do receive probes are likely to be intermediary devices that effectively mask identification of a specific individual's computer (i.e. IP addresses often resolve to router or NAT (network address translation) machines; and Internet service providers frequently use DHCP (dynamic host configuration protocol) which renders IP addresses non-static and random). In other words, it is likely that there will not be a human on a computer at the end of the random probe. In the unlikely event that the randomly selected address is connected to an end user's computer, that address only identifies the computer, not any of the user accounts associated with the computer or the specific human who operates the device. Furthermore, probing is discontinuous, which means that there is no continuous tracking, repeated monitoring or surveillance of the randomly chosen address. It is worth noting that we use a standard, widely accepted diagnostic protocol, Internet Control Message Protocol (ICMP), for conducting the probes.
Finally, the substance of the collected probe responses do not contain any contents of the packets collected from the destination, only the traffic ("transactional routing control") data. To clarify further, the Internet is a packet switched network in which the basic unit of transmission is a "packet." Each packet contains a "header" and a "payload." Much like a letter sent through the U.S. Postal Service, the payload of each packet is like the contents of the letter and the header acts like the envelope. In a computer network, applications generate the payload of packets (the "contents") and the operating system generates the header (the "addressing information"). For these preceding reasons, there is no "selecting" or "enrolling" a human subject.
Once collected, we copy this traffic data back to the central server where we store and bundle based on the probing cycle. We then annotate the data with information from the Domain Name Service (DNS) which maps the IP address to a human readable domain (network) name. We also generate derived datasets that describe the links between the systems (Autonomous Systems, AS) that route the collected traffic, their inferred relationships with each other, and their taxonomy. We publish these datasets as well as our interpretation of this data in the form of a visualization topology map of the AS-level Internet graph [11].
(b) Passive methods: CAIDA collaborates with organizations that provide local and wide-area network infrastructure. Through these collaborations, we have explicit authorization to passively "tap" heavily aggregated links that provide data packet transport to and from local, state-wide, national and international research and education networks as well as the commodity Internet. [3] The tap involves instrumenting these links with specialized measurement equipment to collect packets, anonymize IP addresses, and analyze packet header traces acquired from these networks. Additionally, we publish web-based reports of aggregated traffic statistics on the monitored link. [12] None of the published information contains any personal information about users.
When we measure a network link, we capture all of the packet headers, or a statistically representative sample of them, using a passive network tap that splits the traffic and provides a copy of all the packets to a host that records the data. The packet headers contain IP addresses that can be used to identify the originating computer (the "source address") and the destination computer (the "destination address") of each packet.
The packet header traces are used to build empirical models of network traffic. These models apply to a variety of analysis such as understanding how applications use networks and how such use changes over time. Also, these models assist researchers who need to generate synthetic network traffic to test and experiment with new hardware and software under conditions representative of real world network conditions.
In addition to authorized passive header collection with collaborators, we employ the UCSD Network Telescope. A network telescope (aka a black hole, an Internet sink, or a darknet) typically has few or often no real computers attached to it and carries almost no legitimate traffic. It serves research value as a monitoring point for anomalous traffic which comprises a significant portion of Internet activity. The network telescope may capture phenomena from a wide range of events, including misconfiguration, malicious scanning of address space, backscatter from random source denial-of-service attacks, and automated spread of malicious software. [4, 5, 6] The network telescope is thus a tool to help researchers identify root causes of this anomalous traffic and has already proven successful in uncovering denial-of-service attack victims and tracking the automated spread of worms.
7. HUMAN SUBJECTS
The proposed subject population includes all potential users of the Internet regardless of geography, age, sex, race, ethnicity or health status. We note that the data stored in and information gleaned from the repository is collected from aggregated Internet links, with no intentional or consequential targeting of any specific demographic, and includes specific measures in place to prevent correlation of any subset of the data with individual identifiable persons.
8. INFORMED CONSENT
As described in Section 6, there is no explicit or implicit recruitment, enrolling, selecting, or identification of human subjects. Therefore, plans or procedures for obtaining informed consent are not applicable. Even assuming that the described research activities were deemed to involve the identification of individual human subjects by virtue of identifying a person from an IP address, CAIDA is practically precluded from soliciting consent from those individuals for two reasons, one technical and one methodological. Technically, we measure backbone network links for aggregated populations of millions of users, and it is not viable to request consent. Methodologically, informing users of monitoring experiments, e.g., posting placards on computer kiosks on campus stating "This terminal is being monitored as part of a research project" will create a biased sample of usage. Since such bias would invalidate the research, we do not expect to request or receive individual waivers or consents.
9. POTENTIAL RISKS
The datasets we will collect and make available to researchers represent activities of a large sample of ordinary individuals using the Internet. Different types of data will be collected, each with its own degree of risk (or non-risk). For the Institutional Review Board to adequately assess this risk, we provide below a general description of how the normal functioning of the Internet generates datasets that can be aggregated and published. In general, we expect that all information that could potentially identify or link an individual with a certain portion of the data will be modified to remove any risk to individuals.
The data that is generated in the normal functioning of the Internet can include three data types that may have human subject sensitivities:
1. Internet addresses of the source and destination computers.
This information often does not identify a specific host on the Internet due to the use of Network Address Translation (NAT) devices and other firewalls. An IP address is even less likely to identify a specific user. While this information could possibly be mapped to a specific computer, or even a specific individual, it does not reveal the identity of individuals in and of itself. This information would need to be cross-matched with records from various, private and independently held ISP records, all of which require deliberate, time and resource-intensive legal process to accomplish. In the unlikely situation that the IP address were to be linked to an identifiable individual, this research would present a nominal legal risk to the confidentiality of that IP address, since an entity could undertake legal process to request this data. The risk to the individual who might be associated with the IP address is that his/her one-time Internet transaction (e.g., visit to a website) would be disclosed. Given the extremely limited quality and quantity of information, and the difficulty in correlating the IP address to an individual, the disclosure risk presents little potential social stigmatization or psychological injury, and no physical risk.
2. Application type information.
It is possible to determine the specific application behind a given packet(s). However, validating application classification requires researchers to collect application header information which is deemed content (payload) from the perspective of the TCP/IP header although not from the perspective of the application. Regardless of the header/content posture chosen, we will not preserve any address information that may be associated with application data, therefore there is no legal disclosure, psychological, social or physical risk posed to an identifiable subject.
3. Application payload information.
This could include everything from the content of email messages to the contents of web pages. At the time of application submission, CAIDA does not collect any payload information from Internet traffic links. In fact, we go to great lengths to strip payload from the packet header traces we do collect. However, because of the potential network research value-added provided by payload data, we have frequent requests for payload data. This is an area of research design that is subject to change, but one that will certainly be steered by legal and policy research and whose implementation is subject to the discretion and consent of the IRB via supplements to this application. At this time, there is no legal disclosure, psychological, social or physical risks posed to an identifiable subject.
10. RISK MANAGEMENT
Throughout CAIDA's fifteen years of experience with Internet data analysis and in conjunction with evolving legal, social, and industry norms and standards, we have implemented various policies and procedures for protecting against and minimizing potential risks associated with the collection, analysis, storage and use of measurement data for network research. The result is a dynamic set of privacy and confidentiality risk controls that address the people, policies and technologies in the lifecycle of our research activities.
CAIDA collects and publishes datasets in all three categories mentioned above with careful attention paid to prevent any risk to identifiable human subjects. We mitigate the risk of correlating IP addresses with individual users by modifying the IP addresses using standard anonymization techniques [2], e.g., removing some of the identifying information from the IP address, similar to blacking out the last four digits of a phone number. This technique prevents identifying an individual user on the network while still allowing researchers to reach conclusions about related groups of users.
CAIDA requires both internal and external researchers to provide descriptions of intended usage and sign data privacy agreements. This agreement restricts them from distributing the data beyond authorized users and requires that publications will anonymize, aggregate or summarize the data and will not publish any personally identifiable information to protect the privacy of end users. Researchers wishing to work with non-anonymized data must physically visit CAIDA to conduct the research.
We restrict access to users from export restricted countries. We follow the guidelines provided by the Export Administration Regulations (EAR), International Traffic in Arms Regulations (ITAR), and the Office of Foreign Assets Control (OFAC).
11. POTENTIAL BENEFITS
As national utility infrastructures become intertwined with emerging global data networks, the stability and integrity of the two have become synonymous. This connection, while necessary, leaves network assets vulnerable to the rapidly moving threats of today's Internet. These new threats have impact beyond the scope of the individual enterprise, not only infecting vulnerable hosts (i.e., the individual computers on the network) with malicious code but also denying service to legitimate network users. Fast spreading worms have disrupted financial institutions and emergency services. Inadvertent routing configuration changes have crashed national ISP networks. The enterprise, its upstream ISP(s), and the global Internet community address these threats differently because each has a separate view of the network. Unfortunately, research into Internet-wide or infrastructure level attacks are hampered by a lack of macroscopic datasets. While researchers can often study individual packet traces and compromised machine forensics, these datasets rarely reflect system-level behaviors. Available datasets are often fragmented or difficult to correlate because of missing meta-data or wildly disparate time frames. As part of its participation in the PREDICT virtual center, CAIDA will help address this gap by providing a repository of rich, correlated datasets representing Internet scale behaviors, which will enable qualified cybersecurity researchers to test and prototype novel attack mitigation techniques. Data available from this virtual repository will include both infrastructure data and data from distributed forensic tools.
12. RISK/BENEFIT RATIO
Based on Section 9: Potential Risks and Section 10: Risk Management, we contend that the controls implemented are proportionate to the risk while not obstructing our research goals. The risks described above are minimal, and represent an impact on privacy that is no greater than the risk that individuals face in their regular use of the Internet. The datasets included in the repository represent large aggregations and pose no additional risk to individual privacy. The datasets that do have the potential to affect individual privacy will be anonymized prior to distribution to researchers using standard anonymization techniques described, and therefore represent an acceptable risk. As mentioned above, we hope the availability of these datasets will strengthen research in network security, workload, traffic classification, performance, topology discovery, and routing.
We have engaged an information technology law advisor, view the risk assessment as an ongoing and iterative process, and invite inquiries and dialogue with the IRB to help establish a framework for all stakeholders to achieve their respective goals while protecting their respective interests.
13.BIBLIOGRAPHY
- 1. CAIDA's Macroscopic Topology Project,
https://www.caida.org/projects/macroscopic/ - 2. J. Xu, J. Fan, and M. H. Ammar. "Prefix-Preserving IP Address Anonymization: Measurement-based Security Evaluation and a New Cryptography-based Scheme. In EEE ICNP, 2002.
- 3. CAIDA Internet Data -- Passive Data Sources,
https://www.caida.org/data/passive/ - 4. DeBaecke, D., "Denial of Service Tools and Techniques", University of Memphis 2006
http://umdrive.memphis.edu/ddebaeck/public/Denial%20of%20Service%20Tools%20and%20Techniques.ppt - 5. Rajab, M., Monrose, F., Terzis, A., "Worm Evolution Tracking via Timing Analysis", ACM WORM 2006
- 6. Zesheng Chen; Chuanyi Ji, "Measuring Network-Aware Worm Spreading Ability", INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE Volume , Issue , 6-12 May 2007 Page(s):116 - 124
- 7. kc claffy, "Ten Things Lawyers Should Know About Internet Research", CAIDA, 2008
https://catalog.caida.org/paper/2008_lawyers_top_ten/ - 8. kc claffy, "According to the Best Available Data"
https://blog.caida.org/ - 9. Simson L. Garfinkel, "IRBs and Security Research:
Myths, Facts and Mission Creep", USENIX, UPSEC, 2008
http://www.usenix.org/events/upsec08/tech/full_papers/garfinkel/garfinkel_html/ - 10.
CAIDA:Publications:Bibliography:Networking:Anonymization
https://www.caida.org/archive/networkingbib/bytopic#anonymization - 11. Visualizing IPv4 Internet Topology at a Macroscopic Scale
https://www.caida.org/projects/as-core/ - 12. Passive Network Monitors,
https://www.caida.org/data/realtime/passive/
14. OTHER FUNDING
Indicate whether this project is supported by federal, state, or another source. Provide the UCSD grant number and inclusive dates of support. If you have indicated on the face sheet that there is NO funding support for this project, you will need to explain just how the project is to be supported.
2006-0375 - NSF CNS 0551542 CRI: Toward Community Oriented Network Measurement Infrastructure 9/1/06 - 8/31/11 2007-2459 - DHS NBCHC070133 PREDICT Contract 8/1/07 - 7/31/12 2008-0644 - DHS SPAWAR Contract N66001-08-C-2029 Leveraging the Science and Technology of Internet Mapping for Homeland Security 3/21/08 - 9/20/10 2008-0444
15. CONFLICT OF INTEREST. PRINCIPAL INVESTIGATOR'S STATEMENT OF ECONOMIC INTEREST (Form 730-U)
16. Copies of questionnaires,* survey instruments, testing instruments that are not part of standard clinical or educational practice, must be submitted with the application.
We have no questionnaires or surveys to submit at this time.