Sci2 Manual : 4.2 Data Acquisition and Preparation
This page last changed on Mar 28, 2011 by dapolley.
Typically, about 80 percent of the total project effort is spent on data acquisition and preprocessing; yet well prepared data is mandatory to arrive at high-quality results. Datasets might be acquired via questionnaires, crawled from the Web, downloaded from a database, or accessed as continuous data stream. Datasets differ by their coverage and resolution of time (days, months, years), geography (languages and/or countries considered), and topics (disciplines and selected journal sets). Their size ranges from several bytes to terabytes (trillions of bytes) of data. They might be high-quality materials curated by domain experts or random content retrieved from the Web. Based on a detailed needs analysis and deep knowledge about existing databases, the best suited yet affordable datasets have to be selected, filtered, integrated, and augmented. It may also be necessary for networks to be extracted (see section 4.7 Network Analysis for details). 4.2.1 Datasets: Publications4.2.1.1 Refer/BibIX/enwRefer was one of the first digital reference managers, developed by Bell labs in 1978. Refer's file output format has since been adopted by many tools and web services, including BibIX for UNIX, early versions of EndNote, CiteSeerX, Zotero.
4.2.1.2 BibTeXLike Refer, BibTeX provides a standard reference file format used by many tools and web services, including CiteSeerX, citeulike, BibSonomy, and Google Scholar.
4.2.1.3 ISI Web of ScienceISI Web of Science (WoS) is a leading citation database cataloging over 10,000 journals and over 120,000 conferences. Access it via the "Web of Science" tab at http://www.isiknowledge.com (note: access to this database requires a paid subscription). Along with Scopus, ISI WoS provides some of the most useful datasets for scientometric analysis.
Download the first 500 records using the output box at the bottom of the page. Enter records '1' to '500', select 'Full Record' and 'plus Cited Reference', select 'Save to Plain Text' in the drop down menu, and then click save. Wait for the processing to complete, and then save the file as GarfieldE.isi. The resulting file can be seen in Figure 4.3.
ISI files are loosely based on the RIS file format, and data in this format can be used for the following types of analyses:
4.2.1.4 ScopusElsevier's Scopus, like ISI Web of Science, has an extensive catalog of citations and abstracts from journals and conferences. Subscribers to Scopus can access the service via http://www.scopus.com.
At the output window, select 'Comma separated file, .csv' (e.g. Excel) and 'Complete format' from the drop-down menus and choose 'Export'. Save the file as WattsStrogatz.scopus. The resulting file can be seen in Figure 4.5.
Data in Scopus files can be used for the following types of analyses:
4.2.1.5 Google ScholarGoogle Scholar data can be acquired using Publish or Perish (Harzing, 2008) that can be freely downloaded from http://www.harzing.com/pop.htm. A query for papers by Albert-László Barabási run on Sept. 21, 2008 results in 111 papers that have been cited 14,343 times, see Figure 4.6.
To save records, select 'File > Save' from menu and then choose the appropriate file format (.csv, *.enl, or *.bib) in the 'Choose File' pop-up window. All three file formats can be read by the Sci2 Tool. The result in all three formats named 'barabasi.' is also available in the respective subdirectories in 'yoursci2directory /sampledata/scientometrics/' and will be used later in this tutorial.
4.2.2 Datasets: Funding4.2.2.1 NSF Award SearchFunding data provided by the National Science Foundation (NSF) can be retrieved via the Award Search site (http://www.nsf.gov/awardsearch). Search by PI name, institution, and many other fields, see Figure 4.7.
To retrieve all projects funded under the Science of Science and Innovation Policy (SciSIP) program, simply select the 'Program Information' tab, do an 'Element Code Lookup', enter '7626' into the 'Element Code' field, and click the 'Search' button. On Sept 21st, 2008, exactly 50 awards were found. Award records can be downloaded in csv, Excel, or XML format. Save file in csv format, and change the file extension from .csv to .nsf. A sample .nsf file is available in 'yoursci2directory /sampledata/scientometrics/nsf/BethPlale.nsf'. In the Sci2 Tool, load the file using 'File > Load File'. Select "NSF csv format" in the "Load" pop-up window. A table with all records will appear in the Data Manager. View the file in Excel.
4.2.2.2 NIH RePORTERFunding data provided by the National Institutes of Health (NIH), and associated publications and patents, can be retrieved via the NIH RePORTER site (http://projectreporter.nih.gov/reporter.cfm). The database draws from eRA, Medline, PubMed Central, NIH Intramural, and iEdison. Search by location, PI name, category, etc., see Figure 4.8.
A sample search of "Epidemic" in the 'Public Health Relevance' field displays 205 results as of November 11th, 2009. Up to 500 results can be exported into csv or Excel format using the "Export" button at the top of the page. Save the file as a .csv and load it into the Sci2 Tool using 'File > Load File' to perform temporal or topical analyses.
4.2.3 Datasets: Scholarly Database
Medline, U.S. patent, as well as funding data provided by the National Science Foundation and the National Institutes of Health can be downloaded from the Scholarly Database (SDB) at Indiana University. SDB supports keyword based cross-search of the different data types and data can be downloaded in bulk, see Figures 4.10 and 4.11 for interface snapshots. Register to get a free account or use 'Email: nwb@indiana.edu' and 'Password: nwb' to try out functionality.
Results are displayed in sets of 20 records, ordered by a Solr internal matching score. The first column represents the record source, the second the creators, third comes the year, then title and finally the matching score. Datasets can be downloaded in different subsets and formats for future analysis.
Data from the SDB can be used in a great number of ways. The following is an abridged list of suggested uses:
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
![]() |
Document generated by Confluence on May 31, 2011 15:16 |