Erasmus University Rotterdam (EUR)
Browse
DOCUMENT
README.pdf (63.5 kB)
DATASET
Variable_Description.ods (7.37 kB)
DOCUMENT
FAQ.docx (34.59 kB)
DATASET
Compustat_2021.xlsx (3.97 MB)
TEXT
01 Collect frontpages.py (23.84 kB)
DATASET
URLs_1_deeper.csv (398.19 MB)
TEXT
02 Collect further pages.py (14.34 kB)
DATASET
scrapedURLs.csv (427.84 MB)
.ZIP
HTML.zip (48.63 GB)
TEXT
03 Convert HTML to plaintext.py (10.66 kB)
.ZIP
TXT_uncleaned.zip (4.01 GB)
.CSV
input_categorization_allpages.csv (486.17 MB)
TEXT
04 GPT application.py (8.89 kB)
.CSV
categorization_applied.csv (354.5 MB)
DATASET
exclusion_list.xlsx (325.55 kB)
TEXT
05 Clean and select.py (15.78 kB)
DATASET
metadata.csv (255.73 MB)
.ZIP
TXT_cleaned.zip (1.89 GB)
.ZIP
TXT_combined.zip (1.08 GB)
TEXT
06 Topic model.R (3.49 kB)
1/0
32 files

CompuCrawl: Full database and code

online resource
posted on 2024-11-20, 12:08 authored by Richard HaansRichard Haans, Marc Mertens


This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020—representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.

The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.

The full set of files, in order of use, is as follows:

  • Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.
  • 01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.
  • URLs_1_deeper.csv: List of URLs one page deeper on the main domains.
  • 02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.
  • scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.
  • HTML.zip: Archived version of the set of individual HTML files.
  • 03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.
  • TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.
  • input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.
  • 04 GPT application.py: Python script using OpenAI’s API to classify selected pages according to their HTML title and URL.
  • categorization_applied.csv: Output file containing classification of selected pages.
  • exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.
  • 05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.
  • metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.
  • TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.
  • TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.
  • 06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.
    TM_125.RData: RData file containing the results of the 125-topic model.
  • loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.
  • 125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.
  • 07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.
  • Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.
  • 08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms “sustainability” and “profitability” over time.
  • 99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.
  • URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.

For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.

The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: “The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data”, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.


History

Encoding format

  • CSV
  • HTML
  • Python
  • R code
  • TXT
  • PDF (not preferred)
  • DOCX (not preferred)
  • XLSX (not preferred)

Content size

62.8 GB

Conditions of access

  • Open access

Language

English

Temporal coverage

1996/2020

Spatial coverage

North America

Does your data contain sensitive data

  • No

Usage metrics

    Rotterdam School of Management

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC