CompuCrawl: Full database and code

online resource

posted on 2024-11-20, 12:08 authored by Richard HaansRichard Haans, Marc Mertens

This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020—representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.

The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.

The full set of files, in order of use, is as follows:

Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.
01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.
URLs_1_deeper.csv: List of URLs one page deeper on the main domains.
02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.
scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.
HTML.zip: Archived version of the set of individual HTML files.
03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.
TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.
input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.
04 GPT application.py: Python script using OpenAI’s API to classify selected pages according to their HTML title and URL.
categorization_applied.csv: Output file containing classification of selected pages.
exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.
05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.
metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.
TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.
TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.
06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.
TM_125.RData: RData file containing the results of the 125-topic model.
loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.
125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.
07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.
Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.
08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms “sustainability” and “profitability” over time.
99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.
URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.

For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.

The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: “The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data”, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.

History

Encoding format

CSV
HTML
Python
R code
TXT
PDF (not preferred)
DOCX (not preferred)
XLSX (not preferred)

Content size

62.8 GB

Conditions of access

Open access

Language

English

Temporal coverage

1996/2020

Spatial coverage

North America

CompuCrawl: Full database and code

History

Encoding format

Content size

Conditions of access

Language

Temporal coverage

Spatial coverage

Does your data contain sensitive data

Usage metrics

Categories

Keywords

Licence

Exports