About the Dataset
Introduction
The TechText dataset provides a novel measure of firm-level technological associations by applying positive and unlabeled machine learning to patent descriptions and business descriptions. Unlike traditional patent datasets that focus only on patent ownership, TechText estimates the usefulness of each patent to each firm, revealing for the first time the technological profiles of non-patenting firms—which comprise over 85% of public companies. The dataset offers an unprecedented combination of scale (all U.S. public firms), scope (thousands of technology categories), span (nearly three decades), and specificity (continuous usefulness probabilities for each firm-patent pair).
Methodology
We construct usefulness probabilities using a two-stage approach. First, we measure textual similarities between patent descriptions and firm business descriptions using natural language processing techniques, including term frequency-inverse document frequency (TF-IDF) and modern transformer-based models (Sentence-BERT). Second, we apply positive and unlabeled machine learning to estimate the probability that each patent would be useful to each firm. This approach treats patent ownership as a positive signal of usefulness without assuming patents are useless to firms that do not own them. The resulting usefulness probabilities enable researchers to study technological spillovers, innovation diffusion, and the technological positioning of all firms, not just those that patent.
Data Coverage
The dataset combines two primary sources: business descriptions extracted from SEC annual reports (Forms 10-K and 20-F) and patent descriptions from the USPTO PatentsView database. Coverage spans from 1996 to 2025, encompassing the universe of public firms in the United States—on average around 8,000 unique firms per year—and a sample of 50,000 utility patents annually (representing over 20% of all USPTO utility patents granted), with plans to expand to 100% of USPTO patent grants. Patent classifications follow the Cooperative Patent Classification (CPC) system, providing standardized technology categories at multiple levels of granularity.
Citation Guidelines
When using this dataset in academic work, please cite:
Gorrin, J. and Mullen, R. (2025). Beyond Patent Ownership: Learning About Technological Usefulness. Working Paper. Available at: rorymullen.net.
Documentation Downloads
Dataset Documentation
Comprehensive guide including data dictionary, variable definitions, and technical specifications
Frequently Asked Questions
General Questions
Technical Questions
Usage Questions
Contact Form
Have questions about the dataset or need assistance? This form is the best way to get in touch. Please note that as a small academic project, responses typically take 5-7 business days.