Documentation | TechText Platform

About the Dataset

Introduction

The TechText dataset provides a novel measure of firm-level technological associations by applying positive and unlabeled machine learning to patent descriptions and business descriptions. Unlike traditional patent datasets that focus only on patent ownership, TechText estimates the usefulness of each patent to each firm, revealing for the first time the technological profiles of non-patenting firms—which comprise over 85% of public companies. The dataset offers an unprecedented combination of scale (all U.S. public firms), scope (thousands of technology categories), span (nearly three decades), and specificity (continuous usefulness probabilities for each firm-patent pair).

Methodology

We construct usefulness probabilities using a two-stage approach. First, we measure textual similarities between patent descriptions and firm business descriptions using natural language processing techniques, including term frequency-inverse document frequency (TF-IDF) and modern transformer-based models (Sentence-BERT). Second, we apply positive and unlabeled machine learning to estimate the probability that each patent would be useful to each firm. This approach treats patent ownership as a positive signal of usefulness without assuming patents are useless to firms that do not own them. The resulting usefulness probabilities enable researchers to study technological spillovers, innovation diffusion, and the technological positioning of all firms, not just those that patent.

Data Coverage

The dataset combines two primary sources: business descriptions extracted from SEC annual reports (Forms 10-K and 20-F) and patent descriptions from the USPTO PatentsView database. Coverage spans from 1997 to 2022, encompassing the universe of public firms in the United States—on average around 8,000 unique firms per year—and a sample of 50,000 utility patents annually (representing over 20% of all USPTO utility patents granted), with plans to expand to 100% of USPTO patent grants. Patent classifications follow the Cooperative Patent Classification (CPC) system, providing standardized technology categories at multiple levels of granularity.

Citation Guidelines

When using this dataset in academic work, please cite:

Gorrin, J. and Mullen, R. (2025). Beyond Patent Ownership: Learning About Technological Usefulness. Working Paper.

Documentation Downloads

Dataset Documentation

Comprehensive guide including data dictionary, variable definitions, and technical specifications

2.4 MBLast updated: December 2024

Download

Frequently Asked Questions

General Questions

Technical Questions

Usage Questions

Contact Form

Have questions about the dataset or need assistance? This form is the best way to get in touch. Please note that as a small academic project, responses typically take 5-7 business days.