textliteraturebookspublic-domaingutenbergNLPdigital-humanitiesgenre-classificationmultilinguallibrary-of-congressbibliometricsauthor-metadataliterary-history

Public Domain Books Catalog — 75,000+ Literary Works (1971–2025)

Category
Text
Records
75,545 rows
Format
CSV
Update Frequency
One-time snapshot
Collection Method
uploaded
PII
None detected
Downloads
3

About this data

Cross-national catalog of 75,545 public domain literary works from Project Gutenberg, enriched with genre classification, literary era mapping, and Library of Congress subject area categorization. Covers works in 58+ languages from ancient texts to early 20th-century literature. **Sources:** - Project Gutenberg digital library catalog (primary metadata: titles, authors, dates, subjects, Library of Congress Classification) - Library of Congress Classification scheme (subject area mapping) - Literary period taxonomy (era classification from Medieval through Contemporary) - Custom NLP-derived genre classification across 20+ categories **Schema (23 columns):** - `gutenberg_id` — Unique Project Gutenberg text identifier - `title` — Full title of the work - `author` — Primary author name (normalized to "First Last" format) - `author_birth_year` / `author_death_year` — Author life dates - `num_authors` — Number of credited authors - `language_code` — ISO language code - `language` — Full language name - `issued_date` — Date digitized/added to Project Gutenberg - `primary_subject` — Primary subject heading - `subject_count` — Total number of subject headings - `locc_classification` — Library of Congress Classification code(s) - `locc_area` — Mapped LoCC broad subject area - `genre` — Derived genre (Fiction, Poetry, History, Science Fiction, Mystery, etc.) - `literary_era` — Estimated literary period (Medieval, Renaissance, Romantic, Victorian, Modern, Contemporary) - `bookshelf` — Project Gutenberg bookshelf category - `source` — Data source identifier - `url` — Direct link to the work - `license` — License type (all Public Domain) - `title_word_count` — Number of words in title - `has_author` — Whether author is known (1/0) - `is_english` — English language flag (1/0) - `has_classification` — Has LoCC classification (1/0) **Coverage:** 75,545 unique works across 58+ languages. 60K+ English works plus significant French (4K), Finnish (3.5K), German (2.3K), and 50+ other language collections. Literary eras span from Ancient/Medieval through Contemporary. **Use cases:** Literary analysis, NLP training data catalogs, bibliometric research, digital humanities, author network analysis, genre classification benchmarking, language diversity studies, cultural heritage research.

Schema

NameTypeDescription
gutenberg_idstring
titlestring
authorstring
author_birth_yearstring
author_death_yearstring
num_authorsstring
language_codestring
languagestring
issued_datestring
primary_subjectstring
subject_countstring
locc_classificationstring
locc_areastring
genrestring
literary_erastring
bookshelfstring
sourcestring
urlstring
licensestring
title_word_countstring
has_authorstring
is_englishstring
has_classificationstring

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: 4.8 / 5
3 downloads
Seller: waseemahmad
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "Public Domain Books Catalog — " })
// Found: 9e20a575-5493-47d9-b71b-ad0dc12be01a
get_download_url({ dataset_id: "9e20a575-5493-47d9-b71b-ad0dc12be01a" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/9e20a575-5493-47d9-b71b-ad0dc12be01a/download-url