Replace config-driven HtmlScraperPlugin with specific archive classes
Each archive scraper now has its own class with hardcoded URL and parsing logic; config only carries auto_queue, timeout, and rate_limit_seconds. - html_scraper: refactor to base class with public shared utilities (YEAR_RE, AUTHOR_PREFIX_PAT, cls_inner_texts, img_alts) - rusneb.py (new): RusnebPlugin extracts year per list item rather than globally, eliminating wrong page-level dates - alib.py (new): AlibPlugin extracts year from within each <p><b> entry rather than globally, fixing nonsensical year values - shpl.py (new): ShplPlugin retains the dead ШПИЛ endpoint with hardcoded params; config type updated from html_scraper to shpl - config: remove config: subsections from rusneb, alib_web, shpl entries; update type fields to rusneb, alib_web, shpl respectively - plugins/__init__.py: register new specific types, remove html_scraper - tests: use specific plugin classes; assert all CandidateRecord fields (source, title, author, year, isbn, publisher) with appropriate constraints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -57,28 +57,17 @@ functions:
|
||||
|
||||
rusneb:
|
||||
name: "НЭБ"
|
||||
type: html_scraper
|
||||
type: rusneb
|
||||
auto_queue: true
|
||||
rate_limit_seconds: 5
|
||||
timeout: 8
|
||||
config:
|
||||
url: "https://rusneb.ru/search/"
|
||||
search_param: q
|
||||
img_alt: true
|
||||
author_class: "search-list__item_subtext"
|
||||
|
||||
alib_web:
|
||||
name: "Alib (web)"
|
||||
type: html_scraper
|
||||
type: alib_web
|
||||
auto_queue: false
|
||||
rate_limit_seconds: 5
|
||||
timeout: 8
|
||||
config:
|
||||
url: "https://www.alib.ru/find3.php4"
|
||||
search_param: tfind
|
||||
extra_params: {f: "5", s: "0"}
|
||||
encoding: "cp1251"
|
||||
bold_text: true
|
||||
|
||||
nlr:
|
||||
name: "НЛР"
|
||||
@@ -91,13 +80,9 @@ functions:
|
||||
query_prefix: "title="
|
||||
|
||||
shpl:
|
||||
# Endpoint currently returns HTTP 404; retained for future re-enablement.
|
||||
name: "ШПИЛ"
|
||||
type: html_scraper
|
||||
type: shpl
|
||||
auto_queue: false
|
||||
rate_limit_seconds: 5
|
||||
timeout: 8
|
||||
config:
|
||||
url: "https://www.shpl.ru/cgi-bin/irbis64/cgiirbis_64.exe"
|
||||
search_param: S21ALL
|
||||
extra_params: {C21COM: S, I21DBN: BIBL, P21DBN: BIBL, S21FMT: briefWebRus, Z21ID: ""}
|
||||
brief_class: "brief"
|
||||
|
||||
Reference in New Issue
Block a user