Reference

Documentation for a selection of our system’s common internal tools

Commands

accessibility

Save the accessiblity tree of the provided site.

accessibility [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--verbose

Arguments

HANDLE

Required argument

adstxt

Save the raw ads.txt of the provided site.

adstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--timeout <timeout>
--verbose

Arguments

HANDLE

Required argument

analyze

analyze [OPTIONS] COMMAND [ARGS]...

cli

Analyze the Drudge Report.

drudge-entities

Analyze Drudge entities.

analyze drudge-entities [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Analyze Lighthouse reports.

lighthouse

Analyze Lighthouse scores.

analyze lighthouse [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Analyze US right wing sources.

archive

Save assets to an archive.org collection.

archive [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>
--latest

Crosspost to the latest archive.org item

--verbose

Display the upload progress to archive.org

--timeout <timeout>

Arguments

HANDLE

Required argument

batch

Print a batch of sites.

batch [OPTIONS] COMMAND [ARGS]...

sites-by-batch

Print site handles in the provided batch as a JSON list.

batch sites-by-batch [OPTIONS] BATCH

Options

-b, --batches <batches>

Arguments

BATCH

Required argument

sites-by-bundle

Print site handles in the provided bundle as a JSON list.

batch sites-by-bundle [OPTIONS] BUNDLE

Arguments

BUNDLE

Required argument

sites-by-country

Print site handles in the provided country as a JSON list.

batch sites-by-country [OPTIONS] COUNTRY

Arguments

COUNTRY

Required argument

site

site [OPTIONS] COMMAND [ARGS]...

cli

Create page ranking sites by Lighthouse accessibility score.

accessibility-ranking

Create page ranking sites by Lighthouse accessibility score.

site accessibility-ranking [OPTIONS]

cli

Create bundle detail pages.

bundle-detail

Create bundle detail pages.

site bundle-detail [OPTIONS]

cli

Create bundle list.

bundle-list

Create bundle list.

site bundle-list [OPTIONS]

cli

Create country detail pages.

country-detail

Create country detail pages.

site country-detail [OPTIONS]

cli

Create country list.

country-list

Create country list.

site country-list [OPTIONS]

cli

Create page ranking sites by appearance on drudgereport.com.

drudge

Create page ranking sites by appearance on drudgereport.com.

site drudge [OPTIONS]

cli

Create languages detail pages.

language-detail

Create languages detail pages.

site language-detail [OPTIONS]

cli

Create language list.

language-list

Create language list.

site language-list [OPTIONS]

cli

Create a page showing all of the latest screenshots.

latest-screenshots

Create page showing all of the latest screenshots.

site latest-screenshots [OPTIONS]

cli

Create a page tracking AI blockers based on most recently scraped robots.txt files.

openai

Create a page tracking AI blockers based on most recently scraped robots.txt files.

site openai [OPTIONS]

Options

--no-cache

cli

Create page ranking sites by Lighthouse performance score.

performance-ranking

Create page ranking sites by Lighthouse performance score.

site performance-ranking [OPTIONS]

cli

Create source detail pages.

site-detail

Create source detail pages.

site site-detail [OPTIONS]

cli

Create source list.

source-list

Create source list.

site source-list [OPTIONS]

cli

Create a status report.

status-report

Create a status report.

site status-report [OPTIONS]

extract

extract [OPTIONS] COMMAND [ARGS]...

cli

Consolidate Internet Archive metadata into CSV files.

consolidate

Consolidate Internet Archive metadata into CSV files.

extract consolidate [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Download items from our archive.org collection as JSON.

items

Download items from our archive.org collection as JSON.

extract items [OPTIONS] HANDLE

Options

-y, --year <year>
-o, --output-dir <output_dir>

Arguments

HANDLE

Required argument

cli

Download and parse the provided site’s accessibility files.

accessibility

Download and parse the provided site’s accessibility files.

extract accessibility [OPTIONS] HANDLE

Arguments

HANDLE

Required argument

cli

Download and parse the provided site’s hyperlinks files.

cli

Download and parse the provided site’s Lighthouse files.

lighthouse

Download and parse the provided site’s Lighthouse files.

extract lighthouse [OPTIONS]

Options

--site <site>
--country <country>
--language <language>
--bundle <bundle>
--days <days>
-o, --output-path <output_path>

cli

Download and parse the provided site’s robots.txt files.

robotstxt

Download and parse archived robots.txt files.

extract robotstxt [OPTIONS]

Options

--site <site>
--country <country>
--language <language>
--bundle <bundle>
--days <days>
--latest
-o, --output-path <output_path>
--no-cache
--verbose

cli

Download and parse the provided site’s Wayback Machine files.

wayback

Download and parse the provided site’s Wayback Machine files.

extract wayback [OPTIONS] HANDLE

Arguments

HANDLE

Required argument

mosaic

Create image mosaics.

mosaic [OPTIONS] COMMAND [ARGS]...

gif

Combine images into a mosaic GIF.

mosaic gif [OPTIONS]

Options

-i, --input-dir <input_dir>
-o, --output-dir <output_dir>

jpg

Combine images into jpgs ready for Twitter.

mosaic jpg [OPTIONS]

Options

-i, --input-dir <input_dir>
-o, --output-dir <output_dir>

robotstxt

Save the raw robots.txt of the provided site.

robotstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--timeout <timeout>
--verbose

Arguments

HANDLE

Required argument

rss

Create RSS feeds.

rss [OPTIONS] COMMAND [ARGS]...

bundles

Create bundle feeds.

rss bundles [OPTIONS]

countries

Create country feeds.

rss countries [OPTIONS]

opml

Create an OPML file with all site feeds.

rss opml [OPTIONS]

sites

Create site feeds.

rss sites [OPTIONS]

screenshot

Screenshot the provided homepage.

screenshot [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
-w, --wait <wait>
-x, --width <width>
-y, --height <height>
-f, --full-page

Screenshot the whole page

--verbose

Print verbose output

Arguments

HANDLE

Required argument

slack

Post image to Slack channel.

slack [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>
-v, --verbose
-t, --timeout <timeout>

Arguments

HANDLE

Required argument

telegrammer

Send a Telegram message.

telegrammer [OPTIONS] COMMAND [ARGS]...

bundle

Send a bundle of sources.

telegrammer bundle [OPTIONS] SLUG

Options

-i, --input-dir <input_dir>

Arguments

SLUG

Required argument

country

Send all sources from a single country.

telegrammer country [OPTIONS] CODE

Options

-i, --input-dir <input_dir>

Arguments

CODE

Required argument

mosaic

Tweet a mosaic GIF.

telegrammer mosaic [OPTIONS]

Options

-i, --input-dir <input_dir>

single

Send a single source.

telegrammer single [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>

Arguments

HANDLE

Required argument

toot

Send a toot.

toot [OPTIONS] COMMAND [ARGS]...

bundle

Toot sources in batches of four.

toot bundle [OPTIONS] SLUG

Options

-i, --input-dir <input_dir>

Arguments

SLUG

Required argument

single

Toot a single source.

toot single [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>

Arguments

HANDLE

Required argument

wayback

Archive a URL in the Wayback Machine.

wayback [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--verbose

Arguments

HANDLE

Required argument

Utilities

The utils module contains a variety of functions used by our commands.

newshomepages.utils.batch(li: list, n: int)

Yield n number of sequential chunks from l.

newshomepages.utils.chunk(iterable: list, length: int) list[list]

Split the provided list into chunks of the provided length.

Parameters:
  • iterable (list) – The master list to split.

  • length (int) – The size of the chunks you want

Returns a list of lists.

newshomepages.utils.download_url(url: str, output_path: Path, timeout: int = 180)

Download the provided URL to the provided path.

newshomepages.utils.get_accessibility_df() DataFrame

Get the full list of accessibility files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_accessibility_list() list[dict[str, Any]]

Get the full list of accessibility from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_bundle(slug: str) dict

Get the metadata for the provided bundle.

Parameters:

slug (str) – The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_bundle_list() list[dict]

Get the full list of site bundles.

Returns a list of dictionaries.

newshomepages.utils.get_country(code: str) dict

Get the metadata for the provided country.

Parameters:

slug (str) – The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_country_df() DataFrame

Get the list of countries.

Returns a pandas DataFrame.

newshomepages.utils.get_country_list() list[dict]

Get the full list of countries.

Returns a list of dictionaries.

newshomepages.utils.get_extract_df(name: str, use_cache: bool = True, **kwargs) DataFrame

Read in the requests extracts CSV as a dataframe.

newshomepages.utils.get_flag_emoji(alpha2: str) str

Get the flag emoji for the provided country.

Parameters:

alpha2 (str) – The two-letter ISO code for the country.

Returns:

A string containing the emoji.

Get the full list of hyperlink files from our extracts.

Returns a DataFrame.

Get the full list of hyperlink from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_javascript(handle: str) str | None

Get the JavaScript file to run before the screenshot, if it exists.

Parameters:

handle (str) – The Twitter handle of the site you want.

Returns a JavaScript string ready to be run. Or None, if no file exists.

newshomepages.utils.get_json_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) Any

Get JSON data from the provided URL.

Parameters:
  • url (str) – The URL to request

  • timeout (int) – How long to wait before timing out

  • user_agent (str) – The user agent to provide in the request headers. None by default.

  • verbose (bool) – Whether or not to print a verbose output

Returns:

The JSON response as a Python object.

newshomepages.utils.get_language_df() DataFrame

Get the list of languages.

Returns a pandas DataFrame.

newshomepages.utils.get_language_list() list[dict]

Get the list of languages.

Returns a list of dictionaries.

newshomepages.utils.get_lighthouse_df() DataFrame

Get the full list of Lighthouse files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_lighthouse_list() list[dict[str, Any]]

Get the full list of lighthouse audits from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_local_time(site: dict) datetime

Get the current time in the provided site’s timezone.

Parameters:

site (dict) – A site’s data dictionary.

Returns the current item as a timezone-aware datetime object.

newshomepages.utils.get_robotstxt_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of robots.txt files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_df() DataFrame

Get the full list of screenshot files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_list() list[dict[str, Any]]

Get the full list of screenshots from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_screenshots_by_site(site: dict) list[dict]

Get the list of screenshots for the provided site.

Returns a list of dictionaries.

newshomepages.utils.get_site(handle: str) dict

Get the metadata for the provided site.

Parameters:

handle (str) – The handle of the site you want.

Returns a dictionary.

newshomepages.utils.get_site_df() DataFrame

Get the full list of sites.

Returns a DataFrame.

newshomepages.utils.get_site_list() list[dict]

Get the full list of supported sites.

Returns a list of dictionaries.

newshomepages.utils.get_sites_in_batch(batch_number: int, batches: int = 10) list[dict]

Get all the sites in the provided batch.

Parameters:
  • batch_number (int) – The number of the batch to pull.

  • batches (int) – The total number of batches.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_bundle(slug: str) list[dict]

Get all the sites in the provided bundle.

Parameters:

slug (str) – The unique string identifier of the bundle.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_country(slug: str) list[dict]

Get all the sites in the provided country.

Parameters:

slug (str) – The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_language(code: str) list[dict]

Get all the sites in the provided language.

Parameters:

slug (str) – The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) Response

Get the provided URL.

Parameters:
  • url (str) – The URL to fetch.

  • timeout (int) – The number of seconds to wait before timing out. (Default: 30)

  • user_agent (str) – The user agent to use in the request. (Default: None)

  • verbose (bool) – Whether or not to log the action prior to execution. (Default: False)

Returns a requests.Response object.

newshomepages.utils.get_user_agent() str

Provide a user-agent string.

Returns a string ready to use as a header in web request.

newshomepages.utils.get_wayback_df() DataFrame

Get the full list of wayback files from our extracts.

Returns a DataFrame.

newshomepages.utils.intcomma(value: int | str) str

Convert an integer to a string containing commas every three digits.

For example, 3000 becomes ‘3,000’ and 45000 becomes ‘45,000’.

Parameters:

value (int) – The integer to format

Returns a string with the result.

newshomepages.utils.numoji(number: int) str

Convert a number into a series of emojis for Slack.

Parameters:

number (int) – The number to convert into emoji

Returns: Am emoji string

newshomepages.utils.parse_archive_url(url: str) dict

Parse the handle and timestamp from an archive.org URL.

Parameters:

url (str) – An archive.org URL

Returns a dictinary with the identifier, handle and timestamp parsed out.

newshomepages.utils.safe_ia_handle(handle: str) str

Santize a handle so its safe to use as an archive.org slug.

Parameters:

handle (str) – The unique string identifier of the site.

Returns a lowercase string that’s ready to use.

newshomepages.utils.write_csv(dict_list: list[dict], path: Path, verbose: bool = True) None

Write a list of dictionaries to a CSV file at the provided path.

Parameters:
  • data (Any) – Any Python object ready to be serialized as JSON.

  • path (Path) – The filesystem Path where the object should be written.

  • verbose (bool) – Whether or not to log the action prior to execution. (Default: True)

Returns nothing.

newshomepages.utils.write_json(data: Any, path: Path, indent: int = 2, verbose: bool = True) None

Write JSON data to the provided path.

Parameters:
  • data (Any) – Any Python object ready to be serialized as JSON.

  • path (Path) – The filesystem Path where the object should be written.

  • indent (int) – The number of identations to include in the JSON. (Default: 2)

  • verbose (bool) – Whether or not to log the action prior to execution. (Default: True)

Returns nothing.