Reference

Documentation for a selection of our system’s common internal tools

Commands

accessibility

Save the accessiblity tree of the provided site.

Usage

accessibility [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--verbose

Arguments

HANDLE

Required argument

adstxt

Save the raw ads.txt of the provided site.

Usage

adstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--timeout <timeout>
--verbose

Arguments

HANDLE

Required argument

analyze

Usage

analyze [OPTIONS] COMMAND [ARGS]...

cli

Analyze the Drudge Report.

drudge-entities

Analyze Drudge entities.

Usage

analyze drudge-entities [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Analyze Lighthouse reports.

lighthouse

Analyze Lighthouse scores.

Usage

analyze lighthouse [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Analyze US right wing sources.

archive

Save assets to an archive.org collection.

Usage

archive [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>
--latest

Crosspost to the latest archive.org item

--verbose

Display the upload progress to archive.org

--timeout <timeout>

Arguments

HANDLE

Required argument

batch

Print a batch of sites.

Usage

batch [OPTIONS] COMMAND [ARGS]...

sites-by-batch

Print site handles in the provided batch as a JSON list.

Usage

batch sites-by-batch [OPTIONS] BATCH

Options

-b, --batches <batches>

Arguments

BATCH

Required argument

sites-by-bundle

Print site handles in the provided bundle as a JSON list.

Usage

batch sites-by-bundle [OPTIONS] BUNDLE

Arguments

BUNDLE

Required argument

sites-by-country

Print site handles in the provided country as a JSON list.

Usage

batch sites-by-country [OPTIONS] COUNTRY

Arguments

COUNTRY

Required argument

site

Usage

site [OPTIONS] COMMAND [ARGS]...

cli

Create page ranking sites by Lighthouse accessibility score.

accessibility-ranking

Create page ranking sites by Lighthouse accessibility score.

Usage

site accessibility-ranking [OPTIONS]

cli

Create bundle detail pages.

bundle-detail

Create bundle detail pages.

Usage

site bundle-detail [OPTIONS]

cli

Create bundle list.

bundle-list

Create bundle list.

Usage

site bundle-list [OPTIONS]

cli

Create country detail pages.

country-detail

Create country detail pages.

Usage

site country-detail [OPTIONS]

cli

Create country list.

country-list

Create country list.

Usage

site country-list [OPTIONS]

cli

Create page ranking sites by appearance on drudgereport.com.

drudge

Create page ranking sites by appearance on drudgereport.com.

Usage

site drudge [OPTIONS]

cli

Create languages detail pages.

language-detail

Create languages detail pages.

Usage

site language-detail [OPTIONS]

cli

Create language list.

language-list

Create language list.

Usage

site language-list [OPTIONS]

cli

Create a page showing all of the latest screenshots.

latest-screenshots

Create page showing all of the latest screenshots.

Usage

site latest-screenshots [OPTIONS]

cli

Create a page tracking AI blockers based on most recently scraped robots.txt files.

openai

Create a page tracking AI blockers based on most recently scraped robots.txt files.

Usage

site openai [OPTIONS]

Options

--no-cache

cli

Create page ranking sites by Lighthouse performance score.

performance-ranking

Create page ranking sites by Lighthouse performance score.

Usage

site performance-ranking [OPTIONS]

cli

Create source detail pages.

site-detail

Create source detail pages.

Usage

site site-detail [OPTIONS]

cli

Create source list.

source-list

Create source list.

Usage

site source-list [OPTIONS]

cli

Create a status report.

status-report

Create a status report.

Usage

site status-report [OPTIONS]

extract

Usage

extract [OPTIONS] COMMAND [ARGS]...

cli

Consolidate Internet Archive metadata into CSV files.

consolidate

Consolidate Internet Archive metadata into CSV files.

Usage

extract consolidate [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Download items from our archive.org collection as JSON.

items

Download items from our archive.org collection as JSON.

Usage

extract items [OPTIONS] HANDLE

Options

-y, --year <year>
-o, --output-dir <output_dir>

Arguments

HANDLE

Required argument

cli

Download and parse the provided site’s accessibility files.

accessibility

Download and parse the provided site’s accessibility files.

Usage

extract accessibility [OPTIONS] HANDLE

Arguments

HANDLE

Required argument

cli

Download and parse the provided site’s hyperlinks files.

cli

Download and parse the provided site’s Lighthouse files.

lighthouse

Download and parse the provided site’s Lighthouse files.

Usage

extract lighthouse [OPTIONS]

Options

--site <site>
--country <country>
--language <language>
--bundle <bundle>
--days <days>
-o, --output-path <output_path>

cli

Download and parse the provided site’s robots.txt files.

robotstxt

Download and parse archived robots.txt files.

Usage

extract robotstxt [OPTIONS]

Options

--site <site>
--country <country>
--language <language>
--bundle <bundle>
--days <days>
--latest
--monthly
-o, --output-path <output_path>
--no-cache
--verbose

cli

Create a status report.

status-report

Create a status report.

Usage

extract status-report [OPTIONS]

Options

-o, --output-dir <output_dir>
-c, --use-cache

cli

Download and parse the provided site’s Wayback Machine files.

wayback

Download and parse the provided site’s Wayback Machine files.

Usage

extract wayback [OPTIONS] HANDLE

Arguments

HANDLE

Required argument

mosaic

Create image mosaics.

Usage

mosaic [OPTIONS] COMMAND [ARGS]...

gif

Combine images into a mosaic GIF.

Usage

mosaic gif [OPTIONS]

Options

-i, --input-dir <input_dir>
-o, --output-dir <output_dir>

jpg

Combine images into jpgs ready for Twitter.

Usage

mosaic jpg [OPTIONS]

Options

-i, --input-dir <input_dir>
-o, --output-dir <output_dir>

robotstxt

Save the raw robots.txt of the provided site.

Usage

robotstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--timeout <timeout>
--verbose

Arguments

HANDLE

Required argument

screenshot

Screenshot the provided homepage.

Usage

screenshot [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
-w, --wait <wait>
-x, --width <width>
-y, --height <height>
-f, --full-page

Screenshot the whole page

--verbose

Print verbose output

Arguments

HANDLE

Required argument

slack

Post image to Slack channel.

Usage

slack [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>
-v, --verbose
-t, --timeout <timeout>

Arguments

HANDLE

Required argument

toot

Send a toot.

Usage

toot [OPTIONS] COMMAND [ARGS]...

bundle

Toot sources in batches of four.

Usage

toot bundle [OPTIONS] SLUG

Options

-i, --input-dir <input_dir>

Arguments

SLUG

Required argument

single

Toot a single source.

Usage

toot single [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>

Arguments

HANDLE

Required argument

wayback

Archive a URL in the Wayback Machine.

Usage

wayback [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--verbose

Arguments

HANDLE

Required argument

Utilities

The utils module contains a variety of functions used by our commands.

newshomepages.utils.batch(li: list, n: int)

Yield n number of sequential chunks from l.

newshomepages.utils.chunk(iterable: list, length: int) list[list]

Split the provided list into chunks of the provided length.

Args:

iterable (list): The master list to split. length (int): The size of the chunks you want

Returns a list of lists.

newshomepages.utils.download_url(url: str, output_path: Path, timeout: int = 180)

Download the provided URL to the provided path.

newshomepages.utils.get_accessibility_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of accessibility files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_accessibility_list() list[dict[str, Any]]

Get the full list of accessibility from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_bundle(slug: str) dict

Get the metadata for the provided bundle.

Args:

slug (str): The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_bundle_list() list[dict]

Get the full list of site bundles.

Returns a list of dictionaries.

newshomepages.utils.get_country(code: str) dict

Get the metadata for the provided country.

Args:

slug (str): The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_country_df() DataFrame

Get the list of countries.

Returns a pandas DataFrame.

newshomepages.utils.get_country_list() list[dict]

Get the full list of countries.

Returns a list of dictionaries.

newshomepages.utils.get_extract_df(name: str, use_cache: bool = True, **kwargs) DataFrame

Read in the requests extracts CSV as a dataframe.

newshomepages.utils.get_flag_emoji(alpha2: str) str

Get the flag emoji for the provided country.

Args:

alpha2 (str): The two-letter ISO code for the country.

Returns:

A string containing the emoji.

Get the full list of hyperlink files from our extracts.

Returns a DataFrame.

Get the full list of hyperlink from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_javascript(handle: str) str | None

Get the JavaScript file to run before the screenshot, if it exists.

Args:

handle (str): The Twitter handle of the site you want.

Returns a JavaScript string ready to be run. Or None, if no file exists.

newshomepages.utils.get_json_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) Any

Get JSON data from the provided URL.

Args:

url (str): The URL to request timeout (int): How long to wait before timing out user_agent (str): The user agent to provide in the request headers. None by default. verbose (bool): Whether or not to print a verbose output

Returns:

The JSON response as a Python object.

newshomepages.utils.get_language_df() DataFrame

Get the list of languages.

Returns a pandas DataFrame.

newshomepages.utils.get_language_list() list[dict]

Get the list of languages.

Returns a list of dictionaries.

newshomepages.utils.get_lighthouse_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of Lighthouse files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_lighthouse_list() list[dict[str, Any]]

Get the full list of lighthouse audits from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_local_time(site: dict) datetime

Get the current time in the provided site’s timezone.

Args:

site (dict): A site’s data dictionary.

Returns the current item as a timezone-aware datetime object.

newshomepages.utils.get_robotstxt_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of robots.txt files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of screenshot files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_list() list[dict[str, Any]]

Get the full list of screenshots from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_screenshots_by_site(site: dict) list[dict]

Get the list of screenshots for the provided site.

Returns a list of dictionaries.

newshomepages.utils.get_site(handle: str) dict

Get the metadata for the provided site.

Args:

handle (str): The handle of the site you want.

Returns a dictionary.

newshomepages.utils.get_site_df() DataFrame

Get the full list of sites.

Returns a DataFrame.

newshomepages.utils.get_site_list() list[dict]

Get the full list of supported sites.

Returns a list of dictionaries.

newshomepages.utils.get_sites_in_batch(batch_number: int, batches: int = 10) list[dict]

Get all the sites in the provided batch.

Args:

batch_number (int): The number of the batch to pull. batches (int): The total number of batches.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_bundle(slug: str) list[dict]

Get all the sites in the provided bundle.

Args:

slug (str): The unique string identifier of the bundle.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_country(slug: str) list[dict]

Get all the sites in the provided country.

Args:

slug (str): The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_language(code: str) list[dict]

Get all the sites in the provided language.

Args:

slug (str): The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) Response

Get the provided URL.

Args:

url (str): The URL to fetch. timeout (int): The number of seconds to wait before timing out. (Default: 30) user_agent (str): The user agent to use in the request. (Default: None) verbose (bool): Whether or not to log the action prior to execution. (Default: False)

Returns a requests.Response object.

newshomepages.utils.get_user_agent() str

Provide a user-agent string.

Returns a string ready to use as a header in web request.

newshomepages.utils.get_wayback_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of wayback files from our extracts.

Returns a DataFrame.

newshomepages.utils.intcomma(value: int | str) str

Convert an integer to a string containing commas every three digits.

For example, 3000 becomes ‘3,000’ and 45000 becomes ‘45,000’.

Args:

value (int): The integer to format

Returns a string with the result.

newshomepages.utils.numoji(number: int) str

Convert a number into a series of emojis for Slack.

Args:

number (int): The number to convert into emoji

Returns: Am emoji string

newshomepages.utils.parse_archive_url(url: str) dict

Parse the handle and timestamp from an archive.org URL.

Args:

url (str): An archive.org URL

Returns a dictinary with the identifier, handle and timestamp parsed out.

newshomepages.utils.safe_ia_handle(handle: str) str

Santize a handle so its safe to use as an archive.org slug.

Args:

handle (str): The unique string identifier of the site.

Returns a lowercase string that’s ready to use.

newshomepages.utils.write_csv(dict_list: list[dict], path: Path, verbose: bool = True) None

Write a list of dictionaries to a CSV file at the provided path.

Args:

data (Any): Any Python object ready to be serialized as JSON. path (Path): The filesystem Path where the object should be written. verbose (bool): Whether or not to log the action prior to execution. (Default: True)

Returns nothing.

newshomepages.utils.write_json(data: Any, path: Path, indent: int = 2, verbose: bool = True) None

Write JSON data to the provided path.

Args:

data (Any): Any Python object ready to be serialized as JSON. path (Path): The filesystem Path where the object should be written. indent (int): The number of identations to include in the JSON. (Default: 2) verbose (bool): Whether or not to log the action prior to execution. (Default: True)

Returns nothing.