Reference

Documentation for a selection of our system’s common internal tools

Commands

accessibility

Save the accessiblity tree of the provided site.

accessibility [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--verbose

Arguments

HANDLE

Required argument

adstxt

Save the raw ads.txt of the provided site.

adstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--timeout <timeout>
--verbose

Arguments

HANDLE

Required argument

analyze

analyze [OPTIONS] COMMAND [ARGS]...

cli

Analyze the Drudge Report.

drudge-entities

Analyze Drudge entities.

analyze drudge-entities [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Analyze Lighthouse reports.

lighthouse

Analyze Lighthouse scores.

analyze lighthouse [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Analyze US right wing sources.

archive

Save assets to an archive.org collection.

archive [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>
--latest

Crosspost to the latest archive.org item

--verbose

Display the upload progress to archive.org

--timeout <timeout>

Arguments

HANDLE

Required argument

batch

Print a batch of sites.

batch [OPTIONS] COMMAND [ARGS]...

sites-by-batch

Print site handles in the provided batch as a JSON list.

batch sites-by-batch [OPTIONS] BATCH

Options

-b, --batches <batches>

Arguments

BATCH

Required argument

sites-by-bundle

Print site handles in the provided bundle as a JSON list.

batch sites-by-bundle [OPTIONS] BUNDLE

Arguments

BUNDLE

Required argument

sites-by-country

Print site handles in the provided country as a JSON list.

batch sites-by-country [OPTIONS] COUNTRY

Arguments

COUNTRY

Required argument

site

site [OPTIONS] COMMAND [ARGS]...

cli

Create page ranking sites by Lighthouse accessibility score.

accessibility-ranking

Create page ranking sites by Lighthouse accessibility score.

site accessibility-ranking [OPTIONS]

cli

Create bundle detail pages.

bundle-detail

Create bundle detail pages.

site bundle-detail [OPTIONS]

cli

Create bundle list.

bundle-list

Create bundle list.

site bundle-list [OPTIONS]

cli

Create country detail pages.

country-detail

Create country detail pages.

site country-detail [OPTIONS]

cli

Create country list.

country-list

Create country list.

site country-list [OPTIONS]

cli

Create page ranking sites by appearance on drudgereport.com.

drudge

Create page ranking sites by appearance on drudgereport.com.

site drudge [OPTIONS]

cli

Create languages detail pages.

language-detail

Create languages detail pages.

site language-detail [OPTIONS]

cli

Create language list.

language-list

Create language list.

site language-list [OPTIONS]

cli

Create a page showing all of the latest screenshots.

latest-screenshots

Create page showing all of the latest screenshots.

site latest-screenshots [OPTIONS]

cli

Create a page tracking AI blockers based on most recently scraped robots.txt files.

openai

Create a page tracking AI blockers based on most recently scraped robots.txt files.

site openai [OPTIONS]

Options

--no-cache

cli

Create page ranking sites by Lighthouse performance score.

performance-ranking

Create page ranking sites by Lighthouse performance score.

site performance-ranking [OPTIONS]

cli

Create source detail pages.

site-detail

Create source detail pages.

site site-detail [OPTIONS]

cli

Create source list.

source-list

Create source list.

site source-list [OPTIONS]

cli

Create a status report.

status-report

Create a status report.

site status-report [OPTIONS]

mosaic

Create image mosaics.

mosaic [OPTIONS] COMMAND [ARGS]...

gif

Combine images into a mosaic GIF.

mosaic gif [OPTIONS]

Options

-i, --input-dir <input_dir>
-o, --output-dir <output_dir>

jpg

Combine images into jpgs ready for Twitter.

mosaic jpg [OPTIONS]

Options

-i, --input-dir <input_dir>
-o, --output-dir <output_dir>

robotstxt

Save the raw robots.txt of the provided site.

robotstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--timeout <timeout>
--verbose

Arguments

HANDLE

Required argument

rss

Create RSS feeds.

rss [OPTIONS] COMMAND [ARGS]...

bundles

Create bundle feeds.

rss bundles [OPTIONS]

countries

Create country feeds.

rss countries [OPTIONS]

opml

Create an OPML file with all site feeds.

rss opml [OPTIONS]

sites

Create site feeds.

rss sites [OPTIONS]

screenshot

Screenshot the provided homepage.

screenshot [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
-w, --wait <wait>
-x, --width <width>
-y, --height <height>
-f, --full-page

Screenshot the whole page

--verbose

Print verbose output

Arguments

HANDLE

Required argument

slack

Post image to Slack channel.

slack [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>
-v, --verbose
-t, --timeout <timeout>

Arguments

HANDLE

Required argument

toot

Send a toot.

toot [OPTIONS] COMMAND [ARGS]...

bundle

Toot sources in batches of four.

toot bundle [OPTIONS] SLUG

Options

-i, --input-dir <input_dir>

Arguments

SLUG

Required argument

single

Toot a single source.

toot single [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>

Arguments

HANDLE

Required argument

wayback

Archive a URL in the Wayback Machine.

wayback [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--verbose

Arguments

HANDLE

Required argument

Utilities

The utils module contains a variety of functions used by our commands.

newshomepages.utils.batch(li: list, n: int)

Yield n number of sequential chunks from l.

newshomepages.utils.chunk(iterable: list, length: int) list[list]

Split the provided list into chunks of the provided length.

Args:

iterable (list): The master list to split. length (int): The size of the chunks you want

Returns a list of lists.

newshomepages.utils.download_url(url: str, output_path: Path, timeout: int = 180)

Download the provided URL to the provided path.

newshomepages.utils.get_accessibility_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of accessibility files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_accessibility_list() list[dict[str, Any]]

Get the full list of accessibility from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_bundle(slug: str) dict

Get the metadata for the provided bundle.

Args:

slug (str): The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_bundle_list() list[dict]

Get the full list of site bundles.

Returns a list of dictionaries.

newshomepages.utils.get_country(code: str) dict

Get the metadata for the provided country.

Args:

slug (str): The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_country_df() DataFrame

Get the list of countries.

Returns a pandas DataFrame.

newshomepages.utils.get_country_list() list[dict]

Get the full list of countries.

Returns a list of dictionaries.

newshomepages.utils.get_extract_df(name: str, use_cache: bool = True, **kwargs) DataFrame

Read in the requests extracts CSV as a dataframe.

newshomepages.utils.get_flag_emoji(alpha2: str) str

Get the flag emoji for the provided country.

Args:

alpha2 (str): The two-letter ISO code for the country.

Returns:

A string containing the emoji.

Get the full list of hyperlink files from our extracts.

Returns a DataFrame.

Get the full list of hyperlink from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_javascript(handle: str) str | None

Get the JavaScript file to run before the screenshot, if it exists.

Args:

handle (str): The Twitter handle of the site you want.

Returns a JavaScript string ready to be run. Or None, if no file exists.

newshomepages.utils.get_json_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) Any

Get JSON data from the provided URL.

Args:

url (str): The URL to request timeout (int): How long to wait before timing out user_agent (str): The user agent to provide in the request headers. None by default. verbose (bool): Whether or not to print a verbose output

Returns:

The JSON response as a Python object.

newshomepages.utils.get_language_df() DataFrame

Get the list of languages.

Returns a pandas DataFrame.

newshomepages.utils.get_language_list() list[dict]

Get the list of languages.

Returns a list of dictionaries.

newshomepages.utils.get_lighthouse_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of Lighthouse files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_lighthouse_list() list[dict[str, Any]]

Get the full list of lighthouse audits from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_local_time(site: dict) datetime

Get the current time in the provided site’s timezone.

Args:

site (dict): A site’s data dictionary.

Returns the current item as a timezone-aware datetime object.

newshomepages.utils.get_robotstxt_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of robots.txt files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of screenshot files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_list() list[dict[str, Any]]

Get the full list of screenshots from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_screenshots_by_site(site: dict) list[dict]

Get the list of screenshots for the provided site.

Returns a list of dictionaries.

newshomepages.utils.get_site(handle: str) dict

Get the metadata for the provided site.

Args:

handle (str): The handle of the site you want.

Returns a dictionary.

newshomepages.utils.get_site_df() DataFrame

Get the full list of sites.

Returns a DataFrame.

newshomepages.utils.get_site_list() list[dict]

Get the full list of supported sites.

Returns a list of dictionaries.

newshomepages.utils.get_sites_in_batch(batch_number: int, batches: int = 10) list[dict]

Get all the sites in the provided batch.

Args:

batch_number (int): The number of the batch to pull. batches (int): The total number of batches.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_bundle(slug: str) list[dict]

Get all the sites in the provided bundle.

Args:

slug (str): The unique string identifier of the bundle.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_country(slug: str) list[dict]

Get all the sites in the provided country.

Args:

slug (str): The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_language(code: str) list[dict]

Get all the sites in the provided language.

Args:

slug (str): The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) Response

Get the provided URL.

Args:

url (str): The URL to fetch. timeout (int): The number of seconds to wait before timing out. (Default: 30) user_agent (str): The user agent to use in the request. (Default: None) verbose (bool): Whether or not to log the action prior to execution. (Default: False)

Returns a requests.Response object.

newshomepages.utils.get_user_agent() str

Provide a user-agent string.

Returns a string ready to use as a header in web request.

newshomepages.utils.get_wayback_df(use_cache: bool = True, verbose: bool = False) DataFrame

Get the full list of wayback files from our extracts.

Returns a DataFrame.

newshomepages.utils.intcomma(value: int | str) str

Convert an integer to a string containing commas every three digits.

For example, 3000 becomes ‘3,000’ and 45000 becomes ‘45,000’.

Args:

value (int): The integer to format

Returns a string with the result.

newshomepages.utils.numoji(number: int) str

Convert a number into a series of emojis for Slack.

Args:

number (int): The number to convert into emoji

Returns: Am emoji string

newshomepages.utils.parse_archive_url(url: str) dict

Parse the handle and timestamp from an archive.org URL.

Args:

url (str): An archive.org URL

Returns a dictinary with the identifier, handle and timestamp parsed out.

newshomepages.utils.safe_ia_handle(handle: str) str

Santize a handle so its safe to use as an archive.org slug.

Args:

handle (str): The unique string identifier of the site.

Returns a lowercase string that’s ready to use.

newshomepages.utils.write_csv(dict_list: list[dict], path: Path, verbose: bool = True) None

Write a list of dictionaries to a CSV file at the provided path.

Args:

data (Any): Any Python object ready to be serialized as JSON. path (Path): The filesystem Path where the object should be written. verbose (bool): Whether or not to log the action prior to execution. (Default: True)

Returns nothing.

newshomepages.utils.write_json(data: Any, path: Path, indent: int = 2, verbose: bool = True) None

Write JSON data to the provided path.

Args:

data (Any): Any Python object ready to be serialized as JSON. path (Path): The filesystem Path where the object should be written. indent (int): The number of identations to include in the JSON. (Default: 2) verbose (bool): Whether or not to log the action prior to execution. (Default: True)

Returns nothing.