Reference

Documentation for a selection of our system’s common internal tools

Commands

accessibility

Save the accessiblity tree of the provided site.

accessibility [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--timeout <timeout>

Arguments

HANDLE

Required argument

analyze

analyze [OPTIONS] COMMAND [ARGS]...

cli

Analyze the Drudge Report.

drudge-entities

Analyze Drudge entities.

analyze drudge-entities [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Analyze Lighthouse reports.

lighthouse

Analyze Lighthouse scores.

analyze lighthouse [OPTIONS]

Options

-o, --output-dir <output_dir>

cli

Analyze US right wing sources.

archive

Save assets to an archive.org collection.

archive [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>
--latest

Crosspost to the latest archive.org item

--verbose

Display the upload progress to archive.org

--timeout <timeout>

Arguments

HANDLE

Required argument

batch

Print a batch of sites.

batch [OPTIONS] COMMAND [ARGS]...

sites-by-batch

Print site handles in the provided batch as a JSON list.

batch sites-by-batch [OPTIONS] BATCH

Options

-b, --batches <batches>

Arguments

BATCH

Required argument

sites-by-bundle

Print site handles in the provided bundle as a JSON list.

batch sites-by-bundle [OPTIONS] BUNDLE

Arguments

BUNDLE

Required argument

sites-by-country

Print site handles in the provided country as a JSON list.

batch sites-by-country [OPTIONS] COUNTRY

Arguments

COUNTRY

Required argument

discorder

Post images to Discord.

discorder [OPTIONS] COMMAND [ARGS]...

bundle

Post all images for a bundle.

discorder bundle [OPTIONS] SLUG

Options

-i, --input-dir <input_dir>

Arguments

SLUG

Required argument

country

Post all images for a country.

discorder country [OPTIONS] CODE

Options

-i, --input-dir <input_dir>

Arguments

CODE

Required argument

single

Post a single source.

discorder single [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>

Arguments

HANDLE

Required argument

site

site [OPTIONS] COMMAND [ARGS]...

cli

Create page ranking sites by Lighthouse accessibility score.

accessibility-ranking

Create page ranking sites by Lighthouse accessibility score.

site accessibility-ranking [OPTIONS]

cli

Create bundle detail pages.

bundle-detail

Create bundle detail pages.

site bundle-detail [OPTIONS]

cli

Create bundle list.

bundle-list

Create bundle list.

site bundle-list [OPTIONS]

cli

Create country detail pages.

country-detail

Create country detail pages.

site country-detail [OPTIONS]

cli

Create country list.

country-list

Create country list.

site country-list [OPTIONS]

cli

Create page ranking sites by appearance on drudgereport.com.

drudge

Create page ranking sites by appearance on drudgereport.com.

site drudge [OPTIONS]

cli

Create languages detail pages.

language-detail

Create languages detail pages.

site language-detail [OPTIONS]

cli

Create language list.

language-list

Create language list.

site language-list [OPTIONS]

cli

Create a page showing all of the latest screenshots.

latest-screenshots

Create page showing all of the latest screenshots.

site latest-screenshots [OPTIONS]

cli

Create the openai page based on most recently scrape robots.txt files.

openai

Create the openai page based on most recently scrape robots.txt files.

site openai [OPTIONS]

cli

Create page ranking sites by Lighthouse performance score.

performance-ranking

Create page ranking sites by Lighthouse performance score.

site performance-ranking [OPTIONS]

cli

Create source detail pages.

site-detail

Create source detail pages.

site site-detail [OPTIONS]

cli

Create source list.

source-list

Create source list.

site source-list [OPTIONS]

cli

Create a status report.

status-report

Create a status report.

site status-report [OPTIONS]

html

Save HTML for the provided homepage.

html [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
-w, --wait <wait>

Arguments

HANDLE

Required argument

mosaic

Create image mosaics.

mosaic [OPTIONS] COMMAND [ARGS]...

gif

Combine images into a mosaic GIF.

mosaic gif [OPTIONS]

Options

-i, --input-dir <input_dir>
-o, --output-dir <output_dir>

jpg

Combine images into jpgs ready for Twitter.

mosaic jpg [OPTIONS]

Options

-i, --input-dir <input_dir>
-o, --output-dir <output_dir>

robotstxt

Save the raw robots.txt of the provided site.

robotstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
--timeout <timeout>
--verbose

Arguments

HANDLE

Required argument

rss

Create RSS feeds.

rss [OPTIONS] COMMAND [ARGS]...

bundles

Create bundle feeds.

rss bundles [OPTIONS]

countries

Create country feeds.

rss countries [OPTIONS]

opml

Create an OPML file with all site feeds.

rss opml [OPTIONS]

sites

Create site feeds.

rss sites [OPTIONS]

screenshot

Screenshot the provided homepage.

screenshot [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>
-w, --wait <wait>
-x, --width <width>
-y, --height <height>
-f, --full-page

Screenshot the whole page

Arguments

HANDLE

Required argument

slack

Post image to Slack channel.

slack [OPTIONS] ARTIFACT_PATH

Arguments

ARTIFACT_PATH

Required argument

telegrammer

Send a Telegram message.

telegrammer [OPTIONS] COMMAND [ARGS]...

bundle

Send a bundle of sources.

telegrammer bundle [OPTIONS] SLUG

Options

-i, --input-dir <input_dir>

Arguments

SLUG

Required argument

country

Send all sources from a single country.

telegrammer country [OPTIONS] CODE

Options

-i, --input-dir <input_dir>

Arguments

CODE

Required argument

mosaic

Tweet a mosaic GIF.

telegrammer mosaic [OPTIONS]

Options

-i, --input-dir <input_dir>

single

Send a single source.

telegrammer single [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>

Arguments

HANDLE

Required argument

wayback

Archive a URL in the Wayback Machine.

wayback [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>

Arguments

HANDLE

Required argument

Utilities

The utils module contains a variety of functions used by our commands.

newshomepages.utils.batch(li: list, n: int)

Yield n number of sequential chunks from l.

newshomepages.utils.chunk(iterable: list, length: int) → list[list]

Split the provided list into chunks of the provided length.

Parameters
  • iterable (list) – The master list to split.

  • length (int) – The size of the chunks you want

Returns a list of lists.

newshomepages.utils.download_url(url: str, output_path: pathlib.Path, timeout: int = 180)

Download the provided URL to the provided path.

newshomepages.utils.get_accessibility_df() → pandas.core.frame.DataFrame

Get the full list of accessibility files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_accessibility_list() → list[dict[str, typing.Any]]

Get the full list of accessibility from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_bundle(slug: str) → dict

Get the metadata for the provided bundle.

Parameters

slug (str) – The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_bundle_list() → list[dict]

Get the full list of site bundles.

Returns a list of dictionaries.

newshomepages.utils.get_country(code: str) → dict

Get the metadata for the provided country.

Parameters

slug (str) – The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_country_df() → pandas.core.frame.DataFrame

Get the list of countries.

Returns a pandas DataFrame.

newshomepages.utils.get_country_list() → list[dict]

Get the full list of countries.

Returns a list of dictionaries.

newshomepages.utils.get_extract_df(name: str, use_cache: bool = True, **kwargs) → pandas.core.frame.DataFrame

Read in the requests extracts CSV as a dataframe.

Get the full list of hyperlink files from our extracts.

Returns a DataFrame.

Get the full list of hyperlink from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_javascript(handle: str) → str | None

Get the JavaScript file to run before the screenshot, if it exists.

Parameters

handle (str) – The Twitter handle of the site you want.

Returns a JavaScript string ready to be run. Or None, if no file exists.

newshomepages.utils.get_json_url(url: str)

Get JSON data from the provided URL.

newshomepages.utils.get_language_df() → pandas.core.frame.DataFrame

Get the list of languages.

Returns a pandas DataFrame.

newshomepages.utils.get_language_list() → list[dict]

Get the list of languages.

Returns a list of dictionaries.

newshomepages.utils.get_lighthouse_df() → pandas.core.frame.DataFrame

Get the full list of Lighthouse files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_lighthouse_list() → list[dict[str, typing.Any]]

Get the full list of lighthouse audits from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_local_time(site: dict) → datetime.datetime

Get the current time in the provided site’s timezone.

Parameters

site (dict) – A site’s data dictionary.

Returns the current item as a timezone-aware datetime object.

newshomepages.utils.get_robotstxt_df() → pandas.core.frame.DataFrame

Get the full list of robots.txt files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_df() → pandas.core.frame.DataFrame

Get the full list of screenshot files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_list() → list[dict[str, typing.Any]]

Get the full list of screenshots from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_screenshots_by_site(site: dict) → list[dict]

Get the list of screenshots for the provided site.

Returns a list of dictionaries.

newshomepages.utils.get_site(handle: str) → dict

Get the metadata for the provided site.

Parameters

handle (str) – The Twitter handle of the site you want.

Returns a dictionary.

newshomepages.utils.get_site_df() → pandas.core.frame.DataFrame

Get the full list of sites.

Returns a DataFrame.

newshomepages.utils.get_site_list() → list[dict]

Get the full list of supported sites.

Returns a list of dictionaries.

newshomepages.utils.get_sites_in_batch(batch_number: int, batches: int = 10) → list[dict]

Get all the sites in the provided batch.

Parameters
  • batch_number (int) – The number of the batch to pull.

  • batches (int) – The total number of batches.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_bundle(slug: str) → list[dict]

Get all the sites in the provided bundle.

Parameters

slug (str) – The unique string identifier of the bundle.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_country(slug: str) → list[dict]

Get all the sites in the provided country.

Parameters

slug (str) – The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_language(code: str) → list[dict]

Get all the sites in the provided language.

Parameters

slug (str) – The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False)

Get the provided URL.

Parameters
  • url (str) – The URL to fetch.

  • timeout (int) – The number of seconds to wait before timing out. (Default: 30)

  • user_agent (str) – The user agent to use in the request. (Default: None)

  • verbose (bool) – Whether or not to log the action prior to execution. (Default: False)

Returns a requests.Response object.

newshomepages.utils.get_user_agent() → str

Provide a user-agent string.

Returns a string ready to use as a header in web request.

newshomepages.utils.get_wayback_df() → pandas.core.frame.DataFrame

Get the full list of wayback files from our extracts.

Returns a DataFrame.

newshomepages.utils.intcomma(value: int | str) → str

Convert an integer to a string containing commas every three digits.

For example, 3000 becomes ‘3,000’ and 45000 becomes ‘45,000’.

Parameters

value (int) – The integer to format

Returns a string with the result.

newshomepages.utils.numoji(number: int) → str

Convert a number into a series of emojis for Slack.

Parameters

number (int) – The number to convert into emoji

Returns: Am emoji string

newshomepages.utils.parse_archive_url(url: str) → dict

Parse the handle and timestamp from an archive.org URL.

Parameters

url (str) – An archive.org URL

Returns a dictinary with the identifier, handle and timestamp parsed out.

newshomepages.utils.safe_ia_handle(handle: str) → str

Santize a handle so its safe to use as an archive.org slug.

Parameters

handle (str) – The unique string identifier of the site.

Returns a lowercase string that’s ready to use.

newshomepages.utils.write_csv(dict_list: list[dict], path: Path, verbose: bool = True) → None

Write a list of dictionaries to a CSV file at the provided path.

Parameters
  • data (Any) – Any Python object ready to be serialized as JSON.

  • path (Path) – The filesystem Path where the object should be written.

  • verbose (bool) – Whether or not to log the action prior to execution. (Default: True)

Returns nothing.

newshomepages.utils.write_json(data: Any, path: pathlib.Path, indent: int = 2, verbose: bool = True) → None

Write JSON data to the provided path.

Parameters
  • data (Any) – Any Python object ready to be serialized as JSON.

  • path (Path) – The filesystem Path where the object should be written.

  • indent (int) – The number of identations to include in the JSON. (Default: 2)

  • verbose (bool) – Whether or not to log the action prior to execution. (Default: True)

Returns nothing.