News Homepages

Reference¶

Documentation for a selection of our system’s common internal tools

Commands ¶

accessibility ¶

Save the accessiblity tree of the provided site.

accessibility [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>¶

--verbose¶

Arguments

HANDLE¶: Required argument

adstxt ¶

Save the raw ads.txt of the provided site.

adstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>¶

--timeout <timeout>¶

--verbose¶

Arguments

HANDLE¶: Required argument

analyze ¶

analyze [OPTIONS] COMMAND [ARGS]...

cli¶

Analyze the Drudge Report.

drudge-entities¶

Analyze Drudge entities.

analyze drudge-entities [OPTIONS]

Options

-o, --output-dir <output_dir>¶

drudge-hyperlinks¶

Analyze Drudge hyperlinks.

analyze drudge-hyperlinks [OPTIONS]

Options

-o, --output-dir <output_dir>¶

cli¶

Analyze Lighthouse reports.

lighthouse¶

Analyze Lighthouse scores.

analyze lighthouse [OPTIONS]

Options

-o, --output-dir <output_dir>¶

cli¶

Analyze US right wing sources.

us-right-wing-hyperlinks¶

Analyze US Right Wing hyperlinks.

analyze us-right-wing-hyperlinks [OPTIONS]

Options

-o, --output-dir <output_dir>¶

archive ¶

Save assets to an archive.org collection.

archive [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>¶

--latest¶: Crosspost to the latest archive.org item

--verbose¶: Display the upload progress to archive.org

--timeout <timeout>¶

Arguments

HANDLE¶: Required argument

batch ¶

Print a batch of sites.

batch [OPTIONS] COMMAND [ARGS]...

sites-by-batch¶

Print site handles in the provided batch as a JSON list.

batch sites-by-batch [OPTIONS] BATCH

Options

-b, --batches <batches>¶

Arguments

BATCH¶: Required argument

sites-by-bundle¶

Print site handles in the provided bundle as a JSON list.

batch sites-by-bundle [OPTIONS] BUNDLE

Arguments

BUNDLE¶: Required argument

sites-by-country¶

Print site handles in the provided country as a JSON list.

batch sites-by-country [OPTIONS] COUNTRY

Arguments

COUNTRY¶: Required argument

site ¶

site [OPTIONS] COMMAND [ARGS]...

cli¶

Create page ranking sites by Lighthouse accessibility score.

accessibility-ranking¶

Create page ranking sites by Lighthouse accessibility score.

site accessibility-ranking [OPTIONS]

cli¶

Create bundle detail pages.

bundle-detail¶

Create bundle detail pages.

site bundle-detail [OPTIONS]

cli¶

Create bundle list.

bundle-list¶

Create bundle list.

site bundle-list [OPTIONS]

cli¶

Create country detail pages.

country-detail¶

Create country detail pages.

site country-detail [OPTIONS]

cli¶

Create country list.

country-list¶

Create country list.

site country-list [OPTIONS]

cli¶

Create page ranking sites by appearance on drudgereport.com.

drudge¶

Create page ranking sites by appearance on drudgereport.com.

site drudge [OPTIONS]

cli¶

Create languages detail pages.

language-detail¶

Create languages detail pages.

site language-detail [OPTIONS]

cli¶

Create language list.

cli¶

Create a page showing all of the latest screenshots.

latest-screenshots¶

Create page showing all of the latest screenshots.

site latest-screenshots [OPTIONS]

cli¶

Create a page tracking AI blockers based on most recently scraped robots.txt files.

openai¶

Create a page tracking AI blockers based on most recently scraped robots.txt files.

site openai [OPTIONS]

Options

--no-cache¶

cli¶

Create page ranking sites by Lighthouse performance score.

performance-ranking¶

Create page ranking sites by Lighthouse performance score.

site performance-ranking [OPTIONS]

cli¶

Create source detail pages.

site-detail¶

Create source detail pages.

site site-detail [OPTIONS]

cli¶

Create source list.

source-list¶

Create source list.

site source-list [OPTIONS]

cli¶

Create a status report.

status-report¶

Create a status report.

site status-report [OPTIONS]

extract ¶

extract [OPTIONS] COMMAND [ARGS]...

cli¶

Consolidate Internet Archive metadata into CSV files.

consolidate¶

Consolidate Internet Archive metadata into CSV files.

extract consolidate [OPTIONS]

Options

-o, --output-dir <output_dir>¶

cli¶

Download items from our archive.org collection as JSON.

items¶

Download items from our archive.org collection as JSON.

extract items [OPTIONS] HANDLE

Options

-y, --year <year>¶

-o, --output-dir <output_dir>¶

Arguments

HANDLE¶: Required argument

cli¶

Download and parse the provided site’s accessibility files.

accessibility¶

Download and parse the provided site’s accessibility files.

extract accessibility [OPTIONS] HANDLE

Arguments

HANDLE¶: Required argument

cli¶

Download and parse the provided site’s hyperlinks files.

hyperlinks¶

Download and parse the provided site’s hyperlinks files.

extract hyperlinks [OPTIONS]

Options

-s, --site <site>¶

-c, --country <country>¶

-l, --language <language>¶

-b, --bundle <bundle>¶

-d, --days <days>¶

-u, --use-cache¶

-o, --output-path <output_path>¶

cli¶

Download and parse the provided site’s Lighthouse files.

lighthouse¶

Download and parse the provided site’s Lighthouse files.

extract lighthouse [OPTIONS]

Options

--site <site>¶

--country <country>¶

--language <language>¶

--bundle <bundle>¶

--days <days>¶

-o, --output-path <output_path>¶

cli¶

Download and parse the provided site’s robots.txt files.

robotstxt¶

Download and parse archived robots.txt files.

extract robotstxt [OPTIONS]

Options

--site <site>¶

--country <country>¶

--language <language>¶

--bundle <bundle>¶

--days <days>¶

--latest¶

--monthly¶

-o, --output-path <output_path>¶

--no-cache¶

--verbose¶

cli¶

Create a status report.

status-report¶

Create a status report.

extract status-report [OPTIONS]

Options

-o, --output-dir <output_dir>¶

-c, --use-cache¶

cli¶

Download and parse the provided site’s Wayback Machine files.

wayback¶

Download and parse the provided site’s Wayback Machine files.

extract wayback [OPTIONS] HANDLE

Arguments

HANDLE¶: Required argument

hyperlinks ¶

Save all of a site’s hyperlinks as JSON.

hyperlinks [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>¶

-v, --verbose¶

Arguments

HANDLE¶: Required argument

mosaic ¶

Create image mosaics.

mosaic [OPTIONS] COMMAND [ARGS]...

gif¶

Combine images into a mosaic GIF.

mosaic gif [OPTIONS]

Options

-i, --input-dir <input_dir>¶

-o, --output-dir <output_dir>¶

jpg¶

Combine images into jpgs ready for Twitter.

mosaic jpg [OPTIONS]

Options

-i, --input-dir <input_dir>¶

-o, --output-dir <output_dir>¶

robotstxt ¶

Save the raw robots.txt of the provided site.

robotstxt [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>¶

--timeout <timeout>¶

--verbose¶

Arguments

HANDLE¶: Required argument

screenshot ¶

Screenshot the provided homepage.

screenshot [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>¶

-w, --wait <wait>¶

-x, --width <width>¶

-y, --height <height>¶

-f, --full-page¶: Screenshot the whole page

--verbose¶: Print verbose output

Arguments

HANDLE¶: Required argument

slack ¶

Post image to Slack channel.

slack [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>¶

-v, --verbose¶

-t, --timeout <timeout>¶

Arguments

HANDLE¶: Required argument

toot ¶

Send a toot.

toot [OPTIONS] COMMAND [ARGS]...

bundle¶

Toot sources in batches of four.

toot bundle [OPTIONS] SLUG

Options

-i, --input-dir <input_dir>¶

Arguments

SLUG¶: Required argument

single¶

Toot a single source.

toot single [OPTIONS] HANDLE

Options

-i, --input-dir <input_dir>¶

Arguments

HANDLE¶: Required argument

wayback ¶

Archive a URL in the Wayback Machine.

wayback [OPTIONS] HANDLE

Options

-o, --output-dir <output_dir>¶

--verbose¶

Arguments

HANDLE¶: Required argument

Utilities ¶

The utils module contains a variety of functions used by our commands.

newshomepages.utils.batch(li: list, n: int)¶: Yield n number of sequential chunks from l.

newshomepages.utils.chunk(iterable: list, length: int) → list[list]¶

Split the provided list into chunks of the provided length.

Args:: iterable (list): The master list to split. length (int): The size of the chunks you want

Returns a list of lists.

newshomepages.utils.download_url(url: str, output_path: Path, timeout: int = 180)¶: Download the provided URL to the provided path.

newshomepages.utils.get_accessibility_df(use_cache: bool = True, verbose: bool = False) → DataFrame¶

Get the full list of accessibility files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_accessibility_list() → list[dict[str, Any]]¶

Get the full list of accessibility from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_bundle(slug: str) → dict¶

Get the metadata for the provided bundle.

Args:: slug (str): The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_bundle_list() → list[dict]¶

Get the full list of site bundles.

Returns a list of dictionaries.

newshomepages.utils.get_country(code: str) → dict¶

Get the metadata for the provided country.

Args:: slug (str): The unique string identifier of the bundle.

Returns a dictionary.

newshomepages.utils.get_country_df() → DataFrame¶

Get the list of countries.

Returns a pandas DataFrame.

newshomepages.utils.get_country_list() → list[dict]¶

Get the full list of countries.

Returns a list of dictionaries.

newshomepages.utils.get_extract_df(name: str, use_cache: bool = True, **kwargs) → DataFrame¶: Read in the requests extracts CSV as a dataframe.

newshomepages.utils.get_flag_emoji(alpha2: str) → str¶

Get the flag emoji for the provided country.

Args:: alpha2 (str): The two-letter ISO code for the country.
Returns:: A string containing the emoji.

newshomepages.utils.get_hyperlink_df(use_cache: bool = True, verbose: bool = False) → DataFrame¶

Get the full list of hyperlink files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_hyperlink_list() → list[dict[str, Any]]¶

Get the full list of hyperlink from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_javascript(handle: str) → str | None¶

Get the JavaScript file to run before the screenshot, if it exists.

Args:: handle (str): The Twitter handle of the site you want.

Returns a JavaScript string ready to be run. Or None, if no file exists.

newshomepages.utils.get_json_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) → Any¶

Get JSON data from the provided URL.

Args:: url (str): The URL to request timeout (int): How long to wait before timing out user_agent (str): The user agent to provide in the request headers. None by default. verbose (bool): Whether or not to print a verbose output
Returns:: The JSON response as a Python object.

newshomepages.utils.get_language_df() → DataFrame¶

Get the list of languages.

Returns a pandas DataFrame.

newshomepages.utils.get_language_list() → list[dict]¶

Get the list of languages.

Returns a list of dictionaries.

newshomepages.utils.get_lighthouse_df(use_cache: bool = True, verbose: bool = False) → DataFrame¶

Get the full list of Lighthouse files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_lighthouse_list() → list[dict[str, Any]]¶

Get the full list of lighthouse audits from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_local_time(site: dict) → datetime¶

Get the current time in the provided site’s timezone.

Args:: site (dict): A site’s data dictionary.

Returns the current item as a timezone-aware datetime object.

newshomepages.utils.get_robotstxt_df(use_cache: bool = True, verbose: bool = False) → DataFrame¶

Get the full list of robots.txt files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_df(use_cache: bool = True, verbose: bool = False) → DataFrame¶

Get the full list of screenshot files from our extracts.

Returns a DataFrame.

newshomepages.utils.get_screenshot_list() → list[dict[str, Any]]¶

Get the full list of screenshots from our extracts.

Returns a list of dictionaries.

newshomepages.utils.get_screenshots_by_site(site: dict) → list[dict]¶

Get the list of screenshots for the provided site.

Returns a list of dictionaries.

newshomepages.utils.get_site(handle: str) → dict¶

Get the metadata for the provided site.

Args:: handle (str): The handle of the site you want.

Returns a dictionary.

newshomepages.utils.get_site_df() → DataFrame¶

Get the full list of sites.

Returns a DataFrame.

newshomepages.utils.get_site_list() → list[dict]¶

Get the full list of supported sites.

Returns a list of dictionaries.

newshomepages.utils.get_sites_in_batch(batch_number: int, batches: int = 10) → list[dict]¶

Get all the sites in the provided batch.

Args:: batch_number (int): The number of the batch to pull. batches (int): The total number of batches.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_bundle(slug: str) → list[dict]¶

Get all the sites in the provided bundle.

Args:: slug (str): The unique string identifier of the bundle.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_country(slug: str) → list[dict]¶

Get all the sites in the provided country.

Args:: slug (str): The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_sites_in_language(code: str) → list[dict]¶

Get all the sites in the provided language.

Args:: slug (str): The two digit alpha code of the country.

Returns a list of site dictionaries.

newshomepages.utils.get_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) → Response¶

Get the provided URL.

Args:: url (str): The URL to fetch. timeout (int): The number of seconds to wait before timing out. (Default: 30) user_agent (str): The user agent to use in the request. (Default: None) verbose (bool): Whether or not to log the action prior to execution. (Default: False)

Returns a requests.Response object.

newshomepages.utils.get_user_agent() → str¶

Provide a user-agent string.

Returns a string ready to use as a header in web request.

newshomepages.utils.get_wayback_df(use_cache: bool = True, verbose: bool = False) → DataFrame¶

Get the full list of wayback files from our extracts.

Returns a DataFrame.

newshomepages.utils.intcomma(value: int | str) → str¶

Convert an integer to a string containing commas every three digits.

For example, 3000 becomes ‘3,000’ and 45000 becomes ‘45,000’.

Args:: value (int): The integer to format

Returns a string with the result.

newshomepages.utils.numoji(number: int) → str¶

Convert a number into a series of emojis for Slack.

Args:: number (int): The number to convert into emoji

Returns: Am emoji string

newshomepages.utils.parse_archive_url(url: str) → dict¶

Parse the handle and timestamp from an archive.org URL.

Args:: url (str): An archive.org URL

Returns a dictinary with the identifier, handle and timestamp parsed out.

newshomepages.utils.safe_ia_handle(handle: str) → str¶

Santize a handle so its safe to use as an archive.org slug.

Args:: handle (str): The unique string identifier of the site.

Returns a lowercase string that’s ready to use.

newshomepages.utils.write_csv(dict_list: list[dict], path: Path, verbose: bool = True) → None¶

Write a list of dictionaries to a CSV file at the provided path.

Args:: data (Any): Any Python object ready to be serialized as JSON. path (Path): The filesystem Path where the object should be written. verbose (bool): Whether or not to log the action prior to execution. (Default: True)

Returns nothing.

newshomepages.utils.write_json(data: Any, path: Path, indent: int = 2, verbose: bool = True) → None¶

Write JSON data to the provided path.

Args:: data (Any): Any Python object ready to be serialized as JSON. path (Path): The filesystem Path where the object should be written. indent (int): The number of identations to include in the JSON. (Default: 2) verbose (bool): Whether or not to log the action prior to execution. (Default: True)

Returns nothing.

Reference¶

cli¶

drudge-entities¶

drudge-hyperlinks¶

cli¶

lighthouse¶

cli¶

us-right-wing-hyperlinks¶

sites-by-batch¶

sites-by-bundle¶

sites-by-country¶

cli¶

accessibility-ranking¶

cli¶

bundle-detail¶

cli¶

bundle-list¶

cli¶

country-detail¶

cli¶

country-list¶

cli¶

drudge¶

cli¶

language-detail¶

cli¶

language-list¶

cli¶

latest-screenshots¶

cli¶

openai¶

cli¶

performance-ranking¶

cli¶

site-detail¶

cli¶

source-list¶

cli¶

status-report¶

cli¶

consolidate¶

cli¶

items¶

cli¶

accessibility¶

cli¶

hyperlinks¶

cli¶

lighthouse¶

cli¶

robotstxt¶

cli¶

status-report¶

cli¶

wayback¶

gif¶

jpg¶

bundle¶

single¶