Reference¶
Documentation for a selection of our system’s common internal tools
Commands¶
accessibility¶
Save the accessiblity tree of the provided site.
accessibility [OPTIONS] HANDLE
Options
- -o, --output-dir <output_dir>¶
- --verbose¶
Arguments
- HANDLE¶
Required argument
adstxt¶
Save the raw ads.txt of the provided site.
adstxt [OPTIONS] HANDLE
Options
- -o, --output-dir <output_dir>¶
- --timeout <timeout>¶
- --verbose¶
Arguments
- HANDLE¶
Required argument
analyze¶
analyze [OPTIONS] COMMAND [ARGS]...
cli¶
Analyze the Drudge Report.
drudge-entities¶
Analyze Drudge entities.
analyze drudge-entities [OPTIONS]
Options
- -o, --output-dir <output_dir>¶
drudge-hyperlinks¶
Analyze Drudge hyperlinks.
analyze drudge-hyperlinks [OPTIONS]
Options
- -o, --output-dir <output_dir>¶
cli¶
Analyze Lighthouse reports.
lighthouse¶
Analyze Lighthouse scores.
analyze lighthouse [OPTIONS]
Options
- -o, --output-dir <output_dir>¶
cli¶
Analyze US right wing sources.
us-right-wing-hyperlinks¶
Analyze US Right Wing hyperlinks.
analyze us-right-wing-hyperlinks [OPTIONS]
Options
- -o, --output-dir <output_dir>¶
archive¶
Save assets to an archive.org collection.
archive [OPTIONS] HANDLE
Options
- -i, --input-dir <input_dir>¶
- --latest¶
Crosspost to the latest archive.org item
- --verbose¶
Display the upload progress to archive.org
- --timeout <timeout>¶
Arguments
- HANDLE¶
Required argument
batch¶
Print a batch of sites.
batch [OPTIONS] COMMAND [ARGS]...
sites-by-batch¶
Print site handles in the provided batch as a JSON list.
batch sites-by-batch [OPTIONS] BATCH
Options
- -b, --batches <batches>¶
Arguments
- BATCH¶
Required argument
sites-by-bundle¶
Print site handles in the provided bundle as a JSON list.
batch sites-by-bundle [OPTIONS] BUNDLE
Arguments
- BUNDLE¶
Required argument
sites-by-country¶
Print site handles in the provided country as a JSON list.
batch sites-by-country [OPTIONS] COUNTRY
Arguments
- COUNTRY¶
Required argument
site¶
site [OPTIONS] COMMAND [ARGS]...
cli¶
Create page ranking sites by Lighthouse accessibility score.
accessibility-ranking¶
Create page ranking sites by Lighthouse accessibility score.
site accessibility-ranking [OPTIONS]
cli¶
Create bundle detail pages.
bundle-detail¶
Create bundle detail pages.
site bundle-detail [OPTIONS]
cli¶
Create bundle list.
bundle-list¶
Create bundle list.
site bundle-list [OPTIONS]
cli¶
Create country detail pages.
country-detail¶
Create country detail pages.
site country-detail [OPTIONS]
cli¶
Create country list.
country-list¶
Create country list.
site country-list [OPTIONS]
cli¶
Create page ranking sites by appearance on drudgereport.com.
drudge¶
Create page ranking sites by appearance on drudgereport.com.
site drudge [OPTIONS]
cli¶
Create languages detail pages.
language-detail¶
Create languages detail pages.
site language-detail [OPTIONS]
cli¶
Create language list.
language-list¶
Create language list.
site language-list [OPTIONS]
cli¶
Create a page showing all of the latest screenshots.
latest-screenshots¶
Create page showing all of the latest screenshots.
site latest-screenshots [OPTIONS]
cli¶
Create a page tracking AI blockers based on most recently scraped robots.txt files.
openai¶
Create a page tracking AI blockers based on most recently scraped robots.txt files.
site openai [OPTIONS]
Options
- --no-cache¶
cli¶
Create page ranking sites by Lighthouse performance score.
performance-ranking¶
Create page ranking sites by Lighthouse performance score.
site performance-ranking [OPTIONS]
cli¶
Create source detail pages.
site-detail¶
Create source detail pages.
site site-detail [OPTIONS]
cli¶
Create source list.
source-list¶
Create source list.
site source-list [OPTIONS]
cli¶
Create a status report.
status-report¶
Create a status report.
site status-report [OPTIONS]
hyperlinks¶
Save all of a site’s hyperlinks as JSON.
hyperlinks [OPTIONS] HANDLE
Options
- -o, --output-dir <output_dir>¶
- -v, --verbose¶
Arguments
- HANDLE¶
Required argument
mosaic¶
Create image mosaics.
mosaic [OPTIONS] COMMAND [ARGS]...
gif¶
Combine images into a mosaic GIF.
mosaic gif [OPTIONS]
Options
- -i, --input-dir <input_dir>¶
- -o, --output-dir <output_dir>¶
jpg¶
Combine images into jpgs ready for Twitter.
mosaic jpg [OPTIONS]
Options
- -i, --input-dir <input_dir>¶
- -o, --output-dir <output_dir>¶
robotstxt¶
Save the raw robots.txt of the provided site.
robotstxt [OPTIONS] HANDLE
Options
- -o, --output-dir <output_dir>¶
- --timeout <timeout>¶
- --verbose¶
Arguments
- HANDLE¶
Required argument
rss¶
Create RSS feeds.
rss [OPTIONS] COMMAND [ARGS]...
bundles¶
Create bundle feeds.
rss bundles [OPTIONS]
countries¶
Create country feeds.
rss countries [OPTIONS]
opml¶
Create an OPML file with all site feeds.
rss opml [OPTIONS]
sites¶
Create site feeds.
rss sites [OPTIONS]
screenshot¶
Screenshot the provided homepage.
screenshot [OPTIONS] HANDLE
Options
- -o, --output-dir <output_dir>¶
- -w, --wait <wait>¶
- -x, --width <width>¶
- -y, --height <height>¶
- -f, --full-page¶
Screenshot the whole page
- --verbose¶
Print verbose output
Arguments
- HANDLE¶
Required argument
slack¶
Post image to Slack channel.
slack [OPTIONS] HANDLE
Options
- -i, --input-dir <input_dir>¶
- -v, --verbose¶
- -t, --timeout <timeout>¶
Arguments
- HANDLE¶
Required argument
toot¶
Send a toot.
toot [OPTIONS] COMMAND [ARGS]...
bundle¶
Toot sources in batches of four.
toot bundle [OPTIONS] SLUG
Options
- -i, --input-dir <input_dir>¶
Arguments
- SLUG¶
Required argument
single¶
Toot a single source.
toot single [OPTIONS] HANDLE
Options
- -i, --input-dir <input_dir>¶
Arguments
- HANDLE¶
Required argument
wayback¶
Archive a URL in the Wayback Machine.
wayback [OPTIONS] HANDLE
Options
- -o, --output-dir <output_dir>¶
- --verbose¶
Arguments
- HANDLE¶
Required argument
Utilities¶
The utils module contains a variety of functions used by our commands.
- newshomepages.utils.batch(li: list, n: int)¶
Yield n number of sequential chunks from l.
- newshomepages.utils.chunk(iterable: list, length: int) list[list] ¶
Split the provided list into chunks of the provided length.
- Args:
iterable (list): The master list to split. length (int): The size of the chunks you want
Returns a list of lists.
- newshomepages.utils.download_url(url: str, output_path: Path, timeout: int = 180)¶
Download the provided URL to the provided path.
- newshomepages.utils.get_accessibility_df(use_cache: bool = True, verbose: bool = False) DataFrame ¶
Get the full list of accessibility files from our extracts.
Returns a DataFrame.
- newshomepages.utils.get_accessibility_list() list[dict[str, Any]] ¶
Get the full list of accessibility from our extracts.
Returns a list of dictionaries.
- newshomepages.utils.get_bundle(slug: str) dict ¶
Get the metadata for the provided bundle.
- Args:
slug (str): The unique string identifier of the bundle.
Returns a dictionary.
- newshomepages.utils.get_bundle_list() list[dict] ¶
Get the full list of site bundles.
Returns a list of dictionaries.
- newshomepages.utils.get_country(code: str) dict ¶
Get the metadata for the provided country.
- Args:
slug (str): The unique string identifier of the bundle.
Returns a dictionary.
- newshomepages.utils.get_country_df() DataFrame ¶
Get the list of countries.
Returns a pandas DataFrame.
- newshomepages.utils.get_country_list() list[dict] ¶
Get the full list of countries.
Returns a list of dictionaries.
- newshomepages.utils.get_extract_df(name: str, use_cache: bool = True, **kwargs) DataFrame ¶
Read in the requests extracts CSV as a dataframe.
- newshomepages.utils.get_flag_emoji(alpha2: str) str ¶
Get the flag emoji for the provided country.
- Args:
alpha2 (str): The two-letter ISO code for the country.
- Returns:
A string containing the emoji.
- newshomepages.utils.get_hyperlink_df(use_cache: bool = True, verbose: bool = False) DataFrame ¶
Get the full list of hyperlink files from our extracts.
Returns a DataFrame.
- newshomepages.utils.get_hyperlink_list() list[dict[str, Any]] ¶
Get the full list of hyperlink from our extracts.
Returns a list of dictionaries.
- newshomepages.utils.get_javascript(handle: str) str | None ¶
Get the JavaScript file to run before the screenshot, if it exists.
- Args:
handle (str): The Twitter handle of the site you want.
Returns a JavaScript string ready to be run. Or None, if no file exists.
- newshomepages.utils.get_json_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) Any ¶
Get JSON data from the provided URL.
- Args:
url (str): The URL to request timeout (int): How long to wait before timing out user_agent (str): The user agent to provide in the request headers. None by default. verbose (bool): Whether or not to print a verbose output
- Returns:
The JSON response as a Python object.
- newshomepages.utils.get_language_df() DataFrame ¶
Get the list of languages.
Returns a pandas DataFrame.
- newshomepages.utils.get_language_list() list[dict] ¶
Get the list of languages.
Returns a list of dictionaries.
- newshomepages.utils.get_lighthouse_df(use_cache: bool = True, verbose: bool = False) DataFrame ¶
Get the full list of Lighthouse files from our extracts.
Returns a DataFrame.
- newshomepages.utils.get_lighthouse_list() list[dict[str, Any]] ¶
Get the full list of lighthouse audits from our extracts.
Returns a list of dictionaries.
- newshomepages.utils.get_local_time(site: dict) datetime ¶
Get the current time in the provided site’s timezone.
- Args:
site (dict): A site’s data dictionary.
Returns the current item as a timezone-aware datetime object.
- newshomepages.utils.get_robotstxt_df(use_cache: bool = True, verbose: bool = False) DataFrame ¶
Get the full list of robots.txt files from our extracts.
Returns a DataFrame.
- newshomepages.utils.get_screenshot_df(use_cache: bool = True, verbose: bool = False) DataFrame ¶
Get the full list of screenshot files from our extracts.
Returns a DataFrame.
- newshomepages.utils.get_screenshot_list() list[dict[str, Any]] ¶
Get the full list of screenshots from our extracts.
Returns a list of dictionaries.
- newshomepages.utils.get_screenshots_by_site(site: dict) list[dict] ¶
Get the list of screenshots for the provided site.
Returns a list of dictionaries.
- newshomepages.utils.get_site(handle: str) dict ¶
Get the metadata for the provided site.
- Args:
handle (str): The handle of the site you want.
Returns a dictionary.
- newshomepages.utils.get_site_df() DataFrame ¶
Get the full list of sites.
Returns a DataFrame.
- newshomepages.utils.get_site_list() list[dict] ¶
Get the full list of supported sites.
Returns a list of dictionaries.
- newshomepages.utils.get_sites_in_batch(batch_number: int, batches: int = 10) list[dict] ¶
Get all the sites in the provided batch.
- Args:
batch_number (int): The number of the batch to pull. batches (int): The total number of batches.
Returns a list of site dictionaries.
- newshomepages.utils.get_sites_in_bundle(slug: str) list[dict] ¶
Get all the sites in the provided bundle.
- Args:
slug (str): The unique string identifier of the bundle.
Returns a list of site dictionaries.
- newshomepages.utils.get_sites_in_country(slug: str) list[dict] ¶
Get all the sites in the provided country.
- Args:
slug (str): The two digit alpha code of the country.
Returns a list of site dictionaries.
- newshomepages.utils.get_sites_in_language(code: str) list[dict] ¶
Get all the sites in the provided language.
- Args:
slug (str): The two digit alpha code of the country.
Returns a list of site dictionaries.
- newshomepages.utils.get_url(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False) Response ¶
Get the provided URL.
- Args:
url (str): The URL to fetch. timeout (int): The number of seconds to wait before timing out. (Default: 30) user_agent (str): The user agent to use in the request. (Default: None) verbose (bool): Whether or not to log the action prior to execution. (Default: False)
Returns a requests.Response object.
- newshomepages.utils.get_user_agent() str ¶
Provide a user-agent string.
Returns a string ready to use as a header in web request.
- newshomepages.utils.get_wayback_df(use_cache: bool = True, verbose: bool = False) DataFrame ¶
Get the full list of wayback files from our extracts.
Returns a DataFrame.
- newshomepages.utils.intcomma(value: int | str) str ¶
Convert an integer to a string containing commas every three digits.
For example, 3000 becomes ‘3,000’ and 45000 becomes ‘45,000’.
- Args:
value (int): The integer to format
Returns a string with the result.
- newshomepages.utils.numoji(number: int) str ¶
Convert a number into a series of emojis for Slack.
- Args:
number (int): The number to convert into emoji
Returns: Am emoji string
- newshomepages.utils.parse_archive_url(url: str) dict ¶
Parse the handle and timestamp from an archive.org URL.
- Args:
url (str): An archive.org URL
Returns a dictinary with the identifier, handle and timestamp parsed out.
- newshomepages.utils.safe_ia_handle(handle: str) str ¶
Santize a handle so its safe to use as an archive.org slug.
- Args:
handle (str): The unique string identifier of the site.
Returns a lowercase string that’s ready to use.
- newshomepages.utils.write_csv(dict_list: list[dict], path: Path, verbose: bool = True) None ¶
Write a list of dictionaries to a CSV file at the provided path.
- Args:
data (Any): Any Python object ready to be serialized as JSON. path (Path): The filesystem Path where the object should be written. verbose (bool): Whether or not to log the action prior to execution. (Default: True)
Returns nothing.
- newshomepages.utils.write_json(data: Any, path: Path, indent: int = 2, verbose: bool = True) None ¶
Write JSON data to the provided path.
- Args:
data (Any): Any Python object ready to be serialized as JSON. path (Path): The filesystem Path where the object should be written. indent (int): The number of identations to include in the JSON. (Default: 2) verbose (bool): Whether or not to log the action prior to execution. (Default: True)
Returns nothing.