Reference¶
Documentation for a selection of our system’s common internal tools
Table of contents
Commands¶
accessibility¶
Save the accessiblity tree of the provided site.
accessibility [OPTIONS] HANDLE
Options
-
-o
,
--output-dir
<output_dir>
¶
-
--timeout
<timeout>
¶
Arguments
-
HANDLE
¶
Required argument
analyze¶
analyze [OPTIONS] COMMAND [ARGS]...
cli¶
Analyze the Drudge Report.
cli¶
Analyze Lighthouse reports.
archive¶
Save assets to an archive.org collection.
archive [OPTIONS] HANDLE
Options
-
-i
,
--input-dir
<input_dir>
¶
-
--latest
¶
Crosspost to the latest archive.org item
-
--verbose
¶
Display the upload progress to archive.org
-
--timeout
<timeout>
¶
Arguments
-
HANDLE
¶
Required argument
batch¶
Print a batch of sites.
batch [OPTIONS] COMMAND [ARGS]...
sites-by-batch¶
Print site handles in the provided batch as a JSON list.
batch sites-by-batch [OPTIONS] BATCH
Options
-
-b
,
--batches
<batches>
¶
Arguments
-
BATCH
¶
Required argument
discorder¶
Post images to Discord.
discorder [OPTIONS] COMMAND [ARGS]...
bundle¶
Post all images for a bundle.
discorder bundle [OPTIONS] SLUG
Options
-
-i
,
--input-dir
<input_dir>
¶
Arguments
-
SLUG
¶
Required argument
site¶
site [OPTIONS] COMMAND [ARGS]...
cli¶
Create page ranking sites by Lighthouse accessibility score.
accessibility-ranking¶
Create page ranking sites by Lighthouse accessibility score.
site accessibility-ranking [OPTIONS]
cli¶
Create bundle detail pages.
cli¶
Create country detail pages.
cli¶
Create page ranking sites by appearance on drudgereport.com.
cli¶
Create languages detail pages.
cli¶
Create a page showing all of the latest screenshots.
latest-screenshots¶
Create page showing all of the latest screenshots.
site latest-screenshots [OPTIONS]
cli¶
Create the openai page based on most recently scrape robots.txt files.
openai¶
Create the openai page based on most recently scrape robots.txt files.
site openai [OPTIONS]
cli¶
Create page ranking sites by Lighthouse performance score.
performance-ranking¶
Create page ranking sites by Lighthouse performance score.
site performance-ranking [OPTIONS]
cli¶
Create source detail pages.
html¶
Save HTML for the provided homepage.
html [OPTIONS] HANDLE
Options
-
-o
,
--output-dir
<output_dir>
¶
-
-w
,
--wait
<wait>
¶
Arguments
-
HANDLE
¶
Required argument
hyperlinks¶
Save all hyperlinks as JSON for a site or bundle.
hyperlinks [OPTIONS] HANDLE
Options
-
-o
,
--output-dir
<output_dir>
¶
-
--timeout
<timeout>
¶
Arguments
-
HANDLE
¶
Required argument
mosaic¶
Create image mosaics.
mosaic [OPTIONS] COMMAND [ARGS]...
robotstxt¶
Save the raw robots.txt of the provided site.
robotstxt [OPTIONS] HANDLE
Options
-
-o
,
--output-dir
<output_dir>
¶
-
--timeout
<timeout>
¶
-
--verbose
¶
Arguments
-
HANDLE
¶
Required argument
screenshot¶
Screenshot the provided homepage.
screenshot [OPTIONS] HANDLE
Options
-
-o
,
--output-dir
<output_dir>
¶
-
-w
,
--wait
<wait>
¶
-
-x
,
--width
<width>
¶
-
-y
,
--height
<height>
¶
-
-f
,
--full-page
¶
Screenshot the whole page
Arguments
-
HANDLE
¶
Required argument
slack¶
Post image to Slack channel.
slack [OPTIONS] ARTIFACT_PATH
Arguments
-
ARTIFACT_PATH
¶
Required argument
telegrammer¶
Send a Telegram message.
telegrammer [OPTIONS] COMMAND [ARGS]...
bundle¶
Send a bundle of sources.
telegrammer bundle [OPTIONS] SLUG
Options
-
-i
,
--input-dir
<input_dir>
¶
Arguments
-
SLUG
¶
Required argument
Utilities¶
The utils module contains a variety of functions used by our commands.
-
newshomepages.utils.
batch
(li: list, n: int)¶ Yield n number of sequential chunks from l.
-
newshomepages.utils.
chunk
(iterable: list, length: int) → list[list]¶ Split the provided list into chunks of the provided length.
- Parameters
iterable (list) – The master list to split.
length (int) – The size of the chunks you want
Returns a list of lists.
-
newshomepages.utils.
download_url
(url: str, output_path: pathlib.Path, timeout: int = 180)¶ Download the provided URL to the provided path.
-
newshomepages.utils.
get_accessibility_df
() → pandas.core.frame.DataFrame¶ Get the full list of accessibility files from our extracts.
Returns a DataFrame.
-
newshomepages.utils.
get_accessibility_list
() → list[dict[str, typing.Any]]¶ Get the full list of accessibility from our extracts.
Returns a list of dictionaries.
-
newshomepages.utils.
get_bundle
(slug: str) → dict¶ Get the metadata for the provided bundle.
- Parameters
slug (str) – The unique string identifier of the bundle.
Returns a dictionary.
-
newshomepages.utils.
get_bundle_list
() → list[dict]¶ Get the full list of site bundles.
Returns a list of dictionaries.
-
newshomepages.utils.
get_country
(code: str) → dict¶ Get the metadata for the provided country.
- Parameters
slug (str) – The unique string identifier of the bundle.
Returns a dictionary.
-
newshomepages.utils.
get_country_df
() → pandas.core.frame.DataFrame¶ Get the list of countries.
Returns a pandas DataFrame.
-
newshomepages.utils.
get_country_list
() → list[dict]¶ Get the full list of countries.
Returns a list of dictionaries.
-
newshomepages.utils.
get_extract_df
(name: str, use_cache: bool = True, **kwargs) → pandas.core.frame.DataFrame¶ Read in the requests extracts CSV as a dataframe.
-
newshomepages.utils.
get_hyperlink_df
() → pandas.core.frame.DataFrame¶ Get the full list of hyperlink files from our extracts.
Returns a DataFrame.
-
newshomepages.utils.
get_hyperlink_list
() → list[dict[str, typing.Any]]¶ Get the full list of hyperlink from our extracts.
Returns a list of dictionaries.
-
newshomepages.utils.
get_javascript
(handle: str) → str | None¶ Get the JavaScript file to run before the screenshot, if it exists.
- Parameters
handle (str) – The Twitter handle of the site you want.
Returns a JavaScript string ready to be run. Or None, if no file exists.
-
newshomepages.utils.
get_json_url
(url: str)¶ Get JSON data from the provided URL.
-
newshomepages.utils.
get_language_df
() → pandas.core.frame.DataFrame¶ Get the list of languages.
Returns a pandas DataFrame.
-
newshomepages.utils.
get_language_list
() → list[dict]¶ Get the list of languages.
Returns a list of dictionaries.
-
newshomepages.utils.
get_lighthouse_df
() → pandas.core.frame.DataFrame¶ Get the full list of Lighthouse files from our extracts.
Returns a DataFrame.
-
newshomepages.utils.
get_lighthouse_list
() → list[dict[str, typing.Any]]¶ Get the full list of lighthouse audits from our extracts.
Returns a list of dictionaries.
-
newshomepages.utils.
get_local_time
(site: dict) → datetime.datetime¶ Get the current time in the provided site’s timezone.
- Parameters
site (dict) – A site’s data dictionary.
Returns the current item as a timezone-aware datetime object.
-
newshomepages.utils.
get_robotstxt_df
() → pandas.core.frame.DataFrame¶ Get the full list of robots.txt files from our extracts.
Returns a DataFrame.
-
newshomepages.utils.
get_screenshot_df
() → pandas.core.frame.DataFrame¶ Get the full list of screenshot files from our extracts.
Returns a DataFrame.
-
newshomepages.utils.
get_screenshot_list
() → list[dict[str, typing.Any]]¶ Get the full list of screenshots from our extracts.
Returns a list of dictionaries.
-
newshomepages.utils.
get_screenshots_by_site
(site: dict) → list[dict]¶ Get the list of screenshots for the provided site.
Returns a list of dictionaries.
-
newshomepages.utils.
get_site
(handle: str) → dict¶ Get the metadata for the provided site.
- Parameters
handle (str) – The Twitter handle of the site you want.
Returns a dictionary.
-
newshomepages.utils.
get_site_df
() → pandas.core.frame.DataFrame¶ Get the full list of sites.
Returns a DataFrame.
-
newshomepages.utils.
get_site_list
() → list[dict]¶ Get the full list of supported sites.
Returns a list of dictionaries.
-
newshomepages.utils.
get_sites_in_batch
(batch_number: int, batches: int = 10) → list[dict]¶ Get all the sites in the provided batch.
- Parameters
batch_number (int) – The number of the batch to pull.
batches (int) – The total number of batches.
Returns a list of site dictionaries.
-
newshomepages.utils.
get_sites_in_bundle
(slug: str) → list[dict]¶ Get all the sites in the provided bundle.
- Parameters
slug (str) – The unique string identifier of the bundle.
Returns a list of site dictionaries.
-
newshomepages.utils.
get_sites_in_country
(slug: str) → list[dict]¶ Get all the sites in the provided country.
- Parameters
slug (str) – The two digit alpha code of the country.
Returns a list of site dictionaries.
-
newshomepages.utils.
get_sites_in_language
(code: str) → list[dict]¶ Get all the sites in the provided language.
- Parameters
slug (str) – The two digit alpha code of the country.
Returns a list of site dictionaries.
-
newshomepages.utils.
get_url
(url: str, timeout: int = 30, user_agent: str | None = None, verbose: bool = False)¶ Get the provided URL.
- Parameters
url (str) – The URL to fetch.
timeout (int) – The number of seconds to wait before timing out. (Default: 30)
user_agent (str) – The user agent to use in the request. (Default: None)
verbose (bool) – Whether or not to log the action prior to execution. (Default: False)
Returns a requests.Response object.
-
newshomepages.utils.
get_user_agent
() → str¶ Provide a user-agent string.
Returns a string ready to use as a header in web request.
-
newshomepages.utils.
get_wayback_df
() → pandas.core.frame.DataFrame¶ Get the full list of wayback files from our extracts.
Returns a DataFrame.
-
newshomepages.utils.
intcomma
(value: int | str) → str¶ Convert an integer to a string containing commas every three digits.
For example, 3000 becomes ‘3,000’ and 45000 becomes ‘45,000’.
- Parameters
value (int) – The integer to format
Returns a string with the result.
-
newshomepages.utils.
numoji
(number: int) → str¶ Convert a number into a series of emojis for Slack.
- Parameters
number (int) – The number to convert into emoji
Returns: Am emoji string
-
newshomepages.utils.
parse_archive_url
(url: str) → dict¶ Parse the handle and timestamp from an archive.org URL.
- Parameters
url (str) – An archive.org URL
Returns a dictinary with the identifier, handle and timestamp parsed out.
-
newshomepages.utils.
safe_ia_handle
(handle: str) → str¶ Santize a handle so its safe to use as an archive.org slug.
- Parameters
handle (str) – The unique string identifier of the site.
Returns a lowercase string that’s ready to use.
-
newshomepages.utils.
write_csv
(dict_list: list[dict], path: Path, verbose: bool = True) → None¶ Write a list of dictionaries to a CSV file at the provided path.
- Parameters
data (Any) – Any Python object ready to be serialized as JSON.
path (Path) – The filesystem Path where the object should be written.
verbose (bool) – Whether or not to log the action prior to execution. (Default: True)
Returns nothing.
-
newshomepages.utils.
write_json
(data: Any, path: pathlib.Path, indent: int = 2, verbose: bool = True) → None¶ Write JSON data to the provided path.
- Parameters
data (Any) – Any Python object ready to be serialized as JSON.
path (Path) – The filesystem Path where the object should be written.
indent (int) – The number of identations to include in the JSON. (Default: 2)
verbose (bool) – Whether or not to log the action prior to execution. (Default: True)
Returns nothing.