Releases: autogram-is/spidergram
v0.10.0 — Ham
This release is dedicated to Peter Porker of Earth-8311, an innocent pig raised by animal scientist May Porker. After a freak accident with the world's first atomic powered hairdryer, Peter was bitten by the scientist and transformed into a crime-fighting superhero pig.
New Additions
- Custom queries and multi-query reports can be defined in the Spidergram config files; Spidergram now ships with a handful of simple queries and an overview report as part of its core configuration.
- Spidergram can run an Axe Accessibility Report on every page as it crawls a site; this behavior can be turned on and off via the
spider.auditAccessiblity
config property. - Spidergram can now save cookies, performance data, and remote API requests made during page load using the
config.spider.saveCookies
,.savePerformance
, and.saveXhr
config properties. - Spidergram can identify and catalog design patterns during the post-crawl page analysis process; pattern definitions can also include rules for extracting pattern properties like a card's title and CTA link.
- Resources with attached downloads can be processed using file parsing plugins; Spidergram 0.10.0 comes with support for PDF and .docx content and metadata, image EXIF metadata, and audio/video metadata in a variety of formats.
- The
config.spider.seed
setting lets you set one or more URLs as the default starting points for crawling. - For large crawls, an experimental
config.offloadBodyHtml
settings flag has been added to Spidergram's global configuration. When it's set to 'db', all body HTML will be stored in a dedicated key-value collection, rather than theresources
collection. On sites with many large pages (50k+ pages of 500k+ html or more) this can significantly improve the speed of filtering, queries and reporting.
Changes
- Spidergram's CLI commands have been overhauled; vestigial commands from the 0.5.0 era have been removed and replaced. Of particular interest:
spidergram status
summarizes the current config and DB statespidergram init
generates a fresh configuration file in the current directoryspidergram ping
tests a remote URL using the current analysis settingsspidergram query
displays and saves filtered snapshots of the saved crawl graphspidergram report
outputs a collection of query results as a combined workbook or JSON filespidergram go
crawls one or more URLs, analyzes the crawled files, and generates a report in a single step.spidergram url test
tests a URL against the current normalizer and filter settings.spidergram url tree
replaces the oldurls
command for building site hierarchies.
- CLI consistency is significantly improved. For example:
analyze
,query
,report
, andurl tree
all support the same--filter
syntax for controlling which records are loaded from the database.
Fixes and under-the-hood improvements
- URL matching and filtering has been smoothed out, and a host of tests have been added to ensure things stay solid. Previously, filter strings were treated as globs matched against the entire URL. Now,
{ property: 'hostname', glob: '*.foo.com' }
objects can be used to explicitly specify glob orr regex matches against individual URL components.
v0.9.0
Spidergram 0.9.0: Gwen
This release is dedicated to teen crime-fighter Gwen Stacy of Earth-65. She juggles high school, her band, and wisecracking web-slinging until her boyfriend Peter Parker becomes infatuated with Spider-Woman. Unable to reveal her secret identity, Spider-Woman is blamed for Peter's tragic lizard-themed death on prom night… and Gwen goes on the run.
Like Gwen Stacy's social calendar, this version of Spidergram has a lot going on. Hold onto your seats!
Major Changes
Vertice
andEdge
have been renamed toEntity
andRelationship
to avoid confusion with ArangoDB graph traversal and storage concepts. With the arrival of theDataset
andKeyValueStore
classes (see below), we also needed the clarity when dealing with full-fledged Entities vs random datatypes.- Improved report/query helpers. The
GraphWorker
andVerticeQuery
— both of which relied on raw snippets of AQL for filtering — have been replaced by a new query-builder system. A unifiedQuery
class can take a query definition in JSON format, or construct one piecemeal using fluent methods likefilterBy()
andsort()
. A relatedEntityQuery
class returns pre-instantiated Entity instances to eliminate boilerplate code, and aWorkerQuery
class executes a worker function against each query result while emitting progress events for easy monitoring. - HtmlTools.getPageContent() and .getPageData() are both async now, allowing them to use some of the aync parsing and extraction tools in our toolbox. If your extracted data and content suddenly appear empty, make sure you're awaiting the results of these two calls in your handlers and scripts.
Project
class replaced bySpidergram
class, as part of the configuration management overhaul mentioned below. In most code, changingconst project = await Project.config();
toconst spidergram = await Spidergram.load();
andconst db = await project.graph();
toconst db = spidergram.arango;
should be sufficient.
New Additions
- Spidergram configuration can now live in .json, .js, or .ts files — and can control a much wider variety of internal behaviors. JS and TS configuration files can also pass in custom functions where appropriate, like the
urlNormalizer
andspider.requestHandlers
settings. Specific environment variables, or.env
files, can also be used to supply or override sensitive properties like API account credentials. - Ad-hoc data storage with the
Dataset
andKeyValueStore
classes. Both offer staticopen
methods that give quick access to default or named data stores -- creating new storage buckets if needed, or pulling up existing ones. Datasets offerpushItem(anyData)
andgetItems()
methods, while KeyValueStores offersetItem(key, value)
andgetItem(key)
methods. Behind the scenes, they create and manage dedicated ArangoDB collections that can be used in custom queries. - PDF and DocX parsing via
FileTools.Pdf
andFileTools.Document
, based on the pdf-parse and mammoth projects. Those two formats are a first trial run for more generic parsing/handling of arbitrary formats; both can return filetype-specific metadata, and plaintext versions of file contents. For consistency, the Spreadsheet class has also been moved toFileTools.Spreadsheet
. - Site technology detection via
BrowserTools.Fingerprint
. Fingerprinting is currently based on the Wappalyzer project and uses markup, script, and header patterns to identify the technologies and platforms used to build/host a page. - CLI improvements. The new
spidergram report
command can pull up filtered, aggregated, and formatted versions of Spidergram crawl data. It can output to tabular overviews on the command line, raw JSON files for use with data visualization tools, or ready-to-read Excel worksheets. Thespidergram probe
command allows the new Fingerprint tool to be run from the command line, as well. - Groundwork for cleaner CLI code. While it's not as obvious to end users, we're moving more and more code away from the Oclif-dependent
SgCommand
class and putting it into the sharedSpiderCli
helper class where it can be used in more contexts. In the next version, we'll be leveraging these improvements to make Spidergram's built-in CLI tools take better advantage of the new global configuration settings.
Fixes and minor improvements
- Internal errors (aka, pre-request DNS problems or errors thrown during response processing) saave a wider range of error codes rather than mapping everything to
-1
. Any thrown errors are also saved inResource.errors
for later reference. - A subtle but long-standing issue with the
downloadHandler
(and by extensionsitemapHandler
androbotsTxtHandler
choked on most downloads but "properly" persisted status records rather than erroring out. The improved error handling caught it, and downloads now work consistently. - A handful of request handlers were
awaiting
promises unecessarily, clogging up the Spider's request queue. Crawls with multiple concurrent browser sessions will see some performance improvements.
v0.8.3
v0.8.0
Spidergram 0.8.0: Miles
Brooklyn teenager Miles Morales successfully juggled school, friends, and family — until his uncle Aaron's breakin at an Osborne Labs facility brought a genetically engineered spider to Miles' doorstep. Once bitten, Miles developed a range of varyingly spider-related powers and a much, much busier day planner.
Thrust into the role of super-hero by the death of Peter Parker, Miles is forced to balance the safety of his loved ones against his responsibilities as a crime-fighter, yoinked from his home reality in a sweeping Multiversal disaster, and cloned a bunch of times because Marvel. Miles is the protagonist of — and this is a fact, not opinion — the best Spider-Man movie ever produced.
What's Changed
Lots of quality of life improvements, including bug fixes for report generation and finessing of return/input types that made a number of helper functions difficult to use in conjunction with each other.
Streamlined structured data parsing
The HtmlTools
collection of helpers now includes a one-shot getPageData()
helper function that attempts to parse out all the standard HTML stuff: HEAD subtags like <base>
and <title>
, meta tags, <script>
and <style>
tags, JSON and LDJSON data present in any script tags, etc. Options to toggle various chunks of that data on and off can be passed into the function to avoid spamming yourself.
Streamlined content extraction
Similarly, HtmlTools.getPageContent()
can now act as a quick wrapper for standard extraction of on-page content. Pass in a list of CSS selectors to help it find a page's "primary content" and it will spit out a scrubbed plaintext version, then calculate a readability score.
In the next release we'll be adding some general-purpose "find element X on the page, and if it's there, add its text to the content results" helper functions to HtmlTools
; when that happens, it will be possible to include those instructions in the getPageContent()
options, allowing custom analyzer code to use that function for most garden variety extraction.
Improved pattern/component extraction
Finding and saving sub-page patterns to their own pool of data for querying is a bit simpler, and also uses the same underlying code as getPageData()
for extracting element attributes and content. The HtmlTools.findPattern()
function accepts an array of pattern descriptions, each of which can include CSS selectors, instructions on what data to pull from the markup that's found, and (optionally) a callback to tweak the data before it's returned. The resulting Found Pattern Instances can be saved straight to the fragments
collection, with links back to the pages they occurred on, for easy analysis.
URL hierarchy parsing and reporting
A new HierarchyTools
utility pack has been added, with base classes to simplify hierarchy parsing, manipulation, and reporting code. The first example is UrlHierarchyBuilder, which accepts a giant array of strings (or any array of objects that have 'url' properties). It will use the URL paths to construct a complete tree, with configurable options for filling gaps in the tree, dealing with multiple subdomains under a single TLD, and so on.
The Resulting hierarchy has a host of helper functions and convenience properties for pulling out top-level root URLs, orphans disconnected from the rest of the tree, leaf and branch nodes, etc. Every item in the hierarchy also has a render()
function that can output a nicely-formatted tree view of the item and its children; a number of rendering presets are included, from 'only show me directories, but summarize how many leaf URLs are in each directory' to 'show me everything, and format it as markdown'.
Finally, the spidergram urls
CLI command has been updated to use the new hierarchy tools. It can quickly spit out a summary view of the URLs that have been crawled (or discovered); it can pull in a URL tree from a text, csv, or sitemap.xml file for formatting.
Full Changelog: v0.7.1...v0.8.0
0.7.1
A handful of minor fixes and a license update. Spidergram is released under the GPL, but previous NPM releases had a dangling MIT license in package.json. If you keep your old MIT copy of Spidergram, it may eventually be quite valuable to collectors on the secondary market.
v0.7.0
Spidergram 0.7.0: Cindy
While attending a public exhibition on safe handling of nuclear waste, teenager Cindy Moon was bitten by a particle-accelerator irradiated spider. After manifesting the usual laundry list of spider powers, Cindy was locked in an underground bunker to protect her from trans-dimensional vampire spider hunters. Peter Parker discovered the bunker thirteen years later, opened it, and was immediately attacked by Cindy. She subsequently created a cool costume, started fighting crime as the superhero Silk, and secured a job as social media manager for the Daily Bugle. Cindy is one of the few superheroes with a full-time digital content gig.
What's New in Spidergram 0.7.0
- Design pattern/component extraction
- Companion
create-spidergram
project with example project templates - More helpers for common analysis tasks
- Schema.org page metadata
- Reusable query-builders and data visualizations
- Google Analytics queries
- Sitemap and Robots.txt parsing
- Internal improvements (linting and formatting rules, limited tests, fewer dependencies)
- Still a mind-boggling lack of documentation
Full Changelog: 0.6.0...0.7.0
0.6.0 (Peter)
Initial semi-public version of Spidergram. Please do not fold, spindle, or mutilate.