Sometimes you might find, that after a CMS migration, some things are missing.
Obviously nobody would do such a thing on purpose, right? But it can happen.
Suddenly, instead of something complete, you’re left with just guitars.
Let’s do a quick & dirty hack to check for important, lost URLs.
curl "http://web.archive.org/cdx/search/cdx?url=colored.house.com*&output=txt&from=20241201&to=20241231" --output - \
| grep " text/html 200 " | awk '{print $3}' | sed 's/\?.*//' | sort | uniq -c | sort -nr \
| awk '{print $2}' | head -n 50 | xargs -I {} curl -s -o /dev/null -w "%{http_code} %{url_effective}\\n" {} \
| grep -v "^[23]"
There you are, the top “important” URLs from the site that worked in December, but which don’t work now (neither redirecting nor returning content).
A serious site owner could then check the URLs for things that they either want to maintain (perhaps historical content), if there are new URLs to replace (perhaps one department is now replaced with another), or maybe they just want to keep things 404.
SEOs might want to maintain the value of some of the old URLs, even if the replacements are just kinda close.
Checking for lost URLs after a CMS migration »