How do I extract all the external links of a web page and save them to a file?
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Hypnotic Puzzle4
--
Chapters
00:00 How Do I Extract All The External Links Of A Web Page And Save Them To A File?
00:17 Accepted Answer Score 25
00:34 Answer 2 Score 17
00:47 Answer 3 Score 1
01:55 Answer 4 Score 0
02:11 Answer 5 Score 0
02:20 Thank you
--
Full question
https://superuser.com/questions/372155/h...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#linux #commandline
#avk47
ACCEPTED ANSWER
Score 25
You will need 2 tools, lynx and awk, try this:
$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' > links.txt
If you need numbering lines, use command nl, try this:
$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' | nl > links.txt
ANSWER 2
Score 17
Here's an improvement on lelton's answer: you don't need awk at all for lynx's got some useful options.
lynx -listonly -nonumbers -dump http://www.google.com.br
if you want numbers
lynx -listonly -dump http://www.google.com.br
ANSWER 3
Score 1
As discussed in other answers, Lynx is a great option, but there are many others in nearly every programming language and environment.
Another choice is xmllint
. Sample usage:
$ curl -sS "https://superuser.com" \
| xmllint --html --xpath '//a[starts-with(@href, "http")]/@href' 2>/dev/null - \
| sed 's/^ href="\|"$//g' \
| tail -3
https://linkedin.com/company/stack-overflow
https://www.instagram.com/thestackoverflow
https://stackoverflow.com/help/licensing
Additionally, Perl offers HTML::Parser
:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
use LWP::Simple;
sub start {
my $href = shift->{href};
print "$href\n" if $href && $href =~ /^https?:\/\//;
}
my $url = shift @ARGV or die "No argument URL provided";
my $parser = HTML::Parser->new(api_version => 3, start_h => [\&start, "attr"]);
$parser->report_tags(["a"]);
$parser->parse(get($url) or die "Failed to GET $url");
Sample usage (including writing to file per OP request; usage is the same for any script here with a shebang):
$ ./scrape_links https://superuser.com > links.txt \
&& cat links.txt | tail -3
https://linkedin.com/company/stack-overflow
https://www.instagram.com/thestackoverflow
https://stackoverflow.com/help/licensing
Ruby has the nokogiri gem:
#! /usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(URI.open('https://superuser.com'))
doc.xpath('//a[starts-with(@href, "http")]/@href').each do |link|
puts link.content
end
NodeJS has cheerio:
const axios = require("axios");
const cheerio = require("cheerio");
(async () => {
const $ = cheerio.load((await axios.get("https://superuser.com")).data);
$("a").each((i, e) => console.log($(e).attr("href")));
})();
Python's BeautifulSoup hasn't been shown yet in this thread:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://superuser.com").text, "lxml")
for x in soup.find_all("a", href=True):
if x["href"].startswith("http"):
print(x["href"])
ANSWER 4
Score 0
- Use Beautiful Soup to retrieve the web pages in question.
- Use awk to find all URLs that do not point to your domain
I would recommend Beautiful Soup over screen scraping techniques.
ANSWER 5
Score 0
if command line is not a force you can use Copy All Links Firefox extension.