The Computer Oracle

How do I extract all the external links of a web page and save them to a file?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Hypnotic Puzzle4

--

Chapters
00:00 How Do I Extract All The External Links Of A Web Page And Save Them To A File?
00:17 Accepted Answer Score 25
00:34 Answer 2 Score 17
00:47 Answer 3 Score 1
01:55 Answer 4 Score 0
02:11 Answer 5 Score 0
02:20 Thank you

--

Full question
https://superuser.com/questions/372155/h...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#linux #commandline

#avk47



ACCEPTED ANSWER

Score 25


You will need 2 tools, lynx and awk, try this:

$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' > links.txt

If you need numbering lines, use command nl, try this:

$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' | nl > links.txt



ANSWER 2

Score 17


Here's an improvement on lelton's answer: you don't need awk at all for lynx's got some useful options.

lynx -listonly -nonumbers -dump http://www.google.com.br

if you want numbers

lynx -listonly -dump http://www.google.com.br



ANSWER 3

Score 1


As discussed in other answers, Lynx is a great option, but there are many others in nearly every programming language and environment.

Another choice is xmllint. Sample usage:

$ curl -sS "https://superuser.com" \
| xmllint --html --xpath '//a[starts-with(@href, "http")]/@href' 2>/dev/null - \
| sed 's/^ href="\|"$//g' \
| tail -3
https://linkedin.com/company/stack-overflow
https://www.instagram.com/thestackoverflow
https://stackoverflow.com/help/licensing

Additionally, Perl offers HTML::Parser:

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Parser;
use LWP::Simple;

sub start {
    my $href = shift->{href};
    print "$href\n" if $href && $href =~ /^https?:\/\//;
}

my $url = shift @ARGV or die "No argument URL provided";
my $parser = HTML::Parser->new(api_version => 3, start_h => [\&start, "attr"]);
$parser->report_tags(["a"]);
$parser->parse(get($url) or die "Failed to GET $url");

Sample usage (including writing to file per OP request; usage is the same for any script here with a shebang):

$ ./scrape_links https://superuser.com > links.txt \
&& cat links.txt | tail -3
https://linkedin.com/company/stack-overflow
https://www.instagram.com/thestackoverflow
https://stackoverflow.com/help/licensing

Ruby has the nokogiri gem:

#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(URI.open('https://superuser.com'))

doc.xpath('//a[starts-with(@href, "http")]/@href').each do |link|
  puts link.content
end

NodeJS has cheerio:

const axios = require("axios");
const cheerio = require("cheerio");

(async () => {
  const $ = cheerio.load((await axios.get("https://superuser.com")).data);
  $("a").each((i, e) => console.log($(e).attr("href")));
})();

Python's BeautifulSoup hasn't been shown yet in this thread:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("https://superuser.com").text, "lxml")

for x in soup.find_all("a", href=True):
    if x["href"].startswith("http"):
        print(x["href"])



ANSWER 4

Score 0


  1. Use Beautiful Soup to retrieve the web pages in question.
  2. Use awk to find all URLs that do not point to your domain

I would recommend Beautiful Soup over screen scraping techniques.




ANSWER 5

Score 0


if command line is not a force you can use Copy All Links Firefox extension.