feat(elfeed): the final version

This commit is contained in:
Pavel Korytov 2022-05-10 20:34:34 +03:00
parent 36db756393
commit 605978603f
5 changed files with 712 additions and 38 deletions

View file

@ -0,0 +1,678 @@
+++
title = "Extending elfeed entries with a PDF viewer and subtitles fetcher"
author = ["Pavel Korytov"]
date = 2022-05-09
tags = ["emacs", "org-mode"]
draft = true
+++
## Intro {#intro}
[elfeed](https://github.com/skeeto/elfeed) is one of the most popular Emacs packages, and it's also one in which I ended up investing a lot of effort. I wrote about the [EMMS integration](https://sqrtminusone.xyz/posts/2021-09-07-emms/) and even made a [custom frontpage](https://github.com/SqrtMinusOne/elfeed-summary) to my liking.
However, sites frequently limit the amount of information shipped in the RSS feed. Oftentimes the entry doesn't include the entire content (of which, by the way, this blog was guilty).
Also, there's non-textual content, of which in this post I consider YouTube subscriptions. It's possible to watch YouTube from elfeed, for instance with the aforementioned EMMS integration, but we can do more.
So, the plan for the post is to discuss:
- using [rdrview](https://github.com/eafer/rdrview) to extend elfeed articles;
- using [pandoc](https://pandoc.org) and LaTeX to convert articles to PDFs;
- using [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api) to download YouTube subtitles and [subed](https://github.com/sachac/subed) to control the MPV playback;
Also, heads up! You'll need lexical binding enabled for the code blocks. The easiest way to accomplish this is to add the following to the first line of `init.el`:
```emacs-lisp
;;; -*- lexical-binding: t -*-
```
## rdrview {#rdrview}
[rdrview](https://github.com/eafer/rdrview) is a command-line tool to strip webpages from clutter, extracting only parts related to the actual content. It's a standalone port of the corresponding feature of Firefox, called [Reader View](https://support.mozilla.org/en-US/kb/firefox-reader-view-clutter-free-web-pages).
It seems like the tool [isn't available](https://repology.org/project/rdrview/versions) in a whole lot of package repositories, but it's pretty easy to compile. I've put together a [Guix definition](https://github.com/SqrtMinusOne/channel-q/blob/master/rdrview.scm), which _one day_ I'll submit to upstream.
### Integrating rdrview with Emacs {#integrating-rdrview-with-emacs}
Let's start by integrating `rdrview` with Emacs. In the general case, we want to fetch both metadata and the actual content from the page.
However, the interface of `rdrview` is a bit awkward in this part, so we have the following options:
- call `rdrview` two times: with `-M` flag to fetch the metadata, and without the flag to fetch the HTML;
- call `rdrview` with `-T` flag to append the metadata to the resulting HTML.
I've decided to go with the second option. Here is a function that calls rdrview with the required flags:
```emacs-lisp
(defun my/rdrview-get (url callback)
"Get the rdrview representation of URL.
Call CALLBACK with the output."
(let* ((buffer (generate-new-buffer "rdrview"))
(proc (start-process "rdrview" buffer "rdrview"
url "-T" "title,sitename,body"
"-H")))
(set-process-sentinel
proc
(lambda (process _msg)
(let ((status (process-status process))
(code (process-exit-status process)))
(cond ((and (eq status 'exit) (= code 0))
(progn
(funcall callback
(with-current-buffer (process-buffer process)
(buffer-string)))
(kill-buffer (process-buffer process))) )
((or (and (eq status 'exit) (> code 0))
(eq status 'signal))
(let ((err (with-current-buffer (process-buffer process)
(buffer-string))))
(kill-buffer (process-buffer process))
(user-error "Error in rdrview: %s" err)))))))
proc))
```
The function calls `callback` with the output of `rdrview`. This usually doesn't take long, but it's still nice to avoid freezing Emacs that way.
Now we have to parse the output. The `-T` flag puts the title in the `<h1>` tag, the site name site in the `<h2>` tag, and the content in a `<div>`. What's more, headers of the content are often shifted, e.g. the top-level header may well end up being and `<h2>` or `<h3>`, which does not look great in LaTeX.
With that said, here's a function that does the required changes:
```emacs-lisp
(defun my/rdrview-parse (dom-string)
(let ((dom (with-temp-buffer
(insert dom-string)
(libxml-parse-html-region (point-min) (point-max)))))
(let (title sitename content (i 0))
(dolist (child (dom-children (car (dom-by-id dom "readability-page-1"))))
(when (listp child)
(cond
((eq (car child) 'h1)
(setq title (dom-text child)))
((eq (car child) 'h2)
(setq sitename (dom-text child)))
((eq (car child) 'div)
(setq content child)))))
(while (and
(not (dom-by-tag content 'h1))
(dom-search
content
(lambda (el)
(when (listp el)
(pcase (car el)
('h2 (setf (car el) 'h1))
('h3 (setf (car el) 'h2))
('h4 (setf (car el) 'h3))
('h5 (setf (car el) 'h4))
('h6 (setf (car el) 'h5))))))))
`((title . ,title)
(sitename . ,sitename)
(content . ,(with-temp-buffer
(dom-print content)
(buffer-string)))))))
```
### Using rdrview from elfeed {#using-rdrview-from-elfeed}
Because I didn't find a smart way to advise the desired behavior into elfeed, here's a modification of the `elfeed-show-refresh--mail-style` function with two changes:
- it uses `rdrview` to fetch the HTML;
- it saves the resulting HTML into a buffer-local variable (we'll need that later).
<!--listend-->
```emacs-lisp
(defvar-local my/elfeed-show-rdrview-html nil)
(defun my/rdrview-elfeed-show ()
(interactive)
(unless elfeed-show-entry
(user-error "No elfeed entry in this buffer!"))
(my/rdrview-get
(elfeed-entry-link elfeed-show-entry)
(lambda (result)
(let* ((data (my/rdrview-parse result))
(inhibit-read-only t)
(title (elfeed-entry-title elfeed-show-entry))
(date (seconds-to-time (elfeed-entry-date elfeed-show-entry)))
(authors (elfeed-meta elfeed-show-entry :authors))
(link (elfeed-entry-link elfeed-show-entry))
(tags (elfeed-entry-tags elfeed-show-entry))
(tagsstr (mapconcat #'symbol-name tags ", "))
(nicedate (format-time-string "%a, %e %b %Y %T %Z" date))
(content (alist-get 'content data))
(feed (elfeed-entry-feed elfeed-show-entry))
(feed-title (elfeed-feed-title feed))
(base (and feed (elfeed-compute-base (elfeed-feed-url feed)))))
(erase-buffer)
(insert (format (propertize "Title: %s\n" 'face 'message-header-name)
(propertize title 'face 'message-header-subject)))
(when elfeed-show-entry-author
(dolist (author authors)
(let ((formatted (elfeed--show-format-author author)))
(insert
(format (propertize "Author: %s\n" 'face 'message-header-name)
(propertize formatted 'face 'message-header-to))))))
(insert (format (propertize "Date: %s\n" 'face 'message-header-name)
(propertize nicedate 'face 'message-header-other)))
(insert (format (propertize "Feed: %s\n" 'face 'message-header-name)
(propertize feed-title 'face 'message-header-other)))
(when tags
(insert (format (propertize "Tags: %s\n" 'face 'message-header-name)
(propertize tagsstr 'face 'message-header-other))))
(insert (propertize "Link: " 'face 'message-header-name))
(elfeed-insert-link link link)
(insert "\n")
(cl-loop for enclosure in (elfeed-entry-enclosures elfeed-show-entry)
do (insert (propertize "Enclosure: " 'face 'message-header-name))
do (elfeed-insert-link (car enclosure))
do (insert "\n"))
(insert "\n")
(if content
(elfeed-insert-html content base)
(insert (propertize "(empty)\n" 'face 'italic)))
(setq-local my/elfeed-show-rdrview-html content)
(goto-char (point-min))))))
```
That way, calling `M-x my/rdrview-elfeed-show` replaces the original content with one from `rdrview`.
### How well does it work? {#how-well-does-it-work}
Rather ironically, it works well with sites that already ship with proper RSS, like [Protesilaos Stavrou's](https://protesilaos.com/) or [Karthik Chikmagalur's](https://karthinks.com/software/simple-folding-with-hideshow/) blogs or [The Atlantic](https://www.theatlantic.com/world/) magazine.
Of my other subscriptions, it does a pretty good job with [The Verge](https://www.theverge.com/), which by default sends entries truncated by the words "Read the full article". For [Ars Technica](https://arstechnica.com/), it works only if the story is not large enough, otherwise the site returns its HTML-based pagination interface.
For paywalled sites such as [New York Times](https://www.nytimes.com/) or [The Economist](https://www.economist.com/), this usually doesn't work (by the way, what's the problem with providing individual RSS feeds for subscribers?). If you need this kind of thing, I'd suggest using the [RSS-Bridge](https://github.com/RSS-Bridge/rss-bridge) project. And if something is not available, contributing business logic there definitely makes more sense than implementing workarounds in Emacs Lisp.
## LaTeX and pandoc {#latex-and-pandoc}
However, I also find that I'm not really a fan of reading articles from Emacs. Somehow what works for program code doesn't work that well for natural text. When I have to, I usually switch the Emacs theme to a light one.
But the best solution I've found so far is to render the required articles as PDFs. I may even print out some large articles I want to read.
### Template {#template}
So first, we need a LaTeX template. Pandoc already ships with one, but I don't like it too much, so I've put up a template from my LaTeX styles, targeting my preferred XeLaTeX engine.
I'll add the code here for completeness' sake, but if you use LaTeX, you'll probably be better off using your own setup. Be sure to define the following variables:
- `main-lang` and `other-lang` for polyglossia (or remove them if you have only one language)
- `title`
- `subtitle`
- `author`
- `date`
<!--listend-->
```latex
\documentclass[a4paper, 12pt]{extarticle}
% ====== Math ======
\usepackage{amsmath} % Math stuff
\usepackage{amssymb}
\usepackage{mathspec}
% ====== List ======
\usepackage{enumitem}
\usepackage{etoolbox}
\setlist{nosep, topsep=-10pt} % Remove sep-s beetween list elements
\setlist[enumerate]{label*=\arabic*.}
\setlist[enumerate,1]{after=\vspace{0.5\baselineskip}}
\setlist[itemize,1]{after=\vspace{0.5\baselineskip}}
\AtBeginEnvironment{itemize}{%
\setlist[enumerate]{label=\arabic*.}
\setlist[enumerate,1]{after=\vspace{0\baselineskip}}
}
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
% ====== Link ======
\usepackage{xcolor}
\usepackage{hyperref} % Links
\hypersetup{
colorlinks=true,
citecolor=blue,
filecolor=blue,
linkcolor=blue,
urlcolor=blue,
}
% Linebreaks for urls
\expandafter\def\expandafter\UrlBreaks\expandafter{\UrlBreaks% save the current one
\do\a\do\b\do\c\do\d\do\e\do\f\do\g\do\h\do\i\do\j%
\do\k\do\l\do\m\do\n\do\o\do\p\do\q\do\r\do\s\do\t%
\do\u\do\v\do\w\do\x\do\y\do\z\do\A\do\B\do\C\do\D%
\do\E\do\F\do\G\do\H\do\I\do\J\do\K\do\L\do\M\do\N%
\do\O\do\P\do\Q\do\R\do\S\do\T\do\U\do\V\do\W\do\X%
\do\Y\do\Z}
% ====== Table ======
\usepackage{array}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{multirow}
\usepackage{calc}
% ====== Images ======
\usepackage{graphicx} % Pictures
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\newcommand{\noimage}{%
\setlength{\fboxsep}{-\fboxrule}%
\fbox{\phantom{\rule{150pt}{100pt}}}% Framed box
}
\makeatletter
\patchcmd{\Gin@ii}
{\begingroup}% <search>
{\begingroup\renewcommand{\@latex@error}[2]{\noimage}}% <replace>
{}% <success>
{}% <failure>
\makeatother
% ====== Misc ======
\usepackage{fancyvrb}
\usepackage{csquotes}
\usepackage[normalem]{ulem}
% Quotes and verses style
\AtBeginEnvironment{quote}{\singlespacing}
\AtBeginEnvironment{verse}{\singlespacing}
% ====== Text spacing ======
\usepackage{setspace} % String spacing
\onehalfspacing{}
\usepackage{indentfirst}
\setlength\parindent{0cm}
\setlength\parskip{6pt}
% ====== Page layout ======
\usepackage[ % Margins
left=2cm,
right=2cm,
top=2cm,
bottom=2cm
]{geometry}
% ====== Document sectioning ======
\usepackage{titlesec}
\titleformat*{\section}{\bfseries}
\titleformat*{\subsection}{\bfseries}
\titleformat*{\subsubsection}{\bfseries}
\titleformat*{\paragraph}{\bfseries}
\titleformat*{\subparagraph}{\bfseries\itshape}% chktex 6
\titlespacing*{\section}{0cm}{12pt}{3pt}
\titlespacing*{\subsection}{0cm}{12pt}{3pt}
\titlespacing*{\subsubsection}{0cm}{12pt}{0pt}
\titlespacing*{\paragraph}{0pt}{6pt}{6pt}
\titlespacing*{\subparagraph}{0pt}{6pt}{3pt}
\makeatletter
\providecommand{\subtitle}[1]{
\apptocmd{\@title}{\par {\large #1 \par}}{}{}
}
\makeatother
% ====== Pandoc =======
$if(highlighting-macros)$
$highlighting-macros$
$endif$
% ====== Language ======
\usepackage{polyglossia}
\setdefaultlanguage{$main-lang$}
\setotherlanguage{$other-lang$}
\defaultfontfeatures{Ligatures={TeX}}
\setmainfont{Open Sans}
\newfontfamily\cyrillicfont{Open Sans}
\setmonofont[Scale=0.9]{DejaVu Sans Mono}
\newfontfamily{\cyrillicfonttt}{DejaVu Sans Mono}[Scale=0.8]
\usepackage{bidi}
\usepackage{microtype}
\setlength{\emergencystretch}{3pt}
$if(title)$
\title{$title$}
$endif$
$if(subtitle)$
\subtitle{$subtitle$}
$endif$
$if(author)$
\author{$for(author)$$author$$sep$ \and $endfor$}
$endif$
$if(date)$
\date{$date$}
$endif$
\begin{document}
\maketitle{}
$body$
\end{document}
```
### Invoking pandoc {#invoking-pandoc}
Now that we have the template, let's save it somewhere and store the path to a variable:
```emacs-lisp
(setq my/rdrview-template (expand-file-name
(concat user-emacs-directory "rdrview.tex")))
```
And let's invoke pandoc. We need to pass the following flags:
- `--pdf-engine=xelatex`, of course
- `--template <path-to-template>`;
- `-o <path-to-pdf>`;
- `--variable key=value`.
In fact, pandoc is a pretty awesome tool in the sense that it allows for feeding custom variables to rich-language templates.
So, the rendering function is as follows:
```emacs-lisp
(cl-defun my/rdrview-render (content type variables callback
&key file-name overwrite)
"Render CONTENT with pandoc.
TYPE is a file extension as supported by pandoc, for instance,
html or txt. VARIABLES is an alist that is fed into the
template. After the rendering is complete successfully, CALLBACK
is called with the resulting PDF.
FILE-NAME is a path to the resulting PDF. If nil it's generated
randomly.
If a file with the given FILE-NAME already exists, the function will
invoke CALLBACK straight away without doing the rendering, unless
OVERWRITE is non-nil."
(unless file-name
(setq file-name (format "/tmp/%d.pdf" (random 100000000))))
(let (params
(temp-file-name (format "/tmp/%d.%s" (random 100000000) type)))
(cl-loop for (key . value) in variables
when value
do (progn
(push "--variable" params)
(push (format "%s=%s" key value) params)))
(setq params (nreverse params))
(if (and (file-exists-p file-name) (not overwrite))
(funcall callback file-name)
(with-temp-file temp-file-name
(insert content))
(let ((proc (apply #'start-process
"pandoc" (get-buffer-create "*Pandoc*") "pandoc"
temp-file-name "-o" file-name
"--pdf-engine=xelatex" "--template" my/rdrview-template
params)))
(set-process-sentinel
proc
(lambda (process _msg)
(let ((status (process-status process))
(code (process-exit-status process)))
(cond ((and (eq status 'exit) (= code 0))
(progn
(message "Done!")
(funcall callback file-name)))
((or (and (eq status 'exit) (> code 0))
(eq status 'signal))
(user-error "Error in pandoc. Check the *Pandoc* buffer")))))))))
```
### Opening elfeed entries {#opening-elfeed-entries}
Now we have everything required to open elfeed entries.
Also, in my case elfeed entries come in two languages, so I have to set `main-lang` and `other-lang` variables accordingly. Here's the main function:
```emacs-lisp
(setq my/elfeed-pdf-dir (expand-file-name "~/.elfeed/pdf/"))
(defun my/elfeed-open-pdf (entry overwrite)
"Open the current elfeed ENTRY with a pdf viewer.
If OVERWRITE is non-nil, do the rendering even if the resulting
PDF already exists."
(interactive (list elfeed-show-entry current-prefix-arg))
(let ((authors (mapcar (lambda (m) (plist-get m :name)) (elfeed-meta entry :authors)))
(feed-title (elfeed-feed-title (elfeed-entry-feed entry)))
(tags (mapconcat #'symbol-name (elfeed-entry-tags entry) ", "))
(date (format-time-string "%a, %e %b %Y"
(seconds-to-time (elfeed-entry-date entry))))
(content (elfeed-deref (elfeed-entry-content entry)))
(file-name (concat my/elfeed-pdf-dir
(elfeed-ref-id (elfeed-entry-content entry))
".pdf"))
(main-language "english")
(other-language "russian"))
(unless content
(user-error "No content!"))
(setq subtitle
(cond
((seq-empty-p authors) feed-title)
((and (not (seq-empty-p (car authors)))
(string-match-p (regexp-quote (car authors)) feed-title)) feed-title)
(t (concat (string-join authors ", ") "\\\\" feed-title))))
(when (member 'ru (elfeed-entry-tags entry))
(setq main-language "russian")
(setq other-language "english"))
(my/rdrview-render
(if (bound-and-true-p my/elfeed-show-rdrview-html)
my/elfeed-show-rdrview-html
content)
(elfeed-entry-content-type entry)
`((title . ,(elfeed-entry-title entry))
(subtitle . ,subtitle)
(date . ,date)
(tags . ,tags)
(main-lang . ,main-language)
(other-lang . ,other-language))
(lambda (file-name)
(start-process "xdg-open" nil "xdg-open" file-name))
:file-name file-name
:overwrite current-prefix-arg)))
```
If the `my/elfeed-show-rdrview-html` variable is bound and true, then the content in this buffer was retrieved via `rdrview`, so we'll use that instead of the output of `elfeed-deref`.
Now we can open elfeed entries in a PDF viewer, which I find much nicer to read. Given that RSS feeds generally ship with simpler HTML than the regular websites, results usually look awesome:
{{< figure src="/ox-hugo/pdf-prot.png" >}}
### Opening arbitrary sites {#opening-arbitrary-sites}
As you may have noticed, we also can display arbitrary web pages with this setup, so let's go ahead and implement that:
```emacs-lisp
(defun my/get-languages (url)
(let ((main-lang "english")
(other-lang "russian"))
(when (string-match-p (rx ".ru") url)
(setq main-lang "russian"
other-lang "english"))
(list main-lang other-lang)))
(defun my/rdrview-open (url overwrite)
(interactive
(let ((url (read-from-minibuffer
"URL: "
(if (bound-and-true-p elfeed-show-entry)
(elfeed-entry-link elfeed-show-entry)))))
(when (string-empty-p url)
(user-error "URL is empty"))
(list url current-prefix-arg)))
(my/rdrview-get
url
(lambda (res)
(let ((data (my/rdrview-parse res))
(langs (my/get-languages url)))
(my/rdrview-render
(alist-get 'content data)
'html
`((title . ,(alist-get 'title data))
(subtitle . ,(alist-get 'sitename data))
(main-lang . ,(nth 0 langs))
(other-lang . ,(nth 1 langs)))
(lambda (file-name)
(start-process "xdg-open" nil "xdg-open" file-name)))))))
```
Unfortunately, this part doesn't work that well, so we can't just uninstall Firefox or Chromium and browse the web from a PDF viewer.
The most common problem I've encountered is incorrectly formed pictures, such as `.png` files without the boundary info. I'm sure you've also come across this if you ever tried to insert a lot of Internet pictures into a LaTeX document.
However, sans the pictures issue, for certain sites like Wikipedia this is usable. For instance, here's how the Emacs page looks:
![](/ox-hugo/pdf-emacs.png)
## YouTube transcripts {#youtube-transcripts}
### Getting subtitles {#getting-subtitles}
Finally, let's get to transcripts.
In principle, the YouTube API allows for downloading subtitles, but I've found [this awesome Python script](https://github.com/jdepoix/youtube-transcript-api) which does the same. You can install it from `pip`, or here's mine [Guix definition](https://github.com/SqrtMinusOne/channel-q/blob/master/youtube-transcript-api.scm) once again.
Much like the previous cases, we need to invoke the program and save the output. The [WebVTT](https://en.wikipedia.org/wiki/WebVTT) format will work well enough for our purposes. Here comes the function:
```emacs-lisp
(cl-defun my/youtube-subtitles-get (video-id callback &key file-name overwrite)
"Get subtitles for VIDEO-ID in WebVTT format.
Call CALLBACK when done.
FILE-NAME is a path to the resulting WebVTT file. If nil it's
generated randomly.
If a file with the given FILE-NAME already exists, the function will
invoke CALLBACK straight away without doing the rendering, unless
OVERWRITE is non-nil."
(unless file-name
(setq file-name (format "/tmp/%d.vtt" (random 100000000))))
(if (and (file-exists-p file-name) (not overwrite))
(funcall callback file-name)
(let* ((buffer (generate-new-buffer "youtube-transcripts"))
(proc (start-process "youtube_transcript_api" buffer
"youtube_transcript_api" video-id
"--format" "webvtt")))
(set-process-sentinel
proc
(lambda (process _msg)
(let ((status (process-status process))
(code (process-exit-status process)))
(cond ((and (eq status 'exit) (= code 0))
(progn
(with-current-buffer (process-buffer process)
(setq buffer-file-name file-name)
(save-buffer))
(kill-buffer (process-buffer process))
(funcall callback file-name)))
((or (and (eq status 'exit) (> code 0))
(eq status 'signal))
(let ((err (with-current-buffer (process-buffer process)
(buffer-string))))
(kill-buffer (process-buffer process))
(user-error "Error in youtube_transcript_api: %s" err)))))))
proc)))
```
### elfeed and subed {#elfeed-and-subed}
Now that we have a standalone function, let's invoke it with the current `elfeed-show-entry`:
```emacs-lisp
(setq my/elfeed-srt-dir (expand-file-name "~/.elfeed/srt/"))
(defun my/elfeed-youtube-subtitles (entry &optional arg)
"Get subtitles for the current elfeed ENTRY.
Works only in the entry is a YouTube video.
If ARG is non-nil, re-fetch the subtitles regardless of whether
they were fetched before."
(interactive (list elfeed-show-entry current-prefix-arg))
(let ((video-id (cadr
(assoc "watch?v"
(url-parse-query-string
(substring
(url-filename
(url-generic-parse-url (elfeed-entry-link entry)))
1))))))
(unless video-id
(user-error "Can't get video ID from the entry"))
(my/youtube-subtitles-get
video-id
(lambda (file-name)
(with-current-buffer (find-file-other-window file-name)
(setq-local elfeed-show-entry entry)
(goto-char (point-min))))
:file-name (concat my/elfeed-srt-dir
(elfeed-ref-id (elfeed-entry-content entry))
".vtt")
:overwrite arg)))
```
That opens up a `.vtt` buffer with the subtitles for the current video, which means now we can use the functionality of Sacha Chua's awesome package called [subed](https://github.com/sachac/subed).
This package, besides syntax highlighting, allows for controlling the MPV playback, for instance by moving the cursor in the subtitles buffer. Using that requires having the URL of the video in this buffer, which necessitates the line with `setq-local` in the previous function.
Also, the package launches its own instance of MPV to control it via JSON-IPC, so there seems to be no easy way to integrate it with EMMS. But at least I can reuse the `emms-player-mpv-parameters` variable, the method of setting which I've discussed in a [previous blog post](https://sqrtminusone.xyz/posts/2021-09-07-emms/). The function is as follows:
```emacs-lisp
(defun my/subed-elfeed (entry)
"Open the video file from elfeed ENTRY in MPV.
This has to be launched from inside the subtitles buffer, opened
by the `my/elfeed-youtube-subtitles' function."
(interactive (list elfeed-show-entry))
(unless entry
(user-error "No entry!"))
(unless (derived-mode-p 'subed-mode)
(user-error "Not subed mode!"))
(setq-local subed-mpv-arguments
(seq-uniq
(append subed-mpv-arguments emms-player-mpv-parameters)))
(setq-local subed-mpv-video-file (elfeed-entry-link entry))
(subed-mpv--play subed-mpv-video-file))
```
And here's how it looks when used (the video on the screenshot is [this System Crafters' stream](https://www.youtube.com/watch?v=qjAIXCmhCQQ)):
![](/ox-hugo/pdf-subed.png)
Keep in mind that this function has to be launched inside the buffer opened by the `my/elfeed-youtube-subtitles` function.

View file

@ -1,7 +1,7 @@
#+HUGO_SECTION: posts
#+HUGO_BASE_DIR: ../
#+TITLE: Extending elfeed entries with a PDF viewer, subtitles fetcher, and a speech recognition engine
#+DATE: 2022-05-08
#+TITLE: Extending elfeed entries with a PDF viewer and subtitles fetcher
#+DATE: 2022-05-09
#+HUGO_TAGS: emacs
#+HUGO_TAGS: org-mode
#+HUGO_DRAFT: true
@ -9,17 +9,14 @@
* Intro
[[https://github.com/skeeto/elfeed][elfeed]] is one of the most popular Emacs packages, and it's also one in which I ended up investing a lot of effort. I wrote about the [[https://sqrtminusone.xyz/posts/2021-09-07-emms/][EMMS integration]] and even made a [[https://github.com/SqrtMinusOne/elfeed-summary][custom frontpage]] to my liking.
One general issue with RSS readers is that sites often limit the amount of information shipped in the RSS feed. For instance, oftentimes an RSS entry doesn't include the entire content.
However, sites frequently limit the amount of information shipped in the RSS feed. Oftentimes the entry doesn't include the entire content (of which, by the way, this blog was guilty).
Also, there's non-textual content. If you're subscribed to video or audio feeds, for instance, YouTube channels or podcasts, you have to use video and audio players correspondingly. Which elfeed allows, for instance with the aforementioned EMMS integration. But we can do more.
Also, there's non-textual content, of which in this post I consider YouTube subscriptions. It's possible to watch YouTube from elfeed, for instance with the aforementioned EMMS integration, but we can do more.
Which is why in this post I consider the following tricks to extend elfeed entries:
So, the plan for the post is to discuss:
- using [[https://github.com/eafer/rdrview][rdrview]] to extend elfeed articles;
- using [[https://pandoc.org][pandoc]] and LaTeX to convert articles to PDFs;
- using [[https://github.com/jdepoix/youtube-transcript-api][youtube-transcript-api]] to download YouTube subtitles and [[https://github.com/sachac/subed][subed]] to control the MPV playback;
- using [[https://github.com/alphacep/vosk-api][Vosk]] to get transcripts of podcasts.
That's a lot of stuff that I even considered splitting into two separate posts, but I think a long one will also do just fine.
Also, heads up! You'll need lexical binding enabled for the code blocks. The easiest way to accomplish this is to add the following to the first line of =init.el=:
#+begin_src emacs-lisp
@ -35,10 +32,10 @@ It seems like the tool [[https://repology.org/project/rdrview/versions][isn't av
Let's start by integrating =rdrview= with Emacs. In the general case, we want to fetch both metadata and the actual content from the page.
However, the interface of =rdrview= is a bit awkward in this part, so we have the following options:
- call =rdrview= two times: with =-M= flag to fetch the metadata and without it;
- call =rdrview= two times: with =-M= flag to fetch the metadata, and without the flag to fetch the HTML;
- call =rdrview= with =-T= flag to append the metadata to the resulting HTML.
I've decided to go with the second option. So, here is a function that calls rdrview with the required flags:
I've decided to go with the second option. Here is a function that calls rdrview with the required flags:
#+begin_src emacs-lisp
(defun my/rdrview-get (url callback)
"Get the rdrview representation of URL.
@ -68,9 +65,9 @@ Call CALLBACK with the output."
proc))
#+end_src
The function calls =callback= with the output of =rdrview=. Generally, it doesn't take much time, but it's still nice to avoid freezing Emacs that way.
The function calls =callback= with the output of =rdrview=. This usually doesn't take long, but it's still nice to avoid freezing Emacs that way.
Now, we have to parse the output. The =-T= flag put the title in a =<h1>= flag and the name of the site in a =<h2>= flag and the content in a =<div>=. What's more, headers of the content are often shifted, e.g. the top-level header may well end up being and =<h2>= or =<h3>=, which does not look great in LaTeX.
Now we have to parse the output. The =-T= flag puts the title in the =<h1>= tag, the site name site in the =<h2>= tag, and the content in a =<div>=. What's more, headers of the content are often shifted, e.g. the top-level header may well end up being and =<h2>= or =<h3>=, which does not look great in LaTeX.
With that said, here's a function that does the required changes:
#+begin_src emacs-lisp
@ -108,7 +105,7 @@ With that said, here's a function that does the required changes:
#+end_src
** Using rdrview from elfeed
Because I didn't find a smart way to advise the wanted behavior into elfeed, here's a modification of the =elfeed-show-refresh--mail-style= function with two changes:
Because I didn't find a smart way to advise the desired behavior into elfeed, here's a modification of the =elfeed-show-refresh--mail-style= function with two changes:
- it uses =rdrview= to fetch the HTML;
- it saves the resulting HTML into a buffer-local variable (we'll need that later).
@ -169,20 +166,20 @@ Because I didn't find a smart way to advise the wanted behavior into elfeed, her
That way, calling =M-x my/rdrview-elfeed-show= replaces the original content with one from =rdrview=.
** How well does it work?
Rather ironically, it works well with sites that already ship with a proper RSS, like [[https://protesilaos.com/][Protesilaos Stavrou's]] or [[https://karthinks.com/software/simple-folding-with-hideshow/][Karthik Chikmagalur's]] blogs, or [[https://www.theatlantic.com/world/][The Atlantic]] magazine.
Rather ironically, it works well with sites that already ship with proper RSS, like [[https://protesilaos.com/][Protesilaos Stavrou's]] or [[https://karthinks.com/software/simple-folding-with-hideshow/][Karthik Chikmagalur's]] blogs or [[https://www.theatlantic.com/world/][The Atlantic]] magazine.
From other my subscriptions, it does a pretty good job with [[https://www.theverge.com/][The Verge]], which by default sends entries truncated by the words "Read the full article". For [[https://arstechnica.com/][Ars Technica]], it works only if the story is not large enough, because otherwise, the site returns its HTML-based pagination interface.
Of my other subscriptions, it does a pretty good job with [[https://www.theverge.com/][The Verge]], which by default sends entries truncated by the words "Read the full article". For [[https://arstechnica.com/][Ars Technica]], it works only if the story is not large enough, otherwise the site returns its HTML-based pagination interface.
For paywalled sites, like [[https://www.nytimes.com/][New York Times]] or [[https://www.economist.com/][The Economist]], it usually doesn't work (by the way, what's the problem with providing individual RSS feeds for subscribers?). If you want stuff like that, I'd advise using the [[https://github.com/RSS-Bridge/rss-bridge][RSS-Bridge]] project. And if something is not available, contributing business logic there definitely makes more sense than implementing workarounds in Emacs Lisp.
For paywalled sites such as [[https://www.nytimes.com/][New York Times]] or [[https://www.economist.com/][The Economist]], this usually doesn't work (by the way, what's the problem with providing individual RSS feeds for subscribers?). If you need this kind of thing, I'd suggest using the [[https://github.com/RSS-Bridge/rss-bridge][RSS-Bridge]] project. And if something is not available, contributing business logic there definitely makes more sense than implementing workarounds in Emacs Lisp.
* LaTeX and pandoc
However, I find that I'm not really a fan of reading articles from Emacs. Somehow what works for program code doesn't work that well with the natural text. When I have to, I usually switch the Emacs theme to the light one.
However, I also find that I'm not really a fan of reading articles from Emacs. Somehow what works for program code doesn't work that well for natural text. When I have to, I usually switch the Emacs theme to a light one.
But the best solution I've found so far is to render the required articles as PDFs. I may even print out some large articles I want to read.
** Template
So, first, we need a LaTeX template. Pandoc already ships with one, but I don't like it too much, so I've put up a template from my LaTeX styles, targeting my preferred XeLaTeX engine.
So first, we need a LaTeX template. Pandoc already ships with one, but I don't like it too much, so I've put up a template from my LaTeX styles, targeting my preferred XeLaTeX engine.
I'll add the code here for completeness' sake, but if you use LaTeX, you'll probably end up better using your own setup. Be sure to define the following variables:
I'll add the code here for completeness' sake, but if you use LaTeX, you'll probably be better off using your own setup. Be sure to define the following variables:
- =main-lang= and =other-lang= for polyglossia (or remove them if you have only one language)
- =title=
- =subtitle=
@ -234,9 +231,6 @@ I'll add the code here for completeness' sake, but if you use LaTeX, you'll prob
\do\O\do\P\do\Q\do\R\do\S\do\T\do\U\do\V\do\W\do\X%
\do\Y\do\Z}
% ====== Captions ======
% TODO
% ====== Table ======
\usepackage{array}
\usepackage{booktabs}
@ -369,13 +363,13 @@ Now that we have the template, let's save it somewhere and store the path to a v
(concat user-emacs-directory "rdrview.tex")))
#+end_src
Now let's invoke pandoc. We need to pass the following flags:
And let's invoke pandoc. We need to pass the following flags:
- =--pdf-engine=xelatex=, of course
- =template <path-to-template>=;
- =--template <path-to-template>=;
- =-o <path-to-pdf>=;
- =--variable key=value=.
In fact, pandoc is a pretty awesome tool in the sense that it allows for feeding custom variables to templates and using a rich templating language.
In fact, pandoc is a pretty awesome tool in the sense that it allows for feeding custom variables to rich-language templates.
So, the rendering function is as follows:
#+begin_src emacs-lisp
@ -430,7 +424,7 @@ OVERWRITE is non-nil."
** Opening elfeed entries
Now we have everything required to open elfeed entries.
Also, in my case elfeed entries come in two languages, so I have to set =main-lang= and =other-lang= variables accordingly. Here's the function:
Also, in my case elfeed entries come in two languages, so I have to set =main-lang= and =other-lang= variables accordingly. Here's the main function:
#+begin_src emacs-lisp
(setq my/elfeed-pdf-dir (expand-file-name "~/.elfeed/pdf/"))
@ -443,7 +437,8 @@ PDF already exists."
(let ((authors (mapcar (lambda (m) (plist-get m :name)) (elfeed-meta entry :authors)))
(feed-title (elfeed-feed-title (elfeed-entry-feed entry)))
(tags (mapconcat #'symbol-name (elfeed-entry-tags entry) ", "))
(date (format-time-string "%a, %e %b %Y" (seconds-to-time (elfeed-entry-date entry))))
(date (format-time-string "%a, %e %b %Y"
(seconds-to-time (elfeed-entry-date entry))))
(content (elfeed-deref (elfeed-entry-content entry)))
(file-name (concat my/elfeed-pdf-dir
(elfeed-ref-id (elfeed-entry-content entry))
@ -478,14 +473,15 @@ PDF already exists."
:overwrite current-prefix-arg)))
#+end_src
If the =my/elfeed-show-rdrview-html= variable is bound and true, then the content in this buffer was retrieved by =rdrview=, so we'll use that instead of the output of =elfeed-dered=.
If the =my/elfeed-show-rdrview-html= variable is bound and true, then the content in this buffer was retrieved via =rdrview=, so we'll use that instead of the output of =elfeed-deref=.
So, we can open elfeed entries in a PDF viewer, which I find much nicer to read. Given that RSS feeds generally ship with much simpler HTML than the proper websites, results usually look awesome:
Now we can open elfeed entries in a PDF viewer, which I find much nicer to read. Given that RSS feeds generally ship with simpler HTML than the regular websites, results usually look awesome:
[[./images/pdf-prot.png]]
** Opening arbitrary sites
As you might've noticed, we also can renderer arbitrary web pages with this setup, so let's go ahead and implement that:
As you may have noticed, we also can display arbitrary web pages with this setup, so let's go ahead and implement that:
#+begin_src emacs-lisp
(defun my/get-languages (url)
(let ((main-lang "english")
@ -522,17 +518,17 @@ As you might've noticed, we also can renderer arbitrary web pages with this setu
Unfortunately, this part doesn't work that well, so we can't just uninstall Firefox or Chromium and browse the web from a PDF viewer.
The most common problem I faced is incorrectly formed pictures, for instance, =.png= files without the boundary info. I'm sure you've encountered this if you ever tried to insert a lot of Internet pictures into a LaTeX document.
The most common problem I've encountered is incorrectly formed pictures, such as =.png= files without the boundary info. I'm sure you've also come across this if you ever tried to insert a lot of Internet pictures into a LaTeX document.
However, sans the pictures issue, it works nicely with Wikipedia pages. For instance, here's how the Emacs page looks:
However, sans the pictures issue, for certain sites like Wikipedia this is usable. For instance, here's how the Emacs page looks:
[[./images/pdf-emacs.png]]
* YouTube transcripts
** Getting subtitles
Now, let's get to transcripts.
Finally, let's get to transcripts.
In principle, the YouTube API allows for downloading subtitles, but I've found [[https://github.com/jdepoix/youtube-transcript-api][this awesome Python script]] which does the same. You can install it from =pip=, or here's mine [[https://github.com/SqrtMinusOne/channel-q/blob/master/youtube-transcript-api.scm][Guix definition]] once again.
Much like the previous cases, we need to invoke the program and save the output. The [[https://en.wikipedia.org/wiki/WebVTT][WebVTT]] format will work well enough for our purposes. Here goes the function:
Much like the previous cases, we need to invoke the program and save the output. The [[https://en.wikipedia.org/wiki/WebVTT][WebVTT]] format will work well enough for our purposes. Here comes the function:
#+begin_src emacs-lisp
(cl-defun my/youtube-subtitles-get (video-id callback &key file-name overwrite)
"Get subtitles for VIDEO-ID in WebVTT format.
@ -608,11 +604,11 @@ they were fetched before."
:overwrite arg)))
#+end_src
That opens up a =.vtt= buffer with subtitles for the current video, which means now we can use the functionality of an awesome package of Sacha Chua called [[https://github.com/sachac/subed][subed]].
That opens up a =.vtt= buffer with the subtitles for the current video, which means now we can use the functionality of Sacha Chua's awesome package called [[https://github.com/sachac/subed][subed]].
This package, besides syntax highlighting, allows for controlling the MPV playback, for instance by moving the cursor in the subtitles buffer. Using that requires having the URL of the video in the subtitles buffer, which is why the string with =setq-local= in the previous function is necessary.
This package, besides syntax highlighting, allows for controlling the MPV playback, for instance by moving the cursor in the subtitles buffer. Using that requires having the URL of the video in this buffer, which necessitates the line with =setq-local= in the previous function.
Also, the package launches its own instance of MPV to control it via JSON-IPC, so there seems to be no easy way to integrate it with EMMS. But at least I can reuse the =emms-player-mpv-parameters= variable, the method of setting which I've discussed in a [[https://sqrtminusone.xyz/posts/2021-09-07-emms/][previous blog post]]. So, here's the function:
Also, the package launches its own instance of MPV to control it via JSON-IPC, so there seems to be no easy way to integrate it with EMMS. But at least I can reuse the =emms-player-mpv-parameters= variable, the method of setting which I've discussed in a [[https://sqrtminusone.xyz/posts/2021-09-07-emms/][previous blog post]]. The function is as follows:
#+begin_src emacs-lisp
(defun my/subed-elfeed (entry)
"Open the video file from elfeed ENTRY in MPV.
@ -631,7 +627,7 @@ by the `my/elfeed-youtube-subtitles' function."
(subed-mpv--play subed-mpv-video-file))
#+end_src
And here's how using it looks:
And here's how it looks when used (the video on the screenshot is [[https://www.youtube.com/watch?v=qjAIXCmhCQQ][this System Crafters' stream]]):
[[./images/pdf-subed.png]]
Keep in mind that this function has to be launched inside the buffer opened by the =my/elfeed-youtube-subtitles= function.

Binary file not shown.

After

Width:  |  Height:  |  Size: 344 KiB

BIN
static/ox-hugo/pdf-prot.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 671 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 724 KiB