README - sfeed - RSS and Atom parser
HTML git clone git://git.codemadness.org/sfeed
DIR Log
DIR Files
DIR Refs
DIR README
DIR LICENSE
---
README (36023B)
---
1 sfeed
2 -----
3
4 RSS and Atom parser (and some format programs).
5
6 It converts RSS or Atom feeds from XML to a TAB-separated file. There are
7 formatting programs included to convert this TAB-separated format to various
8 other formats. There are also some programs and scripts included to import and
9 export OPML and to fetch, filter, merge and order feed items.
10
11
12 Build and install
13 -----------------
14
15 $ make
16 # make install
17
18
19 To build sfeed without sfeed_curses set SFEED_CURSES to an empty string:
20
21 $ make SFEED_CURSES=""
22 # make SFEED_CURSES="" install
23
24
25 To change the theme for sfeed_curses you can set SFEED_THEME. See the themes/
26 directory for the theme names.
27
28 $ make SFEED_THEME="templeos"
29 # make SFEED_THEME="templeos" install
30
31
32 Usage
33 -----
34
35 Initial setup:
36
37 mkdir -p "$HOME/.sfeed/feeds"
38 cp sfeedrc.example "$HOME/.sfeed/sfeedrc"
39
40 Edit the sfeedrc(5) configuration file and change any RSS/Atom feeds. This file
41 is included and evaluated as a shellscript for sfeed_update, so its functions
42 and behaviour can be overridden:
43
44 $EDITOR "$HOME/.sfeed/sfeedrc"
45
46 or you can import existing OPML subscriptions using sfeed_opml_import(1):
47
48 sfeed_opml_import < file.opml > "$HOME/.sfeed/sfeedrc"
49
50 an example to export from an other RSS/Atom reader called newsboat and import
51 for sfeed_update:
52
53 newsboat -e | sfeed_opml_import > "$HOME/.sfeed/sfeedrc"
54
55 an example to export from an other RSS/Atom reader called rss2email (3.x+) and
56 import for sfeed_update:
57
58 r2e opmlexport | sfeed_opml_import > "$HOME/.sfeed/sfeedrc"
59
60 Update feeds, this script merges the new items, see sfeed_update(1) for more
61 information what it can do:
62
63 sfeed_update
64
65 Format feeds:
66
67 Plain-text list:
68
69 sfeed_plain $HOME/.sfeed/feeds/* > "$HOME/.sfeed/feeds.txt"
70
71 HTML view (no frames), copy style.css for a default style:
72
73 cp style.css "$HOME/.sfeed/style.css"
74 sfeed_html $HOME/.sfeed/feeds/* > "$HOME/.sfeed/feeds.html"
75
76 HTML view with the menu as frames, copy style.css for a default style:
77
78 mkdir -p "$HOME/.sfeed/frames"
79 cp style.css "$HOME/.sfeed/frames/style.css"
80 cd "$HOME/.sfeed/frames" && sfeed_frames $HOME/.sfeed/feeds/*
81
82 To automatically update your feeds periodically and format them in a way you
83 like you can make a wrapper script and add it as a cronjob.
84
85 Most protocols are supported because curl(1) is used by default and also proxy
86 settings from the environment (such as the $http_proxy environment variable)
87 are used.
88
89 The sfeed(1) program itself is just a parser that parses XML data from stdin
90 and is therefore network protocol-agnostic. It can be used with HTTP, HTTPS,
91 Gopher, SSH, etc.
92
93 See the section "Usage and examples" below and the man-pages for more
94 information how to use sfeed(1) and the additional tools.
95
96
97 Dependencies
98 ------------
99
100 - C compiler (C99).
101 - libc (recommended: C99 and POSIX >= 200809).
102
103
104 Optional dependencies
105 ---------------------
106
107 - POSIX make(1) for the Makefile.
108 - POSIX sh(1),
109 used by sfeed_update(1) and sfeed_opml_export(1).
110 - POSIX utilities such as awk(1) and sort(1),
111 used by sfeed_content(1), sfeed_markread(1), sfeed_opml_export(1) and
112 sfeed_update(1).
113 - curl(1) binary: https://curl.haxx.se/ ,
114 used by sfeed_update(1), but can be replaced with any tool like wget(1),
115 OpenBSD ftp(1) or hurl(1): https://codemadness.org/hurl.html
116 - iconv(1) command-line utilities,
117 used by sfeed_update(1). If the text in your RSS/Atom feeds are already UTF-8
118 encoded then you don't need this. For a minimal iconv implementation:
119 https://git.etalabs.net/cgit/noxcuse/tree/src/iconv.c
120 - xargs with support for the -P and -0 option,
121 used by sfeed_update(1).
122 - mandoc for documentation: https://mdocml.bsd.lv/
123 - curses (typically ncurses), otherwise see minicurses.h,
124 used by sfeed_curses(1).
125 - a terminal (emulator) supporting UTF-8 and the used capabilities,
126 used by sfeed_curses(1).
127
128
129 Optional run-time dependencies for sfeed_curses
130 -----------------------------------------------
131
132 - xclip for yanking the URL or enclosure. See $SFEED_YANKER to change it.
133 - xdg-open, used as a plumber by default. See $SFEED_PLUMBER to change it.
134 - awk, used by the sfeed_content and sfeed_markread script.
135 See the ENVIRONMENT VARIABLES section in the man page to change it.
136 - lynx, used by the sfeed_content script to convert HTML content.
137 An alternative: webdump: https://codemadness.org/webdump.html
138 See the ENVIRONMENT VARIABLES section in the man page to change it.
139
140
141 Formats supported
142 -----------------
143
144 sfeed supports a subset of XML 1.0 and a subset of:
145
146 - Atom 1.0 (RFC 4287): https://datatracker.ietf.org/doc/html/rfc4287
147 - Atom 0.3 (draft, historic).
148 - RSS 0.90+.
149 - RDF (when used with RSS).
150 - MediaRSS extensions (media:).
151 - Dublin Core extensions (dc:).
152
153 Other formats like JSON Feed, twtxt or certain RSS/Atom extensions are
154 supported by converting them to RSS/Atom or to the sfeed(5) format directly.
155
156
157 OS tested
158 ---------
159
160 - Linux,
161 compilers: clang, gcc, chibicc, cproc, lacc, pcc, scc, tcc,
162 libc: glibc, musl.
163 - OpenBSD (clang, gcc).
164 - NetBSD (with NetBSD curses).
165 - FreeBSD
166 - DragonFlyBSD
167 - GNU/Hurd
168 - Illumos (OpenIndiana).
169 - Windows (cygwin gcc + mintty, mingw).
170 - HaikuOS
171 - SerenityOS
172 - FreeDOS (djgpp, Open Watcom).
173 - FUZIX (sdcc -mz80, with the sfeed parser program).
174
175
176 Architectures tested
177 --------------------
178
179 amd64, ARM, aarch64, HPPA, i386, MIPS32-BE, RISCV64, SPARC64, Z80.
180
181
182 Files
183 -----
184
185 sfeed - Read XML RSS or Atom feed data from stdin. Write feed data
186 in TAB-separated format to stdout.
187 sfeed_atom - Format feed data (TSV) to an Atom feed.
188 sfeed_content - View item content, for use with sfeed_curses.
189 sfeed_curses - Format feed data (TSV) to a curses interface.
190 sfeed_frames - Format feed data (TSV) to HTML file(s) with frames.
191 sfeed_gopher - Format feed data (TSV) to Gopher files.
192 sfeed_html - Format feed data (TSV) to HTML.
193 sfeed_json - Format feed data (TSV) to JSON Feed.
194 sfeed_opml_export - Generate an OPML XML file from a sfeedrc config file.
195 sfeed_opml_import - Generate a sfeedrc config file from an OPML XML file.
196 sfeed_markread - Mark items as read/unread, for use with sfeed_curses.
197 sfeed_mbox - Format feed data (TSV) to mbox.
198 sfeed_plain - Format feed data (TSV) to a plain-text list.
199 sfeed_twtxt - Format feed data (TSV) to a twtxt feed.
200 sfeed_update - Update feeds and merge items.
201 sfeed_web - Find URLs to RSS/Atom feed from a webpage.
202 sfeed_xmlenc - Detect character-set encoding from a XML stream.
203 sfeedrc.example - Example config file. Can be copied to $HOME/.sfeed/sfeedrc.
204 style.css - Example stylesheet to use with sfeed_html(1) and
205 sfeed_frames(1).
206
207
208 Files read at runtime by sfeed_update(1)
209 ----------------------------------------
210
211 sfeedrc - Config file. This file is evaluated as a shellscript in
212 sfeed_update(1).
213
214 At least the following functions can be overridden per feed:
215
216 - fetch: to use wget(1), OpenBSD ftp(1) or an other download program.
217 - filter: to filter on fields.
218 - merge: to change the merge logic.
219 - order: to change the sort order.
220
221 See also the sfeedrc(5) man page documentation for more details.
222
223 The feeds() function is called to process the feeds. The default feed()
224 function is executed concurrently as a background job in your sfeedrc(5) config
225 file to make updating faster. The variable maxjobs can be changed to limit or
226 increase the amount of concurrent jobs (8 by default).
227
228
229 Files written at runtime by sfeed_update(1)
230 -------------------------------------------
231
232 feedname - TAB-separated format containing all items per feed. The
233 sfeed_update(1) script merges new items with this file.
234 The format is documented in sfeed(5).
235
236
237 File format
238 -----------
239
240 man 5 sfeed
241 man 5 sfeedrc
242 man 1 sfeed
243
244
245 Usage and examples
246 ------------------
247
248 Find RSS/Atom feed URLs from a webpage:
249
250 url="https://codemadness.org"; curl -L -s "$url" | sfeed_web "$url"
251
252 output example:
253
254 https://codemadness.org/atom.xml application/atom+xml
255 https://codemadness.org/atom_content.xml application/atom+xml
256
257 - - -
258
259 Make sure your sfeedrc config file exists, see the sfeedrc.example file. To
260 update your feeds (configfile argument is optional):
261
262 sfeed_update "configfile"
263
264 Format the feeds files:
265
266 # Plain-text list.
267 sfeed_plain $HOME/.sfeed/feeds/* > $HOME/.sfeed/feeds.txt
268 # HTML view (no frames), copy style.css for a default style.
269 sfeed_html $HOME/.sfeed/feeds/* > $HOME/.sfeed/feeds.html
270 # HTML view with the menu as frames, copy style.css for a default style.
271 mkdir -p somedir && cd somedir && sfeed_frames $HOME/.sfeed/feeds/*
272
273 View formatted output in your browser:
274
275 $BROWSER "$HOME/.sfeed/feeds.html"
276
277 View formatted output in your editor:
278
279 $EDITOR "$HOME/.sfeed/feeds.txt"
280
281 - - -
282
283 View formatted output in a curses interface. The interface has a look inspired
284 by the mutt mail client. It has a sidebar panel for the feeds, a panel with a
285 listing of the items and a small statusbar for the selected item/URL. Some
286 functions like searching and scrolling are integrated in the interface itself.
287
288 Just like the other format programs included in sfeed you can run it like this:
289
290 sfeed_curses ~/.sfeed/feeds/*
291
292 ... or by reading from stdin:
293
294 sfeed_curses < ~/.sfeed/feeds/xkcd
295
296 By default sfeed_curses marks the items of the last day as new/bold. This limit
297 might be overridden by setting the environment variable $SFEED_NEW_AGE to the
298 desired maximum in seconds. To manage read/unread items in a different way a
299 plain-text file with a list of the read URLs can be used. To enable this
300 behaviour the path to this file can be specified by setting the environment
301 variable $SFEED_URL_FILE to the URL file:
302
303 export SFEED_URL_FILE="$HOME/.sfeed/urls"
304 [ -f "$SFEED_URL_FILE" ] || touch "$SFEED_URL_FILE"
305 sfeed_curses ~/.sfeed/feeds/*
306
307 It then uses the shellscript "sfeed_markread" to process the read and unread
308 items.
309
310 - - -
311
312 Example script to view feed items in a vertical list/menu in dmenu(1). It opens
313 the selected URL in the browser set in $BROWSER:
314
315 #!/bin/sh
316 url=$(sfeed_plain "$HOME/.sfeed/feeds/"* | dmenu -l 35 -i | \
317 sed -n 's@^.* \([a-zA-Z]*://\)\(.*\)$@\1\2@p')
318 test -n "${url}" && $BROWSER "${url}"
319
320 dmenu can be found at: https://git.suckless.org/dmenu/
321
322 - - -
323
324 Generate a sfeedrc config file from your exported list of feeds in OPML
325 format:
326
327 sfeed_opml_import < opmlfile.xml > $HOME/.sfeed/sfeedrc
328
329 - - -
330
331 Export an OPML file of your feeds from a sfeedrc config file (configfile
332 argument is optional):
333
334 sfeed_opml_export configfile > myfeeds.opml
335
336 - - -
337
338 The filter function can be overridden in your sfeedrc file. This allows
339 filtering items per feed. It can be used to shorten URLs, filter away
340 advertisements, strip tracking parameters and more.
341
342 # filter fields.
343 # filter(name, url)
344 filter() {
345 case "$1" in
346 "tweakers")
347 awk -F '\t' 'BEGIN { OFS = "\t"; }
348 # skip ads.
349 $2 ~ /^ADV:/ {
350 next;
351 }
352 # shorten link.
353 {
354 if (match($3, /^https:\/\/tweakers\.net\/[a-z]+\/[0-9]+\//)) {
355 $3 = substr($3, RSTART, RLENGTH);
356 }
357 print $0;
358 }';;
359 "yt BSDNow")
360 # filter only BSD Now from channel.
361 awk -F '\t' '$2 ~ / \| BSD Now/';;
362 *)
363 cat;;
364 esac | \
365 # replace youtube links with embed links.
366 sed 's@www.youtube.com/watch?v=@www.youtube.com/embed/@g' | \
367
368 awk -F '\t' 'BEGIN { OFS = "\t"; }
369 function filterlink(s) {
370 # protocol must start with HTTP, HTTPS or Gopher.
371 if (match(s, /^(http|https|gopher):\/\//) == 0) {
372 return "";
373 }
374
375 # shorten feedburner links.
376 if (match(s, /^(http|https):\/\/[^\/]+\/~r\/.*\/~3\/[^\/]+\//)) {
377 s = substr($3, RSTART, RLENGTH);
378 }
379
380 # strip tracking parameters
381 # urchin, facebook, piwik, webtrekk and generic.
382 gsub(/\?(ad|campaign|fbclid|pk|tm|utm|wt)_([^&]+)/, "?", s);
383 gsub(/&(ad|campaign|fbclid|pk|tm|utm|wt)_([^&]+)/, "", s);
384
385 gsub(/\?&/, "?", s);
386 gsub(/[\?&]+$/, "", s);
387
388 return s
389 }
390 {
391 $3 = filterlink($3); # link
392 $8 = filterlink($8); # enclosure
393
394 # try to remove tracking pixels: <img/> tags with 1px width or height.
395 gsub("<img[^>]*(width|height)[[:space:]]*=[[:space:]]*[\"'"'"' ]?1[\"'"'"' ]?[^0-9>]+[^>]*>", "", $4);
396
397 print $0;
398 }'
399 }
400
401 - - -
402
403 Aggregate feeds. This filters new entries (maximum one day old) and sorts them
404 by newest first. Prefix the feed name in the title. Convert the TSV output data
405 to an Atom XML feed (again):
406
407 #!/bin/sh
408 cd ~/.sfeed/feeds/ || exit 1
409
410 awk -F '\t' -v "old=$(($(date +'%s') - 86400))" '
411 BEGIN { OFS = "\t"; }
412 int($1) >= old {
413 $2 = "[" FILENAME "] " $2;
414 print $0;
415 }' * | \
416 sort -k1,1rn | \
417 sfeed_atom
418
419 - - -
420
421 To have a "tail(1) -f"-like FIFO stream filtering for new unique feed items and
422 showing them as plain-text per line similar to sfeed_plain(1):
423
424 Create a FIFO:
425
426 fifo="/tmp/sfeed_fifo"
427 mkfifo "$fifo"
428
429 On the reading side:
430
431 # This keeps track of unique lines so might consume much memory.
432 # It tries to reopen the $fifo after 1 second if it fails.
433 while :; do cat "$fifo" || sleep 1; done | awk '!x[$0]++'
434
435 On the writing side:
436
437 feedsdir="$HOME/.sfeed/feeds/"
438 cd "$feedsdir" || exit 1
439 test -p "$fifo" || exit 1
440
441 # 1 day is old news, don't write older items.
442 awk -F '\t' -v "old=$(($(date +'%s') - 86400))" '
443 BEGIN { OFS = "\t"; }
444 int($1) >= old {
445 $2 = "[" FILENAME "] " $2;
446 print $0;
447 }' * | sort -k1,1n | sfeed_plain | cut -b 3- > "$fifo"
448
449 cut -b is used to trim the "N " prefix of sfeed_plain(1).
450
451 - - -
452
453 For some podcast feed the following code can be used to filter the latest
454 enclosure URL (probably some audio file):
455
456 awk -F '\t' 'BEGIN { latest = 0; }
457 length($8) {
458 ts = int($1);
459 if (ts > latest) {
460 url = $8;
461 latest = ts;
462 }
463 }
464 END { if (length(url)) { print url; } }'
465
466 ... or on a file already sorted from newest to oldest:
467
468 awk -F '\t' '$8 { print $8; exit }'
469
470 - - -
471
472 Over time your feeds file might become quite big. You can archive items of a
473 feed from (roughly) the last week by doing for example:
474
475 awk -F '\t' -v "old=$(($(date +'%s') - 604800))" 'int($1) > old' < feed > feed.new
476 mv feed feed.bak
477 mv feed.new feed
478
479 This could also be run weekly in a crontab to archive the feeds. Like throwing
480 away old newspapers. It keeps the feeds list tidy and the formatted output
481 small.
482
483 - - -
484
485 Convert mbox to separate maildirs per feed and filter duplicate messages using the
486 fdm program.
487 fdm is available at: https://github.com/nicm/fdm
488
489 fdm config file (~/.sfeed/fdm.conf):
490
491 set unmatched-mail keep
492
493 account "sfeed" mbox "%[home]/.sfeed/mbox"
494 $cachepath = "%[home]/.sfeed/fdm.cache"
495 cache "${cachepath}"
496 $maildir = "%[home]/feeds/"
497
498 # Check if message is in the cache by Message-ID.
499 match case "^Message-ID: (.*)" in headers
500 action {
501 tag "msgid" value "%1"
502 }
503 continue
504
505 # If it is in the cache, stop.
506 match matched and in-cache "${cachepath}" key "%[msgid]"
507 action {
508 keep
509 }
510
511 # Not in the cache, process it and add to cache.
512 match case "^X-Feedname: (.*)" in headers
513 action {
514 # Store to local maildir.
515 maildir "${maildir}%1"
516
517 add-to-cache "${cachepath}" key "%[msgid]"
518 keep
519 }
520
521 Now run:
522
523 $ sfeed_mbox ~/.sfeed/feeds/* > ~/.sfeed/mbox
524 $ fdm -f ~/.sfeed/fdm.conf fetch
525
526 Now you can view feeds in mutt(1) for example.
527
528 - - -
529
530 Read from mbox and filter duplicate messages using the fdm program and deliver
531 it to a SMTP server. This works similar to the rss2email program.
532 fdm is available at: https://github.com/nicm/fdm
533
534 fdm config file (~/.sfeed/fdm.conf):
535
536 set unmatched-mail keep
537
538 account "sfeed" mbox "%[home]/.sfeed/mbox"
539 $cachepath = "%[home]/.sfeed/fdm.cache"
540 cache "${cachepath}"
541
542 # Check if message is in the cache by Message-ID.
543 match case "^Message-ID: (.*)" in headers
544 action {
545 tag "msgid" value "%1"
546 }
547 continue
548
549 # If it is in the cache, stop.
550 match matched and in-cache "${cachepath}" key "%[msgid]"
551 action {
552 keep
553 }
554
555 # Not in the cache, process it and add to cache.
556 match case "^X-Feedname: (.*)" in headers
557 action {
558 # Connect to a SMTP server and attempt to deliver the
559 # mail to it.
560 # Of course change the server and e-mail below.
561 smtp server "codemadness.org" to "hiltjo@codemadness.org"
562
563 add-to-cache "${cachepath}" key "%[msgid]"
564 keep
565 }
566
567 Now run:
568
569 $ sfeed_mbox ~/.sfeed/feeds/* > ~/.sfeed/mbox
570 $ fdm -f ~/.sfeed/fdm.conf fetch
571
572 Now you can view feeds in mutt(1) for example.
573
574 - - -
575
576 Convert mbox to separate maildirs per feed and filter duplicate messages using
577 procmail(1).
578
579 procmail_maildirs.sh file:
580
581 maildir="$HOME/feeds"
582 feedsdir="$HOME/.sfeed/feeds"
583 procmailconfig="$HOME/.sfeed/procmailrc"
584
585 # message-id cache to prevent duplicates.
586 mkdir -p "${maildir}/.cache"
587
588 if ! test -r "${procmailconfig}"; then
589 printf "Procmail configuration file \"%s\" does not exist or is not readable.\n" "${procmailconfig}" >&2
590 echo "See procmailrc.example for an example." >&2
591 exit 1
592 fi
593
594 find "${feedsdir}" -type f -exec printf '%s\n' {} \; | while read -r d; do
595 name=$(basename "${d}")
596 mkdir -p "${maildir}/${name}/cur"
597 mkdir -p "${maildir}/${name}/new"
598 mkdir -p "${maildir}/${name}/tmp"
599 printf 'Mailbox %s\n' "${name}"
600 sfeed_mbox "${d}" | formail -s procmail "${procmailconfig}"
601 done
602
603 Procmailrc(5) file:
604
605 # Example for use with sfeed_mbox(1).
606 # The header X-Feedname is used to split into separate maildirs. It is
607 # assumed this name is sane.
608
609 MAILDIR="$HOME/feeds/"
610
611 :0
612 * ^X-Feedname: \/.*
613 {
614 FEED="$MATCH"
615
616 :0 Wh: "msgid_$FEED.lock"
617 | formail -D 1024000 ".cache/msgid_$FEED.cache"
618
619 :0
620 "$FEED"/
621 }
622
623 Now run:
624
625 $ procmail_maildirs.sh
626
627 Now you can view feeds in mutt(1) for example.
628
629 - - -
630
631 The fetch function can be overridden in your sfeedrc file. This allows to
632 replace the default curl(1) for sfeed_update with any other client to fetch the
633 RSS/Atom data or change the default curl options:
634
635 # fetch a feed via HTTP/HTTPS etc.
636 # fetch(name, url, feedfile)
637 fetch() {
638 hurl -m 1048576 -t 15 "$2" 2>/dev/null
639 }
640
641 - - -
642
643 Caching, incremental data updates and bandwidth saving
644
645 For servers that support it some incremental updates and bandwidth saving can
646 be done by using the "ETag" HTTP header.
647
648 Create a directory for storing the ETags and modification timestamps per feed:
649
650 mkdir -p ~/.sfeed/etags ~/.sfeed/lastmod
651
652 The curl ETag options (--etag-save and --etag-compare) can be used to store and
653 send the previous ETag header value. curl version 7.73+ is recommended for it
654 to work properly.
655
656 The curl -z option can be used to send the modification date of a local file as
657 a HTTP "If-Modified-Since" request header. The server can then respond if the
658 data is modified or not or respond with only the incremental data.
659
660 The curl --compressed option can be used to indicate the client supports
661 decompression. Because RSS/Atom feeds are textual XML content this generally
662 compresses very well.
663
664 These options can be set by overriding the fetch() function in the sfeedrc
665 file:
666
667 # fetch(name, url, feedfile)
668 fetch() {
669 basename="$(basename "$3")"
670 etag="$HOME/.sfeed/etags/${basename}"
671 lastmod="$HOME/.sfeed/lastmod/${basename}"
672 output="${sfeedtmpdir}/feeds/${basename}.xml"
673
674 curl \
675 -f -s -m 15 \
676 -L --max-redirs 0 \
677 -H "User-Agent: sfeed" \
678 --compressed \
679 --etag-save "${etag}" --etag-compare "${etag}" \
680 -R -o "${output}" \
681 -z "${lastmod}" \
682 "$2" 2>/dev/null || return 1
683
684 # succesful, but no file written: assume it is OK and Not Modified.
685 [ -e "${output}" ] || return 0
686
687 # use server timestamp from curl -R to set Last-Modified.
688 touch -r "${output}" "${lastmod}" 2>/dev/null
689 cat "${output}" 2>/dev/null
690 # use write output status, other errors are ignored here.
691 fetchstatus="$?"
692 rm -f "${output}" 2>/dev/null
693 return "${fetchstatus}"
694 }
695
696 These options can come at a cost of some privacy, because it exposes
697 additional metadata from the previous request.
698
699 - - -
700
701 CDNs blocking requests due to a missing HTTP User-Agent request header
702
703 sfeed_update will not send the "User-Agent" header by default for privacy
704 reasons. Some CDNs like Cloudflare or websites like Reddit.com don't like this
705 and will block such HTTP requests.
706
707 A custom User-Agent can be set by using the curl -H option, like so:
708
709 curl -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'
710
711 The above example string pretends to be a Windows 10 (x86-64) machine running
712 Firefox 78.
713
714 - - -
715
716 Page redirects
717
718 For security and efficiency reasons by default redirects are not allowed and
719 are treated as an error.
720
721 For example to prevent hijacking an unencrypted http:// to https:// redirect or
722 to not add time of an unnecessary page redirect each time. It is encouraged to
723 use the final redirected URL in the sfeedrc config file.
724
725 If you want to ignore this advise you can override the fetch() function in the
726 sfeedrc file and change the curl options "-L --max-redirs 0".
727
728 - - -
729
730 Shellscript to handle URLs and enclosures in parallel using xargs -P.
731
732 This can be used to download and process URLs for downloading podcasts,
733 webcomics, download and convert webpages, mirror videos, etc. It uses a
734 plain-text cache file for remembering processed URLs. The match patterns are
735 defined in the shellscript fetch() function and in the awk script and can be
736 modified to handle items differently depending on their context.
737
738 The arguments for the script are files in the sfeed(5) format. If no file
739 arguments are specified then the data is read from stdin.
740
741 #!/bin/sh
742 # sfeed_download: downloader for URLs and enclosures in sfeed(5) files.
743 # Dependencies: awk, curl, flock, xargs (-P), yt-dlp.
744
745 cachefile="${SFEED_CACHEFILE:-$HOME/.sfeed/downloaded_urls}"
746 jobs="${SFEED_JOBS:-4}"
747 lockfile="${HOME}/.sfeed/sfeed_download.lock"
748
749 # log(feedname, s, status)
750 log() {
751 if [ "$1" != "-" ]; then
752 s="[$1] $2"
753 else
754 s="$2"
755 fi
756 printf '[%s]: %s: %s\n' "$(date +'%H:%M:%S')" "${s}" "$3"
757 }
758
759 # fetch(url, feedname)
760 fetch() {
761 case "$1" in
762 *youtube.com*)
763 yt-dlp "$1";;
764 *.flac|*.ogg|*.m3u|*.m3u8|*.m4a|*.mkv|*.mp3|*.mp4|*.wav|*.webm)
765 # allow 2 redirects, hide User-Agent, connect timeout is 15 seconds.
766 curl -O -L --max-redirs 2 -H "User-Agent:" -f -s --connect-timeout 15 "$1";;
767 esac
768 }
769
770 # downloader(url, title, feedname)
771 downloader() {
772 url="$1"
773 title="$2"
774 feedname="${3##*/}"
775
776 msg="${title}: ${url}"
777
778 # download directory.
779 if [ "${feedname}" != "-" ]; then
780 mkdir -p "${feedname}"
781 if ! cd "${feedname}"; then
782 log "${feedname}" "${msg}: ${feedname}" "DIR FAIL" >&2
783 return 1
784 fi
785 fi
786
787 log "${feedname}" "${msg}" "START"
788 if fetch "${url}" "${feedname}"; then
789 log "${feedname}" "${msg}" "OK"
790
791 # append it safely in parallel to the cachefile on a
792 # successful download.
793 (flock 9 || exit 1
794 printf '%s\n' "${url}" >> "${cachefile}"
795 ) 9>"${lockfile}"
796 else
797 log "${feedname}" "${msg}" "FAIL" >&2
798 return 1
799 fi
800 return 0
801 }
802
803 if [ "${SFEED_DOWNLOAD_CHILD}" = "1" ]; then
804 # Downloader helper for parallel downloading.
805 # Receives arguments: $1 = URL, $2 = title, $3 = feed filename or "-".
806 # It should write the URI to the cachefile if it is successful.
807 downloader "$1" "$2" "$3"
808 exit $?
809 fi
810
811 # ...else parent mode:
812
813 tmp="$(mktemp)" || exit 1
814 trap "rm -f ${tmp}" EXIT
815
816 [ -f "${cachefile}" ] || touch "${cachefile}"
817 cat "${cachefile}" > "${tmp}"
818 echo >> "${tmp}" # force it to have one line for awk.
819
820 LC_ALL=C awk -F '\t' '
821 # fast prefilter what to download or not.
822 function filter(url, field, feedname) {
823 u = tolower(url);
824 return (match(u, "youtube\\.com") ||
825 match(u, "\\.(flac|ogg|m3u|m3u8|m4a|mkv|mp3|mp4|wav|webm)$"));
826 }
827 function download(url, field, title, filename) {
828 if (!length(url) || urls[url] || !filter(url, field, filename))
829 return;
830 # NUL-separated for xargs -0.
831 printf("%s%c%s%c%s%c", url, 0, title, 0, filename, 0);
832 urls[url] = 1; # print once
833 }
834 {
835 FILENR += (FNR == 1);
836 }
837 # lookup table from cachefile which contains downloaded URLs.
838 FILENR == 1 {
839 urls[$0] = 1;
840 }
841 # feed file(s).
842 FILENR != 1 {
843 download($3, 3, $2, FILENAME); # link
844 download($8, 8, $2, FILENAME); # enclosure
845 }
846 ' "${tmp}" "${@:--}" | \
847 SFEED_DOWNLOAD_CHILD="1" xargs -r -0 -L 3 -P "${jobs}" "$(readlink -f "$0")"
848
849 - - -
850
851 Shellscript to export existing newsboat cached items from sqlite3 to the sfeed
852 TSV format.
853
854 #!/bin/sh
855 # Export newsbeuter/newsboat cached items from sqlite3 to the sfeed TSV format.
856 # The data is split per file per feed with the name of the newsboat title/url.
857 # It writes the URLs of the read items line by line to a "urls" file.
858 #
859 # Dependencies: sqlite3, awk.
860 #
861 # Usage: create some directory to store the feeds then run this script.
862
863 # newsboat cache.db file.
864 cachefile="$HOME/.newsboat/cache.db"
865 test -n "$1" && cachefile="$1"
866
867 # dump data.
868 # .mode ascii: Columns/rows delimited by 0x1F and 0x1E
869 # get the first fields in the order of the sfeed(5) format.
870 sqlite3 "$cachefile" <<!EOF |
871 .headers off
872 .mode ascii
873 .output
874 SELECT
875 i.pubDate, i.title, i.url, i.content, i.content_mime_type,
876 i.guid, i.author, i.enclosure_url,
877 f.rssurl AS rssurl, f.title AS feedtitle, i.unread
878 -- i.id, i.enclosure_type, i.enqueued, i.flags, i.deleted, i.base
879 FROM rss_feed f
880 INNER JOIN rss_item i ON i.feedurl = f.rssurl
881 ORDER BY
882 i.feedurl ASC, i.pubDate DESC;
883 .quit
884 !EOF
885 # convert to sfeed(5) TSV format.
886 LC_ALL=C awk '
887 BEGIN {
888 FS = "\x1f";
889 RS = "\x1e";
890 }
891 # normal non-content fields.
892 function field(s) {
893 gsub("^[[:space:]]*", "", s);
894 gsub("[[:space:]]*$", "", s);
895 gsub("[[:space:]]", " ", s);
896 gsub("[[:cntrl:]]", "", s);
897 return s;
898 }
899 # content field.
900 function content(s) {
901 gsub("^[[:space:]]*", "", s);
902 gsub("[[:space:]]*$", "", s);
903 # escape chars in content field.
904 gsub("\\\\", "\\\\", s);
905 gsub("\n", "\\n", s);
906 gsub("\t", "\\t", s);
907 return s;
908 }
909 function feedname(feedurl, feedtitle) {
910 if (feedtitle == "") {
911 gsub("/", "_", feedurl);
912 return feedurl;
913 }
914 gsub("/", "_", feedtitle);
915 return feedtitle;
916 }
917 {
918 fname = feedname($9, $10);
919 if (!feed[fname]++) {
920 print "Writing file: \"" fname "\" (title: " $10 ", url: " $9 ")" > "/dev/stderr";
921 }
922
923 contenttype = field($5);
924 if (contenttype == "")
925 contenttype = "html";
926 else if (index(contenttype, "/html") || index(contenttype, "/xhtml"))
927 contenttype = "html";
928 else
929 contenttype = "plain";
930
931 print $1 "\t" field($2) "\t" field($3) "\t" content($4) "\t" \
932 contenttype "\t" field($6) "\t" field($7) "\t" field($8) "\t" \
933 > fname;
934
935 # write URLs of the read items to a file line by line.
936 if ($11 == "0") {
937 print $3 > "urls";
938 }
939 }'
940
941 - - -
942
943 Progress indicator
944 ------------------
945
946 The below sfeed_update wrapper script counts the amount of feeds in a sfeedrc
947 config. It then calls sfeed_update and pipes the output lines to a function
948 that counts the current progress. It writes the total progress to stderr.
949 Alternative: pv -l -s totallines
950
951 #!/bin/sh
952 # Progress indicator script.
953
954 # Pass lines as input to stdin and write progress status to stderr.
955 # progress(totallines)
956 progress() {
957 total="$(($1 + 0))" # must be a number, no divide by zero.
958 test "${total}" -le 0 -o "$1" != "${total}" && return
959 LC_ALL=C awk -v "total=${total}" '
960 {
961 counter++;
962 percent = (counter * 100) / total;
963 printf("\033[K") > "/dev/stderr"; # clear EOL
964 print $0;
965 printf("[%s/%s] %.0f%%\r", counter, total, percent) > "/dev/stderr";
966 fflush(); # flush all buffers per line.
967 }
968 END {
969 printf("\033[K") > "/dev/stderr";
970 }'
971 }
972
973 # Counts the feeds from the sfeedrc config.
974 countfeeds() {
975 count=0
976 . "$1"
977 feed() {
978 count=$((count + 1))
979 }
980 feeds
981 echo "${count}"
982 }
983
984 config="${1:-$HOME/.sfeed/sfeedrc}"
985 total=$(countfeeds "${config}")
986 sfeed_update "${config}" 2>&1 | progress "${total}"
987
988 - - -
989
990 Counting unread and total items
991 -------------------------------
992
993 It can be useful to show the counts of unread items, for example in a
994 windowmanager or statusbar.
995
996 The below example script counts the items of the last day in the same way the
997 formatting tools do:
998
999 #!/bin/sh
1000 # Count the new items of the last day.
1001 LC_ALL=C awk -F '\t' -v "old=$(($(date +'%s') - 86400))" '
1002 {
1003 total++;
1004 }
1005 int($1) >= old {
1006 totalnew++;
1007 }
1008 END {
1009 print "New: " totalnew;
1010 print "Total: " total;
1011 }' ~/.sfeed/feeds/*
1012
1013 The below example script counts the unread items using the sfeed_curses URL
1014 file:
1015
1016 #!/bin/sh
1017 # Count the unread and total items from feeds using the URL file.
1018 LC_ALL=C awk -F '\t' '
1019 # URL file: amount of fields is 1.
1020 NF == 1 {
1021 u[$0] = 1; # lookup table of URLs.
1022 next;
1023 }
1024 # feed file: check by URL or id.
1025 {
1026 total++;
1027 if (length($3)) {
1028 if (u[$3])
1029 read++;
1030 } else if (length($6)) {
1031 if (u[$6])
1032 read++;
1033 }
1034 }
1035 END {
1036 print "Unread: " (total - read);
1037 print "Total: " total;
1038 }' ~/.sfeed/urls ~/.sfeed/feeds/*
1039
1040 - - -
1041
1042 sfeed.c: adding new XML tags or sfeed(5) fields to the parser
1043 -------------------------------------------------------------
1044
1045 sfeed.c contains definitions to parse XML tags and map them to sfeed(5) TSV
1046 fields. Parsed RSS and Atom tag names are first stored as a TagId, which is a
1047 number. This TagId is then mapped to the output field index.
1048
1049 Steps to modify the code:
1050
1051 * Add a new TagId enum for the tag.
1052
1053 * (optional) Add a new FeedField* enum for the new output field or you can map
1054 it to an existing field.
1055
1056 * Add the new XML tag name to the array variable of parsed RSS or Atom
1057 tags: rsstags[] or atomtags[].
1058
1059 These must be defined in alphabetical order, because a binary search is used
1060 which uses the strcasecmp() function.
1061
1062 * Add the parsed TagId to the output field in the array variable fieldmap[].
1063
1064 When another tag is also mapped to the same output field then the tag with
1065 the highest TagId number value overrides the mapped field: the order is from
1066 least important to high.
1067
1068 * If this defined tag is just using the inner data of the XML tag, then this
1069 definition is enough. If it for example has to parse a certain attribute you
1070 have to add a check for the TagId to the xmlattr() callback function.
1071
1072 * (optional) Print the new field in the printfields() function.
1073
1074 Below is a patch example to add the MRSS "media:content" tag as a new field:
1075
1076 diff --git a/sfeed.c b/sfeed.c
1077 --- a/sfeed.c
1078 +++ b/sfeed.c
1079 @@ -50,7 +50,7 @@ enum TagId {
1080 RSSTagGuidPermalinkTrue,
1081 /* must be defined after GUID, because it can be a link (isPermaLink) */
1082 RSSTagLink,
1083 - RSSTagEnclosure,
1084 + RSSTagMediaContent, RSSTagEnclosure,
1085 RSSTagAuthor, RSSTagDccreator,
1086 RSSTagCategory,
1087 /* Atom */
1088 @@ -81,7 +81,7 @@ typedef struct field {
1089 enum {
1090 FeedFieldTime = 0, FeedFieldTitle, FeedFieldLink, FeedFieldContent,
1091 FeedFieldId, FeedFieldAuthor, FeedFieldEnclosure, FeedFieldCategory,
1092 - FeedFieldLast
1093 + FeedFieldMediaContent, FeedFieldLast
1094 };
1095
1096 typedef struct feedcontext {
1097 @@ -137,6 +137,7 @@ static const FeedTag rsstags[] = {
1098 { STRP("enclosure"), RSSTagEnclosure },
1099 { STRP("guid"), RSSTagGuid },
1100 { STRP("link"), RSSTagLink },
1101 + { STRP("media:content"), RSSTagMediaContent },
1102 { STRP("media:description"), RSSTagMediaDescription },
1103 { STRP("pubdate"), RSSTagPubdate },
1104 { STRP("title"), RSSTagTitle }
1105 @@ -180,6 +181,7 @@ static const int fieldmap[TagLast] = {
1106 [RSSTagGuidPermalinkFalse] = FeedFieldId,
1107 [RSSTagGuidPermalinkTrue] = FeedFieldId, /* special-case: both a link and an id */
1108 [RSSTagLink] = FeedFieldLink,
1109 + [RSSTagMediaContent] = FeedFieldMediaContent,
1110 [RSSTagEnclosure] = FeedFieldEnclosure,
1111 [RSSTagAuthor] = FeedFieldAuthor,
1112 [RSSTagDccreator] = FeedFieldAuthor,
1113 @@ -677,6 +679,8 @@ printfields(void)
1114 string_print_uri(&ctx.fields[FeedFieldEnclosure].str);
1115 putchar(FieldSeparator);
1116 string_print_trimmed_multi(&ctx.fields[FeedFieldCategory].str);
1117 + putchar(FieldSeparator);
1118 + string_print_trimmed(&ctx.fields[FeedFieldMediaContent].str);
1119 putchar('\n');
1120
1121 if (ferror(stdout)) /* check for errors but do not flush */
1122 @@ -718,7 +722,7 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl,
1123 }
1124
1125 if (ctx.feedtype == FeedTypeRSS) {
1126 - if (ctx.tag.id == RSSTagEnclosure &&
1127 + if ((ctx.tag.id == RSSTagEnclosure || ctx.tag.id == RSSTagMediaContent) &&
1128 isattr(n, nl, STRP("url"))) {
1129 string_append(&tmpstr, v, vl);
1130 } else if (ctx.tag.id == RSSTagGuid &&
1131
1132 - - -
1133
1134 Running custom commands inside the sfeed_curses program
1135 -------------------------------------------------------
1136
1137 Running commands inside the sfeed_curses program can be useful for example to
1138 sync items or mark all items across all feeds as read. It can be comfortable to
1139 have a keybind for this inside the program to perform a scripted action and
1140 then reload the feeds by sending the signal SIGHUP.
1141
1142 In the input handling code you can then add a case:
1143
1144 case 'M':
1145 forkexec((char *[]) { "markallread.sh", NULL }, 0);
1146 break;
1147
1148 or
1149
1150 case 'S':
1151 forkexec((char *[]) { "syncnews.sh", NULL }, 1);
1152 break;
1153
1154 The specified script should be in $PATH or be an absolute path.
1155
1156 Example of a `markallread.sh` shellscript to mark all URLs as read:
1157
1158 #!/bin/sh
1159 # mark all items/URLs as read.
1160 tmp="$(mktemp)" || exit 1
1161 (cat ~/.sfeed/urls; cut -f 3 ~/.sfeed/feeds/*) | \
1162 awk '!x[$0]++' > "$tmp" &&
1163 mv "$tmp" ~/.sfeed/urls &&
1164 pkill -SIGHUP sfeed_curses # reload feeds.
1165
1166 Example of a `syncnews.sh` shellscript to update the feeds and reload them:
1167
1168 #!/bin/sh
1169 sfeed_update
1170 pkill -SIGHUP sfeed_curses
1171
1172
1173 Running programs in a new session
1174 ---------------------------------
1175
1176 By default processes are spawned in the same session and process group as
1177 sfeed_curses. When sfeed_curses is closed this can also close the spawned
1178 process in some cases.
1179
1180 When the setsid command-line program is available the following wrapper command
1181 can be used to run the program in a new session, for a plumb program:
1182
1183 setsid -f xdg-open "$@"
1184
1185 Alternatively the code can be changed to call setsid() before execvp().
1186
1187
1188 Open an URL directly in the same terminal
1189 -----------------------------------------
1190
1191 To open an URL directly in the same terminal using the text-mode lynx browser:
1192
1193 SFEED_PLUMBER=lynx SFEED_PLUMBER_INTERACTIVE=1 sfeed_curses ~/.sfeed/feeds/*
1194
1195
1196 Yank to tmux buffer
1197 -------------------
1198
1199 This changes the yank command to set the tmux buffer, instead of X11 xclip:
1200
1201 SFEED_YANKER="tmux set-buffer \`cat\`"
1202
1203
1204 Alternative for xargs -P and -0
1205 -------------------------------
1206
1207 Most xargs implementations support the options -P and -0.
1208 GNU or *BSD has supported them for over 20+ years!
1209
1210 These functions in sfeed_update can be overridden in sfeedrc, if you don't want
1211 to use xargs:
1212
1213 feed() {
1214 # wait until ${maxjobs} are finished: will stall the queue if an item
1215 # is slow, but it is portable.
1216 [ ${signo} -ne 0 ] && return
1217 [ $((curjobs % maxjobs)) -eq 0 ] && wait
1218 [ ${signo} -ne 0 ] && return
1219 curjobs=$((curjobs + 1))
1220
1221 _feed "$@" &
1222 }
1223
1224 runfeeds() {
1225 # job counter.
1226 curjobs=0
1227 # fetch feeds specified in config file.
1228 feeds
1229 # wait till all feeds are fetched (concurrently).
1230 [ ${signo} -eq 0 ] && wait
1231 }
1232
1233
1234 Known terminal issues
1235 ---------------------
1236
1237 Below lists some bugs or missing features in terminals that are found while
1238 testing sfeed_curses. Some of them might be fixed already upstream:
1239
1240 - cygwin + mintty: the xterm mouse-encoding of the mouse position is broken for
1241 scrolling.
1242 - HaikuOS terminal: the xterm mouse-encoding of the mouse button number of the
1243 middle-button, right-button is incorrect / reversed.
1244 - putty: the full reset attribute (ESC c, typically `rs1`) does not reset the
1245 window title.
1246 - Mouse button encoding for extended buttons (like side-buttons) in some
1247 terminals are unsupported or map to the same button: for example side-buttons 7
1248 and 8 map to the scroll buttons 4 and 5 in urxvt.
1249
1250
1251 License
1252 -------
1253
1254 ISC, see LICENSE file.
1255
1256
1257 Author
1258 ------
1259
1260 Hiltjo Posthuma <hiltjo@codemadness.org>