Fixpoint

2021-11-30

Healing my old #ossasepia log links

Filed under: Data, Historia — Jacob Welsh @ 05:17

I got a bit of a nerve pinched the other day by a seemingly innocent offer of help fixing a broken link. The pushback I gave at the time still looks valid from a certain angle, if not that important; yet the exchange exposed the sore spot that I hadn't even taken a relatively simple step in maintaining my old content, never mind consistency of new production. So, since there wasn't otherwise anything presently on fire - at least not besides the couple things on a low smolder in the corner where I can see them - I set aside those plans for the day that could be set aside and dug into updating my links to the #ossasepia channel logs formerly served by ossabot and its flask of pythons, based on the pointers provided by Diana. Since I took the time to work it out, I'll expand on those pointers to show my full process in hopes that it may be of use.

1. I downloaded, inspected and extracted the linked oldnewloglinks.tar.gz.(i)

2. It appeared that the old and new message ID numbers matched as expected, but I verified:

sed 's/[^# ]*#/#/g' ossapub.txt eulorapub.txt | awk '$1 != $2 { print }'

3. Thus, I chose to simplify the mapping and consequently reduce the number of blog-wide search-and-replaces by removing the line numbers and deduplicating; then to this I added a substitution to correct URL prefixes for the #eulora links:

sed 's/#[0-9]\+/#/g' ossapub.txt | uniq >ossapub-simp.txt
sed 's/#[0-9]\+/#/g' eulorapub.txt | uniq | sed 's:/ossasepia/:/eulora/:' >eulorapub-simp.txt

This brought the count from 206376 down to 1688.

4. I expanded this into a big update script using a template SQL command:

awk -v q="'" '{ print "update posts set post_content=replace(post_content," q $1 q "," q $2 q ");" }' ossapub-simp.txt eulorapub-simp.txt >replacements.sql

5. After a sanity check on the result I uploaded it to the server, then fed it to the database:

mysql fixpoint <replacements.sql

For my modestly sized blog this step took 34 seconds.

6. I checked some articles manually to make sure the results were as expected, but still wanted a way to see exactly what was changed by the hidden workings of the machine, so I did a fresh backup and set to compare it with the one made before. One trick to make "diff" sorta-work with MySQL dumps or other files with really long lines is to preprocess the input, breaking it up on syntactic or word boundaries to force it into diff's line-based view of the world. The following example of such munging also cleans up the linefeed characters that MySQL so uninvitedly backslashifies (and I haven't a clue how carriage returns got in there too):

sed -r 's/,/,\n/g; s/([^\\])\\n/\1\n/g; s/([^\\])\\r/\1/g'

This being hideous but likely to come in handy again, I named it as a script on my system.

All in all, the results had come out to a whole... 35 replacements. The #trilema links will be a tougher but perhaps juicier target to automate.

In case you'd like to work directly from my simplified mappings: ossapub-simp.txt, eulorapub-simp.txt.

  1. I usually do a tar tvf FILENAME prior to extracting an unfamiliar tarball, to check for any weird paths that might clobber something unexpectedly in the working directory. [^]

4 Comments »

  1. Perhaps not an issue for Fixpoint but for completeness, if you set out to replace the links, you might want to at least look as well at the post_comments table not just at post_content (ie for any links that might appear in the comments rather than in an article's content).

    Comment by Diana Coman — 2021-11-30 @ 19:46

  2. Thanks @Diana Coman, good point and it had popped into my head at some point in the process but was forgotten. One dodgy case that comes up when I try it - though nothing to do with comments per se - is the bridged logs where it seems to me preferable to replace only the final hrefs rather than altering URLs in the record of what was said (because what services we were using at the time *is* a part of the history after all).

    Comment by Jacob Welsh — 2021-12-01 @ 04:13

  3. Approximate (conservative) approach to that, now avoiding the "awk" overkill for perhaps less confusing quoting:

    cat ossapub-simp.txt eulorapub-simp.txt | while read old new; do
    	echo "update posts set post_content=replace(post_content, '<a href=\"http://$old', '<a href=\"http://$new');"
    	echo "update comments set comment_content=replace(comment_content, '<a href=\"http://$old', '<a href=\"http://$new');"
    done >replacements.sql
    

    This done, I found the linked example was in fact the only instance among my comments (that is, of raw URLs not touched by the above replacements), and there it was mp-wp making the links clickable rather than any markup I'd put in the comments themselves, so I patched them up manually.

    Comment by Jacob Welsh — 2021-12-01 @ 05:09

  4. [...] and unpublished. An upside is I won't be starting from zero, so getting some articles published and improvements implemented soon shouldn't be far [...]

    Pingback by Little by little, a little less retarded. « Dorion Mode — 2022-09-26 @ 21:22

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by MP-WP. Copyright Jacob Welsh.