Archive 1 (2005-2007)Archive 2 (2008-2009)Archive 3 (2010-2014)Archive 4 (2015-2021)Archive 5 (2022-)
This is the talk page for discussing improvements to the Link rot page. |
|
This project page does not require a rating on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||
|
There are several articles on Wikipedia which use YouTube videos as a reference. Since YouTube videos get blocked and/ or deleted frequently, I tried to archive some YouTube videos using Wayback Machine and archive.today but Wayback doesn't let me view the video on the archived webpage and archive.today fails to archive a youtube video every time. Is there any other way to archive youtube videos for preventing link rot on wikipedia articles? — Preceding unsigned comment added by Tech2009Girl (talk • contribs) 10:56, 29 January 2022 (UTC)
Hello, there is a note on alexa.com that it will be retiring on 1 May 2022 See https://support.alexa.com/hc/en-us/articles/4410503838999. We currently have just over 800 links to the site. Keith D (talk) 12:45, 12 March 2022 (UTC)
I found a bunch of references to one dead site, but after poking around, I've found out that all the content is still there, just under a slightly tweaked website name. It's even retained the exact same URL structure as before, it's literally just the precise details of the website's name that's changed -- update that, and the links spring back to life. Is there any way to collect the articles which still cite the old URL, so I can correct them en masse using AWB? Searching the regular way shows me there are about 2.6k articles still using it (although some may have valid archives, which I'd leave untouched), but no easy way to convert that into one grand list for inputting into AWB. Buttons to Push Buttons (talk | contribs) 17:15, 18 May 2022 (UTC)
I am not sure what would be the right place to ask about this - I have tried here. As far as I can tell, CiteSeerX changed the scheme they're using. For example, Template:CiteSeerX offers 10.1.1.34.2426 as an example. The link is dead, but looking in the Wayback Machine, we can see that it was the paper William D. Harvey , Matthew L. Ginsberg: Limited Discrepancy Search. In the new scheme, pid/efa56b710ff3c6d8b2666971d07c311eeb6c5b40 or pid/d8b76a9af36448b775997ef0a960e4b0fa585beb seem like the most likely candidate. Is there any chance to fix the old links in some other way than checking them one-by-one and replacing them by the new links manually? Is somebody aware of some announcement from CiteSeerX containing some details about the old identifiers and the new ones? --Kompik (talk) 10:27, 6 January 2023 (UTC)
At 2019 Military World Games, clicking the archive link in the following citation loads a blank archive.org page. Bad snapshot maybe?
<ref>{{Cite web |url=https://results.wuhan2019mwg.cn/index.htm#/organisation |title=Archived copy |access-date=2019-10-19 |archive-url=https://web.archive.org/web/20191028234735/https://results.wuhan2019mwg.cn/index.htm#/organisation |archive-date=2019-10-28 |url-status=dead }}ref>
What should be done? Is there a way to repair this? Should it be deleted? Marked with something? Thanks. –Novem Linguae (talk) 21:54, 4 April 2023 (UTC)
After asking at the wrong place, I ask here: is there any need for this type of edit? Shouldn't we just archive dead or unfit sources? Unlike the former, this edit actually makes sense because it did rescue sources. What I usually do is to rescue sources manually, this prevents outdated archives, i.e. archived pages that present (very) old information compared to live sources (e.g. a page showing information from 2023 and another showing information from 2015.) SLBedit (talk) 16:13, 7 June 2023 (UTC)
Worst-case scenario in adding archives automatically: A source is archived several times throughout the years; in 2000, 2010, and 2020. A user adds that source to an article in 2023 and a bot adds a link to the 2000 archive. Source becomes dead in 2024 and the article points the reader to an outdated source. Conclusion: After an automatic archive, someone needs to make sure the article links to the proper archive, as there may be different versions of the original source. SLBedit (talk) 21:08, 8 June 2023 (UTC)
access-date=
parameter and add the most recent archived version that does not postdate the access-date
. Folly Mox (talk) 21:38, 8 June 2023 (UTC)I frequently find myself manually finding archived content from a citation in order to verify a claim, but I don't always have time to update the citation accordingly to aid future readers. Is there a tool that automates adding the necessary three(?) parameters to a citation if I already have the archive link in hand? I would prefer this over using a bot, for the reasons given above. Orange Suede Sofa (talk) 00:20, 8 August 2023 (UTC)
To wrap up the question for myself, and in case this helps anyone else, I created an AppleScript to take a Wayback URL from the clipboard, parse the URL for the date, and then automatically add the |archive-URL=
and |archive-date=
parameters to an existing citation, pre-filled and with no additional typing needed. More info here. Orange Suede Sofa (talk) 02:40, 15 August 2023 (UTC)
This discussion emerges from those on Billjones94's talk page, my talk page, and other previous discussions (1 again @ Billjones94, 2 again @ Billjones94, 3 @ Wikipedia:Bots, 4 @ Wikipedia talk:Link rot, 5 @ Village pump). Tags: Billjones94, Rhododendrites, Scyrme, Novem Linguae, ActivelyDisinterested, Izno, Kuzma, GreenC, Folly Mox, Dhtwiki, DMacks, and Cyberpower678. Please tag others.
I propose an addition to this page as follows:
After the words "in general, do not" in the second paragraph of the lede, insert "(with automated tools or otherwise) add archive links for live websites or".
The paragraph would then read In general, do not (with automated tools or otherwise) add archive links for live websites or delete cited information solely because the URL to the source does not work any longer.
My understanding, for the record, is that links cited on the English Wikipedia are automatically archived. Hitting the check mark in the IA Bot to add those archived links for live sites does not archive anything. It does not actually archive those pages nor does it update those archives. It just adds the links themselves to the article text. Moreover, archive links are automatically substituted for links that become dead.
What these archive links for live websites do is profoundly clutter the editor. This makes it very difficult for humans to parse. An example of this is the old version of Julius Caesar. This was a single citation (for a half of a sentence about Caesar's wife) therein:
Suetonius, ''Julius'' [https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Suetonius/12Caesars/Julius*.html#1 1] {{Webarchive|url=https://archive.today/20120530163202/http://penelope.uchicago.edu/Thayer/E/Roman/Texts/Suetonius/12Caesars/Julius*.html#1 |date=30 May 2012 }}; Plutarch, ''Caesar'' [https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Plutarch/Lives/Caesar*.html#1 1] {{Webarchive|url=http://webarchive.loc.gov/all/20180213130122/http://penelope.uchicago.edu/Thayer/e/roman/texts/plutarch/lives/caesar%2A.html#1 |date=13 February 2018 }}; Velleius Paterculus, ''Roman History'' [https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Velleius_Paterculus/2B*.html#41 2.41] {{Webarchive|url=https://web.archive.org/web/20220731043323/https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Velleius_Paterculus/2B%2A.html#41 |date=31 July 2022 }}
When I removed these archive links en masse, the page shortened by over 28,000 characters (probably upward of 35,000 after including all of my edits). Again, these additions are not necessary to preserve the text of the cited source. This is a live website. And if it became dead the archive URL would be automatically inserted. The costs are, however, substantial for active editors. Just finding real article text, as opposed to background mark up, in articles packed with these archive links becomes difficult.
Moreover, removing these archive links is significantly more difficult than adding them. It is almost trivial for someone to add unnecessary archive URLs. Not to pick on Billjones94 (the selection is merely because this series of discussions emerges from an edit on Roman Republic), the following edits were all done within a single hour:
Billjones94 contribs log, excerpt |
---|
03:01, 16 April 2022 diff hist +5,778 Mohammedan SC (Dhaka) Rescuing 28 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:58, 16 April 2022 diff hist +2,468 Churchill Brothers FC Goa Rescuing 14 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:54, 16 April 2022 diff hist +15,029 Pune FC Rescuing 78 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:49, 16 April 2022 diff hist +8,917 Salgaocar FC Rescuing 48 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:46, 16 April 2022 diff hist +425 Sreenidi Deccan FC Rescuing 2 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:45, 16 April 2022 diff hist +1,984 Moinuddin Khan (footballer) Rescuing 9 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:44, 16 April 2022 diff hist +1,246 Punjab FC Rescuing 6 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:41, 16 April 2022 diff hist +1,937 Sudeva Delhi FC Rescuing 10 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:39, 16 April 2022 diff hist +8,071 Sporting Clube de Goa Rescuing 42 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:36, 16 April 2022 diff hist +1,841 FC Kerala Rescuing 9 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:32, 16 April 2022 diff hist +3,159 Mohammed Rahmatullah Rescuing 16 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:30, 16 April 2022 diff hist +7,003 Kerala United FC Rescuing 36 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:27, 16 April 2022 diff hist +5,427 Peerless SC Rescuing 26 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:25, 16 April 2022 diff hist +3,337 NEROCA FC Rescuing 16 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:21, 16 April 2022 diff hist +2,285 Aizawl FC Rescuing 12 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:19, 16 April 2022 diff hist +1,720 TRAU FC Rescuing 9 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:16, 16 April 2022 diff hist +1,783 FC Kochin Rescuing 9 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:14, 16 April 2022 diff hist +11,674 Dempo SC Rescuing 63 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:10, 16 April 2022 diff hist +4,811 Mahindra United FC Rescuing 27 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:07, 16 April 2022 diff hist +3,616 South United FC Rescuing 19 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:05, 16 April 2022 diff hist +3,532 Hindustan Aeronautics Limited SC Rescuing 18 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] 02:02, 16 April 2022 diff hist +4,435 ONGC FC Rescuing 25 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2] |
Not a single source was tagged as dead. The average edit added 4,657 bytes of text and in total this single hour of triggering IA Bot added 100,478 bytes to Wikipedia's servers. I also firmly believe that these edits fall within the scope of WP:MEATBOT and WP:FAITACCOMPLI. Undoing them one by one after intervening edits is extremely difficult; Billjones94 has been repeatedly informed and tagged of how these mass additions are controversial with absolutely no response beyond "Thanks" on talk page edits. Nor do I believe for a second that anyone can review 11,674 bytes of additions – Dempo SC; around 3,000 bytes reviewed per minute – in the elapsed four minutes between the last edit.
Moreover, the archive links are also generated for paywalled sources hosted on Jstor (other services like Cambridge Core or Oxford Academic suffer similarly). For example, at Roman Republic:
{{Cite journal |last=Steel |first=Catherine |date=2014 |title=The Roman senate and the post-Sullan "res publica" |url=https://www.jstor.org/stable/24432812 |journal=Historia: Zeitschrift für Alte Geschichte |volume=63 |issue=3 |pages=323–339 |doi=10.25162/historia-2014-0018 |jstor=24432812 |s2cid=151289863 |issn=0018-2311 |access-date=26 May 2022 |archive-date=26 May 2022 |archive-url=https://web.archive.org/web/20220526152815/https://www.jstor.org/stable/24432812 |url-status=live }}
In those cases, the archive links do not preserve anything at all. Going to the archive URL loads a single front page of the article. On my computer the image thereof does not even load, leaving a blank page with the citation at the right. Given the stability of Jstor, there are functionally no benefits to these paywalled archive links. The costs in the editability of these articles remains however. Inasmuch as nothing is added for readers, editing ought to take priority.
Concluding, I want to emphasise three things. First, link rot is a semi-solved problem in which these WP:MEATBOT-esque additions do not help. Second, the enormous volume and rapidity of these WP:FAITACCOMPLI additions make them both harmful to actual content contribution and difficult to remove. Third, many times these archive links add nothing between the website still being live and paywalled sources' archives still being paywalled. We should edit the guidelines to reflect these facts and require adding archive links for live URLs to be justified instead of accepted by default. Ifly6 (talk) 16:11, 21 September 2023 (UTC)
Notes
|jstor=
exists. -- LCU ActivelyDisinterested ∆transmissions∆ °co-ords° 16:41, 21 September 2023 (UTC) |jstor=
parameter, citations to jstor in particular should not get urls either, since that encourages the addition of crufty archives and access dates.I don't really buy the argument that addition of archive urls to live links requires human review, except for urls whose contents change frequently, in which case the archived version may not match the cited version. An online edition of some translation of an ancient or classical text, as used in the example, is not likely to change, so any date of the archive is probably correct, although currently unnecessary. I notice that none of the example references from the Julius Caesar article mentions the date of composition of the original work, the date of publication of the translation, nor the identity of the translator or the publisher of the translation, all of which are more serious issues than an unnecessary archive.Prose can be located within citation syntax with the use of syntax highlighting, and there's at least one gadget that does this. Moving to list-defined references or shortened footnotes also fixes that issue, and while I'm not a fan of adding archives to live links for many domains, I don't think guidance should be changed because the laziest referencing style of just fully defining the reference at its initial point clutters the source. That's more of an argument to encourage better citation styles.I also think the exact proposed wording is a bit ambiguous, in one sentence talking about live links and then dead links, without a clear separation. It could be read as discussing cases of adding archives to live websites... where the URL no longer works, which is contradictory. I must stress this is not an argument against the idea, just the proposed text.I have, in at least one case, been forced to add archives to live links, due to the way IABot defines "live links". I was cleaning up citations on an article about some museum in Germany. The article cited something like 140 different pages on the museum's site, but they had restructured their domain such that all the links now resolved to a custom 404 page, which automated tools understand as a "live link". After updating the first sixteen or so manually over an hour or two, I despaired and asked IABot to add archives to live links, since it couldn't understand that the links were actually useless.Some domains do drop articles with some frequency. Local news sites, sohu.com, obituary sites, etc. When I'm citing a source like this, I'll add the archive to the live site prophylactically, since I deem it unlikely to work within a year or so. Folly Mox (talk) 17:09, 21 September 2023 (UTC)
Do not add archive links for live websites (using automated tools or otherwise) without a justification specific to the circumstances? Ifly6 (talk) 18:00, 21 September 2023 (UTC)
Moreover, archive links are automatically substituted for links that become dead.Did you not mean "...are [not] automatically substituted...", which is my experience and is in line with the point you're making. I'm in certainly favor of changing this article's wording to discourage mass archive-link additions, especially since this guideline is often used as justification for such activity. I've also heard people justify mass additions as being required by featured article reviewers, although I've only seen recommendations to that effect. Perhaps we should clarify things with the FAC page. Dhtwiki (talk) 04:35, 22 September 2023 (UTC)
archive links are automatically substituted for links that become dead– is that it reflects this on the opposite (non-talk) side:
there is a Wikipedia bot ... that automates fixing link rot. It runs continuously, checking all articles on Wikipedia if a link is dead, adding archives to Wayback Machine (if not yet there), and replacing dead links in the wikitext with an archived version.Ifly6 (talk) 06:10, 22 September 2023 (UTC)
|access-date=
and pick archives prior to or around that date. Ifly6 (talk) 22:07, 25 September 2023 (UTC)The above discussion (Mass additions of archive links for live sites) discusses why indiscriminate archiving of live links is a problem and a broader mechanism – don't add archive links for live websites unless you have an actual and specific reason – for resolving them.
Per Folly Mox's minimalist framing of the question at Talk:Citation bot, I propose the following (with appropriate wording to be determined):
permalive
prevent the check box in IA Bot from adding archive URLs?As to the utility of archive links of paywalled landing pages, they are not useful and they do not provide full text. They are currently being added if you hit the check mark in the IA Bot management console. The resulting links are largely blank. They do not archive anything or trigger anything to archive anything while introducing extremely large amounts of markup with no value. Ifly6 (talk) 05:21, 2 October 2023 (UTC)
permalive
should not affect the availability of its contents at all, which are mostly paywalls to as far as Internet Archive is concerned anyway. The "open access case" can be untangled later, if it is deemed important enough to bother about.What I'm hoping to get from this discussion is consensus that we shouldn't have archives of paywall landing pages. The permalive
for User:IABot is an implementation detail, and how we can go about cleaning up this cruft is also a follow up. Any bot task requires consensus first (unless speedy approved by a BAG member). Folly Mox (talk) 06:52, 2 October 2023 (UTC)(crosspost from Wiki: Teahouse#external links: URLs that were broken due to editing errors)
hello maintainers. I made a List of ~10000 brocken URLs User:ⵓ/Worklist brocken URLs with Quarry:query/78127 (feel free to fork it)
The SQL query filters not existing top level-domains, all URLs in this List are broken. The Domain in list is in reversed order (el_to_domain_index
)
Most of the cases are easy to fix (i.e remove a white space-character). In some cases I needed a URL-decoder ( meyerweb.com/eric/tools/dencoder/). More difficult cases can only be solved with the help of the version history or with the help of web archives and Google search. (i.e. https://www.duhoctrungquoc.vn/wiki/index.php?lang=en&q=2018%E2%80%9319_Ukrainian_First_League&diff=prev&oldid=1185545810 )
I fixed this kind of errors in german wikipedia so Quarry:query/77794 is clean. But I am not able to do this in english wikipedia. ⵓ (talk) 17:57, 17 November 2023 (UTC)
This article uses material from the Wikipedia English article Wikipedia talk:Link rot, which is released under the Creative Commons Attribution-ShareAlike 3.0 license ("CC BY-SA 3.0"); additional terms may apply (view authors). Content is available under CC BY-SA 4.0 unless otherwise noted. Images, videos and audio are available under their respective licenses.
®Wikipedia is a registered trademark of the Wiki Foundation, Inc. Wiki English (DUHOCTRUNGQUOC.VN) is an independent company and has no affiliation with Wiki Foundation.