Archived Internet pages. How to extract unique content from a web archive

Greetings, dear blog readers. Do you want to know how to get information about what was on any site a year ago or a month ago, but has already been deleted today? Then read the article and apply the knowledge in practice. I’ll show you how to see old site posts that are hidden by the owner.

Surely there are many people who have thought about how to view a site’s archive on the Internet. This feature will be useful to anyone who has been running their own resource for many years, who has several sites, or who wants to restore an old site. Oddly enough, such a possibility exists, and has been for a long time.

Archive.org is an archive of all Internet websites that acts as an online library. The beginning of the project dates back to 1996, and its place of origin is San Francisco. At that time, the service was not only unique, but also practically useless for many, because the Internet was very poorly distributed, and there were very few websites.

With distribution World Wide Web The archive gained great popularity and became a kind of time machine, since it made it possible to view even existing web sites.

Now the Archive.org library has a huge storage space and offers free access to files for everyone. By 2017, the library already contains almost 90 billion web pages, but despite this, you can find out data about any site almost instantly by entering its address in the search bar.

When and why a site ends up in the archive of Internet sites

After creating a site, it can end up in Archive.org either immediately or after some time, and it happens that even a functioning site is not there. The conditions for your Internet resource to be included in the Archive are as follows:

  • absence in the robots.txt file of a command to prohibit its indexing

(User-agent:ia_archiver

Disallow: /);

  • the presence on the resource of links to search engines or popular services;
  • moving to a site using search engines of other users.
How can I use it as an archiver?

Archive.org stores:

  • text materials;
  • audio files;
  • video files;
  • photos and pictures;
  • links.

The archive allows you to:

  • Explore the entire history of your site. If the information on it is updated periodically and the site contains dozens of pages, it can sometimes be difficult to find any information. This is where the archive of Internet sites will come to the rescue.
  • Restore the site itself or some of its pages if you did not make backups.
  • Find unique content for your site. True, this very content can only be taken from resources that no longer exist, since what is presented on existing ones, as we know, will not be unique. In addition, you need to know the site address in order to find it and take any information from the archive.
Instructions for working with Archive.org

The operating principle of the Archive.org service is very simple. To find data about a site, you just need to indicate its address in the WaybackMachine (Webarchive Machine) line.

Let's look at the site archive using my blog as an example. Press Enter.

Note. If we do not immediately insert the address, but type it, other sites with similar names appear under the search bar. This function is useful, for example, if you have forgotten the name of the resource you were looking for.

A page with data opens. Under the site name we see information about how many times the site was archived and when. As you can see, the first archiving took place on June 18, 2014, and the last one on October 2, 2016. These dates are in no way related to the changes taking place on the site itself, because when the archiving will take place is determined by WebArchive itself.

To take a closer look at all the changes or see the original view of the site, simply select the year, and then click on the date and month in the calendar.

Click on the oldest date. The system will take us to the blog itself, where its original interface and content will be visible. I also changed the design of some elements, but the very first articles are now lost and it will not be so easy to find them right away.

This way you can see all the changes that have ever occurred in the blog, or find necessary information.

How to Find Unique Content Using Webarchive Machine

The ability to view an archive of old sites allows anyone to use the data that was on it without fear of non-uniqueness. The fact is that after the “death” of a site, its content is no longer checked by search engines, which means that it is unique again, and the only problem that remains is finding these same sites.

If you want to take content from your old resource or site that you used, but which no longer exists, there will be no problems, because you probably remember the address. Well, if you intend to search among all the “dead” sites, you can use special services that provide lists with vacant domains, that is, with addresses of sites that no longer exist.

I opened one of these services, copied the first domain and entered WaybackMachine in Archive.org, but it didn’t give any result. The same story repeated itself with the four subsequent domains. Finally, a search for the sixth address revealed information about the site.

The fact that Arcgive.org does not open all dead sites may be due to several reasons. So, perhaps the domain was purchased, but the site itself was never filled with any content. This is the case with most of the domain names on release lists. Another reason is that the site’s creator removed his resource from the Archive itself. This could also happen. And finally, perhaps the site was not in the Web Archive at all.

So, we finally managed to find a “dead” site from which we can read information. As you can see, the site has existed since 1999, and over that time 269 archives have been made on it.

We can open the resource and take information from there. To do this, as is the case with existing sites, we simply select any date. This action will open the main page as it was on the date we selected. If there is nothing useful here, you should check other dates.

After useful content found, you should definitely first check it for uniqueness, since, firstly, someone before you could have already used it, and secondly, perhaps it is still being tracked by search engines.

How to restore a site using Archive.org

Those who have been running websites for a long time know that they need to do it periodically backup. But those who have not thought about this may face the problem of losing web pages or losing the functionality of an entire site. In this case, the Archive will come to the rescue again, but if you have a resource with big amount pages, restoration will take a very long time. Another problem that may arise is the possibility of some information being lost or the design being distorted.

If you do decide to restore your site using Archive.org, you will need to perform operations on each page, hence the waste of time.

So, to restore the site, we need to replace the internal page link with the original one. If we look in address bar,the link will look like: http://web.archive.org/web/20161002194015/http://site/, that is, information from such a page cannot be simply copied.

To make this possible, you can simply manually remove the beginning of the links, but when there are hundreds of pages, this becomes a rather painstaking task. Therefore, we will use the ability of the Archive itself to replace links. To do this, insert “id_” in the address bar after typing the numbers and press Enter. That is, instead of the original link in the line there should be: http://web.archive..

Now the link is original and you can simply copy texts, pictures and other files from the Archive source code. We carry out the same operation with other pages of the site. Of course, even this option will take a lot of time, but if not backup copies, it is unlikely that it will be possible to restore the site in any other way.

How to remove a site from Archive.org

Most website creators want their resource to end up in the Archive, but there are also cases when, on the contrary, you need to make sure that it either does not end up there or is deleted. InternetArchive itself offers a very simple method for this. You just need to set a command for the service robot that the site does not need to be included in the Archive, that is, write the following in robots.txt:

User-agent: ia_archiver
Disallow: /

Thus, creating an archive of all sites is a help for many Internet users in finding information and restoring old resources. It is for the purpose of preserving information that Archive.org was created, and that is why it preserves archives of sites that exist in currently, and provides the opportunity to use data from “dead” or abandoned resources.

I hope the material was useful and you will not forget to repost the article and subscribe to the blog newsletter. All the best -))).

Sincerely, Galiulin Ruslan.

Stumbled upon broken link. The link was to a manual for setting up backups for the site. The topic was so interesting that I went to archive.org to see what kind of manual it was. There I discovered a blog of a man who was once involved in website building and some topics on the Internet. But apparently he gave it all up. The blog existed until December 2013, then there was a stub for another year. I'll go ahead and check the site's domain. He turned out to be free. The fact is that I have been interested in such sites for a long time, from time to time I go to telderi and look for an inexpensive IT-related site to buy. So far I haven’t found anything suitable in terms of price/quality.

Why do I need such a site? I'm hatching a plan to do some kind of merger or acquisition. Connect such a site with this one. To increase traffic on it and other goodies. Someone will say - what about diversification? Of course, diversification is a good thing. But there is nothing yet to diversify; we need to develop something first. And so, I see the idea of ​​merging sites as very promising.

So, that's all the background. I decided to restore the site I found. It turned out to be about 300 pages. I registered the domain and started looking for a tool to download the site.

How to restore a website from a web archive?

The procedure is simple. Take it and download it. But the matter is complicated by the fact that there are many pages, and all of them will be in the form of static html files. You will be tortured to download it manually. I started asking people who were involved in this kind of work. People recommended r-tools.org. It turned out to be paid. I started googling it because I know what it is simple procedure, and I didn’t want to pay for it, even if it was like this small fee. The solution was found very quickly in the form of a ruby ​​application. As I expected, everything is very simple, instructions are included.

Install a utility for restoring sites from archive.org

Without thinking twice, I install everything on the server and start the recovery.

#install ruby:

apt-get install ruby

#Install the tool itself:

gem install wayback_machine_downloader

We start downloading the site from the web archive

wayback_machine_downloader http://www.site.ru --timestamp 20131209110704

Here you can specify the snapshot mark in the timestamp option. Because a site may have dozens or hundreds of images in its web archive. I indicate the last one, when the site was still alive, logically. The utility immediately determines the number of pages and displays the downloaded pages to the console.

Everything is downloaded and saved, we get a scattering of static files in the folder. Create a folder in in the right place, and put the downloaded files there. I like to use rsync:

rsync -avh ./websites/www.site.com/ /var/www/site.com/

If you are not familiar with her yet, I recommend it. This is an exchange from Mirafox, which you may already know from other projects for webmasters (Telderi, Miralinks, Gogetlinks). On Kwork, freelancers are not selected based on proposals posted by potential customers, but rather they themselves submit proposals that the customer can choose. The “trick” of the service is base cost any quork (as freelancer offers are called) is always 500 rubles.

Well, for those who want to figure out a lot of letters with incomprehensible commands and scripts and do it themselves - we continue.

Creating nginx configuration for the restored site

I am making a universal config, with an eye to the future - php processing. You may need it if you want to revive the site and improve the functionality, for example, forms for sending messages, subscriptions.

But in general, minimum configuration for a static site it will look something like this:

Server (
server_name site.ru www.site.ru *.site.ru;
root /var/www/site.ru;
index index.html;

gzip on;
gzip_disable "msie6";
gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript application/javascript;

location = /robots.txt (
allow all;
log_not_found off;
access_log off;
}

location ~* \.(js|css|png|jpg|jpeg|gif|ico|woff)$ (
expires max;
log_not_found off;
}
}

This configuration also includes compression and caching in the browser.

Restart the webserver:

service nginx restart

How to check a website without changing DNS?

In principle, you can wait for the DNS update after registering the domain. But I want to see the result as soon as possible. And you can start work right away. There is a simple way to do this - write down the server IP for the desired domain V hosts file, a record like this:

10.10.1.1 site.ru

After this, the desired site will open exclusively on your computer.

Like this. I feel like a necromancer :)

The site will be shown exactly as its users saw it. All links will work as long as you have everything necessary files. Perhaps some of them will be broken, somewhere there will be missing images, styles or something else. But this is not the point - after all, the most important thing for any site is content. And it will most likely remain.

Cleaning the code of the restored site

But that is not all. Although you can leave it this way. But to achieve better effect, it makes sense to tidy up the restored site a little. This is generally the most the hard part in this whole thing. The fact is that since the site will be displayed the way its users saw it, there will be a bunch of all kinds of garbage in the page code. This is primarily advertising, banners and counters. Also some elements that are useless on a static site. For example, a link to log into the site admin area. Forms for sending comments, subscriptions, some buttons and other elements inherited from the dynamic CMS on which the site worked before. In my case it was WordPress.

How to remove fragments html code on many static pages?

How can all this be removed? Very simple. Look in the code and simply remove what is unnecessary. It's easy to say. But we have several hundred pages. That's why magic is needed here.

find ./site.ru/ -type f -name "*.html" -exec sed -i "s|

Entrance

||g"
{} \;

With this construction you can remove ALL html tags from a file. The easiest. You will then have text files

sed -e "s/]*>//g" test.html

The normal approach is if you just download content and then use only useful content for something else - for writing new articles, for doorways, or something else.

But this doesn’t suit me, I want to first recreate the site completely and see how it will come to life and whether it will exist at all. Therefore, cleaning up the code takes me a couple of hours of painstaking work. I open the site pages and look with a debugger source pages, I find javascript, banners, counters, forms that I don’t need.

This is how I remove the Liveinternet counter from all pages of my static site:

find site.ru/ -type f -name "*.html" -exec sed -i "//,//d" () \;

find site.ru/ -type f -name "*.html" -exec sed -i "s|||g" (
} \;

Despite the constructions that may seem scary to an ignorant person, these are quite simple things, since this counter has unique comment tags, by which we determine the part of the code to be deleted, indicating them as patterns.

In some cases, you have to rack your brains to cut out what is unnecessary and not touch what you need, because some elements may be repeated on the pages. For example, to delete a counter Google Analytics I had to write something like this:

First, I delete the line from which the counter begins. This command removes the line above the var gaJsHost pattern, since I only need to remove it in this place and not touch it anywhere else:

find site.ru/ -type f -name "*.html" -exec sed -i -n "/var gaJsHost/(x;d;);1h;1!(x;p;);$(x;p ;)" () \;

Now we cut out the rest of the part, which becomes easy to identify by the unique patterns in the first and last lines:

find site.ru/ -type f -name "*.html" -exec sed -i "/var gaJsHost/,/catch(err)/d" () \;

Similarly, I remove the form for adding comments:

I clear 4 lines with non-unique closing tags after the line with a unique pattern:

find theredhaired.ru/ -type f -iname "*.html" -exec sed -i "/block_links/(N;N;N;N;s/\n.*//;)" () \;

And now I’m cutting out a fairly large block of 30 lines, indicating the unique patterns of its first line and last:

find theredhaired.ru/ -type f -iname "*.html" -exec sed -i "/ Subscription/,/block_links/d" () \;

You can, of course, try to solve these last couple of cases using multiline patterns, but I never mastered them, no matter how much I googled. I found a lot of examples with multi-line, but they are all simple, with no special characters or escape characters (tabs, line breaks).

Perhaps all this cleaning will be easier to do in PHP or even perl, for which text processing is the purpose. But, unfortunately, I don’t know them, so I use bash and sed.

I did all this on separate copy site with a bunch of iterations, tests, so that it was always possible to roll back changes, I saved copies after each significant change, again using rsync.

How to bulk edit titles and other elements on a static website?

Since my goal is not just to resurrect the site, but to get it indexed, ranked in search, and even get traffic from search, I need to think about some kind of SEO. The original titles definitely don't suit me, so I want to change them. WordPress inherited the %sitename% » %postname% scheme. Moreover, our sitename is unclear - the site domain itself. The easiest option is to cut out the first part of the title. But that doesn't work for me either. So I'll change this part of the title to a tricky request. This is how I do it:

As you can see, there are a lot of checks and iterations. But in the end, the titles become what they need. You can guess that I started an attempt to collect traffic to this site based on requests for restoring sites from a web archive. Why do I need this - I'm going to provide paid service to restore such sites. As you can see, in in this case It's quite easy to make a replacement. It was possible not to bother with several options, but to sum everything up under one. But I wanted to remove or change unnecessary symbols, and since there were several options, I changed them to several of my own. This is SEO.

Now I'm going to add Yandex Metrica to all html files of my site. And at the same time translate it from old scheme www on without www.

How to convert a static website from www to non-www?

This is done by simply replacing:

find ./ -type f -iname ‘*.html’ -exec sed -i ‘s/http:\/\/www.site.ru/http:\/\/site.ru/g’ () \;

Then, just in case, in the nginx configuration we will add the option with www to the redirect:

server(
server_name www.site.ru;
return 301 $scheme://site.ru$request_uri;
}

How to create a sitemap.xml for a static site?

This will be needed when we add the site to search engines. This is very important, given that our site has been restored, it may lack some navigation, and there will be no links to some pages at all. The site map smoothes out this point - even if you can’t get to the page by going through the site itself - by specifying it in sitemap.xml, we will allow it to be indexed, which can potentially lead traffic from the search directly to the page.

In addition, after some time I will conduct an analysis of the results that I have achieved with this site. Traffic, leads or something else. So, stay tuned to the site, in 2-6 months you will see the continuation of the story. I’ll show you the stat, if there is one, etc. If you are reading this article six months later, and there is still no link to the continuation, remind me of this in the comments, please :)

Got it figured out, right?

If you are inspired, have figured it out and are going to do it yourself - low bow and respect to you. I like people who want to understand and comprehend everything.

Hello, dear readers of the blog site. Not long ago I wrote about something that certainly deserves all sorts of flattering epithets, despite its small shortcomings and criticism of its articles from the scientific community.

The very fact that a non-profit project has been working for the benefit of the entire Internet community for decades deserves great respect. But there is also a similar large-scale project on the Internet, which, without receiving any income from it, performs a very important role - it preserves archives of websites, videos, audio and printed materials.

What’s noteworthy is that the last column of this list (which can be opened in Excel) will display the number of archives created for each site in the Web Archive (however, you can check the availability of a domain in the web archive in a number of online services, for example, on this or on this).

A list of bourgeois domain names that are being released or have already been released can be downloaded from this link. Well, then we look through the contents of the sites that were saved by the Web Archive and try to find something worthwhile. Then we check the uniqueness of these materials (I provided the link just above) and, if successful, publish them on our resource, or sell them in some.

Yes, the method is tedious and has not been tested by me personally. But I think that with some degree of automation and brainpower it can produce good output. Probably someone has already put this on stream. And what do you think?

Good luck to you! See you soon on the pages of the blog site

You can watch more videos by going to ");">

You might be interested

Comparison of sites in SEObuilding.RU for free analysis of potential donors when purchasing links
Free online service for selecting beautiful and free domains for registration (Frishki.ru)
Photo stocks and photo banks - 30 free legal sources of photos, images and icons
SEObuilding.RU - complete free analysis of sites with calculation of their trust, value and much more
Site verification (analysis) - 85 online services and programs

Every site is a story that has a beginning and an end. But how to trace the stages of the project’s formation, its life cycle? For these purposes there is special service, which is called a web archive. In this article we will talk about the presentation of such resources, their use and capabilities.

What is a web archive and why is it needed?

A web archive is a specialized site that is designed to collect information about various Internet resources. The robot saves a copy of projects automatically and manual mode, it all depends on the site and data collection system.

Currently, there are several dozen sites with similar mechanics and tasks. Some of them are considered private, others are non-profit projects open to the public. The resources also differ from each other in the frequency of visits, the completeness of the information stored and the possibilities of using the received history.

As some experts note, pages storing information flows are considered an important component of Web 2.0. That is, part of the ideology of the development of the Internet, which is in constant evolution. The collection mechanics are very mediocre, but there are no more advanced methods or analogues. Using a web archive, you can solve several problems: tracking information over time, restoring a lost site, searching for information.

How to use web archive?

As noted above, a web archive is a site that provides a certain kind of search service in history. To use the project, you must:

  • Go to a specialized resource (for example, web.archive.org).
  • Enter information for the search in the special field. It could be Domain name or keyword.
  • Get relevant results. This will be one or more sites, each of which has a fixed crawl date.
  • By clicking on a date, go to the corresponding resource and use the information for personal purposes.
  • We’ll talk about specialized sites for searching for historical records of projects later, so stay with us.

    Projects that provide site history

    Today there are several projects that provide services for finding saved copies. Here are some of them:

  • The most popular and in demand among users is web.archive.org. The presented site is considered the oldest on the Internet; its creation dates back to 1996. The service collects data automatically and manually, and all information is hosted on huge foreign servers.
  • The second most popular site is peeep.us. The resource is very interesting, because it can be used to save a copy information flow, which is available only to you. Note that the project works with all domain names and expands the boundaries of the use of web archives. As for the completeness of the information, the presented site does not save pictures and frames. Since 2015, it has also been included in the list of prohibited products in Russia.
  • A similar project to the one described above is archive.is. The differences include the completeness of information collection, as well as the ability to save pages from social networks. Therefore, if you have lost a post or interesting information, you can search through the web archive.
  • Possibility of using web archives

    Now everyone knows what a web archive is and what sites provide services for saving copies of projects. But many still do not understand how to use the information presented. The capabilities of archival data are expressed as follows:

  • Choosing a domain name. It's no secret that many webmasters use already upgraded domains. It is worth understanding that experienced users track not only target parameters, but also the history of previous use. Every network user wants to know what they are purchasing: whether there were previously prohibitions or sanctions, whether the project was subject to filters.
  • Restoring a site from archives. Sometimes a disaster happens that threatens the existence of your own project. The lack of timely backups in the hosting profile and an accidental error can lead to tragedy. If this happens, don’t be upset, because you can use the web archive. We'll talk about the recovery process below.
  • Search for unique content. Every day, sites filled with content die on the Internet. This happens with particular consistency, which is why a huge flow of information is lost. Over time, such pages fall out of the index, and a resourceful webmaster can borrow the information for a personal project. Of course, there is a search problem, but that is a secondary concern.
  • We've looked at the main features that web archives provide, now it's time to move on to a more detailed study of individual elements.

    Restoring a website from a web archive

    No one is immune from problems with websites. Most of them are solved using backups. But what if there is no saved copy on the hosting server? Use the web archive. To do this you should:

  • Go to the specialized resource we talked about earlier.
  • Enter your own domain name into the search bar and open the project in a new window.
  • Choose the most successful photo, which is located closer to the problem date and has a full-fledged view.
  • To correct internal links to straight lines. To do this, use the link “http://web.archive.org/web/any_sequence_number_id_/Site name”.
  • Copy lost information or design data to be applied for restoration.
  • Note that the process is somewhat tedious, given the speed of the archive. Therefore, we recommend that owners of large web resources make backups more often, which will save time and nerves.

    We are looking for unique content for our own website

    Some webmasters use interesting way getting a new one, no one required content. Every day hundreds of sites go into oblivion, and information is lost along with them. To become a content owner, you need to do the following:

  • Enter URL
    https://www.nic.ru/auction/forbuyer/download_list.shtml#buying in the search bar.
  • On the domain name auction website, download files with the name ru.
  • Open received files from using excel and begin selection based on the availability of design information.
  • Enter the projects found in the list on the web archive search page.
  • Open the snapshot and access the information flow.
  • We recommend monitoring content for plagiarism, this will allow you to find truly worthy texts. And that's all! Now everyone knows about the possibilities and methods of using a web archive. Use knowledge wisely and profitably.

    We released new book"Content marketing in in social networks: How to get into your subscribers’ heads and make them fall in love with your brand.”

    Subscribe

    Web archive is free platform, where all the sites that have ever been created are collected and on which there is no ban imposed on their preservation.


    More videos on our channel - learn internet marketing with SEMANTICA

    This is a real library in which anyone can open a web resource that interests them and look at its contents on the date on which the web archive visited the site and saved a copy.

    Introduction to archive org or how Valery found old texts from the web archive
    In 2010, Valery created a website in which he wrote articles about Internet marketing. He wrote one of them about advertising on Google (AdWords) in the form of a short summary. A few years later he needed this information. But the page with the texts was mistakenly deleted by him some time ago. It happens to everyone.

    However, Valery knew how to get out of the situation. He confidently opened the web archive service, and search bar entered the address he needed. A few moments later, he was already reading the material he needed and a little later he restored the texts on his website.

    History of the creation of the Internet Archive

    In 1996, Brewster Kyle, an American programmer, created the Internet Archive, where he began collecting copies of websites with all the information contained in them. These were completely preserved in in real form pages, as if you had opened the required site in a browser.

    Anyone can use the web archive data completely free of charge. When creating it, Brewster Kyle had a main goal - to preserve the cultural and historical values ​​of the Internet space and create an extensive electronic library.

    In 2001, the main Internet Archive Wayback Machine service was created, which can still be found today at https://archive.org. This is where copies of everyone in free access to view.

    In order not to be limited to a collection of sites, in 1999 they began archiving texts, images, sound recordings, videos and software.

    In March 2010, at the annual Free Software Awards, the Internet Archive was awarded the title of winner in the Project of Social Benefit category.

    The library is growing every year, and already in August 2016, the Webarchive volume amounted to 502 billion copies of web pages. All of them are stored on very large servers in San Francisco, New Alexandria and Amsterdam.

    Everything about archive.org: how to use the service and how to get a site from a web archive

    Brewster Kyle created the Internet Archive Wayback Machine service, without which it is impossible to imagine the work of modern Internet marketing. View the history of any portal, see what they looked like certain pages earlier, restore your old web resource or find necessary and interesting content - all this can be done using Webarchive.

    How to view site history on archive.org

    Thanks to , the web archive library stores most of the Internet sites with all their pages. Also, it saves all of its changes. Thus, you can view the history of any web resource, even if it has not existed for a long time.

    To do this, you need to go to https://web.archive.org/ and enter the address of the web resource in the search bar.

    After some time, the web archive will display a calendar with the dates of changes to this page and information about its creation and the number of changes for the entire period.

    According to the information received, it can be found out that home page our site was first found by the service on May 24, 2014. And, from that time until today, a copy of it has been saved 38 times. Dates of changes on the page are marked in blue on the calendar. In order to view the history of changes and see what a certain section of the web resource looked like on the day you are interested in, you should select required period in the feed with previous years, and a date in the calendar from those offered by the service.

    In a moment, the web archive will open the requested version on its platform, where you can see what our site looked like in its original form.

    Next, using the calendar with arrows at the very top of the screen, you can flip through the pages according to the chronology of their changes in order to track how the appearance and their content.

    Thus, you can dive into the past and see all the changes that have happened to it throughout its existence.

    Why you may not find out on Webarchive what the site looked like before
    It happens that a web site cannot be found using the Internet Archive Wayback Machine service. And this happens for several reasons:

    • the copyright holder has decided to delete all copies;
    • the web resource was closed in accordance with the law on the protection of intellectual property;
    • V root directory Internet platforms, a ban has been introduced through the robots.txt file

    In order for the site to be in the web archive at any time, it is recommended to take precautions and save it yourself in the Webarchive library. To do this, in the Save Page Now section, enter the address of the web resource that you want to archive, and click the Save Page button.

    Thus, for the safety and security of all information, it is necessary to repeat this procedure with each change. This will give a 100% guarantee that your pages will be saved for a long time.

    How to restore an inactive website from a web archive

    There are different situations, when the browser reports that such and such a web service no longer exists. But the data needs to be retrieved. Webarchive will help.

    And for this there are two options. The first is suitable for old sites that are small in size and well indexed. Just extract the data the required version. Next, the page code is reviewed and the links are manually polished. The process is somewhat labor-intensive in terms of time and steps. Therefore, there is another, more optimal way.

    The second option is ideal for those who want to save time and solve the download issue as quickly and easily as possible. To do this, you need to open the site recovery service from Webarchive - RoboTools. Enter the domain name of the portal you are interested in and indicate the date of its saved version. After some time, the task will be completed in full, with all pages filled.

    How to find content from a web archive

    Webarchive is a wonderful source for filling web resources with full texts. There are many sites that, for a number of reasons, have ceased to exist, but contain useful and necessary information. Which is not included in the indexes search engines, and is essentially non-repetitive.

    So, there are free domains that store a lot interesting material. All you need to do is find suitable content and check its uniqueness. This is very profitable, both financially - because you won’t need to pay for the work of the authors, and in time - because all the content has already been written.

    How to prevent a site from being included in the web archive library

    There are situations when the owner of an Internet site values ​​​​the information posted on his portal, and he does not want it to become available to a wide circle. In such situations, there is one simple way out - in the robots.txt file, write a prohibited directive for Webarchive. After this change in settings, the web machine will no longer create copies of such a web resource.