What we define as “digital history” is a field of study which bloomed during the last decade, affirming itself as a fundamental component of the digital humanities environment.
Inspired by Brügger’s interpretation of Cohen and Rosenzweig, two major tasks on which the contributions on this topic are mainly focused can be identified, namely: using computer science technologies as tools to improve historical researches (in order to find, access, search, manipulate and preserve sources) and using the web as a platform to share the results of these works (by creating websites, timelines, interactive visualizations, etc.).
However, as Brügger remarked in the same paper, “digital history largely equates a ‘digital source’ with a source which was previously analog but has now been digitized. In other words, digital sources have so far been limited to traditional analog sources in digital form” and “very little attention has been paid to the new digital media as historical sources”.
Meanwhile, as Noiret and Milligan underlined, a huge amount of “born-digital” sources have been produced in the last twenty years: websites, blogs, videos on Youtube, tweets, pictures on Instagram, etc. All these different documents at first attracted the attention of archivists interested in their preservation, and then of other scholars, who started to employ them as primary sources in their works.
Taking into consideration all these points, the focus of this work, which was inspired by the researches above mentioned, is a methodological analysis of the issues that arise when working with “born-digital” primary sources for historical purpose. In particular, the intent is to discuss how the processes of finding, selecting and analyzing sources transform when dealing with “born-digital” ones.
The specific cases selected are the websites of four Italian universities (Polytechnic University of Turin, University of Rome 2, University of Trento and University of Bologna): the aim is to analyze difficulties that are encountered when trying to rediscover their digital past (mainly by using documents from the Internet Archive) and to suggest possible solutions.
In addition to this, the paper will remark on the importance of integrating computational method, such as Natural Language Processing and Text Mining techniques, into this kind of historical research, in order to allow a fruitful exploitation of big amounts of digital data.
The final part of this paper will be not only focused on discussing the necessity of interdisciplinary skills for the historian , but also aimed at implicitly raising a deeper question, namely whether it is still significant to distinguish between “humanities” and “sciences” now that computer science tools and quantitative methods are the unavoidable starting point for the study of the already so called “big digital history”.
1. Introduction: rediscovering the past of Italian universities websites
Italian universities have a strong tradition of innovation in communication technologies. On the 30th of April 1986, the University of Pisa activated its connection with ARPANET, thereby making Italy the third country in Europe to be “online”, after Norway and the United Kingdom and in the same years Italy also became the first European country on BITNET. In 1987 the National Research Council (CNR) registered the first “.it” domain and in 1991 the Center for Advanced Studies, Research and Development in Sardinia (CRS4) created the first Italian website (www.crs4.it), which was the second one in Europe. Another fruitful digital environment bloomed in Bologna in the early nineties, thanks to the collaborations between the University, the Municipality, various small IT companies and the CINECA (a non-profit consortium, currently composed of almost 70 Italian universities). Here in 1995 “Iperbole” was created, one of the first Civic Networks in the world.
With this early interest of the academia in extending its presence on the Internet, a robust and active web community was also born. In fact, even if in 1994 the number of Internet users in Italy were just 15.000, already in those years a consistent number of people were curious about this new technology, as can be noticed from the articles published by some of the most important newspapers, where pieces of advice regarding the Internet were given and common doubts were clarified.
Today, even though less than 60% of Italians use the Internet, the country nevertheless has more than 30 million active users which are the target and also the producers of a vast amount of daily information.
As it is well known, the web keeps changing without leaving traces of its past; therefore a “National Web Archive” could be a fundamental tool to study the Italian web history. However, despite an important collaboration between the National Libraries of Florence and Rome, which started in 2006, currently a web platform is still not available. Therefore web historians are forced to use solely international web archives.
1.1 How the study was set up
As Gomes, Miranda and Costa and Niu have presented, if we consider the forty most important web archival projects, only three of them have an international spectrum; therefore only these three could have consistently preserved the “Italian web sphere”. They are the Internet Archive (created in 1996), the Internet Memory (founded in 2004 under the name of European Archive) and the California Digital Library Web Archiving Service (WAS, 2005). In this study the first one has been used the most, as it presents an enormous amount of snapshots compared to the Internet Memory, and because Italian institutions haven’t been preserved by WAS.
Before moving on, it is important to remark on some features of the Internet Archive relevant for academic purposes. First of all it is only possible to retrieve information by a URL search tool: this means that we can only search web pages, as search by keywords is not allowed.
Secondly, the results of a query will be displayed in chronological order on a calendar; for this reason all the dates mentioned in this research will be “preservation-dates”, and they could be different from the dates of real layout changes.
Finally, as it is very complicated to preserve a website in its entirety, the sources consulted have to be considered as “reborn digital materials”; this means that “what is archived is almost never a copy on a 1:1 scale of what was once online; it is rather a collection of unique versions that did not exist before the act of archiving”.
Taking into consideration all these features, in the first part of this research 20 Italian universities were analyzed: for each one of them the first snapshot available on the Internet Archive and every major layout change in its homepage were identified.
Table 1. The results of 17 Italian university websites analyzed during March 2014 are presented here. University of Bologna, University of Foreigners – Perugia and University of Rome 2 – Tor Vergata have been excluded for reasons that will be explained later. The issue related to the first snapshot of the Polytechnic of Turin will be described in the next section.
Future works could aim at looking for similarities in the nature of these changes (are they caused by technological or structural reasons?), at comparing the Italian situation to the international one (a subset of American and European University websites has been already analyzed) and at investigating the influence of Italian university’s reform acts on these changes. However, in the following paragraphs the attention will be focused on four specific issues emerged during the study above presented, which make the consultation of the snapshots particularly challenging.
2. Different issues that arise when using the Internet Archive
In the next pages, four different issues with snapshots of universities preserved by the Internet Archive, which make their consultation particularly challenging, are presented. They concern the websites of the Polytechnic University of Turin, the University of Rome 2 – Tor Vergata, the University of Trento and the University of Bologna. In the conclusion of this part the importance and the limitations of international web archives in this kind of historical research will be both emphasized.
2.1 Conflicting Dates with Polytechnic University of Turin ’97 snapshots
If we type the URL of the Polytechnic University of Turin (http://www.polito.it/) in the search tool we will find the first snapshot available of its homepage and the date on which it was taken. However the Internet Archive is not always a reliable source, as we can see by analyzing this document: apparently it was archived on the 22nd of January 1997, but at the bottom of the page the “last modification date” indicates the 8th of July 1997 (Fig. 1). Therefore this snapshot was not harvested in January but at least in the first part of July.
Unfortunately, as it is not so frequent that websites have a last modification date, without this information we cannot be sure if other pages have the same inconsistency.
Fig. 1. The first snapshot available of the Polytechnic University of Turin, which reveals the “conflicting dates issue” here described.
The same issue appears in other cases (Duke University: 19/02/97 – 18/06/97; University of Edinburgh: 04/01/97 – 07/05/97). Nielsen in her work underlined a similar problem with snapshots from 1997 and Brügger, while presenting several issues that emerge in dealing with archived websites, remarked the necessity of “methodological principles, rules and recommendations for a future critical textual philology of the website”.
2.2 What happened to http://www.uniroma2.it/ between November ’96 and December ’98? A digital investigation
Searching the website’s history of the University of Rome 2 – Tor Vergata on the Internet Archive, we could notice that there are no snapshots available before December 1998. This is a very unusual thing because, as we can see in Table 1, all other Italian university websites analyzed had been preserved at least since the end of January 1998. Thus, it seems very unlikely that Internet Archive crawler had not found Tor Vergata’s website in almost two years.
For the same reason, as every other Italian university has a website in those years, it seems improbable that Tor Vergata created its digital platform only in 1998. However Tor Vergata, as all other Italian universities analyzed, did not offer specific information about the “history of the website”, especially regarding who led the project and what changes had been made on the platform. Thus, using only these snapshots as sources, it is not possible to know whether the website was already online in 1997.
A link, offered on Tor Vergata homepage: “Per i visitatori: Università Italiane” (Information for visitors: other Italian universities) turned out to be very helpful.
It sends to another website, realized by CILEA (Italian Universities Consortium), which offered links to all Italian university websites online in those years.
As the Internet Archive preserved a previous snapshot of the “CILEA” page, obtained the 25th October 1997, the information needed could be finally found. Summarizing, if the University of Tor Vergata is not present in this CILEA list, it means that the platform has been created later, if it is present than the explanation is that Internet Archive crawler didn’t find Tor Vergata website for almost two years.
However, what was discovered is a more obvious reason: University of Rome 2 – Tor Vergata is in the list but the URL (http://www.utovrm.it/) is different from the actual one (http://www.uniroma2.it/). This change could be due to a decision to standardize the addresses of every Italian universities to a common form: “uni + the initials of the city”.
Even though the complete change of URL appeared only once in this entire research, this issue could be identified as one of the most difficult problem for web historians. Without the fundamental help of an external reliable source, such as the CILEA website, it could have been a really complicated problem to solve.
2.3 Digital bonds between the University of Trento and the Istituto Trentino di Cultura: a diachronic analysis
The University of Trento is one of the excellences of the Italian high-cultural system. Located in Trento and in Rovereto, it is the main node of a fruitful cultural ecosystem, which involves the Bruno Kessler Foundation (originally founded in 1962 under the name of Istituto Trentino di Cultura), the Microsoft Bioinformatics Research Center COSBI (opened in 2005) and the Center of Integrative Biology CIBIO (2007).
For these reasons it is evident that its website has played an important role during the last twenty years in order to attract international excellences to the university departments and to establish enriching collaborations. Therefore it could be interesting to study how these bonds with other research centers have been presented on the website.
If we search the URL “http://www.unitn.it/” in the Internet Archive, we will receive a substantial number of snapshots after the middle of 2004 and a decent number of them going back until 2000.
The situation is not as well documented between December 1999 and the 17th October 1997, which is also the first available snapshot. During these first years the University of Trento has offered to users an almost identical English version of its website (even though the section “General Information” is not available in English).
If the intent is to analyze how the University of Trento has emphasized its connections with local research centers, primarily in order to attract international scholars, it is important to notice that the link to the Istituto Trentino di Cultura was offered in the section “Other Links” in this version of the website.
The first significant change in the layout of the homepage is on the 27th January 1998. On this date we can notice that the link to the Istituto Trentino di Cultura has been moved to the section “Pagine utili” (Useful pages). However, if we visit the English version, we will also notice that this specific link is not present in the “Useful pages” section.
Fig. 2. The evolution of the University of Trento’s homepage. Preservation issues are evident after 2006.
On the 16th May 2000 the website changes again. Now the link to the Istituto Trentino di Cultura is offered in a little box on the right of the section “Ricerca” (Research), without a description. Astonishingly, from this date on, there is no link to the English version of the website. Moreover, as the old English version has been preserved until the first months of 2000, it is not possible to be sure whether a new English version had been subsequently realized.
Between the 2nd July 2001 and the 2nd April 2002 the website changed again. Now an English version is present, which has offered information since 9th February 2002. Studying both the Italian and the English versions of the website it was noticed that the Istituto Trentino di Cultura is not mentioned anymore.
November 2006: from that date on the user is redirected to the page http://portale.unitn.it/, which has not been preserved by the Internet Archive due to robots.txt
After a total blackout in the preservation of almost four years, the website became available on the Internet Archive on the 8thof July 2010. However, since that date, it has been often preserved poorly, so it is not possible to continue this investigation. Moreover, this issue impedes from performing a comparative analysis on how the opening of Microsoft’s COSBI and CIBIO have been presented on the University of Trento’s website.
Similar preservation issues appeared in this study a few times, for instance with the University of Manchester website after the 26th March 2004 and with the University of Pisa after the 8th July 2011.
Creating an automatic tool able to notice and notify in the Internet Archive calendar the major layout changes of a webpage could be very useful for web historians, as these modifications can be very hard to track.
2.4 How the University of Bologna has offered educational information on the Web
. Given the number of both Italian and international students who are users of the website and of its various resources, it was decided to focus this part of the research on analyzing how the University of Bologna has used the potentiality of the web to share educational information. Thus, the intent is to rediscover the course programs offered online by the University during the last twenty years. To do so, both the materials still present on http://www.unibo.it and snapshots of its website preserved in the Internet Archive could be used.
Before moving on, it is important to remark that, even though we decided to focus our research especially on the use of born-digital documents (which guarantee the direct employment of computational methods, avoiding the intermediate digitization step), traditional sources from universities archives could be as useful as digital ones in several steps of this historical reconstruction.
For more than ten years, educational information has been offered on individual departments pages. However, because of an important general website update in 2005, these materials are often not available on those pages anymore. Therefore, it is presently possible to access educational information only from academic year 2004/2005. Moreover, without the help of a web archive, we cannot be certain that the layout offered today is the same as ten years ago and we cannot guarantee the preservation of educational materials linked from those pages.
As previously remarked, Italy does not have a National Web Archive, therefore the Internet Archive is practically the only platform available for this task. However the University of Bologna’s website is not preserved there (Fig. 3); supposedly it does not allow Internet Archive crawlers. Moreover, as the Internet Memory offered only one snapshot of the University of Bologna website (taken in 2006), this seems a non-solvable problem.
Fig. 3. University of Bologna is not preserved in the Internet Archive.
3. How to solve these issues and how to deal with new ones
For what concerns the University of Bologna, first of all the attempt will be to solve this issue working with the Internet Archive, in order to understand if this is a problem related to the robot.txt file which has been stopping the crawlers during the years or if someone has explicitly asked to remove the website from the Archive.
However, in order to carry on with this research there are other viable options: the first one is to use national web archives, which from time to time could have preserved part of the Italian web sphere. The second one is to contact and interview the people who initially created and managed this website, hoping that they will help to answer research questions through their memories or, maybe, with the aid of old backups. Finally other more traditional sources, such as newspapers or magazines, could be useful as well.
3.1 Italian Websites in other National Web Archives
Since 1996 several different national libraries began preserving their national web past. PANDORA, started in 1996 by the National Library of Australia, the UK Web Archive (2004), the Netarkinvet (2005) in Denmark and the Portuguese Web Archive (2011) are some of the most interesting examples. The process of finding and selecting “national websites” is often complicated, as Brügger remarked. To better understand how this works, here is an example from the Dutch web archive:
“What is a “Dutch website”? It is a Dutch website, if it is:
- Dutch language, and registered in the Netherlands;
- Any language, and registered in the Netherlands;
- Dutch language, registered outside the Netherlands;
- Any language, registered outside the Netherlands, with subject matter related to the Netherlands.”
However, during this archival process part of the non-national web will also end up to be unintentionally preserved, as a consequence of the way a crawler works. To better explain that, let’s imagine a crawler which is set to go at most 10 links away from a specific starting URL (chosed by the curator): it will probably crawl also non-national content, only because it will systematically follow all the hyperlinks. For this reason, if the University of Bologna were to organize a Summer School and the University of Amsterdam had linked it from its website, the University of Bologna website (or at least part of it) would have been accidentally preserved in the Netherlands Web Archive.
This characteristic of the archival process, which is problematic for National Web Archives, actually comes as an advantage for this research: in fact, a few snapshots of the University of Bologna (between 2006 and 2013) and the University of Trento (between 2006 and 2007), that could be useful for this study, have been discovered in the Danish Web Archive. Moreover, the Portuguese Web Archive uses mostly Internet Archive snapshots, but also offers a text search tool which gives the possibility of retrieving pages, related to the University of Bologna and preserved by this institution, searching by keywords and not only by URL.
3.2 Backups and “traditional” sources
The second approach to solving this issue is to directly contact the people who managed these websites in the past. For instance, thanks to the collaboration of CESIA (Center for the Management and Development of Services at the University of Bologna) it was possible to retrieve images of the most important changes in the homepage of the website, since 1998. Obviously these sources are not complete snapshots of the webpages but just images, therefore it is not possible to consult the links and navigate this websites, however it is still a good starting point.
Another way of finding information related to the old web is to search in newspaper archives to see if there are articles which mention or describe a specific website. The practice of using print media to retrieve information about the web of the past has been already described, for instance in Brügger, and it was also adopted earlier in this paper to present information on the first years of the Italian web sphere.
Using the digital archive of the newspaper La Stampa (http://www.archiviolastampa.it/) a few articles published between 1996 and 1999 were retrieved, which describe the use of the web by universities, in particular focusing on the activities online of the Polytechnic and the University of Turin.
3.3 Italian Web Archive
In this paper four different issues with the Internet Archive, which are making its use complicated in this research, are emphasized. However the intent of this study was not merely to criticize the Internet Archive project, on the contrary: preserving digital information is a multifaceted and complicated task, and only thanks to projects like the Internet Archive researches as the one presented here will be possible.
The main objective of this paper was instead to underline that web historians cannot depend solely on the Internet Archive and it is their duty to stress the importance of National web archival projects and to participate actively in their development, as they could be the only doors to our digital past.
Therefore it would be important for Italy to become part of the projects led by the International Internet Preservation Consortium and start a discussion with the most interesting and advanced projects in the field, for instance the RESAW project, lead by the NetLab of the University of Aarhus. Creating a bond between the archivists, the IT developers and the researchers could help Italy in preserving, studying and therefore better understanding its first twenty years online.
4. The Future of this Research: From Scarcity to Abundance
When working with born digital sources there is always a thin line between the total absence of materials and an enormous and unmanageable amount of data. As an example, Twitter is not preserved by the Internet Archive and, usually, is archived by national institutions only for specific purposes. Therefore, even if Twitter has offered all its archive to the Library of Congress in 2010, currently it is almost impossible for researchers to recreate and analyze for instance a “national Twitter sphere” during the 2009 European Election and compare that to the 2014 one.
At the same time, an enormous part of all the tweets sent is still online and apps like Topsy offer the possibility of retrieving information using a full-text search tool, which is not the perfect solution for researchers, but it could be a first step in dealing with big data sources from social media.
4.1 Content analysis and link analysis: big data approaches
If we consider the sources available on university websites we will see again this thin line. It is very difficult to reconstruct the course programs shared by the University of Bologna from the Nineties until today but, at the time, currently we can consult all the programs from 2004/2005 and so on, which are over 6.000 per year. It is evident that if the intent is to extract useful information from this enormous amount of data a computational approach has to be employed (in Wong et al., 2012, several different methods are presented). As already described a Natural Language Processing technique such as Named Entity Recognition could help with the identification of people, places, organizations, etc mentioned in text. Employing this approach with other statistical methods, such as Topic Modeling – which gives the possibility of identifying the underlining topics in a text – and integrating all the information extracted with a knowledge base like FreeBase, DbPedia or WordNet, could help to find similarities between courses from different departments, underline the most recurrent arguments by year, and therefore discover changes in the didactic activities.
Moreover, taking into consideration the approach applied when analyzing the relationship between the University of Trento and the Istituto Trentino di Cultura, this was simply to methodically look at every version of the website searching for the presence of a specific link. But if we imagine a large scale approach and we are able to face the preservation-issues described earlier, all the links from and to the University of Trento during the last twenty years could be studied. A method like this has been employed by Hale and others to study the relationships between British universities on the web from 1999 to 2014.
4.2 The limitations of these methods
It is important to keep in mind a few different things when employing big data approaches on historical born digital sources. First of all it is fundamental to remark that most of the time researchers are studying re-born digital materials, which are often not completely preserved, were archived during a long period of time and are always just a part, a selection of the web of the past. Therefore, as it has been already underlined in this paper, the results and conclusions will always be closely related to this specific corpus and every kind of generalization to the “all web” has to be declared clearly.
Secondly, it is also important to know the limitations and the problems of the tools employed in the research, especially when dealing with semantic analyses. Researchers always have to consider the reliability of these techniques: as an example it is often difficult to apply Named Entity Recognition on Twitter, Topic Modeling results are sometimes complex to interpret and, even if there has been great progress in this field of study, the precision and the recall of these tools has to be improved.
5. Conclusion: From the Future of Web History to the Historians of the Future
In order to write the history of the last two decades researchers must rely on born digital primary sources which offer useful first-hand information on our relationship with the web.
At the same time it is important that historians cooperate always more with archivists and computer scientists in order to build structures to preserve these sources and tools to analyze them. As it was underlined earlier in this paper, the Italian situation perfectly represents the risks of lacking an interdisciplinary project like this: without a national web archive researchers are forced to use only international ones, which present several issues. At the same time, without the collaboration between different experts it is also complicated to develop computational methods for these researches.
However during this discussion another important thing has not come up yet: the university education of students in history, especially in Italy, is slowly integrating the first courses on Digital Humanities. But it is clear that being able to cooperate with computer scientists employing (and understanding) NLP approaches or discussing with archivists how a crawler works is far more complicated. Other humanistic and social science disciplines, like linguistics, sociology and psychology, have started integrating statistics and computational methods decades ago both in bachelor and in master degree courses. Now that, as Milligan has already perfectly remarked, history is facing this sudden change from “paper” to digital primary sources, the universities have to be ready to integrate new courses in their educational system, which will give to the new generations the possibility of bringing on our work.
I am grateful to the Centre for Internet Studies at Aarhus University for hosting me in June ’14, to the Netarkivet for giving me access to their archive and in particular to Niels Brügger, Meghan Dougherty and Janne Nielsen for their precious comments, which helped me to sharpen the arguments in this article.
 Cfr. S. Vitali, Passato digitale: le fonti dello storico nell’era del computer, Pearson Italia Spa, 2004; D.J. Cohen, R. Rosenzweig, “Digital history: A guide to gathering, preserving, and presenting the past on the web”, University of Pennsylvania Press, 2006; D. Seefeldt, W.G. Thomas III, “What is digital history? A look at some exemplar projects”, Intersections: History and New Media, 2009; G. Monina, “Storia digitale. Il dibattito storiografico in Italia”, MEMORIA E RICERCA, 2013.
 We can notice this by looking at the number of publications on Google Scholar which mention “digital history”: between 1996-2000 there were only 105, 2001-2005: 415, 2006-2010: 1080, 2011- August 2014: 1.540
 Cfr. T. Scheinfeldt, “The Dividends of Difference: Recognizing Digital Humanities’ Diverse Family Tree/s, 7/04/2014, http://www.foundhistory.org/2014/04/07/the-dividends-of-difference-recognizing-digital-humanities-diverse-family-trees/; S. Robertson, “The Differences between Digital History and Digital Humanities”, 23/05/2014
 N. Brügger, “Digital history and a register of websites: an old practice with new implications”, The Long History of New Media: Technology, Historiography, and Contextualizing Newness, pp. 283-298, 2011.
 D.J. Cohen, R. Rosenzweig, “Digital history: A guide to gathering, preserving, and presenting the past on the web”, University of Pennsylvania Press, 2006.
 For instance during the 80s Manfred Thaller conducted a pioneering work in this field of study focusing on the use of databases for historical computing, as described by Schreibman, Siemens and Unsworth in S. Schreibman, R. Siemens, J. Unsworth, A companion to digital humanities, John Wiley & Sons, 2008.
 S. Noiret, “Digital History: the new craft of (Public) Historians”, 28/05/2012, http://sergenoiret.blogspot.it/2012/05/digital-history-new-craft-of-public.html
 I. Milligan, “Mining the ‘Internet Graveyard’: Rethinking the Historians’ Toolkit”, Journal of the Canadian Historical Association/Revue de la Société historique du Canada, 23.2, pp. 21-64, 2012.
 B. Kahle, “Preserving the internet”, Scientific American, 276-3, pp- 82-83, 1997; N. Brügger, Archiving Websites. General Considerations and Strategies, Aarhus: Centre for Internet Research, 2005; Dougherty, Meghan, et al, “Researcher engagement with web archives: State of the art”, Final Report for the JISC-funded project “Researcher Engagement with Web Archives”, 2010; Dougherty, Meghan, and Eric T. Meyer, “Community, tools, and practices in web archiving: The state of the art in relation to social science and humanities research needs”, Journal of the Association for Information Science and Technology, 2014.
 N. Brügger, “Website history and the website as an object of study”, New Media & Society, 11.1-2, pp. 115-132, 2009; Foot, Kirsten, et al, “Candidates’ Web Practices in the 2002 US House, Senate, and Gubernatorial Elections”, Journal of Political Marketing, 8.2, pp- 147-167, 2009; E. Thorsen, “BBC News Online: a brief history of past and present”, in N. Brügger, (ed) Web History, Peter Lang, 2009; M.S. Ankerson, “Writing web histories with an eye on the analog past”, new media & society, 14.3, pp. 384-400, 2011; A. Ben-David, “The Emergence of the Palestinian Web-Space: a Digital History of a Digital Landscape”, Paper presented at MIT7 Unstable Platforms: The Promise and Peril of Transition, Cambridge (MA) 13-15 May 2011, http://web.mit.edu/comm-forum/mit7/papers/MIT7_bendavid.pdf. It is also important to mention the organization of specific panels dedicated to this topic during recent digital humanities meetings, as an example: http://www.digitalhumanities.lu/?page_id=335
These approaches are parts of a multi-disciplinary research area involved in the automatic processing of human language, as for instance is described here: https://hlt.fbk.eu/
 D.J. Cohen, et al., “Interchange: The promise of digital history”, The journal of American history, pp. 452-491, 2008; D. Seefeldt, and W.G. Thomas III, “What is digital history? A look at some exemplar projects”, Intersections: History and New Media, 2009; S. Noiret, “Storia Digitale: sulle risorse di rete per gli storici”, La Macchina del Tempo. Studi di informatica umanistica in onore di Tito Orlandi, Le Lettere, pp. 201-258, 2011; I. Milligan, “Mining the ‘Internet Graveyard’: Rethinking the Historians’ Toolkit”, Journal of the Canadian Historical Association/Revue de la Société historique du Canada, 23.2, pp. 21-64, 2012.
 Graham, Shawn, Milligan, Ian, and Weingart, Scott, “The Historian’s Macroscope” – working title. Under contract with Imperial College Press, Open Draft Version, Autumn 2013, http://themacroscope.org
 M. Chiapparini, “Interattività e nuove tecnologie: il caso di Internet”, MS Thesis in “Teoria dell’informazione”, University of Bologna, A.A. 1995-96, http://bit.ly/1g89oxc A. Valentini, “Il pioniere del web che spalancò all’Italia le vie del cyberspazio”, Il Tirreno, 04/05/2011.
 M. Chiapparini, ibidem.
 As described on its website: http://www.iit.cnr.it/servizi/registro_it
 A. Pinna, “Soru: un incontro con Rubbia, così nacque il web in Sardegna”, Il Corriere della Sera, 28/12/1999, p.24
 S. Chiara, “La telematica e la città. Il progetto Iperbole a Bologna”, MA Thesis in “Comunicazioni di massa”, University of Bologna, A.A. 1997-98, http://bit.ly/1XIllf4
 G. Romagnoli, “Noi, i sovversivi del computer”, La Stampa, 18/08/1994, p. 11 [This information is offered in a box]
 M. Miccoli,” Cosa serve per collegarsi ad Internet”, La Repubblica, 03/10/1994 e S.A. Merciai, “Scriveteci siamo su Internet”, La Stampa – Tutto Scienze, 15/02/1995, p. 2, Il Corriere della Sera, “La finestra che cambierà il personal”, 24/08/1995, p. 18
 N. Brügger, Archiving Websites. General Considerations and Strategies, Aarhus: Centre for Internet Research, 2005; N. Brügger, “Web archiving—Between past, present, and future”, in M. Consalvo & C. Ess (Eds.), The handbook of Internet studies, Oxford, England: Wiley-Blackwell, pp. 24-42, 2011; J. Masanès, “Web archiving: issues and methods”, Web Archiving, Springer Berlin Heidelberg, 2006; D. Smith, “Websites ‘must be saved for history’.” The Observer, 25/01/2009, http://www.theguardian.com/technology/2009/jan/25/preserving-digital-archive; Dougherty, Meghan, et al., “Researcher engagement with web archives: State of the art”, Final Report for the JISC-funded project “Researcher Engagement with Web Archives”, 2010; D. Gomes, J. Miranda, M. Costa, “A survey on web archiving initiatives”, Research and Advanced Technology for Digital Libraries, Springer Berlin Heidelberg, pp. 408-420, 2011.
 A national web archive is a project which intend to preserve snapshots of websites related to a “national web sphere”. There are several different examples of national web archives from all over the World, like Pandora in Australia, Netarkivet in Denmark, BnF Web Legal Deposit and INA in France.
 D. Gomes, J. Miranda, M. Costa, “A survey on web archiving initiatives”, Research and Advanced Technology for Digital Libraries, Springer Berlin Heidelberg, pp. 408-420, 2011
 J. Niu, “An overview of web archiving”, D-Lib magazine,18.3/4, 2012.
 N. Brügger, Archiving Websites. General Considerations and Strategies, Aarhus: Centre for Internet Research, 2005.
 N. Brügger, Niels, N.O. Finnemann, “The Web and digital humanities: Theoretical and methodological concerns”, Journal of Broadcasting & Electronic Media, 57.1, pp. 66-80, 2013.
 J. Nielsen, “DR’s education across media: A historical study of media interplaying in the Danish Broadcasting Corporation”. Doctoral Dissertation, Aarhus University, 2014
 N. Brügger, “The archived website and website philology: A new type of historical document?”, Nordicom Review, 29.2, 2008 and “Web archiving—Between past, present, and future”, in M. Consalvo & C. Ess (Eds.), The handbook of Internet studies, Oxford, England: Wiley-Blackwell, pp. 24-42, 2011.
 A similar problem appears with the University of Foreigners Perugia.
 A Web crawler is an Internet software application (bot) that systematically browses the World Wide Web, typically for the purpose of Web indexing.
 As opposed to American ones, for instance: https://web.archive.org/web/19970518021303/http://www.utexas.edu/teamweb/history/
Other small modifications appeared during this work, i.e. http://www.unipv.eu was previously http://www.unipv.it
 To go deeper into the “changing of domain names problem” in historical researches we suggest N. Brügger, “Digital history and a register of websites: an old practice with new implications”, The Long History of New Media: Technology, Historiography, and Contextualizing Newness, pp. 283-298, 2011.
Ateneo, nasce il centro di biologia
 As the Italian version of the website has been preserved more times than the English one, the dates mentioned will refer to the first one.
 The robots.txt protocol is a convention to advising cooperating web crawlers and other web robots about accessing all or part of a website which is otherwise publicly viewable.
 For instance, this is the old version of the homepage of the Department of Classic and Medieval Philology, one of the few pre-change department pages still available on the live web: http://www2.classics.unibo.it/
 It is important to notice that this academic year is also present in the old version of department pages.
 However it is important to remark that the message displayed is different from the one of the snapshot of the University of Trento website in Fig.2.
 As an example prof. Francesca Tomasi has been one of the curators of the Department of Classic and Medieval Philology web page at the University of Bologna for a few years.
 N. Brügger , “Probing a nation’s web sphere: A new approach to web history and a new kind of historical source”, 2014.
 R. Rogers, Digital methods, MIT Press, 2013.
 N. Brügger, “Web archiving—Between past, present, and future”, in M. Consalvo & C. Ess (Eds.), The handbook of Internet studies, Oxford, England: Wiley-Blackwell, pp. 24-42, 2011.
 I.e. the article “Anche l’università via Internet”, written by Giovanna Favro and published the 14th of May 1998, available here: Archivio La Stampa
 The expression “scarcity and abundance” in relation to the history of digital media was coined by Roy Rosenzwieg. R. Rosenzweig, “Scarcity or abundance? Preserving the past in a digital era”, The American Historical Review, 108.3, pp. 735-762, 2003.
 For instance Henrik Sivertsen is the responsible in the Resaw Project for the preservation of digital materials related to the 2014 Eurovision Song Contest http://resaw.eu/projects/eurovision-song-contest/.
 Update during the peer-review process: in November 2014 Twitter offered access to its complete archive online, opening completely new scenarios for researchers. https://blog.twitter.com/2014/building-a-complete-tweet-index
 As mentioned earlier, we decided to focus our research on born-digital documents because they guarantee the direct applicability of computational methods, without the intermediate digitization step. However, traditional sources from universities archives could be as useful as digital ones in this specific part of the study.
 This information is available in the “Rapporto di Valutazione”: http://www.unibo.it/nucleodivalutazione/default.aspx
 W. Wong, W. Liu, M. Bennamoun, “Ontology learning from text: A look back and into the future”, ACM Computing Surveys (CSUR), 44.4, 2012.
 F. Nanni, “Managing Educational Information on University Websites: a proposal for Unibo.it”, Collaborative Research Practices and Shared Infrastructures for Humanities Computing, Proceedings of the 2nd AIUCD Annual Conference AIUCD2013 (Padua, Italy, 11-12 December 2013), CLEUP, pp. 279-286, 2014.
 S. Hale, et al., “Mapping the UK Webspace: Fifteen Years of British Universities on the Web”, Proceedings of WebSci, 2014.
 A. Ritter, S. Clark, O. Etzioni, “Named entity recognition in tweets: an experimental study”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011; C. Li, et al., “Twiner: named entity recognition in targeted twitter stream”, Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM, 2012.
 B.M. Schmidt, “Words alone: dismantling topic models in the humanities”, Journal of Digital Humanities, 2.1, pp. 49-65, 2012.
 J. Chang, et al. “Reading tea leaves: How humans interpret topic models”, Advances in neural information processing systems, 2009; A. Gangemi, “A comparison of knowledge extraction tools for the semantic web”, The Semantic Web: Semantics and Big Data, Springer Berlin Heidelberg, 2013. 351-366.
 D. Fiormonte, “Informatica Umanistica…quindici anni dopo”, 30/03/2014, http://infouma.hypotheses.org/56
 I. Milligan, “Mining the ‘Internet Graveyard’: Rethinking the Historians’ Toolkit”, Journal of the Canadian Historical Association/Revue de la Société historique du Canada, 23.2, pp. 21-64, 2012.