Digital archivists are responsible for preserving our history, in the era of Web 2.0 this not an easy undertaking. Focusing on weblogs (blogs), I will attempt to examine the archiving and preservation of these digitally born materials. Aspects inherent with the nature of a blogs can make their preservation problematic to those in the archiving field. We must recognize the historic importance of these documents and begin preservation methods now for long term future , research, reference and scholarly pursuit.
Archivist are now faced with mass amounts of digitally born medias, many of which are in need of preservation. In 2010, Web 2.0 practices have become part of our every day life, with about 3/4 of the estimated 142 million American Web users engaging in some form of social media interaction. Web 2.0 applications are now an essential aspect in our society, our culture, our economics and now our history. The World Wide Web can be seen in an evolving organic manner. Transforming and reconfiguring our concepts of communities. Users participate together in a form of collective intelligence, there is a synergy between users, a community. This “community” is never more prevalent than
in users of weblogs and Twitter, which is now called “the blogosphere.” (Baudoin, 2008)
Weblogging and “microblogging” as on Twitter– can be seen as the “digital” form of diaries, personal journals, private letters–all of which are significant to our cultural memory. These items offer glimpses into a moment in time and space, and can provide a consistent and reliable form of information regarding our history and past–but only if we preserve it now. These resource will provide future researchers a greater understanding of life in the late 20th and early 21st centuries.
The creators of the World Wide Web are seen as innovators of information access. Believers of information for all, as Derek M. Powazek states in his essay, What The Hell is a Weblog and Why Won’t they Leave Me Alone? featured in the book we’ve got blog, “This was the anti-television. Digital democracy.” (3) In the late 90s, when weblogs started to emerge on the web, opening up a wealth of persona
l expression to whoever had access and means to a computer and the will to express whatever they wanted whenever they wanted. The weblog arena quickly became a community of links, information and of personal opinion geared toward a participatory interactive public.
The exact history of who is credited with the first weblog is still up for debate. In Catherine O’Sulivan’s essay for the The American Archivist, titled On-line Diaries, and the Future Loss to Archives: Or, Blogs and the Blogging Bloggers Who Blog Them, cites Carolyn Burke as having created the first “on-line diary entry” in January of 1995. In her online journal called Carolyn’s Diary, posted this:
I’ve found that hope is the thing that keeps me interested in anything, including living, including good books, including people. The hopeless make me feel separate, trapped. Hope. What was her brother, the emperor’s name?…Hope is the sun in the sky feeding all of us. Hope is the feeling that my life will have a purpose for me. With it I can feel the future rising up fuzzily unpredictable, and yet malleable. My own will uses hope to sculpt what will be my present.
This posting by Burke is a deeply personal reflection and becomes an important part of our cultural memory. Technologically, for being the first posting of its kind; socially, because it hearkens back to the era of diary writing as a personal expression, but now set in a digitally born format.
However, Carolyn’s Diary‘s position as the first “official” blog is still debated, and generally credit is given to Justin Hall’s Links from the Underground, conceived in 1994, as the first blog that would be recognizable to the modern blogger, featuring new site links and “dated” commentary. This format came to be known as “filter-style” weblogging. Hall’s site was followed by Dave Winer, who wrote a weblog that chronicled the 24 Hours of Democracy Project in early 1996. In 1997 Jorn Barger coined the term “weblog,” which was then transformed by Peter Merholz, who coined the term “wee-blog” which then became shortened to simply “blog.” In 1998, Cameron Barret began what could be considered the first weblog archive with Cam World, consisting of a maintained list of updated URL’s. (Riley, 2005)
Blogging really took root in 1999. Early that year, Bridgette Eaton established Eaton Web Portal, the first portal dedicated to blog listings. Eaton’s only criteria for inclusion was that entries needed to be dated. In August 1999, the launch of Blogger–created by Pyra Labs, and eventually bought out by Google in 2003, revolutionized the whole concept of blogging. Blogger created a “free-form interface” in which anyone with a web browser could easily create a blog, consisting of any content they wanted, without needing a background in HTML code writing to do so. (Blood, 2002)
According to Rebecca Blood, in her 2000 essay Weblogs: A History and Perspective, this creation of Blogger not only opened the door to free-form publishing, but altered the definition of weblogs. What was once considered a list, consisting of links and some commentary became a site that is updated frequently with new material posted at the top of the page. It went from what was termed a “filter-style” weblog, which creates a large sift of information that is then posted in link formation– to a “blog-style” weblog. However;both styles still persist.
By June 2003, at the time of Google’s acquisition of Blogger and the start up of WordPress, there was an estimated 2.4 to 2.9 million active blogs online (“active” meaning that the blog had been updated within the last 90 days). (O’Sullivan, 2005) Blogger interfaces included spaces to inform readers about the author, in the “about me,” “bio,” and “profile” sections. Blogging became interactive, giving voices to those who were not computer scientists, programmers or coders. For the first time in history media became a “participatory endeavor.” (Blood, 2002)
By April of 2007, there was an estimated 15.5 million active blogs with about 1.5 million posts a day. In 2010 the blogging rate hit a plateau. Heather Green of Bloomberg Business Week Online, reports that the reason growth is slowing is twofold. First, those who want to blog, are doing so. Second, the increased use of other social medias–such as, video, podcast, social networks and Twitter fill the same function of traditional blogs. (Green, 2007)
Twitter entered the social media scene in 2006, envisioned as a platform for “micro-blogging” via SMS texting. The posts (tweets) may contain up to 140 characters per posting. Twitter users follow other Twitter users or organizations and companies. At the onset of the company, no one predicted the popularity of this approach. According to Alexa: The Web Information Company, Twitter is ranked the number twelve website in the world. There there are nearly 50 million tweets sent by about 15 million active Twitter users. Interestingly, the highest ranked country of Twitter use is not the United States, but South Africa, followed by the Philippines and India. Twitter is the prime example of how social networks form and evolve over time.
The Library of Congress has recognized the cultural significance of Twitter to our history. On April 14, 2010, it was announced that Twitter would donate all of the public tweets from 2006 to present. The Library of Congress enthusiastically accepted stewardship of the donation to enhance their current and gr owing set of “born-digital materials.” In the press release on the Library of Congress website, they openly acknowledged the significance of Twitter to our culture–economically, social, and politically.
Many important events relevant to our recent history have been tweeted about. The 2008 Congressional and Presidential elections culminated in the historic tweet from the newly elected President Barack Obama, announcing his victory. Twitter was used during the historic 2009 Iranian elections as a protest tool, allowing users to communicate with the outside world even after the government had blocked most communication methods. Twitter has been used as a tool in social change, protest, politics and dissent in other countries. An example, cited by the Library of Congress was in the case of James Buck and his Egyptian translator, who were arrested in Egypt while reporting on the 2008 elections in that county. Buck tweeted about his arrest to his 48 followers, who contacted UC Berkley where Buck was a student, and U.S. Embassy in Cairo. Buck continued to send updates while being detained by the Egyptian government.
The Library of Congress anticipates that select tweets will be eventually made available to the public, based on topic searchable from their website.
Blogs inherently archive themselves by nature of their design—now we need archivist who believe in the importance of this media to archive the blogs. While the importance of preserving digitally born material becomes increasingly evident, there is still debate as to the value of preserving weblogs and Twitter’s tweets. Digitally born objects, like blogs, seem as an intangible document, living in the organic flux and ever transforming state. They operate as an unconstrained medium, but this is also the dichotomy for preservationist. How do we preserve the free-form intangible publishing, what portion should be archived, who will decide what is relevant for inclusion into an repository, what will their criteria for inclusion be, and who will pay for the preservation endeavor?
The Problem Solvers
The Internet Archive was founded in San Francisco in 1996, with the mission of providing permanent and free access to web information and history.(O’Sullivan, 2005) The Internet Archive, in reference to their site, offers “access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format.” Since the onset of their organization, they have been accepting data donations from organizations such as Alexa; a company that specializes in intelligent web navigation, via web crawls, collective participation and the use of their “toolbar” by its users.
The Internet Archive includes text, audio, moving image, and software. Most importantly, to preservationists, is that the Internet Archive collects web pages using what they call The Way Back Machine. This program features several of the organizations’ archiving pursuits; Archive-it-, their K-12 Web Archiving Program, Around the World in 2 Billion Pages, and their growing web collection. These subsidiaries of Internet Archive work directly together to create a holistically accessible archive. Archive-it allows users to “harvest, catalog, and archive their collections, and then search and browse the collections when complete.” The site does not explicitly give a subscription rate, but alludes to there being one. Archive-it depends heavily on user participation about what gets archived and how it is maintained, it’s a collective working to
preserve the past for the future use. If a blog creator would like to utilize their system they too can archive their work. According to their site they have collected 1,667,446,800 URLs for 1,035 public collections.
The Way Back Machine allows archives from the Web to be accessed, providing a “snapshot” version of the the site. The query is conducted using an URL to search their extensive database of collected sites. The results can be used to see what a sites page has looked like previously, to gather “original source” code for further development , to visit sites no longer in operation, and to gather a historical background. Not every website ever created is listed, owners do have the option to have their site excluded or removed. Internet Archive and their Way Back Machine is one of the most extensive and collaborative organization working to preserve for future use. They have created a wealth of information for professionals, students and patrons alike–they operate like a public library that’s physical space is the World Wide Web.
In early 1999 Bridgette Eaton developed the Eaton Web Portal. Since 2001, it has transformed and become known as the Eaton Web: The Blog Directory. A snapshot of the first generation of Eaton is still accessible through Internet Archives: Way Back Machine, and because of this we can track the history of her site. It is informative, allowing us to examine the
transformation she made in her archiving methodology. According to the site, criteria for inclusion in the web portal was “loosely… something that’s organized chronologically and updated fairly regularly.” She validated each site herself choosing sites in operation for at least a month. She categorized the weblogs, alphabetically, categorically, by language and country. There also has a search field. To have a weblog listed required basic information: blog name, URL, description, author, bio and email must be submitted.
Eaton morphed her site in 2001 making it a far more participatory, collective community of bloggers managing their own blog archive. The process is basically the same structure; a user must register on the site, submit their blogs–title, URL, feed URL, and blog tags. However; the biggest change is that now user must “donate”. The nominal annual fee of $34.99 allows the user to manage two blogs. Eaton’s works as a “blog manager,” offering more interactive
tools for promotions, monetizing your site and maintenance of your work. The new site is far more commerce based rather than the “labor of love” like the initial Eaton Web Portal, but it offers a different niche in the preservation of the blogosphere.
Technorati, founded in November of 2002, claims to be the first “blog search engine.” The company promotes and utilizes “open source” software to run their operations. A user had to register and “claim” their blog by submitting their blog and feed URL, which then must be evaluated, verified and reviewed before being listed in the Technorati index. The system operates using an “Authority and Rank” system. The site defines authority as the
number of unique links to your blog and rank as “how far your authority is from the number one position. The higher the authority, the closer to number one you will be. In recent years, Technorati has dwindled in popularity and use as they added and removed and attempted to expand their service base.
Google blog search professes their firm belief in “self publishing” and that their “blog search” makes it easier to explore this community of self-publishing more effectively. The index is searchable as far back as January 2000, but most entries do not date further back than June 2005, when Google blog began this endeavor. Users can conduct their query by using a standard basic search or advanced search–they also offer their own unique set of search operators- inblogtitle, inposttitle, inpostauthor, blogurl. Blog search also offers access through the many search options featured on their home page, including: hot queries, recent posts, top stories,top videos and a genre search (politics, entertainment, technology, science…), and can be featured in numerous languages. Google’s only criteria for their “automatic” inclusion in their “blog search” is that the blog must have an RSS feed attached to their site. If the site does not have an RSS feed, submissions for inclusion are accepted directly and easily on the site. Google indexes through a site’s feed, checking often for updates. This technology allows for relevant timely, dated responses to search queries.
Lastly I want to mention The Library of Congress as it pertains to Twitter and archiving. The stated mission of the Library, located on their website, is “to make its resources available and useful to the Congress and the American people and to sustain and preserve a universal collection of knowledge and creativity for future generations.” Their digital collection has been available to the public since 1994–specializing in a collection that is unavailable anywhere else– a substantial assemblage of digitized photography, manuscripts, maps, sound files, motion pictures, books and “born digital” materials. The Library of Congress staffs researchers, librarians, and library professionals to aid the public in their pursuit of knowledge, their mission keeps core ideals of librarianship. They are leaders in their approach to digital preservation–taking into consideration “copyright issues, metadata and retrieval standards, preservation, scanning and conversion, and text mark-up.”
The Obstacles and Solutions
There is yet to be an ideal method for archiving the Web, let alone the multiple leveled idiosyncrasies of the blogosphere, which consist of unstable, constantly changing and disappearing blogs. The organizations I mention here have begun great work toward a practice of archiving this material, but are not without flaws.
Eaton Web: The Blog Directory and Archive-it requires some fee for their service, and some creators don’t want to pay or invest in the time it takes to maintain their blog. Both Eaton and Technorati update their sites regularly, but they also dispose of outdated material–the focus is more on real-time and freshness of content. (Helmond, 2007) Google blogs, Eaton Web: The Blog Directory, and Technorati offer no past history of their own site, there are no “snapshots” in time of their own spot in history on any given day. The Internet Archive has the ability to preserve a permanent history, but for now we see only glimpses into that history. They are ideal for showing dead links from the past, providing us that “snapshot” in history, but they only go so far. We are still missing content and context–the enlightenment to broader historical cultural relevance is still missing. (O’Sullivan, 2005) The Library of Congress is the only one who is preserving with a holistic approach, particularly with the recent Twitter donation. This opportunity will be a learning experience for their institution and we will be the benefactors, but they are extremely exclusive about what they will include in their collection.
As a creator of a blog you should take matters into your own hands. If you believe your work is worthy of long-term preservation, then do something about it. Pay those minimal subscription/donation fees and get your blog out there. Sign up with Technorati. Make sure you are showing up in Google blog search and if you are not take steps to make sure you are. Loss is a big issue, make sure you back up your work, your history, your work, whatever your passionate about can be lost with few keystrokes, a faulty host provider, or your blog may even be shut down for reasons such as “violation of terms.” If you have a proper backup plan your loss will be only of your time. One choice, if you are a WordPress blog user, is Vault Press–new on the market and still in the experimental phases– they claim to not only back up your posts, but your site as a whole–your plugins, dashboard, themes, and comments.
As a professional archivist or a member of an preservation team– action within your institution needs to happen now to determine which blogs are relevant and worthy of preservation to your particular institution. An acquisition policy should be developed constituting a sound method of appraisal. In the field of preservation there are every level of digital repository systems and software, both proprietary and open source available to meet your organizations needs. Meta Archive is one of many options available for digitally born collection development. The homepage of their site offers a suitable statement concerning the the state of digital archiving at the moment, “The greatest threat to digital assets is not fire, flood or theft. It’s the assumption that cultural memory organizations have taken the requisite steps to preserve them.” Meta Archive claims to have a community based approach to their preservation system, which lends itself well to the blogosphere community. They specialize in low-cost preservation to ensure long-term digital preservation for universities, libraries, museums, and other cultural heritage institutions.
Conclusion and Future Work
There needs to be a commitment in this collective Web 2.0 community. It needs to involve the creators/authors of blogs, archivists, librarians, administrators and those with the like mind to preserve these important shards of history. There are levels as to what should be preserved and debate about what content is relevant.Care and thought should be consider in the appraisal of inclusion, because sometimes inclusion criteria itself can re-write history.
As harvesters of potential history: subject matte r, content selection and preservation determination cannot be intransigent on the frontier of information. Archivists must attempt to consider why the blog was created, define its significance to our cultural memory and act now to determine the relevance and worthiness of those blogs and then strive to preserve them for long term use far into the future.
The information gathered for this essay brought up other issues concerning the preservation of the blogosphere. Further research could include loss recovery from blogs that were improperly backed up, deleted, or suspended due to some form of “violation of terms.” Are they recoverable? Secondly, which was really the base of the essay is, inclusion. This is important and it makes me wonder if we should not be working to preserve it all.
1. Alexa: the web information company. (2010, May 10). Retrieved from http://www.alexa.com/siteinfo/twitter.com
2. Archive-it. (2010). Retrieved from http://www.archive-it.org/
3. Baudoin, Patsy. (2008). On preserving our blogs for future generations . the Serials Librarian , 53(4).
4. Blood, R. (Ed.). (2002). We’ve got blog. Cambridge: Perseus .
5. Capra, R.G., Lee, C.A., Marchionini, G., Russill, T., & Shah, C. (2008). Selection and context scoping for digital video collections. Unpublished manuscript, School of Information and Library Science, University ofNorth Carolina at Chapel Hill,Chapel Hill, North Caroline. Retrieved fromhttp://fredstutzman.com/pubs/stutzman_jcdl2008.pdF
6. Eaton , B. (n.d.). Eaton web: the blog directory. Retrieved from http://portal.eatonweb.com/
7. Google. (2010). Google blogs. Retrieved from http://blogsearch.google.com/
8. Green, H. (2007, April 25). With 15.5 million active blogs, new technorati data shows that blogging growth seems to be peaking. BusinessWeek: Blogspotting, Retrieve http://www.businessweek.com/the_thread/blogspotting/archives/2007/04/blogging rowth.html
9. Helmond, Anne. (2007, December 03). Archiving blogs and the blogosphere. The Blog Herald, Retrieved fromhttp://www.blogherald.com/2007/12/03/archiving-blogs- and-the-blogosphere/
10. Library of Congress, Initials. (2010, April 15). Twitter donates entire tweet archive to library of congress. Retrieved from http://www.loc.gov/today/pr/2010/10-081.html
12. O’Sullivan, Catherine. (2005). Diaries, on-line diaries, and the future loss to archives: or blogs and the blogging bloggers who blog them. The American Archivist, 68(1), Retrieved from http://www.jstor.org/stable.40294257
13. Riley, D. (2005, March 06). A Short history of blogging. The Blog Herald, Retrieved from http://www.blogherald.com/author/duncan/
14. Riley, D. (2007, Novemember 05). Technorati drops content older than 6 months old. techcrunch, Retrieved from http://techcrunch.com/2007/11/05/technorati-drops-content-olderthan-6-months-old/
15. Samouelian, M. (2009). Embracing web 2.0: archives and the newest generation of web applications. The American Archivist , 72(Spring/Summer), Retrieved fromhttp://archivists.metapress.com/content/k73112x7n0773111/
16. Twitter . (2010). Retrieved from http://twitter.com/about
17. Vault press. (2010). Retrieved from http://vaultpress.com/