SRCCON 2018 • June 28 & 29 in MPLS Support OpenNews!

Session Transcript:
Archiving News Websites for the Long Long Term

Session facilitator(s): Albert Sun, Kathleen Hansen

Day & Time: Thursday, 11:45am-1pm

Room: Ski-U-Mah

Albert: Okay. Hello, everybody. I am wondering – can everyone hear if we just talk? Or do we need to talk into a microphone?

Louder or microstone – microphone.

You can use the lapel one, too.

Kathleen: I don’t need a microphone. I am used to talking to 19 year olds.

Albert: Hello? Powerful. So, is that good? My name is Albert Sun. This is co-facilitator Kathy Hansen. I am working at the New York Times and I have worked on web archiving projects there.

Most recently archives.newarchives.newYorktimes.

Kathleen: I am a faculty member here at the University of Minnesota. I have been on the faculty for 37 years. I remember when there was no Internet. My colleague and I have been thinking about archiving for a long time and wrote a book called “Future-Proofing the News” that outlined all the ways going back to colonial newspapers and all the ways things haven’t been archived. SJ

Albert: Kathy has the idea of how we should be doing things and I have the UP close up look in the newsroom and what it looks like and the nitty-gritty of why news websites aren’t archiving their stuff. I see a bunch of familiar faces in the room. I think if everyone could take a little bit of time – did everyone already introduce themselves to the people at the table and generally get to know everyone? The way this will run is we have a couple slides, then we will break up into small group discussions and come back at the end, and the goal at the end is everyone will have –

what was it? Three good ideas and like a stack of handouts or talking points.

Kathleen: The handouts are on the ethernet site.

Kathleen: This session is being transcribed so if anyone needs to go off the record just say off the record and then back on the record.

Albert: And one more. Maybe by a show of hands, how many people work for a news organization in like a sort of technical or developer role? And then how many are more, like, a reporter or editor or more news gathering? And then anyone who works for like a cultural institution or someone interested in news content.

Kathleen: Just to orient everybody. There are three kinds of born digital technology and each live in their own universe in terms of being archived. The first that have a print element to their publication that content has been digital for more than 30 years. That content has been archived in a way through some of the vendor services that have contracts with your organization, the people that used to be in charge of that digital content, the news, researchers or news librarians were the first to get fired so most of them are gone, and the kind of things that used to happen to make sure those databases were being managed properly is mostly not happening anymore. That has historically been a source of some revenue for the news organization that would share that content. The thing to remember with those databases is they were never the entire content of the newspaper.

The visuals, graphics, illustrations and agate, any of the freelance stuff is not in there. That has never been a complete archive of the print publication.

The second kind is what I think most of you are here thinking about and that is the content that appears on the web.

That includes everything from the sort of repurposed stuff in the print publication but always all of the original content created specifically for that electronic delivery system. It includes thinking about not just capturing the content but the look, the feel, the functionality. We know the news APPS are created outside the content management system so there is no archive process for a lot of that content and that is the stuff that is at most risk, and in fact if you go back and see what somebody’s website looked like 15 years ago you will not be able to find what you are looking for.

There is the content being delivered on the other platforms like the social media platform or user content and that stuff lives on somebody else’s server. It is a vendor’s server or content you don’t have control over internal to your news organization necessarily and that in many cases is yet a different kind of archiving problem. There are many different types of digital content that we are talking about and I think the first thing is to be clear about those three different kinds of content when we are talking about archiving born digital.

Albert: For each of those three different types do you know what your organization is doing? Or do you know who knows?

Kathleen: When we were working on our book we were talking to a large news organization in Minnesota, which will go unnamed, we were in a room with seven people at a table, most of whom had never talked to each other before we asked them to come together to talk to us. That is not atypical. If you are in that kind of situation, the first thing you have to ask is what do we have and whose got it and who am I not talking to right now.

Albert: There is no one person whose job it is to manage the archives. There have been proposals for people to do that but we have a photo morg and the Snapchat team archives their stories their way. We have an internal search tool for reporters and these are independent silos where there is effort and someone managing that. We don’t have a cohesive whole view of it generally.

The most recent new archive project was the web content archive and that was created out of the technology department because some previous servers were slated to be shutdown so it is a mad dash into the burning building to rescue things before it all disappeared.

Kathleen: That is one of the things you need to start with; what is here and who is responsible for it? There are pockets of things I guarantee you you don’t know about. That is the beginning point.

Albert: So, I think, for everyone on a site, are there things that you remember being created before and are they still up there? I know from my past work and before, this is something I periodically check on and the results are not always pretty. That is one of our biggest motivations for doing archiving. What are these things? What are the – think back in your organizational history, what are the important turning points in the digital evolution? Or in the creation of whatever you were publishing, is it still there? Can you refer back to it? What have you learned from it? How would you reference it in the future?

Kathleen: In some cases it may be there but the functionality is gone or it may be there but it relied on three previous CMS to run. There are things that you can salvage but they will require work. We want to run through a couple organizations that are trying to do something, briefly.

Albert: Many are familiar with the internet archive which is the premier organization.

They are one snapshot service that takes snapshots of services over time and have tons of captures and they have gone back for 20-plus years now. So that is a great resource if it works for your pages which for many the most interactive thing it doesn’t work that well.

Kathleen: Yeah, the Internet Archive recognizes what they are doing in calling the way back machine is not an archive. It is a site with some captures. So they are trying to manually capture full sites, but there are lots and lots of intellectual property issues with that and they pretty much bury it on the internet archive.

It is very hard to find. They are archiving television news and that is searchable because they are capturing the closed captioning. That is a valuable tool and I think they have been doing that for about seven years.

Unfortunately they are starting to shut that down a little bit. They are changing the way it is going to be working.

Kathleen: There is somebody trying do something but it is still not what we are talking about. The Library of Congress has a web archiving project they have been working on for a long time. They have something called general news on the internet which is 36 websites that they capture but it is not accessible for one year from the publication, again, because of IP and other issues, and it is really buried on the Library of Congress website and almost impossible to find. I am a librarian by training and I can almost never find it.

They have something called public policy topics, specific subjects and they are using the internet archive technology to capture those websites and some of that includes news content but, again, it is very limited in terms of what they are doing.

They have a dump of tweets as well but there is no way to access it.

Kathleen: I was just going to get to that and that is an interesting project, too.

Universities are trying to step in and do something.

Stanford web archive includes some news sites. This is from the Mercury News which you would think they want to pay attention to but they have one capture for 2016. They are trying to do something, but it is not significant. University of Missouri has their journalism digital news archive project.

They are trying to develop prototypes of systems that would capture content but, again, very early days. Historical societies in your area might be trying to do something. Public libraries might be trying to do something and I want to run through these quickly but the slides are all up on the ethernet and you have handouts there, too.

And there is something called Time travel or with Pastpages? These are projects where people are trying to capture something about a website but it is not an archive. Anybody know of any other kinds of projects like this that you are familiar with?

Larger, big picture kind of archival projects?

Albert: I think outside the U.S. there are some also. Some countries have legal deposit laws where you have to submit to the national library but it is just getting started.

Does Google News have this stuff in an archive anywhere?

Kathleen: No. No.

It has the cache but that sticks around.

In the research group that I am in at MIT we are capturing talk radio and we will have around 3,000 station by the end of the year however because of IP reasons we don’t know if we will be able to make it accessible.

Kathleen: Public radio has a wonderful project going on where they are capturing public radio content and digitizing public radio content. That is another resource. Again, it is called the American archive of public broadcasting – AAPB. It is something that is essentially designed to salvage all of tapes that are deteriorateing. They are making that content available as some sort of archival system.

Anyone remember the Rogers photo archive? They gave photos to this guy in Arizona – Arkansas and he promised to upload them.

If you remember what happened with gawker. Gawker went out of business and the database of the entire run of gawker was up for grabs. Archive-It is the only record of Gawker. Here is the Twitter issue. In 2017 they announced they are going to continue to preserve their existing content but it is in remaining in embargo and no one can have access to it. Great to have the archive but nobody can get to it because there is no way to access it.

These are always going to be playing catch up. There is always going to be a problem unless the news organizations themselves are taking this seriously and working on it as an internal issue because the externexternal organizations don’t have the ability to create a full database.

Albert: Why aren’t news organizations doing that themselves? For a print archive, you can take each copy and PDF and stash it somewhere and keep it safe. What is the digital equivalent of that? And why hasn’t it happened so far? I think a big reason is cultural and Kathy in her book talks about how this is a longstanding problem and how the focus is always on the next story and not so much about preserving yesterday’s stories.

There is technical issues. Every time you change your content management system or come up with a publishing format as the technology evolves what happens to Flash or Real Media Player and formats that don’t work anymore? There are economic issues like who can afford to do all this work? It is a lot of work. The revenue stream for it is a little uncertain. Print archives have a licensing path but what is the path for diital content? Who would be willing to pay for images of old websites? So far that question is a little unanswered. Then there is the legal and copyright and intellectual property issues and I think that is one of the big reasons why it is hard for outside organizations to do it if news organizations themselves don’t pay attention or don’t have a strategy or approach to it.

Kathleen: So where do you start? This is one of those wicked problems, right? The journalism and digital news archive project at Missouri helped create something called guidelines for digital newspaper preservation readiness. We have links n o – on the ethernet site. It is the way for the news organization to begin with inventory. What kind of formats are we dealing with and what are the standards that need to be in place if we are going to try to archive it for the long term? I think that is a useful starting point and a tool for people in terms of where to even begin to tackle this huge problem.

Albert: I feel like that was our attempt to give a little scope and current state of the problem. It is kind of a hard problem. There is not exactly answers for it. We are hoping today we can come together and think about how you would approach or start to tackle this in like concrete ways that you can take back to your organizations now, or that you can start doing personally to open your own material. We have come up with a setup for discussion questions and we are thinking we can split up into groups based on interest in them and we will work for about 25 minutes on it. For each group that is a specific deliverable that we will ask everyone to present back to the group.

Kathleen: We don’t want to suggest these are the only questions. There could be something you are really interested in that are not on here but these are four we thought might get things started. The first is what are some of the concrete actions that might take place. What are the technical tools you are using if you are doing archiving? And what would you like to have in your hands to do that.

Albert: The next is how do you go about finding people that care about this in your building and recruit people to worry about the cause and pitch it to people who hold the purse strings that may might also be invested.

Albert: The third is allies outside the building –

Kathleen: – these can take every conceivable format but think about if you are trying to seek help from another cultural institution or technical institution what would that look like and how would you pitch a collaboration to them? What would that, sort of, call for help look like?

Albert: And finally, like, what are the marque use cases?

How can you collect the archive material out there that would be interesting to turn back into your news coverage or into different projects that then you can put out into the world.

Kathleen: Are there other questions that need to be up here that you would like to suggest as a group for one of the groups? Things you are struggling with?

Yeah?

I don’t know if this is worth a whole subgroup, but should everything be archived?

Kathleen: That is a really important question and one of the things we have on the prompts. What is worth saving?

Albert: I think that most closely falls under number four.

Kathleen: And also the concrete actions. If you are going to start doing something, what is it worth trying to save?

Is the issue what is worth saving or is it there are things that explicitly shouldn’t be save. I say everything is worth saving if it is cheap enough but are there ethical reasons. The third thing is how do you work out that relationship with that vendor? There are lots of reasons why something might be worth not saving.

Albert: So maybe everyone can stand up, get up, and we will move around the room and say that number one will be this corner here. Two can be in the back by the doors. Three will be over in that corner. And four up here.

We might need to rearrange furniture depending on the group sizes.

Kathleen: One, two, three, and four.

Albert: There is a ling –

link to the etherpad with everything we talked about and more specific prompts for each thing.

[Group activity]

Kathleen: We have a deliverable for each group it –

if that is how you would like to focus on. If you choose something different that is fine. For the first group, a list of concrete actions and tools to investigate using either personally or as an organization and plan for the first step to start doing that for each one.

For the second group, the list of people in our organizations or maybe job titles, 30 second elevator pitch if you find yourself talking to an executive who could be an internal ali. The third is a list of organization you could work with on a project and a draft of an e-mail proposing a collaboration. The fourth is a list of stories and projects that could use an archive and what kind of archival material it would need. The prompts are just a way to get the juices flowing and have people thinking about what the issues are.

Everybody good?

[Group table activity]

Kathleen: We are going to go with the discussion for another, maybe, eight minutes, and then we will want to hear from each group before we finish at 1:00.

Kathleen: Okay, guys, we hate to end up with conversations hanging, but we do want to give each group a chance to share all the good stuff you came up with.

Which table wants to go first?

I volunteer to distribute. We were group four. My name is Jason by the way. Hi. We were looking at the use cases for our archival material. What was our prompt?

Albert: A list of projects that would use archiving material and what kinds of archiving material.

Perfect. What we kind of talked through was it might be a good idea to start with projects that had a monetization feature.

Something that could be re-sellable to museums or mass media or scholastic book publishing or sports content.

Things that are going to be the easiest to re-monetize to solve the technical issues. If there is money coming in you will solve the technical issues and then you can go into larger projects.

Kathleen: And these are your notes? You have your notes in Etherpad?

They are in Etherpad.

Kathleen: Fantastic. This is great.

Albert: Were there specific story ideas? What is the most monetize-able? Anything off the top of your mind?

You brought up the point of being able to sindicate the material.

We also talked about, kind of, more holistically some of this might not be immediately monetize-able but sort of the greater good and promote ideas that you don’t realize until they are there. We talked about artistic uses of the material.

Also a bit of maybe self-reflection of a news organization on their own archives.

Kathleen: Yup. Great. Who wants to go next?

I will do it. I am happy to report back because I took the notes but we have a real life person working on this issue.

Please interrupt me when I get something wrong.

We sort of rejiggered our deliverable to we will send an e-mail to an outside Ally and instead made a list of things to think about when you are trying to identify or think about whether an outside ally would be good.

The number one most important thing is is somebody already doing this and if so talk to them before you reinvent the wheel or try to do anything.

The answer is maybe they won’t be the right person. We heard about stories overwhelmed by donations of a lot of physical stuff. If you are like I heard you are doing this and I mailed you a lot of crap they are going to be pretty mad or at least not be able to do what you want.

That was the biggest one. They will definitely have tips for you, though. Find people who are already doing this work.

Second is does a collaborator have experience to help you and the representation of not messing this up or screwing with other people.

Albert: Who might that be?

Three was one I cared about a lot. Do you have any reason to believe this organization will continue to exist for whatever number of years seems reasonable. If not, or even if so, do they have a handoff plan if not?

And I think you would be best to talk about four which is what does your organization want to get out of this.

[Off the record]

Kathleen: That is a real good point. When the Denver public library drew up an agreement with Scripps for the Rocky Mountain news, they spent three years working on that agreement of donation and there is an example on the Etherpad site of the agreement between scripps Howard and the Denver public library. Understanding what the memorandum of agreement is, if you are working with an ex ernal organization, is crucial because the cultural clash could be striking.

And we added to do internal research because if you are the person that signs the agreement you know what it means but that doesn’t mean the –

Kathleen: The other users do.

The reporters who have been relying on this for years understand what you have agreed to.

Kathleen: Group two?

I will start because I took the notes but the others feel free to jump in. If you go down, we started line 129. Basically our task was about trying to make a list of who could potential internal allies be in your building or newsroom, sort of figure out what their motivations would be if you were to make the pitch, what were the things you should be appealing to, and when it came to talking points to pitch, we sort of broke them out by use case underneath because if you have news researchers, it is pretty easy to pitch because they want to be able to research on the news. Not everybody has that luxury of having those in their newsroom. Journalist and content creators want to make sure their work is preserved, editors would love to research work or stratstrategize on those. The heads of developer are people that care but they need to be talked to about the technical debt that is currently had, that this could alleviate technical debt, and that it is worth putting the effort into the archives and it will save you time. Several are repeated because the points are universal but that is how we tackled it. If there is anything anybody else wanted to add.

There is a business case for archive and it might help saving time by calling it Evergreen content and in a lot of cases that could be recasted and recirculateed for the audience helping you do more with the existing archive. It is not just for the pure alterism of people who might be doing research. You can make it a more craven thing you can put ads on.

Kathleen: The craven is the only way it will get done.

I work at the “Wall Street Journal” and one of our sister products is an archive product we sell.

Kathleen: That has been out there for 30 years and has a track record of value.

Even people outside the newsroom, on the business side, there is the case to sell this.

Kathleen: Group one?

Our deliverable was we were supposed to have a list of concrete actions and tools to investigate using either in-person or as an organization and plan for first steps of starting to use them.

Starting around line 51 to 117 we talked about things that are currently being done which went on. There are tools there, other resources, people just talking about how they were currently using archive stuff.

This sort of wrapped it up. We went around the table after coming up with a list of things that can be done and that was, you know, making sure your archives can be used – I am reading this backwards. I am sorry.

Yeah… I guess you could say our deliverable list of concrete actions were just to make sure that you get everything into one system so it can be easily searched and indexed because then is browseable and people can use your archives to find information. Automate your backup creation because this is something we talked about where they lost a site and the backup wasn’t there or they changed their CMS and didn’t have a solid back up of the previous CMS. A couple people talked about having offsite backups –

we changed our CMS in 2008 and lost everything published before 2008 because that was gone. Or in the case of the photo library. It was flooded and there was no backup there. Make sure you have your backup and there is a backup someplace else that in the event of fires, floods, in the case of the Canadian environment libraries or the government changing, or organizational sabotage, make sure someone else has a copy.

Something else was we made these decisions or mark down why we made the decisions on what to archive, what not to archive, interview, and the people who made the decisions.

Even if you have gone pull back meeting notes from four years ago where the decisions were made at least having a way of recording the future for prosperity.

And another thing was don’t build dynamic sites. Build things that are static that don’t require the continued existence of a server or third party. It should be as simple as possible if you want the archives to continue working.

Kathleen: Vint Cerf has talked about the concept of digital vellum. Anybody heard that phrase? He is saying you create a box that has the content of the interactive but it always – also – includes all of the stuff that made that interactive work and that becomes a static thing within which you have got an interactive whatever. That is something that I think is not necessarily technically possible at this moment.

It is. The internet archive has a whole bunch of old computer games for DOS and I think Windows ‘95. When you want to play this archive video game, they boot up a virtual machine running in your browser and gives the entire system to run the game. People are saying that is how you will have to interact with Flash content. Spin up a virtual machine containing a browser and old version of Adobe Flash.

Kathleen: You have to have the emulator for the thing you are archiving.

Just one caveat on that. That works if you can scope the amount of data you need access to. For example, you will not reproduce an entire search engine. You depend on that for external services.

If it is less than a gigabyte you canned stick it in a single file – you could.

Right. But you need to prepare to do that and there are datasets that are larger and if you are relying on internal basis.

Are you going to duplicate all of Twitter?

That is a scale it goes beyond but it is pretty big.

Albert: Anything else from concrete actions?

Kathleen: Albert, do you want to give us the 23,000 mile version of what you were working on and the problems you faced?

Albert: It sounds like there are a ton of different avenues for archiving which is sort of the issue we all face; there is the very specific and how you archive that one specific project you care about and then there is the strategy of how do you make someone care about it or I archived it but how do I not lose the hard drive I archived it on so you don’t lose it.

I think we are fortunate we are embedded within an institution that is quite long-lived and within an organization that has people who care about process and long-term thinking and just in an organization that has people that have been around for along time. The idea we were able to create a flat file system for a lot of our web content and then sort of, with the idea that we moved the things from the shelves and now we can go through the restoration process to fix the old broken interactive and have a system for newer web pages so we can put it on the shelves in the same sort of system.

Kathleen: And you were crowd sourcing fixing things; right?

Albert: Internally crowd sourcing fixing things and then put up a public e-mail address saying does this look right?

E-mail us and we will put it in the long backlog of things. We get 3-4 e-mails a day about someone who has found one of these old pages –

Kathleen: Find the geneologist people in your community. They are crazy people but they love news sites. Find the group in your community that really cares about what you are doing and they are going to be thrilled that you are thinking about this. They are the people that are most upset about having lost as much as we have already lost.

Albert: You. And then this other project we worked on this overlook project which is about women and people of color who never got an obituary in the New York Times. We don’t have a great way of looking back test past obituaries. There was no digital version of it. The last print version was published in the ’70s. So we were able to partner with the New York genealogy society and they scanned this old material for us and from that we have been able to run counts and figure out what the percentage of female names and titles in the archives were. I hope people have thoughts and specific ideas they can take way with this and go and do. Even if it is just going back and screenshotting your old things, to maybe pitching or forming a working group wherever you are at, anyone have other thoughts or last questions to toss out?

Kathleen: Please add to the Etherpad. We will go in and try to organize it a little bit, but we don’t want to mess up what you have done. You have done great work. If you have other ideas, if things come to you and you have a conversation and come up with something, please go in.

You can find the link on the schedule. It just says here are the notes from this session. Add what you can. We are thrilled that you came up with so much good work in such a short amount of time. Thank you, everybody.

[APPLAUSE]