Search

Search Results

Other Borealis Collections Logo
Borealis
Ruest, Nick 2022-01-10 <p>2,661,117 tweet ids for #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline tweets, collected with <a href="http://www.docnow.io/" target="_blank">Documenting the Now's</a> twarc. Tweets can be “<a href="https://medium.com/on-archivy/on-forgetting-e01a2b95272" target="_blank">rehydrated</a>” with Documenting the Now’s <a href="https://github.com/DocNow/twarc" target="_blank">twarc</a>, or <a href="https://github.com/DocNow/hydrator" target="_blank">Hydrator.</a></p> <p> <code>twarc hydrate tweet-ids.txt > tweets.jsonl</code> </p> <p>ID files are available for all hashtags or some individual hashtags: <ul> <li>covid19-ids.txt</li> <li>covid19vaccines-ids.txt</li> <li>fordnation-ids.txt</li> <li>healthcanada-ids.txt</li> <li>healthcanada-NACI-fordnation-medicalfreedom-covid19-covid19vaccines-protectourfamilies-protectyourchildren-holdtheline-ids.txt</li> <li>holdtheline-ids.txt</li> <li>medicalfreedom-ids.txt</li> <li>NACI-ids.txt</li> <li>protectyourchildren-ids.txt</li> </ul> </p> <p>Tweets were collected via the <a href="https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets" target="_blank">Standard Search API</a> on: <ul> <li>November 18, 2021</li> <li>November 21, 2021</li> <li>November 26, 2021</li> <li>December 1, 2021</li> </ul> </p>
Other Borealis Collections Logo
Borealis
Ruest, Nick; Sala, Christine; Thurman, Alex 2020-02-16 <p>Web archive derivatives of the&nbsp;<a href="https://archive-it.org/collections/1757">Avery Library Historic Preservation and Urban Planning</a> collection from <a href="https://archive-it.org/home/Columbia">Columbia University Libraries</a>. The derivatives were created with the <a href="https://github.com/archivesunleashed/aut/">Archives Unleashed Toolkit</a> and <a href="https://cloud.archivesunleashed.org/">Archives Unleashed Cloud</a>.</p> <p>The&nbsp;<strong>cul-1757-parquet.tar.gz</strong> derivatives&nbsp;are&nbsp;in&nbsp;the <a href="https://parquet.apache.org/">Apache&nbsp;Parquet format</a>,&nbsp;which&nbsp;is&nbsp;a <a href="http://en.wikipedia.org/wiki/Column-oriented_DBMS">columnar&nbsp;storage</a> format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See <a href="https://github.com/archivesunleashed/notebooks/blob/master/datathon-nyc/parquet_pandas_stonewall.ipynb">this</a> notebook for examples.</p> <p><strong>Domains</strong></p> <pre> <code class="language-java">.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)</code></pre> <p>Produces&nbsp;a&nbsp;DataFrame&nbsp;with&nbsp;the&nbsp;following&nbsp;columns:</p> <ul> <li>domain</li> <li>count</li> </ul> <p><strong>Web&nbsp;Pages</strong></p> <pre> <code class="language-java">.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))</code></pre> <p>Produces&nbsp;a&nbsp;DataFrame&nbsp;with&nbsp;the&nbsp;following&nbsp;columns:</p> <ul> <li>crawl_date</li> <li>url</li> <li>mime_type_web_server</li> <li>mime_type_tika</li> <li>content</li> </ul> <p><strong>Web&nbsp;Graph</strong></p> <pre> <code class="language-java">.webgraph()</code></pre> <p>Produces&nbsp;a&nbsp;DataFrame&nbsp;with&nbsp;the&nbsp;following&nbsp;columns:</p> <ul> <li>crawl_date</li> <li>src</li> <li>dest</li> <li>anchor</li> </ul> <p><strong>Image&nbsp;Links</strong></p> <pre> <code class="language-java">.imageLinks()</code></pre> <p>Produces&nbsp;a&nbsp;DataFrame&nbsp;with&nbsp;the&nbsp;following&nbsp;columns:</p> <ul> <li>src</li> <li>image_url<br /> &nbsp;</li> </ul> <p>The <strong>cul-1757-auk.tar.gz </strong>derivatives<strong> </strong>are the <a href="https://cloud.archivesunleashed.org/derivatives">standard set of web archive derivatives</a> produced by the Archives Unleashed Cloud.</p> <ul> <li><strong>Gephi </strong>file, which can be loaded into <a href="https://gephi.org/">Gephi</a>. It will have basic characteristics already computed and a basic layout.</li> <li><strong>Raw Network</strong> file, which can also be loaded into <a href="https://gephi.org/">Gephi</a>. You will have to use that network program to lay it out yourself.</li> <li><strong>Full text</strong> file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.</li> <li><strong>Domains count</strong> file. A text file containing the frequency count of domains captured within your web archive.</li> </ul> <p>Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with <code>cat</code>. For example: <pre><code class="bash">cat cul-1757-parquet.tar.gz.part* > cul-1757-parquet.tar.gz</code></pre> </p>
Other Borealis Collections Logo
Borealis
Ruest, Nick 2016-03-03 Derivative data for #MakeDonaldDrumpfAgain tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate MakeDonaldDrumpfAgain-tweet-ids.txt > MakeDonaldDrumpfAgain.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter. This dataset is the combination of hydrated http://hdl.handle.net/10864/11310 tweet ids, and htttp://hdl.handle.net/10864/11270.
Other Borealis Collections Logo
Borealis
Ruest, Nick 2019-11-23 <p>2,944,525 tweet ids for #elxn43 tweets, collected with <a href="http://www.docnow.io/" target="_blank">Documenting the Now's</a> twarc. Tweets can be “<a href="https://medium.com/on-archivy/on-forgetting-e01a2b95272" target="_blank">rehydrated</a>” with Documenting the Now’s <a href="https://github.com/DocNow/twarc" target="_blank">twarc</a>, or <a href="https://github.com/DocNow/hydrator" target="_blank">Hydrator.</a></p> <p> <code>twarc hydrate elxn43-ids.txt > elxn43.jsonl</code>. </p> <p>Tweets were collected via the <a href="https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets" target="_blank">Standard Search API</a> on a cron job every five days from September 9, 2019 - November 23, 2019.</p> https://creativecommons.org/licenses/by/2.0/ca/
Other Borealis Collections Logo
Borealis
Ruest, Nick 2016-12-31 228,086 tweet ids for "TheHip, hipinkingston" captured during the Tragically Hip's final concert in Kingston, Ontario in August 2016. Tweets can be "rehydrated" with Documenting the Now's twarc (https://github.com/DocNow/twarc). twarc.py --hydrate th_final_concert_kingston_tweet_ids.txt > th_final_concert_kingston.json
Other Borealis Collections Logo
Borealis
Ruest, Nick 2016-08-21 Description Tweet ids for #YMMfire tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate ymmfire-ids.txt > ymmfire-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.
Other Borealis Collections Logo
Borealis
Ruest, Nick 2016-04-13 Description Tweet ids for #thechalkening tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate thechalkening-ids-20160412.txt > thechalkening-20160412-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.
Other Borealis Collections Logo
Borealis
Ruest, Nick 2016-04-13 <p>Tweet ids for #panamapapers tweets, collected with <a href="http://www.docnow.io/" target="_blank">Documenting the Now's</a> twarc. Tweets can be "<a href="https://medium.com/on-archivy/on-forgetting-e01a2b95272" target="_blank">rehydrated</a>" with Documenting the Now's <a href="https://github.com/DocNow/twarc" target="_blank">twarc</a>, or <a href="https://github.com/DocNow/hydrator" target="_blank">Hydrator.</a></p> <p> <code>twarc hydrate panamapapers-ids.txt > panamapapers-tweets.jsonl</code>. </p> <p>Hydrating will recreate the original tweet(s) in JSON format, provided the content is still available on Twitter./p>
Other Borealis Collections Logo
Borealis
Ruest, Nick 2017-01-29 14,478,518 tweet ids for #WomensMarch collected with Documenting the Now's twarc from January 21-28, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate WomensMarch_tweet_ids.txt > WomensMarch.json Also included are the logs files for the Filter API and Search API queries. The Filter API query captures the cumulative number of dropped tweets. https://creativecommons.org/licenses/by/2.0/ca/
Other Borealis Collections Logo
Borealis
Ruest, Nick 2017-05-03 681,668 tweet ids for #climate collected with Documenting the Now's twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate climatemarch_tweet_ids.txt > climatemarch.json. https://creativecommons.org/licenses/by/2.0/ca/
Other Borealis Collections Logo
Borealis
Ruest, Nick 2017-12-10 <p>362,464,578 tweet ids for tweets directed at Donald Trump (@realDonaldTrump), collected with <a href="http://www.docnow.io/" target="_blank">Documenting the Now's</a> twarc. Tweets can be “<a href="https://medium.com/on-archivy/on-forgetting-e01a2b95272" target="_blank">rehydrated</a>” with Documenting the Now’s <a href="https://github.com/DocNow/twarc" target="_blank">twarc</a>, or <a href="https://github.com/DocNow/hydrator" target="_blank">Hydrator.</a></p> <p> <code>twarc hydrate to_realdonaldtrump_20210120_ids.txt > to_realdonaldtrump_20210120.jsonl</code>. </p> <p>Collection notes: <ul> <li>Tweets from May 7, 2017 - October 16, 2018 of the dataset used a combination of the Filter (Streaming) API and Search API.</li> <li>The Filter API failed on June 21, 2017.</li> <li>From June 23, 2017 forward only the Search API was used to collect.</li> <li>Collection was done every 5 days on a cron job, and periodically deduplicated.</li> <li>There is a data gap from <code>Tue Jul 28 13:53:50 +0000 2020</code> through <code>Thu Aug 06 09:36:23 +0000 2020</code> due to a collection error.</li> </ul> </p> <p>This dataset also includes a number of derivative csv files from the original <code>jsonl</code> collected. This includes: <ul> <li>A user csv file created with <a href="https://stedolan.github.io/jq/" target=_blank">jq</a> (see below).</li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-user-information" target=_blank">twut userInfo</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-tweet-language" target=_blank">twut language</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-tweet-times" target=_blank">twut times</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-tweet-sources" target=_blank">twut sources</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-hashtags">twut hashtags</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-urls" target=_blank">twut urls</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-animated-gif-urls" target=_blank">twut animatedGifUrls</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-image-urls" target=_blank">twut imageUrls</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-media-urls" target=_blank">twut mediaUrls</a></li> <li><a href="https://github.com/archivesunleashed/twut/blob/main/docs/usage.md#extract-video-urls" target=_blank">twut videoUrls</a></li> </ul> </p> <p>User csv:</p> <p><code>jq -r '[.id_str, .created_at, .user.screen_name, .retweeted_status != null] | @csv' to_realdonaldtrump_20190130.jsonl > to_realdonaldtrump_20190130_users.jsonl </code> </p> https://creativecommons.org/licenses/by/2.0/ca/
Other Borealis Collections Logo
Borealis
Ruest, Nick 2021-11-08 <p>2,075,645 tweet ids for #elxn44 tweets, collected with <a href="http://www.docnow.io/" target="_blank">Documenting the Now's</a> twarc. Tweets can be “<a href="https://medium.com/on-archivy/on-forgetting-e01a2b95272" target="_blank">rehydrated</a>” with Documenting the Now’s <a href="https://github.com/DocNow/twarc" target="_blank">twarc</a>, or <a href="https://github.com/DocNow/hydrator" target="_blank">Hydrator.</a></p> <p> <code>twarc hydrate elxn44-tweet-ids.txt > elxn44.jsonl</code>. </p> <p>Tweets were collected via the <a href="https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets" target="_blank">Standard Search API</a> on a cron job every five days from July 28, 2021 - November 05, 2021.</p>
Other Borealis Collections Logo
Borealis
Ruest, Nick 2020-04-19 <p>425,227 tweet ids for Wet'suwet'en tweets, collected with <a href="http://www.docnow.io/" target="_blank">Documenting the Now's</a> twarc. Tweets can be “<a href="https://medium.com/on-archivy/on-forgetting-e01a2b95272" target="_blank">rehydrated</a>” with Documenting the Now’s <a href="https://github.com/DocNow/twarc" target="_blank">twarc</a>, or <a href="https://github.com/DocNow/hydrator" target="_blank">Hydrator.</a></p> <p> <code>twarc hydrate wetsuweten-20210115-ids.txt > wetsuweten.jsonl</code> </p> <p>Tweets were collected via the <a href="https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets" target="_blank">Standard Search API</a> on a cron job every five days beginning on February 18, 2020. Collection is ongoing.</p> <p>The account that was used to collect these Tweets failed to collect Tweets for the period from Sun Jul 26 02:00:21 +0000 2020 through Fri Aug 07 20:05:54 +0000 2020.</p>
Other Borealis Collections Logo
Borealis
Ruest, Nick 2020-04-19 <p>80,264 tweet ids for Tyendinaga tweets, collected with <a href="http://www.docnow.io/" target="_blank">Documenting the Now's</a> twarc. Tweets can be “<a href="https://medium.com/on-archivy/on-forgetting-e01a2b95272" target="_blank">rehydrated</a>” with Documenting the Now’s <a href="https://github.com/DocNow/twarc" target="_blank">twarc</a>, or <a href="https://github.com/DocNow/hydrator" target="_blank">Hydrator.</a></p> <p> <code>twarc hydrate tyendinaga-20210115-ids.txt > tyendinaga.jsonl</code>. </p> <p>Tweets were collected via the <a href="https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets" target="_blank">Standard Search API</a> on a cron job every five days beginning on February 24, 2020. Collection is ongoing.</p> <p>The account that was used to collect these Tweets failed to collect Tweets for the period from Sun Jul 26 03:46:41 +0000 2020 through Fri Aug 07 20:33:27 +0000 2020.</p>
Other Borealis Collections Logo
Borealis
Milligan, Ian; Ruest, Nick; Lin, Jimmy 2015-12-01 <p>This contains derivative data for the Canadian Political Parties and Interest Groups collection.</p> <p>If you cite this material, please use:</p> <p>University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051004191340/http://canadianactionparty.ca/Default2.asp</p>
Other Borealis Collections Logo
Borealis
Ruest, Nick 2016-06-27 Description Tweet ids for #jcdl2016 tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate jcdl2016-tweet-ids.txt > jcdl2016-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.
Other Borealis Collections Logo
Borealis
Ruest, Nick 2015-12-12 Description Tweet ids for #paris #Bataclan #parisattacks #porteouverte tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate paris-tweet-ids.txt > paris-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.
Other Borealis Collections Logo
Borealis
Ruest, Nick; Library and Archives Canada 2015-12-07 Tweet ids for #elxn42 tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate elxn42-tweet-ids.txt > elxn42-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter. This dataset is the combination of hydrated http://hdl.handle.net/10864/11310 tweet ids, and htttp://hdl.handle.net/10864/11270.

Map search instructions

1.Turn on the map filter by clicking the “Limit by map area” toggle.
2.Move the map to display your area of interest. Holding the shift key and clicking to draw a box allows for zooming in on a specific area. Search results change as the map moves.
3.Access a record by clicking on an item in the search results or by clicking on a location pin and the linked record title.
Note: Clusters are intended to provide a visual preview of data location. Because there is a maximum of 50 records displayed on the map, they may not be a completely accurate reflection of the total number of search results.