Twitter-Archive-Parser Analysis

During the past few weeks, I have been developing a Python script that aims to uncover Twitter data through the data archive that a user can download. This is slightly different from a cloud extraction, such as with Magnet axioms Cloud Pull feature, because doing an archive download sends the user a zip file with many JavaScript files containing their information. This archive has  more information about a user than the Axiom pull, but is harder to manually search through. This is why I created a script to parse out some of the many files in the archive.

Upon running the script, which took less than 4 minutes,  data is parsed and outputted to the user specified output SQLite3 database. Run time of the script is impacted by the number of account IDs needed to be requested and the general content size of the account. This is due to the fact no account usernames or handles are recorded in this archive, only numeric account IDs. this can be overcome by one of two ways: implement the Twitter API, or make HTTP requests to Twitter. For this script, I used HTTP requests. To lessen the amount of HTTPs requests needed to be made, a dictionary is made with the account ID as the key, and username + handle as the values. Each time an account ID is parsed, the script checks the dictionary to see if it is a known account ID with an already parsed out username + handle, if so, no HTTPS request is made, and if not, an HTTPS request is made, then data appended to the dictionary. This method lessened run time by over 500% on my personal Twitter archive. If you’re interested, download it here.


The outputted database has the following fields, and the data inside each field:


Account Info

Contains data obtained from the account.js file. Which is parsed from & into the following:



Blocked Users

Contains data obtained from the block.js file. This data is useful for an investigation such as a harassment case, as you can see that the user in question is actively blocking a potential suspect. The file is parsed from & into the following:











Connected Applications

Contains data from connected-application.js which shows what apps were given permission to use the user accounts data. This has less functionality than Axioms, where its able to tell an investigator which Tweet was sent by what device.


Direct Messages

Contains data from direct-message.js which contains all data associated with direct messages, and is parsed out from & into the following:



Followers & Following

Contains data from followers.js & following.js which has account IDs for all users that follow the user in question. This data can be used by investigators to see who the suspect was potentially able to be in contact with.





Imported Contacts

Contains data from contacts.js which are all contacts that a user chooses to import from their personal contact application, into Twitter. Unfortunately, I could not find a way to correlate the Contact_ID/Phone Number with a Twitter Account_ID, as the Contact_ID number is not present in any other file in the archive, and Twitter’s API does not allow access to Contact_IDs or phone numbers. This data can be good material to corroborate an analysis done on a suspect’s phone to implicate the suspect did have this number in their contacts. Data is parsed into the following:




This contains data from ip-audit.js and is parsed into account ID, username, and handle of the suspect user, and more importantly, the IP address used to login for that specific Twitter session, and the date at which it occurred This is very interesting information, as with the public IP address of each session, it is possible to determine general location data with the IP address as well.



This contains data from tweet.js and is parsed into the following:




Many more artifacts reside in this archive, however, I just wanted a working script that can be able to obtain more artifacts than Axiom’s pull. More features are planned to be released with more optimizations. More info about the twitter-archive-parser can be found on my GitHub Repo.

How Much Data Does Twitter Have On You?

What does the Bird know?

Brief History of Twitter

Twitter’s inception occurred in 2006, by Jack Dorsey and a handful of other members; Evan Williams, Biz Stone, and Noah Glass. The idea was originally supposed to be based on SMS messages as the primary means of communication, instead of internet based as we know of it today. Twitter wasn’t always called Twitter as well, it used to be named Twttr, due to naming trends at the time (Tumblr, Flickr, etc). Shortening up words as shown previously in texting became popular at this time, which saw many tech companies emulating the trend.

How To Obtain Your Twitter Archives

To download your Twitter archive, follow the steps here, which are given by Twitter. Be careful, as there are two types of archives you can download, a comprehensive archive of all your data, which was linked above, which is much more difficult to parse and read. The less comprehensive Archive only contains your Tweets, but Twitter gives you a nice HTML page to easily view your Tweets in a timeline. This less comprehensive Twitter archive can be downloaded by following Twitter’s guide here.

Data in the Comprehensive Twitter Archive

In the comprehensive Twitter archive discussed above, it is clear to see how much data Twitter actively collects and stores on its users. What follows is an overview of data obtained through this comprehensive Twitter archive:  
Phone Number Account Suspension Direct Messages IPs Used to Log In Periscope Account Info
Email Ad Engagements Direct Message Media Liked Tweets Protected History
Username Ad Impressions Profile Media Created Lists Saved Searches
Account ID Date of Birth Direct Message Headers Member Lists Screen Name Change
Creation Date Inputted Age Change of Email Address Subscribed Lists Verification Status
Display Name Ad Inferred Age Facebook Connections Moments
Timezone Blocked Users Followers Muted Accounts/Words
Creation IP Address Connected Applications Following Devices Used

Analysis of the Comprehensive Archive

Twitter-Archive-Parser is a Python script I created to parse out some of these artifacts. For more information on this script, please visit my blog post about it and the GitHub repository.