Scraping MyFitnessPal with Python

UPDATE: Updated some methods for recent MFP site changes including JavaScript handling.

MyFitnessPal is a great website and app to log nutritional intake and other health metrics. To really delve deep into the numbers and find patterns, I wanted to import the data into Python.

In sum the below code will return an SQL database with table mfp_profile including

  • Age, Gender, Height
  • Caloric Goals
  • Macronutrient Ratio Goals (Carb, Protein, Fat limits)
  • Goal Weight
  • Desired rate of weight loss
  • Activity Level
  • Preference for imperial vs metric

as well as a table mfp_data indexed by date including

  • Weight
  • Calories
  • Protein
  • Fat
  • Carbs
  • Fiber
  • Sugar
  • Sodium
  • Saturated, Trans, Poly, and Monounsaturated Fat
  • Cholesterol
  • Potassium, Calcium, Vitamin A, Vitamin C
  • Minutes spent Exercising
  • Calories Burned
  • Net Calories
  • Custom entry categories, body fat in given code example

Usage:

loadMfp(username = ‘USERNAME’, password = ‘PASSWORD’, mfpDB = “database_name.db”, ‘start_date’)

Where dates are in the form ‘YYYY-MM-DD’. If no date is entered, it will automatically try to pull the last 1500 days.

In [1]:
import urllib  # accessing site
import http.cookiejar as cookielib  # storing cookies for access
from bs4 import BeautifulSoup  # for reading data from site
import time  # for sleep timers between page requests
import datetime  # organizing dates to scrape
import pandas as pd  # dataframe workhorse
import json # for json structured calorie/macro goals from MFP
import sqlite3 # for loading data into db
from selenium import webdriver
import selenium as selenium
import warnings
warnings.filterwarnings('ignore')

Big thanks to martinjc for providing most of the mfpExtractor class

In [2]:
class MfpExtractor(object):

    def __init__(self, username, password):

        # url for website
        self.base_url = 'http://www.myfitnesspal.com'
        # login action we want to post data to
        self.login_action = '/account/login'
        # file for storing cookies
        self.cookie_file = 'mfp.cookies'

        # user provided username and password
        self.username = username
        self.password = password

        # set up a cookie jar to store cookies
        self.cj = cookielib.MozillaCookieJar(self.cookie_file)

        # set up opener to handle cookies, redirects etc
        self.opener = urllib.request.build_opener(
            urllib.request.HTTPRedirectHandler(),
            urllib.request.HTTPHandler(debuglevel=0),
            urllib.request.HTTPSHandler(debuglevel=0),
            urllib.request.HTTPCookieProcessor(self.cj)
        )

        self.opener.addheaders = [('User-agent', 'Mozilla/5.0')]

    def access_page(self, path, num_days):

        # strip the path
        path = path.lstrip('/')
        path = path.rstrip('/')

        # construct the url
        url = self.base_url + '/' + path + '/' + str(num_days)
        print('Retrieving the page ' + str(url))

        # retrieve the web page
        try:
            response = self.opener.open(url)
        except urllib.error.HTTPError as e:
            raise e
        except urllib.error.URLError as e:
            raise e

        # return the data from the page
        return response.read()

    # method to do login
    def login(self):

        # open the front page of the website to set and save initial cookies
        response = self.opener.open(self.base_url)
        soup = BeautifulSoup(response)
        token = soup.find('input', attrs={'name': 'authenticity_token'})['value']

        login_data = urllib.parse.urlencode({
            'username': self.username,
            'password': self.password,
            'remember_me': True,
            'authenticity_token': token
        }).encode('ascii')

        # construct the url
        login_url = 'https://www.myfitnesspal.com' + self.login_action
        # then open it
        try:
            self.opener.open(login_url, login_data)
        except urllib.error.URLError as e:
            raise e
        # save the cookies
        self.cj.save()

    # method to get progress data i.e. weight and any custom entries
    def get_progress_report(self, path, num_days):
        report_path = 'reports/results/progress/' + path
        return self.access_page(report_path, num_days)

    # method to get nutrition data i.e. macros, micros, cals
    def get_nutrition_report(self, path, num_days):
        report_path = 'reports/results/nutrition/' + path
        return self.access_page(report_path, num_days)

    def get_fitness_report(self, path, num_days):
        report_path = 'reports/results/fitness/' + path
        return self.access_page(report_path, num_days)

Scraping Profile and Nutrition Data

Profile includes goal weight, age, goal calories, macronutrients, and desired rate of weight loss

Nutrition includes daily calories, weight, carbs, fat, protein, fiber, micronutrients and any other custom entries

In [11]:
def start_extract(un, pw, *args):
    """
    This will scrape MFP for all user's data

    :param un: username(str)
    :param pw: password(str)
    :param args: start date(datetime object)
                OR default last 1500 days if none entered
    :return: pandas dataframe indexed by date containing profile MFP info
             pandas dataframe indexed by date containing all nutritional+etc MFP info
    """

    username = un
    password = pw

    # check how many days to retrieve
    if len(args) == 1:
        # start date specified
        start_date = datetime.datetime.strptime(args[0], "%Y-%m-%d").date()
        end_date = datetime.date.today()
        num_days = (end_date - start_date)
#        start_date = end_date - datetime.timedelta(args[0])    
    else:
        # no dates specified, default to 1500 days from today ( about 4 years)
        end_date = datetime.date.today()
        num_days = datetime.timedelta(days=1500.0)
        start_date = end_date - num_days

    print('Retrieving data for %s days' % str(num_days.days+1))

    # initialise an MfpExtractor and login to the website
    mfp = MfpExtractor(username, password)
    mfp.login()

    def scrapeProfile():
        profiledict = {}

        # guided_goals
        time.sleep(3)
        html = mfp.access_page('account/change_goals_guided/','')
        soup = BeautifulSoup(html)

        # determine whether imperial or metric
        profiledict['Imperial'] = [True]
        if soup.find('label', {'for':'weight_value_display_value'}).string != 'lbs':
            profiledict['Imperial'] = [False]
        profiledict['GoalWeight'] = [float(soup.find('input', {'id':'profile_goal_weight_display_value'})['value'])]
        
        # determine gender
        profiledict['Gender'] = ['male']
        if soup.find_all('input', {'checked':'checked', 'name':'profile[sex]'})[0]['value'] != 'M':
            profiledict['Gender'] = ['female']


        # determine activity level
        activitydict = {'1': 'sedentary', '2': 'light', '3': 'moderate', '4': 'heavy'}
        profiledict['Activity'] = [activitydict[soup.find_all('input', {'checked': 'checked', 'name': 'profile[pal]'})[0]['value']]]

        # determine weekly weight loss desired
        profiledict['WeeklyChange'] = [float(soup.find('select', {'id': 'profile_goal_loss_per_week'}).find('option', {'selected':'selected'})['value'])]

        # determine height in inches or cm
        if profiledict['Imperial'] == [True]:
            ftquery = float(soup.find_all('input', {'id': 'profile_height_large_value'})[0]['value'])
            inquery = float(soup.find_all('input', {'id': 'profile_height_small_value'})[0]['value'])
            profiledict['Height'] = [round(ftquery*12 + inquery, 1)]
        else:
            profiledict['Height'] = [float(soup.find_all('input', {'id': 'profile_height_display_value'})[0]['value'])]


        # determine age
        # UPDATED FOR NEW CHANGES
        url = 'profile/' + un + '/'
        html = mfp.access_page(url, '')
        soup = BeautifulSoup(html)
        profiledict['Age'] = soup.find_all('h5')[0].get_text().split(" ")[0]
#         dob = [int(x['value']) for x in soup.find_all('option', {'selected':'selected'}, limit=3)]
#         dob = datetime.date(dob[2], dob[0], dob[1])
#         today = datetime.date.today()
#         profiledict['Age'] = [today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))]

            
        # macro goals
        
        # Obsolete due to changes in site to javascript
#         time.sleep(3)
#         html = mfp.access_page('account/my_goals/','')
#         soup = BeautifulSoup(html)

#         query = soup.find_all(lambda tag: tag.name == 'script' and len(tag.attrs) == 0)#[1].string[70:-105]
#         jsonprep = '{%s}' % (query.split('{', 1)[1].rsplit('}', 1)[0],)
#         jsoned = json.loads(jsonprep)

#         profiledict['CalGoal'] = [jsoned['user']['goal_preferences']['daily_energy_goal']['value']]
#         profiledict['CarbRatio'] = [jsoned['user']['goal_preferences']['carb_ratio']]
#         profiledict['ProteinRatio'] = [jsoned[#39;user']['goal_preferences']['protein_ratio']]
#         profiledict['FatRatio'] = [jsoned['user']['goal_preferences']['fat_ratio']]
        
        url = 'http://www.myfitnesspal.com/account/my_goals/daily_nutrition_goals'
        driver = webdriver.PhantomJS()
        driver.set_window_size(1120, 550)

        driver.get(url)
        driver.implicitly_wait(10)

        if driver.current_url == url:
            # we are logged in getting results
            pass
        else:
            # log in if necessary using this engine
            # print("Logging In")
            try:
                username = driver.find_element_by_id("username").send_keys(un)
                password = driver.find_element_by_id("password").send_keys(pw)

                driver.find_element_by_xpath("//input [@type='submit' and  @value='Log In']").click()
                time.sleep(10)
            except:
                pass

        # check that we are in the right place now, else redirect
        if driver.current_url == url:
            pass
        else:
            driver.get(my_url)
            time.sleep(10)

        profiledict['CalGoal'] = driver.find_element_by_xpath("//input [@type='text' \
                                    and  @id='ember1677']").get_attribute("value")
        profiledict['CarbRatio'] = webdriver.support.ui.Select(driver.find_element_by_id('ember1728'))\
                                    .first_selected_option.text.strip("%")
        profiledict['FatRatio'] = webdriver.support.ui.Select(driver.find_element_by_id('ember1780'))\
                                    .first_selected_option.text.strip("%")
        profiledict['ProteinRatio'] = webdriver.support.ui.Select(driver.find_element_by_id('ember1824'))\
                                    .first_selected_option.text.strip("%")

        profiledf = pd.DataFrame.from_dict(profiledict, dtype='float')

        return profiledf

    def scrapeData():
        dates = [start_date + datetime.timedelta(days=x) for x in range(0, num_days.days+1)]

        totaldict = {}

        # fitness measurements
        fitness_paths = {'ExerciseMins': 'Exercise%20Minutes', 'CalsBurned': 'Calories%20Burned'}
        for key in fitness_paths:
            time.sleep(3)
            html = mfp.get_fitness_report(fitness_paths[key], num_days.days+1)
            soup = BeautifulSoup(html)
            vals = soup.find_all('number')
            temp = []
            for val in vals:
                temp.append(val.string)
            totaldict[key] = temp

        '''
        progress measurements including custom entries, for custom entries go to
        "view-source:http://www.myfitnesspal.com/reports" in your browser and search for 
        "MFP.Reports.menu.init". Immediately following you will find a similar dict 
        structure with any custom variables. Below I include an example using my custom 
        entry "Body Fat"
        '''
        prog_paths = {'Weight': '1', 'Bodyfat': '94738698'}
        # bodyfat is a custom measurement example, remove or replace here and below in orderedcols
        for key in prog_paths:
            time.sleep(3)
            try:
                html = mfp.get_progress_report(prog_paths[key], num_days.days+1)
                soup = BeautifulSoup(html)
                vals = soup.find_all('number')
                temp = []
                for val in vals:
                    temp.append(val.string)
                totaldict[key] = temp
            except:
                print('No ' + str(key) + ' found.')

        # nutrition measurements
        nutr_paths = {'Calories': 'Calories', 'Carbs': 'Carbs', 'Fat': 'Fat', 'Protein': 'Protein',
                      'Fiber': 'Fiber', 'Sugar': 'Sugar', 'SatFat': 'Saturated%20Fat',
                      'PolyFat': 'Polyunsaturated%20Fat', 'MonoFat': 'Monounsaturated%20Fat',
                      'TransFat': 'Trans%20Fat', 'Cholesterol': 'Cholesterol', 'Sodium': 'Sodium',
                      'Potassium': 'Potassium', 'VitA': 'Vitamin%20A', 'VitC': 'Vitamin%20C',
                      'Iron': 'Iron', 'Calcium': 'Calcium', 'NetCals': 'Net%20Calories'}
        for key in nutr_paths:
            time.sleep(3)
            html = mfp.get_nutrition_report(nutr_paths[key], num_days.days+1)
            soup = BeautifulSoup(html)
            vals = soup.find_all('number')
            temp = []
            for val in vals:
                temp.append(val.string)
            totaldict[key] = temp
            
        mfp_daily = pd.DataFrame.from_dict(totaldict, orient='columns', dtype='float')
        mfp_daily.index = dates
        mfp_daily.index.name = 'Date'

        # note that bodyfat is a custom column below, remove if only using default MFP categories
        orderedcols = ['Weight', 'Calories', 'Protein', 'Fat', 'Carbs', 'Fiber', 'Sugar', 'Sodium', 'SatFat',
                       'TransFat', 'PolyFat', 'MonoFat', 'Cholesterol', 'Potassium', 'VitA', 'VitC',
                       'Iron', 'Calcium', 'Bodyfat', 'ExerciseMins', 'CalsBurned', 'NetCals']

        mfp_daily = mfp_daily[orderedcols]

        print("Done scraping " + str(num_days.days+1) + " days worth of data.")

        return mfp_daily

    return scrapeProfile(), scrapeData()

Loader function encapsulates the above:

In [4]:
def loadMfp(username, password, mfpDB="mfp_clean.db", *args):
    conn = sqlite3.connect(mfpDB)
    mfp_profile, mfp_data = start_extract(username, password, *args)
    mfp_profile.to_sql('mfp_profile', conn, if_exists='replace')
    mfp_data.to_sql('mfp_data', conn, if_exists='replace')
    conn.commit()
    conn.close()

Dates are optional, if none entered it will scrape the last 1500 days (~ 4 years)

Note the custom entry categories 94738698 (representing body fat entries) and 1 (representing weight) iterated through

In [12]:
loadMfp('USERNAME', 'PASSWORD', "mfp_clean.db", '2011-07-10')
Retrieving data for 2060 days
Retrieving the page http://www.myfitnesspal.com/account/change_goals_guided/
Retrieving the page http://www.myfitnesspal.com/profile/USERNAME/
Retrieving the page http://www.myfitnesspal.com/reports/results/fitness/Exercise%20Minutes/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/fitness/Calories%20Burned/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/progress/94738698/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/progress/1/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Vitamin%20C/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Saturated%20Fat/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Trans%20Fat/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Iron/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Monounsaturated%20Fat/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Vitamin%20A/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Polyunsaturated%20Fat/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Potassium/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Sodium/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Calories/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Cholesterol/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Protein/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Carbs/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Fat/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Fiber/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Sugar/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Calcium/2060
Retrieving the page http://www.myfitnesspal.com/reports/results/nutrition/Net%20Calories/2060
Done scraping 2060 days worth of data.

Examining our new database of MyFitnessPal data

In [13]:
conn = sqlite3.connect("mfp_clean.db")
cursor = conn.cursor()
df = pd.read_sql_query("SELECT * FROM sqlite_master", conn)
print(df)
    type                  name     tbl_name  rootpage  \
0  table           mfp_profile  mfp_profile         2   
1  index  ix_mfp_profile_index  mfp_profile         3   
2  table              mfp_data     mfp_data         4   
3  index      ix_mfp_data_Date     mfp_data         5   

                                                 sql  
0  CREATE TABLE "mfp_profile" (\n"index" INTEGER,...  
1  CREATE INDEX "ix_mfp_profile_index"ON "mfp_pro...  
2  CREATE TABLE "mfp_data" (\n"Date" DATE,\n  "We...  
3  CREATE INDEX "ix_mfp_data_Date"ON "mfp_data" (...  

Goal weight, macronutrient ratios, calories, etc stored in the mfp_profile table

In [14]:
df = pd.read_sql_query("SELECT * FROM mfp_profile", conn)
print(df)
   index   Activity   Age  CalGoal  CarbRatio  FatRatio Gender  GoalWeight  \
0      0  sedentary  30.0   1500.0       30.0      25.0   male       150.0   

   Height  Imperial  ProteinRatio  WeeklyChange  
0    68.0       1.0          45.0           2.0  

Daily information stored in the mfp_data table

In [15]:
df = pd.read_sql_query("SELECT * FROM mfp_data", conn)
print(df.columns)
print(df.iloc[758:763])
Index(['Date', 'Weight', 'Calories', 'Protein', 'Fat', 'Carbs', 'Fiber',
       'Sugar', 'Sodium', 'SatFat', 'TransFat', 'PolyFat', 'MonoFat',
       'Cholesterol', 'Potassium', 'VitA', 'VitC', 'Iron', 'Calcium',
       'Bodyfat', 'ExerciseMins', 'CalsBurned', 'NetCals'],
      dtype='object')
           Date  Weight  Calories  Protein   Fat  Carbs  Fiber  Sugar  Sodium  \
758  2013-08-06     0.0    2176.0    169.0  72.0  228.0   21.0   51.0  5910.0   
759  2013-08-07     0.0    1699.0    185.0  44.0  167.0   37.0   71.0  2303.0   
760  2013-08-08     0.0    1294.0    145.0  36.0  115.0   14.0   41.0  3419.0   
761  2013-08-09     0.0    2781.0    138.0  77.0  132.0   18.0   41.0  3592.0   
762  2013-08-10     0.0    1244.0    102.0  50.0  127.0   24.0   28.0  1715.0   

     SatFat   ...     Cholesterol  Potassium  VitA   VitC  Iron  Calcium  \
758    25.0   ...           462.0     1958.0  50.0  332.0  55.0    228.0   
759    12.0   ...           363.0     1253.0   7.0   17.0  45.0     84.0   
760    11.0   ...           349.0     1287.0  10.0   18.0  43.0    102.0   
761    20.0   ...           321.0     2769.0  14.0   75.0  28.0      6.0   
762    18.0   ...           195.0     1912.0  14.0    0.0  20.0    134.0   

     Bodyfat  ExerciseMins  CalsBurned  NetCals  
758      0.0           0.0         0.0   2176.0  
759      0.0           0.0         0.0   1699.0  
760      0.0           0.0         0.0   1294.0  
761      0.0           0.0         0.0   2781.0  
762      0.0           0.0         0.0   1244.0  

[5 rows x 23 columns]

With access to detailed nutritional information we can delve deeper into analysis. More to follow.