Jefit ETL with Python

The fitness app market is enormous, but some are clearly superior than others for a given use pattern. Jefit fits this niche perfectly for me, which makes it a pity it tries its hardest to limit exporting your own data. Only by making a backup within the app (an option unavailable on its fully fledged web interface), transferring it to a computer and then decrypting it can one access the SQL database inside.

Performing the backup and transferring it are trivial, and a huge thanks to josama who posted a .jar to decrypt the file at the Jefit support forums.

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import warnings
import re
warnings.filterwarnings('ignore')

Raw export SQL below after decryption

In [2]:
conn = sqlite3.connect("jefit.bak")
cursor = conn.cursor()
df = pd.read_sql_query("SELECT * FROM sqlite_master", conn)
print(df)
     type                 name             tbl_name  rootpage  \
0   table      sqlite_sequence      sqlite_sequence         5   
1   table       exerciseRecord       exerciseRecord         9   
2   table          cardioUnits          cardioUnits         6   
3   table              profile              profile        11   
4   table       deletetracking       deletetracking         4   
5   table                notes                notes        13   
6   table              setting              setting        10   
7   table             exercise             exercise        14   
8   table       routinePackage       routinePackage        12   
9   table              workOut              workOut         8   
10  table  workOutExerciseList  workOutExerciseList         3   
11  table         exerciseLogs         exerciseLogs         7   
12  table           cardioLogs           cardioLogs        16   
13  index       eid_besys_date           cardioLogs        15   
14  table     android_metadata     android_metadata        18   
15  table               photos               photos        17   

                                                  sql  
0              CREATE TABLE sqlite_sequence(name,seq)  
1   CREATE TABLE exerciseRecord (edit_time INT(10)...  
2   CREATE TABLE cardioUnits (edit_time INT(10), _...  
3   CREATE TABLE profile (edit_time INT(10), _id i...  
4   CREATE TABLE deletetracking (_id INTEGER PRIMA...  
5   CREATE TABLE notes (mydate VARCHAR(10), title ...  
6   CREATE TABLE "setting" (Legal_Note TEXT, gende...  
7   CREATE TABLE "exercise" (rating REAL, descript...  
8   CREATE TABLE "routinePackage" (edit_time INT(1...  
9   CREATE TABLE "workOut" (edit_time INT(10),pack...  
10  CREATE TABLE "workOutExerciseList" (mysort INT...  
11  CREATE TABLE "exerciseLogs" (edit_time INT(10)...  
12  CREATE TABLE cardioLogs (lap REAL,duration INT...  
13  CREATE UNIQUE INDEX eid_besys_date ON cardioLo...  
14             CREATE TABLE android_metadata (locale)  
15  CREATE TABLE "photos" ("_id" integer PRIMARY K...  

This table tracks weight and other dimensional measurements. Note that whether one enters data through the app vs the website can result in not imputing previous values, resulting in a lot of 0 values.

In [3]:
df = pd.read_sql_query("SELECT * FROM profile", conn)
print(df.tail())
      edit_time     _id      mydate  weight  fatpercent  chest  arms  waist  \
325  1439757215  100289  2015-07-29   163.2       16.93  38.25  12.5  34.50   
326  1438365807  100290  2015-07-31   164.2        0.00   0.00   0.0   0.00   
327  1438895600  100291  2015-08-06   162.6        0.00   0.00   0.0   0.00   
328  1439756984  100292  2015-08-14   161.4       16.30  38.00  12.5  34.00   
329  1439757190  100293  2015-08-04   162.2       17.00  38.25  12.5  34.25   

     calves  height   hips  thighs  shoulders  neck  forearms  
325    13.5    68.0  39.00   21.25      44.25  14.5     11.25  
326     0.0     0.0   0.00    0.00       0.00   0.0      0.00  
327     0.0     0.0   0.00    0.00       0.00   0.0      0.00  
328    13.5    68.0  38.25   21.00      44.50  14.5     11.25  
329    13.5    68.0  39.00   20.75      44.00  14.5     11.25  

This table keeps track of all-time records for each exercise

In [4]:
df = pd.read_sql_query("SELECT * from exerciseRecord", conn)
print(df.head())
    edit_time  _id  belongSys  record  eid  target1RM
0  1439674710    1          1   245.0   12      249.0
1  1436843223    2          1     0.0  467        0.0
2  1438099707    3          1    53.0   41       90.0
3  1436843223    4          1   145.0  300        0.0
4  1436843223    5          1    14.0   23        0.0

As can be seen below, the measurements for time based activities (planks), cardio (running), and exercises with weights all follow a different scheme. In fact, some of the time based activities can have an arbitrary number of columns that need to be parsed dynamically.

In [5]:
df = pd.read_sql_query("SELECT * from exerciseLogs", conn)
print(df.tail())
print(df[df['ename'] == 'Treadmill Running'])
      edit_time  belongSys     _id      mydate  eid  \
488  1439672171          1  100044  2015-08-10   82   
489  1439672182          1  100045  2015-08-10  159   
490  1439672191          1  100046  2015-08-10  874   
491  1439672199          1  100047  2015-08-10  160   
492  1439672221          1  100048  2015-08-10  631   

                           ename                  logs  record  day_item_id  
488                      Chin Up        35x5,35x5,35x5   40.83            0  
489  Cable Rope Triceps Pushdown            20x8,30x10   40.00            0  
490         Cable Rope Face Pull           15x12,20x12   28.00            0  
491              Lying Leg Curls       62.5x12,62.5x12   87.50            0  
492                        Plank  0x0,0x0,0x0,0x0,0x63    0.00            0  
      edit_time  belongSys     _id      mydate  eid              ename  \
9    1436843223          1      10  2013-06-12  321  Treadmill Running   
16   1436843223          1      17  2013-06-15  321  Treadmill Running   
43   1436843223          1      44  2013-06-30  321  Treadmill Running   
463  1436843223          1  100018  2013-06-08  321  Treadmill Running   
464  1436843223          1  100019  2013-06-09  321  Treadmill Running   
466  1436843223          1  100021  2013-06-24  321  Treadmill Running   
469  1436843223          1  100024  2013-06-29  321  Treadmill Running   
471  1436843223          1  100026  2013-07-02  321  Treadmill Running   

                              logs  record  day_item_id  
9    0x0.0,0x0.0,0x4.7,0x0.0,0x540     0.0       100000  
16   0x0.0,0x0.0,0x4.7,0x0.0,0x960     0.0       100000  
43   0x0.0,0x0.0,0x6.5,0x0.0,0x480     0.0       100000  
463        0x0,0x0,0x4.5,0x0,0x480     0.0            0  
464        0x0,0x0,0x4.5,0x0,0x540     0.0            0  
466          0x0,0x0,0x5,0x0,0x900     0.0            0  
469        0x0,0x0,0x6.5,0x0,0x480     0.0            0  
471        0x0,0x0,0x6.5,0x0,0x600     0.0            0  

Function to parse and clean dimensional measurements

In [6]:
def load_jefit_measurements(jefitbak="jefit.bak"):
    # init sqlite
    conn = sqlite3.connect(jefitbak)

    # extract weight/bodyfat/tape measurements
    profile_logs_query = "SELECT mydate,weight,fatpercent,chest,arms,waist,calves,hips, \
                            thighs,shoulders,neck,forearms FROM profile;"
    measurements = pd.read_sql(profile_logs_query, conn, parse_dates=['mydate'])

    measurements.columns = ['Date', 'Weight', 'Bodyfat', 'Chest', 'Arms', 'Waist', 
                            'Calves', 'Hips', 'Thighs', 'Shoulders', 'Neck', 'Forearms']
    measurements = measurements.set_index('Date', drop=True).sort()

    # remove any rows with no data, separate into 3, and then nan missing data
    measurements = measurements.loc[(measurements!=0).any(axis=1)]
    bodyfat = pd.DataFrame(measurements.Bodyfat.replace(0, np.nan).dropna())
    weight = pd.DataFrame(measurements.Weight.replace(0, np.nan).dropna())
    measurements = measurements.drop(['Weight', 'Bodyfat'],axis=1).replace(0, np.nan).dropna()

    return weight, bodyfat, measurements

weight, bodyfat, measurements = load_jefit_measurements()
In [7]:
print(weight.head())
            Weight
Date              
2013-05-18   184.6
2013-05-19   183.6
2013-05-20   183.2
2013-05-21   183.0
2013-05-22   184.2
In [8]:
print(bodyfat.head())
            Bodyfat
Date               
2013-06-15     25.0
2013-06-16     25.0
2013-06-16     25.0
2013-06-17     25.0
2013-06-18     25.0
In [9]:
print(measurements.head())
            Chest  Arms  Waist  Calves  Hips  Thighs  Shoulders   Neck  \
Date                                                                     
2013-06-19   39.0  13.0  37.25    14.0  40.0    21.0       46.0  15.25   
2013-06-20   39.0  13.0  37.25    14.0  40.0    21.0       46.0  15.25   
2013-06-21   39.0  13.0  37.25    14.0  40.0    21.0       46.0  15.25   
2013-06-22   39.0  13.0  37.25    14.0  40.0    21.0       46.0  15.25   
2013-06-23   39.0  13.0  37.25    14.0  40.0    21.0       46.0  15.25   

            Forearms  
Date                  
2013-06-19      11.5  
2013-06-20      11.5  
2013-06-21      11.5  
2013-06-22      11.5  
2013-06-23      11.5  

Helper functions to clean up the different types of exercise data

In [10]:
def _time_based_df(exercises_df, time_based_list):
    # extract time based exercises from full list
    time_df = exercises_df.query('ExerciseName in %s' %time_based_list).set_index(['Date', 'ExerciseName']).sort()

    # clean log data from 0x0.0, 0x0.0 etc to simply Time
    time_df = time_df['Logs'].apply(lambda x: pd.Series(re.split('x|,', x))) # , dtype=np.float64))

    # dropping empty columns
    time_df.replace(0, np.nan, inplace=True)
    time_df.replace(0.0, np.nan, inplace=True)
    time_df = time_df.dropna(how='all', axis=1)

    # count how many columns left (i.e. max # of sets done for time-based exercises)
    num_cols = len(time_df.columns)
    cols = ['Set' + str(idx) + 'Time' for idx in range(1, num_cols+1)]
    cols.insert(0, 'ExerciseName')
    
    # time in seconds
    time_df = time_df.reset_index().set_index(['Date'])
    time_df.index = pd.to_datetime(time_df.index)
    time_df.columns = cols
    time_df['Sets'] = 1

    return time_df
In [11]:
def _cardio_based_df(exercises_df, cardio_based_list):
    # extract cardio based exercises from full list
    cardio_df = exercises_df.query('ExerciseName in %s' %cardio_based_list).set_index(['Date', 'ExerciseName']).sort()

    # clean log data from 0x0.0, 0x0.0 etc to appropriate columns
    # split , and x's using regular expressions
    cardio_df = cardio_df['Logs'].apply(lambda x: pd.Series(re.split('x|,',x)))
    cardio_df.drop(cardio_df.columns[[0,2,4,6,7,8]], axis=1, inplace=True)
    # sometimes a random tenth column with a few nan's is present.. inexplicable
    cardio_df = cardio_df.dropna(how='any', axis=1)
    
    cardio_df = cardio_df.reset_index().set_index(['Date'])
    cardio_df.index = pd.to_datetime(cardio_df.index)
    # cals, miles, mph, seconds
    cardio_df.columns = ['ExerciseName', 'CardioCalsBurned', 'CardioDistance', 'CardioSpeed', 'CardioTime']
    cardio_df['Sets'] = 1

    return cardio_df
In [12]:
def _lifts_based_df(exercises_df, non_lift_list):
    # extract lift exercises, removing cardio/time based
    lifts_df = exercises_df.query('ExerciseName not in %s' %non_lift_list).set_index(['Date', 'ExerciseName']).sort()

    # cleaning log data from 0x0.0,0x0.0,0x0,0x30,0x... garbage to weight x reps
    # first step, split by , to isolate each set
    lifts_df = lifts_df['Logs'].apply(lambda x: pd.Series(re.split(',',x)))

    # stack here makes separate row for each wgt x rep set
    lifts_df = pd.DataFrame(lifts_df.stack())

    # stacking adds another index level we don't want
    lifts_df.index = lifts_df.index.droplevel(-1)
    lifts_df.columns = ['Logs']

    # now we split each set up into wgt x rep
    lifts_df = pd.DataFrame(lifts_df.Logs.str.split('x').tolist(), columns=['Weight', 'Reps'], 
                            index=lifts_df.index, dtype=float)
    lifts_df = lifts_df.reset_index().set_index(['Date'])

    lifts_df.index = pd.to_datetime(lifts_df.index)
    
    # add sets, volume, 1rm column to all rows to help calculations later
    lifts_df['Sets'] = 1
    lifts_df['Volume'] = lifts_df.Weight * lifts_df.Reps
    lifts_df['1RM'] = (lifts_df.Volume * 0.033) + lifts_df.Weight

    return lifts_df

Combining the exercise helper functions into one call

This function will extract and transform all the exercise data from a Jefit backup file

In [13]:
def load_jefit_exercises(jefitbak="jefit.bak", cardiolist = ['Treadmill Running', 'Walking'], 
                         timelist = ['Plank']):
    # init sqlite
    conn = sqlite3.connect(jefitbak)

    # extract lifting/cardio/time based logs
    exercise_logs_query = "SELECT mydate, ename, logs FROM exerciseLogs;"
    exercises = pd.read_sql(exercise_logs_query, conn, parse_dates=['mydate'])
    exercises.columns = ['Date', 'ExerciseName', 'Logs']

    # clean all logs return in 3 dataframes
    time_df = _time_based_df(exercises, timelist)
    cardio_df = _cardio_based_df(exercises, cardiolist)
    lifts_df = _lifts_based_df(exercises, (timelist+cardiolist))
    
    # merge time_based back into lifts
    lifts_df = pd.merge(lifts_df.reset_index(), time_df.reset_index(), 
                        on=['Date', 'ExerciseName', 'Sets'], how='outer')
    lifts_df = lifts_df.set_index(['Date']).sort()
    
    # merge cardio back into lifts
    lifts_df = pd.merge(lifts_df.reset_index(), cardio_df.reset_index(), 
                        on=['Date', 'ExerciseName', 'Sets'], how='outer')
    lifts_df = lifts_df.set_index(['Date']).sort()

    # extract bodypart for each exercise
    bodypart_query = "SELECT exercisename, bodypart FROM workOutExerciseList;"
    bodyparts = pd.read_sql(bodypart_query, conn)
    bodyparts.columns = ['ExerciseName', 'Bodypart']
    
    # manually adding exercises since removed from jefit's database
    bodyparts.loc[len(bodyparts)] = ['Front Two Dumbell Raise', '6']

    # some duplicates present if have multiple routines with same exercise
    bodyparts.drop_duplicates('ExerciseName', inplace=True)
    
    lifts_df = pd.merge(lifts_df.reset_index(), bodyparts, on='ExerciseName', 
                        how='left').set_index(['Date']).sort()

    bodypart_dict = {0: 'Abs', 1: 'Back', 2: 'Biceps', 3: 'Chest', 4: 'Forearms', 
                     5: 'Glutes', 6: 'Shoulders', 7: 'Triceps', 8: 'Upper Legs', 
                     9: 'Lower Legs', 10: 'Cardio', np.nan: 'Unknown'}
    lifts_df.Bodypart = lifts_df.Bodypart.replace(bodypart_dict)
    
    lifts_df.fillna(0, inplace=True)
        
    return lifts_df
lifts_df = load_jefit_exercises()
print(lifts_df.columns)
print(lifts_df.tail())
Index(['ExerciseName', 'Weight', 'Reps', 'Sets', 'Volume', '1RM', 'Set1Time',
       'Set2Time', 'Set3Time', 'Set4Time', 'Set5Time', 'Set6Time', 'Set7Time',
       'Set8Time', 'Set9Time', 'Set10Time', 'Set11Time', 'Set12Time',
       'Set13Time', 'Set14Time', 'Set15Time', 'Set16Time', 'Set17Time',
       'Set18Time', 'Set19Time', 'Set20Time', 'CardioCalsBurned',
       'CardioDistance', 'CardioSpeed', 'CardioTime', 'Bodypart'],
      dtype='object')
                           ExerciseName  Weight  Reps  Sets  Volume      1RM  \
Date                                                                           
2015-08-15         Barbell Glute Bridge    20.0  12.0     1   240.0   27.920   
2015-08-15                Barbell Squat   155.0   5.0     1   775.0  180.575   
2015-08-15                Barbell Squat   165.0   5.0     1   825.0  192.225   
2015-08-15  Cable Rope Triceps Pushdown    40.0   8.0     1   320.0   50.560   
2015-08-15  Barbell Incline Bench Press   170.0   5.0     1   850.0  198.050   

           Set1Time Set2Time Set3Time Set4Time     ...     Set16Time  \
Date                                               ...                 
2015-08-15        0        0        0        0     ...             0   
2015-08-15        0        0        0        0     ...             0   
2015-08-15        0        0        0        0     ...             0   
2015-08-15        0        0        0        0     ...             0   
2015-08-15        0        0        0        0     ...             0   

           Set17Time Set18Time Set19Time Set20Time CardioCalsBurned  \
Date                                                                  
2015-08-15         0         0         0         0                0   
2015-08-15         0         0         0         0                0   
2015-08-15         0         0         0         0                0   
2015-08-15         0         0         0         0                0   
2015-08-15         0         0         0         0                0   

           CardioDistance CardioSpeed CardioTime    Bodypart  
Date                                                          
2015-08-15              0           0          0      Glutes  
2015-08-15              0           0          0  Upper Legs  
2015-08-15              0           0          0  Upper Legs  
2015-08-15              0           0          0     Triceps  
2015-08-15              0           0          0       Chest  

[5 rows x 31 columns]

Finally, we encapsulate everything with the Load function, which will store all the dataframes as tables in a database

In [14]:
def loadJefit(jefitbak="jefit.bak", jefitDB="jefit_clean.db", cardiolist = ['Treadmill Running', 'Walking'], timelist = ['Plank']):
    conn = sqlite3.connect(jefitDB)
    lifts = load_jefit_exercises(jefitbak, cardiolist, timelist)
    weight, bodyfat, measurements = load_jefit_measurements()
    lifts.to_sql('lifts', conn, if_exists='replace')
    weight.to_sql('weight', conn, if_exists='replace')
    bodyfat.to_sql('bodyfat', conn, if_exists='replace')
    measurements.to_sql('measurements', conn, if_exists='replace')
    conn.commit()
    conn.close()
loadJefit()

The resulting transformed and cleaned database

In [15]:
conn = sqlite3.connect("jefit_clean.db")
cursor = conn.cursor()
df = pd.read_sql_query("SELECT * FROM sqlite_master", conn)
print(df)
    type                  name      tbl_name  rootpage  \
0  table                 lifts         lifts         2   
1  index         ix_lifts_Date         lifts         5   
2  table                weight        weight         6   
3  index        ix_weight_Date        weight         7   
4  table               bodyfat       bodyfat        19   
5  index       ix_bodyfat_Date       bodyfat        22   
6  table          measurements  measurements        26   
7  index  ix_measurements_Date  measurements        27   

                                                 sql  
0  CREATE TABLE "lifts" (\n"Date" TIMESTAMP,\n  "...  
1    CREATE INDEX "ix_lifts_Date"ON "lifts" ("Date")  
2  CREATE TABLE "weight" (\n"Date" TIMESTAMP,\n  ...  
3  CREATE INDEX "ix_weight_Date"ON "weight" ("Date")  
4  CREATE TABLE "bodyfat" (\n"Date" TIMESTAMP,\n ...  
5  CREATE INDEX "ix_bodyfat_Date"ON "bodyfat" ("D...  
6  CREATE TABLE "measurements" (\n"Date" TIMESTAM...  
7  CREATE INDEX "ix_measurements_Date"ON "measure...  

Confirming everything is still as it should be

In [16]:
df = pd.read_sql_query("SELECT * FROM lifts", conn)
print(df.tail(2))
                     Date                 ExerciseName  Weight  Reps  Sets  \
1087  2015-08-15 00:00:00  Cable Rope Triceps Pushdown    40.0   8.0     1   
1088  2015-08-15 00:00:00  Barbell Incline Bench Press   170.0   5.0     1   

      Volume     1RM Set1Time Set2Time Set3Time   ...    Set16Time Set17Time  \
1087   320.0   50.56        0        0        0   ...            0         0   
1088   850.0  198.05        0        0        0   ...            0         0   

     Set18Time Set19Time Set20Time CardioCalsBurned CardioDistance  \
1087         0         0         0                0              0   
1088         0         0         0                0              0   

     CardioSpeed CardioTime Bodypart  
1087           0          0  Triceps  
1088           0          0    Chest  

[2 rows x 32 columns]

Now that we finally have access to our Jefit data, we can begin having some fun deriving insights! More to follow.

Leave a Reply

Your email address will not be published. Required fields are marked *