米国フライトデータの元データの取得

Data Expo 2009でコンテストの題材とされた米国のフライトデータは,R界隈でも大規模データセットの例として頻繁に使用されます.

上記のData Expo 2009のページで配布されているデータは,コンテストのために必要なデータ項目だけが抽出されています.元データは,米国運輸省の調査・革新技術庁(RITA : Research and Innovative Technology Administration)の下部組織である米国運輸統計局(BTS : Bureau of Transportation Statistics)のページから取得が可能です.なお,BTSは,フライトデータだけでなく,高速道路や鉄道など運輸全般のデータを公開しています.

BTSは,フライトデータの元データを月ごとにzipファイルで提供しており,以下のスクリプトにより取得できます(参考ページ).

getdata.sh

:

# BTSのデータ公開のURL
BASE_URL="http://transtats.bts.gov/Download/On_Time_On_Time_Performance"

# 1987年は9月以前はヘッダのみのデータのzipファイルが提供されているので,10月以降を取得
YEAR=1987
for MONTH in {10..12}
do
  echo $MONTH
  wget ${BASE_URL}_${YEAR}_${MONTH}.zip
done

# 1988年以降は各月に対してzipファイルが提供されている
for YEAR in {1988..2013}
do
  echo $YEAR
  for MONTH in {1..12}
  do
    wget ${BASE_URL}_${YEAR}_${MONTH}.zip
  done
done

こうして取得したデータは,以下のスクリプトにより一つのファイルに集約できます.

makedata.sh

:

BASE_IFN="On_Time_On_Time_Performance"
OFN="airline.csv"

# 出力ファイルが既に存在する場合は削除する
if [ -e $OFN ]; then
  rm $OFN
fi

# 1987年10月はヘッダごと出力ファイルにコピー,以降はヘッダを除いてコピーする
YEAR=1987 
echo $YEAR
for MONTH in {10..12}
do
  ifn=${BASE_IFN}_${YEAR}_${MONTH}
  unzip -o $ifn.zip
  if [ ! -e $OFN ]; then
    cp $ifn.csv $OFN
  else
    tail -n +2 $ifn.csv >> $OFN
  fi
done
  
# 1988年以降は各月に対して,ヘッダを除いてコピーする
for YEAR in {1988..2013}
do
  echo $YEAR
  for MONTH in {1..12}
  do
    ifn=${BASE_IFN}_${YEAR}_${MONTH}
    if [ -e $ifn.zip ]; then
      unzip -o $ifn.zip
      tail -n +2 $ifn.csv >> $OFN
    fi
  done
done

このようして作成したファイルは,約1億5,000レコード,64GBというサイズになっています.

$ wc -l airline.csv
152636070 airline.csv
$ ls -lh airline.csv
-rw-r--r-- 1 sfchaos sfchaos 64G Dec 31 19:40 airline.csv

Data Expo 2009で提供されたデータは29項目しかありませんでしたが,元データは112項目となっています.

$ head airline.csv
"Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac","DestAirportID","DestAirportSeqID","DestCityMarketID","Dest","DestCityName","DestState","DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups","DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15","ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime","Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay","FirstDepTime","TotalAddGTime","LongestAddGTime","DivAirportLandings","DivReachedDest","DivActualElapsedTime","DivArrDelay","DivDistance","Div1Airport","Div1AirportID","Div1AirportSeqID","Div1WheelsOn","Div1TotalGTime","Div1LongestGTime","Div1WheelsOff","Div1TailNum","Div2Airport","Div2AirportID","Div2AirportSeqID","Div2WheelsOn","Div2TotalGTime","Div2LongestGTime","Div2WheelsOff","Div2TailNum","Div3Airport","Div3AirportID","Div3AirportSeqID","Div3WheelsOn","Div3TotalGTime","Div3LongestGTime","Div3WheelsOff","Div3TailNum","Div4Airport","Div4AirportID","Div4AirportSeqID","Div4WheelsOn","Div4TotalGTime","Div4LongestGTime","Div4WheelsOff","Div4TailNum","Div5Airport","Div5AirportID","Div5AirportSeqID","Div5WheelsOn","Div5TotalGTime","Div5LongestGTime","Div5WheelsOff","Div5TailNum",
1987,4,10,19,1,1987-10-19,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1649",-1.00,0.00,0.00,-1,"1600-1659",,"","",,"1850","1858",8.00,8.00,0.00,0,"1800-1859",0.00,"",0.00,180.00,189.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",
1987,4,10,20,2,1987-10-20,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1651",1.00,1.00,0.00,0,"1600-1659",,"","",,"1850","1905",15.00,15.00,1.00,1,"1800-1859",0.00,"",0.00,180.00,194.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",
1987,4,10,21,3,1987-10-21,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1653",3.00,3.00,0.00,0,"1600-1659",,"","",,"1850","1916",26.00,26.00,1.00,1,"1800-1859",0.00,"",0.00,180.00,203.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",
1987,4,10,22,4,1987-10-22,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1649",-1.00,0.00,0.00,-1,"1600-1659",,"","",,"1850","1854",4.00,4.00,0.00,0,"1800-1859",0.00,"",0.00,180.00,185.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",
1987,4,10,23,5,1987-10-23,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1647",-3.00,0.00,0.00,-1,"1600-1659",,"","",,"1850","1916",26.00,26.00,1.00,1,"1800-1859",0.00,"",0.00,180.00,209.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",
1987,4,10,24,6,1987-10-24,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1650",0.00,0.00,0.00,0,"1600-1659",,"","",,"1850","1857",7.00,7.00,0.00,0,"1800-1859",0.00,"",0.00,180.00,187.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",
1987,4,10,25,7,1987-10-25,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1653",3.00,3.00,0.00,0,"1600-1659",,"","",,"1850","1906",16.00,16.00,1.00,1,"1800-1859",0.00,"",0.00,180.00,193.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",
1987,4,10,26,1,1987-10-26,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1651",1.00,1.00,0.00,0,"1600-1659",,"","",,"1850","1857",7.00,7.00,0.00,0,"1800-1859",0.00,"",0.00,180.00,186.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",
1987,4,10,27,2,1987-10-27,"CO",19704,"CO","","597",10821,1082102,30852,"BWI","Baltimore, MD","MD","24","Maryland",35,12266,1226601,31453,"IAH","Houston, TX","TX","48","Texas",74,"1650","1648",-2.00,0.00,0.00,-1,"1600-1659",,"","",,"1850","1858",8.00,8.00,0.00,0,"1800-1859",0.00,"",0.00,180.00,190.00,,1.00,1235.00,5,,,,,,"",,,,,,,,"",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","","",,,"",,,"","",

それでは来年も良いお年を!