This file accompanies the FIC industry database and describes where the data comes from, the papers that should be cited when providing academic references, and some very important technical details regarding its usage. Please read the technical details in full before using this data. These details are critically important to ensure proper usage. The data is at the FIC-300 industry x year level. ********************************************************************************************************************************** ********************************************************************************************************************************** ******************************************* General Background on FIC industries ************************************************* ******************************************* General Background on FIC industries ************************************************* ******************************************* General Background on FIC industries ************************************************* ********************************************************************************************************************************** ********************************************************************************************************************************** For an extensive description of this data, please read the data and methodology sections of the studies noted below. Here is a brief description. This data is based on web crawling and text parsing algorithms that process the text in the business descriptions of 10-K annual filings on the SEC Edgar website from 1996 to 2008. These product descriptions are legally required to be accurate, as Item 101 of Regulation S-K legally requires that firms describe the significant products they offer to the market, and these descriptions must also be updated and representative of the current fiscal year of the 10-K. We merge each firm's text product description to the CRSP/COMPUSTAT universe using the central index key (CIK) [We thank the Wharton Research Data Service (WRDS) for providing us with an expanded historical mapping of SEC CIK to COMPUSTAT gvkey, as the base CIK variable in COMPUSTAT only contains current links]. Our resulting database is based on all publicly traded firms (domestic firms traded on either NYSE, AMEX, or NASDAQ) for which we have COMPUSTAT and CRSP data. We calculate our firm-by-firm pairwise similarity scores by parsing the product descriptions from the firm 10Ks and forming word vectors for each firm to compute continuous measures of product similarity for every pair of firms in our sample in each year (a pairwise similarity matrix). This is done using the cosine similarity method, which is applied after basic screens to eliminate common words are applied (see studies noted below). For any two firms i and j, we thus have a product similarity, which is a real number in the interval [0,1] describing how similar the words used by firms i and j are. For any given year, if there are 5000 firms, this would be ((5000*5000)-5000)/2 pairwise similarities (the lower off diagonal of a square matrix). The FIC classification is based on a clustering algorithm that groups firms together to maximize within-industry similarity while achieving a goal of N industries. To maintain the fixed location properties of other FIC industies such as SIC or NAICS, they are constructed using the 1997 data alone, and then the same set of industries is held fixed over time. We use 1997 as this is the earliest year for which we have full coverage in Edgar. The clustering algorithm is also run over the subset of firms excluding conglomerates to identify pure-play product markets accurately. The clustering algorithm reduces the set of all firms to N industries using a maximization of within-industry similarity procedure described in the papers below. Because the algorithm adjusts industry memberships after each iteration, it is possible that a FIC industry designed to have N industries might end up having only N-1 industries. The attached file includes FIC-500, FIC-400, FIC-300, FIC-200, and FIC-100 industries. A count of industries reveals that these classifications have 499, 400, 299, 200, and 100 industries, respectively. The incidence of having one fewer industry than targeted for the 500-classification and the 300-classification is a natural result of the algorithm and should not be viewed as being problematic. One last note is that although we fix the classifications based on 1997 data, we do assign all firms to these fixed set of industries for the full length of our sample. Firms are thus evaluated each year and a firm's industry assignment can change each year. That is, we use firm i's 2003 10-K to assign it to one of the N 1997 fixed location industries in 2003. This is done for each year. Because firm 10-Ks can change over time, and because the industries are fixed over time, a given firm's industry assignment can thus change as its 10-K evolves. This is analogous to the possibility that a firm can move from one SIC code to another over time. Hence, our FIC industries are designed to offer the same properties as other FIC industries like SIC and NAICS, but with FREQUENT updating based on how firms product descriptions change over time. Note that all FIC industries miss out on the enhanced flexibility offered by VIC industries. If your analysis can benefit from time-varying industry locations, or from the full knowledge of how similar firms and industries are to one another (see papers below), please use the VIC industry data that is also now available on the web. ************************************************************************************************************** ************************************************************************************************************** ********************************************** Citations ***************************************************** ********************************************** Citations ***************************************************** ********************************************** Citations ***************************************************** ************************************************************************************************************** ************************************************************************************************************** Please cite the following study when using this HHI data: Dynamic Text-Based Industry Classifications and Endogeneous Product Differentiation Gerard Hoberg and Gordon Phillips, University of Maryland Working Paper. * If using FIC or VIC data beyond HHIs, please cite the studies referred to in the readme file associated with FIC or VIC industry classification data. ********************************************************************************************************************** ********************************************************************************************************************** ********************************************** Technical Details ***************************************************** ********************************************** Technical Details ***************************************************** ********************************************** Technical Details ***************************************************** ********************************************************************************************************************** ********************************************************************************************************************** Please read the following carefully to ensure proper usage of this data. Technical Note 1) Because our own research reveals that firms and industries move considerably within the product space over time, we view VIC industries to be far more informative and useful than FIC classifications, including SIC, NAICS, or even these FIC industries. Also please read the final paragraph in the background section above in this file, which makes it clear that VIC data is needed to derive more economic content about how similar firms are within an industry, or how similar they are across industries. Technical Note 2) Each file contains a year and icode300 industry/year identifier. The FIC300HHI variable is the concentration measure. Two steps must be used to fully integrate this into a COMPUSTAT table. First, you must merge your COMPUSTAT data using gvkey to the FIC industry classifications themselves. This is a separate database, that is available in the Hoberg-Phillips data library (industry classifications section). Second, you would sort by icode300 and year, and then merge to this database. Then you will be left with a fic300hhi value for all available observations. Please carefully read the readme file associated with the FIC-300 classification itself to get more details on this procedure. Technical Note 3) These HHI data are computed using VIC designations that include the firm itself in part of the HHI calculation. All HHIs are based on firm sales data from COMPUSTAT, and are computed using the Herfindahl-Hirschmann sum of squared market shares formulation. Technical Note 4) This data is not lagged. If lags are needed, the researcher must lag the data.