Input file
sample.xml:
----------
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
// load
scala> val df_xml = sqlContext.read.format("xml").option("rowTag","book").load("/home/hadoop/Desktop/sample.xml")
df_xml: org.apache.spark.sql.DataFrame = [_id: string, author: string ... 5 more fields]
scala> df_xml.show()
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
| _id| author| description| genre|price|publish_date| title|
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
|bk101|Gambardella, Matthew|An in-depth look ...| Computer|44.95| 2000-10-01|XML Developer's G...|
|bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16| Midnight Rain|
|bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17| Maeve Ascendant|
|bk104| Corets, Eva|In post-apocalyps...| Fantasy| 5.95| 2001-03-10| Oberon's Legacy|
|bk105| Corets, Eva|The two daughters...| Fantasy| 5.95| 2001-09-10| The Sundered Grail|
|bk106| Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| Lover Birds|
|bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| Splish Splash|
|bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| Creepy Crawlies|
|bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| Paradox Lost|
|bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09|Microsoft .NET: T...|
|bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01|MSXML3: A Compreh...|
|bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16|Visual Studio 7: ...|
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
// Transformation : filter
scala> val dfXML_filtered = df_xml.where("genre='Fantasy'")
dfXML_filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_id: string, author: string ... 5 more fields]
scala> dfXML_filtered.show()
+-----+-----------+--------------------+-------+-----+------------+------------------+
| _id| author| description| genre|price|publish_date| title|
+-----+-----------+--------------------+-------+-----+------------+------------------+
|bk102| Ralls, Kim|A former architec...|Fantasy| 5.95| 2000-12-16| Midnight Rain|
|bk103|Corets, Eva|After the collaps...|Fantasy| 5.95| 2000-11-17| Maeve Ascendant|
|bk104|Corets, Eva|In post-apocalyps...|Fantasy| 5.95| 2001-03-10| Oberon's Legacy|
|bk105|Corets, Eva|The two daughters...|Fantasy| 5.95| 2001-09-10|The Sundered Grail|
+-----+-----------+--------------------+-------+-----+------------+------------------+
// Transformation : Filter with multiple conditions
scala> val dfXML_filtered = df_xml.where("genre='Fantasy' and author='Ralls, Kim'")
dfXML_filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_id: string, author: string ... 5 more fields]
// Action
scala> dfXML_filtered.show()
+-----+----------+--------------------+-------+-----+------------+-------------+
| _id| author| description| genre|price|publish_date| title|
+-----+----------+--------------------+-------+-----+------------+-------------+
|bk102|Ralls, Kim|A former architec...|Fantasy| 5.95| 2000-12-16|Midnight Rain|
+-----+----------+--------------------+-------+-----+------------+-------------+
// Write into File System
scala> dfXML_filtered.write.format("xml").save("file:///home/hadoop/Desktop/xmloutput")
Subscribe to:
Post Comments (Atom)
Flume - Simple Demo
// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...
-
How to fetch Spark Application Id programmaticall while running the Spark Job? scala> spark.sparkContext.applicationId res124: String = l...
-
input data: ---------- customerID, itemID, amount 44,8602,37.19 35,5368,65.89 2,3391,40.64 47,6694,14.98 29,680,13.08 91,8900,24.59 ...
-
pattern matching is similar to switch statements in C#, Java no fall-through - at least one condition matched no breaks object PatternExa { ...
No comments:
Post a Comment